cs.AI, cs.LG

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

arXiv:2605.10194v1 Announce Type: new
Abstract: On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the…