On-Policy Distillation with Best-of-N Teacher Rollout Selection
arXiv:2605.09725v2 Announce Type: replace
Abstract: On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward depend…