Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
arXiv:2605.05040v1 Announce Type: new
Abstract: On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on o…