cs.AI, cs.LG

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv:2603.23871v1 Announce Type: new
Abstract: Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all – “cliff” prompts – the RL gradient vanis…