Selective Off-Policy Reference Tuning with Plan Guidance
arXiv:2605.11505v2 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without chan…