cs.AI

Selective Off-Policy Reference Tuning with Plan Guidance

arXiv:2605.11505v2 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without chan…