Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le

Selective Off-Policy Reference Tuning with Plan Guidance

Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le / May 14, 2026

arXiv:2605.11505v2 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without chan…

Author name: Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le

Selective Off-Policy Reference Tuning with Plan Guidance