cs.LG

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

arXiv:2510.26109v4 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of language models (LMs). However, existing RLVR approaches train LMs based on their own on-po…