Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
arXiv:2605.14539v1 Announce Type: new
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by s…