Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
arXiv:2605.06650v1 Announce Type: new
Abstract: Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The communit…