cs.AI, cs.LG

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

arXiv:2605.11491v1 Announce Type: new
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, of…