Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu / May 13, 2026

arXiv:2605.11491v1 Announce Type: new
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, of…

Author name: Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization