ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
arXiv:2603.28204v1 Announce Type: new
Abstract: Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically…