cs.AI, cs.LG

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

arXiv:2603.28204v1 Announce Type: new
Abstract: Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically…