cs.AI, cs.CL, cs.LG

Learning to Hint for Reinforcement Learning

arXiv:2604.00698v1 Announce Type: cross
Abstract: Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same …