Learning to Hint for Reinforcement Learning
arXiv:2604.00698v1 Announce Type: cross
Abstract: Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same …