Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
arXiv:2511.00066v4 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer …