Smooth Gate Functions for Soft Advantage Policy Optimization
arXiv:2602.19345v2 Announce Type: replace-cross
Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability…