cs.AI, cs.LG

Smooth Gate Functions for Soft Advantage Policy Optimization

arXiv:2602.19345v2 Announce Type: replace-cross
Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability…