Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
arXiv:2505.07527v5 Announce Type: replace
Abstract: The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the wit…