PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning
arXiv:2602.03190v3 Announce Type: replace-cross
Abstract: Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. W…