cs.AI, cs.LG

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

arXiv:2605.12380v1 Announce Type: new
Abstract: Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model traini…