Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola / May 13, 2026

arXiv:2605.12380v1 Announce Type: new
Abstract: Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model traini…

Author name: Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training