Dong Shu, Denghui Zhang, Jessica Hullman

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Dong Shu, Denghui Zhang, Jessica Hullman / April 3, 2026

arXiv:2604.01597v1 Announce Type: new
Abstract: Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimizatio…

Author name: Dong Shu, Denghui Zhang, Jessica Hullman

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training