Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
arXiv:2604.01597v1 Announce Type: new
Abstract: Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimizatio…