Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
arXiv:2605.05802v1 Announce Type: new
Abstract: Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environme…