A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
arXiv:2605.06200v1 Announce Type: new
Abstract: Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls wi…