Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

arXiv:2603.24709v2 Announce Type: replace-cross Abstract: Multi-step tool orchestration remains challenging for LLMs, as state-of-the-art models frequently fail on full sequence execution due to parameter errors. Training for these workflows faces two obstacles: the lack of environments supporting complex real-world API dependencies, and sparse binary rewards that provide no signal for partial correctness. We propose a reinforcement learning framework addressing both challenges. First, we construct a deterministic environment backed by a large-scale cache of real API responses, enabling constrained synthesis of valid multi-step traces with controllable complexity. Second, we introduce a graduated reward that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing with dependency respect). On ComplexFuncBench, our approach substantially improves turn accuracy, with ablations confirming both reward components are essential. Cross-benchmark evaluation on BFCL v4 shows that the learned orchestration skills transfer to entirely different API ecosystems (e.g., agentic web search and memory management), yielding consistent gains while maintaining stable single-step performance. Code is available at https://github.com/horizon-rl/ToolOrchestrationReward

Leave a Comment