been running qwen 32b, gemma 9b, and command r 32b on my M4 mac mini for a few weeks for agent tasks, and there is a specific failure mode nobody talks about enough. all three start losing context around tool call 8 or 9 in a chain.
the symptom: first 6 or 7 tool calls go fine. then around call 8 the model starts returning the first tool's arguments with the last tool's name. by call 10 it is just making up tool names that do not exist. you would think it is a context window issue, but I am running with 16k context and the chain is nowhere near that.
what I think is actually happening: the attention gets spread thin across the accumulated tool call history. the schemas for earlier tools crowd out the later ones in the attention budget. the context window has room, but the effective attention weight per tool drops below some threshold.
what has helped, roughly in order of impact:
prune old tool results from the context. keep the system prompt plus last 4 tool calls, drop the rest. costs you memory of what happened earlier but buys you 2 or 3 more reliable calls.
reset tool schemas between phases. if the agent finishes a search phase and moves to an action phase, explicitly remove the search tools from the schema for the action phase. smaller tool menu means better attention per tool.
name tools in a way that disambiguates them phonetically. search_docs and search_code confuse models more than find_in_docs and grep_repo. specific verbs help a lot.
if you can run a 70b model the problem basically disappears. but on 32b and below, the attention pruning matters.
the frustrating thing. benchmarks do not catch this because they usually test 1 to 3 tool calls. real workloads have 15 tool call chains for any meaningful agent task.
anyone else seeing this pattern? and is there a specific setting in vllm or llama.cpp that helps with sustained tool call reliability?
[link] [comments]