How do you objectively tell if your custom agent tools are actually better?

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.

But I have zero objective way to know if it's actually better.

Maybe I'm just cherry-picking the tasks where it works.

So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?

submitted by /u/Own_Suspect5343
[link] [comments]

Leave a Comment