Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models

I spent the past week testing a simple question:

Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch?

So I held the model fixed and changed only the scaffold.

Same Qwen3.5-9B Q4 weights in both conditions.

Same Aider Polyglot benchmark.

Full 225 exercises.

Results:

- vanilla Aider: 19.11%

- little-coder: 45.56% mean pass@2 across two full runs

little-coder is not a new model. It is a scaffold I adapted to the behavioral profile of a ~10B local model: bounded reasoning budget, a Write guard that refuses to overwrite existing files, explicit workspace discovery, and small per-turn skill injections instead of one huge static preamble.

This is not a conference paper. There are obvious things a proper paper would still want:

- more replications

- component ablations

- more model families

- maybe a second benchmark

But the effect size was large enough that I thought it was worth sharing now (I don’t have time to do the above unfortunately).

My takeaway is fairly narrow:

at this scale, coding-agent benchmark results are not just properties of model weights. They are also properties of scaffold–model fit.

I suspect sub-10B local models may have been written off too early in coding-agent evaluation.

Full write-up, code, and numbers here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent

Would be very interested in replication attempts, failure cases, or reasons you think this would not generalize.

submitted by /u/Creative-Regular6799
[link] [comments]

Leave a Comment