Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html
Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an uncontrolled measurement — useful for showing whether each model could fill our complex recursive-union AST schemas at all, but not really a benchmark in any rigorous sense.
This post is the proper version, with controlled variables and a real scoring rubric.
Three findings worth sharing
The function calling harness has effectively closed the frontier-vs-local gap on backend generation. gpt-5.4's DB/API design ≈ qwen3.5-35b-a3b's. claude-sonnet-4.6's logic ≈ qwen3.5-27b's.
This is the last round we include frontier models. Running them every month is genuinely too expensive for an open-source project — one shopping-mall run is ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, the comparison set is limited to OpenRouter endpoints under $0.25/M, or models that fit on a 64GB unified-memory laptop.
Frontend automation joins the benchmark in two or three months. The SDK that AutoBe already emits is enough to drive a working AI-built frontend end-to-end (visuals rough, but every function works). The June/July round will cover backend + auto-generated frontend together.
Three inversions, still investigating
A few results I'm honestly not sure how to read yet:
openai/gpt-5.4 actually scores below its own mini sibling. deepseek-v4-pro lands one notch below qwen3.5-35b-a3b, and barely separates from its own Flash sibling. - Within the Qwen family, dense 27B beats every MoE variant — even 397B-A17B.
Two readings I want to investigate before claiming anything:
- CoT-compliance phenomenon — bigger / more frontier-tier models tending to skip procedural instructions, which our harness enforces hard.
- Benchmark defects — n=4 reference projects, narrow score band, our own harness scoring our own pipeline.
I'll report back in a future round once we've dug more.
Recommendations welcome
Three candidates we're locked in on so far:
openai/gpt-5.4-nano — $0.25/M qwen/qwen3.6-27b — $0.195/M deepseek/deepseek-v4-flash — $0.14/M
If you know other small models that meet either condition (under $0.25/M on OpenRouter, or runnable on a 64GB unified-memory laptop) and handle function calling cleanly, please drop a comment.
r/LocalLLaMA tends to spot these faster than we do, and recommendations from this thread will fill out a big chunk of next month's comparison set.
References
submitted by