Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek)

Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html


Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an uncontrolled measurement — useful for showing whether each model could fill our complex recursive-union AST schemas at all, but not really a benchmark in any rigorous sense.

This post is the proper version, with controlled variables and a real scoring rubric.

Three findings worth sharing

  1. The function calling harness has effectively closed the frontier-vs-local gap on backend generation. gpt-5.4's DB/API design ≈ qwen3.5-35b-a3b's. claude-sonnet-4.6's logic ≈ qwen3.5-27b's.

  2. This is the last round we include frontier models. Running them every month is genuinely too expensive for an open-source project — one shopping-mall run is ~200–300M tokens (~$1,000–$1,500 per model on GPT 5.5 pricing). From next month, the comparison set is limited to OpenRouter endpoints under $0.25/M, or models that fit on a 64GB unified-memory laptop.

  3. Frontend automation joins the benchmark in two or three months. The SDK that AutoBe already emits is enough to drive a working AI-built frontend end-to-end (visuals rough, but every function works). The June/July round will cover backend + auto-generated frontend together.

Three inversions, still investigating

A few results I'm honestly not sure how to read yet:

  • openai/gpt-5.4 actually scores below its own mini sibling.
  • deepseek-v4-pro lands one notch below qwen3.5-35b-a3b, and barely separates from its own Flash sibling.
  • Within the Qwen family, dense 27B beats every MoE variant — even 397B-A17B.

Two readings I want to investigate before claiming anything:

  1. CoT-compliance phenomenon — bigger / more frontier-tier models tending to skip procedural instructions, which our harness enforces hard.
  2. Benchmark defects — n=4 reference projects, narrow score band, our own harness scoring our own pipeline.

I'll report back in a future round once we've dug more.

Recommendations welcome

Three candidates we're locked in on so far:

  • openai/gpt-5.4-nano — $0.25/M
  • qwen/qwen3.6-27b — $0.195/M
  • deepseek/deepseek-v4-flash — $0.14/M

If you know other small models that meet either condition (under $0.25/M on OpenRouter, or runnable on a 64GB unified-memory laptop) and handle function calling cleanly, please drop a comment.

r/LocalLLaMA tends to spot these faster than we do, and recommendations from this thread will fill out a big chunk of next month's comparison set.

References

submitted by /u/jhnam88
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top