Local model on coding has reached a certain threshold to be feasible for real work

We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public leaderboard uses (Qwen's official post uses a more relaxed config) . We deliberately used the default setup for TB official leaderboard, because we wanted an apples-to-apples number against the verified leaderboard.

https://preview.redd.it/zqlzk1303uxg1.png?width=1800&format=png&auto=webp&s=42c0526b2ce9377cad927ef68e24fae1a89181c6

One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds.

https://preview.redd.it/wbmsuq704uxg1.png?width=1000&format=png&auto=webp&s=17db5694f34a2e869e9a4b66696d4986f90a982b

The interesting part isn't 38.2% in absolute terms — current verified SOTA is ~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). The interesting part is what 38.2% maps to in time.

Anchoring on model release dates of verified leaderboard entries:

Terminus 2 + Claude Opus 4.1 (released Aug 2025): 38.0%
Terminus 2 + GPT-5.1-Codex (Nov 2025): 36.9%
Claude Code + Sonnet 4.5 (Sep 2025): 40.1%
Codex CLI + GPT-5-Codex (Sep 2025): 44.3%

So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads).

https://preview.redd.it/ykkbj61o3uxg1.png?width=1284&format=png&auto=webp&s=8af000a5095c41a917bfc2c7098571a50dfd013d

more details on our blog: https://antigma.ai/blog/2026/04/24/offline-coding-models

submitted by /u/Exciting-Camera3226
[link] [comments]

Leave a Comment