A week or two ago, an open-source project called ATLAS made the rounds for scoring 74.6% on LiveCodeBench with a frozen 9B model on a single consumer GPU- outperforming Claude Sonnet 4.5 (71.4%).
As I was watching it make the rounds, a common response was that it was either designed around a benchmark or that it could never work in a real codebase- and I agreed.
Well, V3.0.1 just shipped, and it proved me completely wrong. The same verification pipeline that scored 74.6% now runs as a full coding assistant, and with a smaller 9B Qwen model versus the 14B like before.
The model emits structured tool calls- read, write, edit, delete, run commands, search files. For complex files, the V3 pipeline kicks in: generates diverse implementation approaches, tests each candidate in a sandbox, scores them with a (now working) energy-based verifier, and writes the best one. If they all fail, it repairs and retries.
It builds multi-file projects across Python, Rust, Go, C, and Shell. The whole stack runs in Docker Compose- so anyone with an NVIDIA GPU can spin it up.
Still one GPU. Still no cloud. Still ~$0.004/task in electricity... But marginally better for real world coding.
ATLAS remains a stark reminder that it's not about whether small models are capable. It's about whether anyone would build the right infrastructure to prove it.
[link] [comments]