You’re sleeping on Devstral Small 2 – 24B Instruct

You're sleeping on Devstral Small 2 - 24B Instruct

Peeps are always asking which local model is best. That question is loaded and totally depends on the task you're asking of it.

Llama2 for example is old but still useful for summarizing YouTube transcripts into 10 bullet points. I don't code with it obviously, but it works well for that.

So, I built my own benchmarking tool to test local models on my client codebases. SWE Bench and similar tools test only Python gate-based tasks. They do not care whether the model writes slop or creates additional bugs. The only question: did the test turn green? Yes or no.

My benchmark runs 30 tests across 8 scenarios - with a maximum of 64pts possible.

Category What It Probes
surgical-edit Fix exactly the thing that's broken. Don't touch adjacent code.
audit Read the code, find the bugs. Do NOT edit anything.
scope-discipline Make the requested change. Nothing else.
read-only-analysis Answer a question about the code. Don't reach for the edit tool.
verify-and-repair Close the loop: reproduce the failure, fix it, verify, and recover if needed.
implementation Read a spec, build the feature. Multi-file spec-to-code.
responsiveness Stay usable in a tight edit loop. Correctness only counts when turns stay under budget.
long-context Retrieve the right answer from a very large inline context and respond quickly.

Right now I'm focused mainly on testing Javascript, Typescript, React, Go and SQL.

I've been bench marking a lot of local models for use with code and comparing them to the SOTA models on my Bench Mark.

To make a long story short, everyone kind of doesn't take Mistral seriously. But I was looking for something else to benchmark while browsing Hugging Face and decided to try Devstal Small 2 24B Instruct. SPECIFICALLY using this Q8 qaunt from Unsloth: https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF

To my surprise - this has scored the HIGHEST in 3 runs on my bench mark for a LOCAL model. I've mainly been using Qwen in a hybrid fashion for code work. I use Claude to write specs, Qwen to execute and Codex to do code Reviews. Generally... most of the fixes needed with Qwen are stylistic or duplication or maybe some anti-patterns introduced.

https://preview.redd.it/t9tij1ijqdyg1.png?width=2102&format=png&auto=webp&s=d337208acdd9ad44d18a4c1ba5032b7531ffd816

But Qwen hasn't scored as high as Devstral - the first local model to break over 80% on my benchmarks. It even beat out Sonnet 4.6 and Codex 5.3 !!! OK.... surprising?

TPS however is a little slow. Wall time not so bad. And I'm just wondering, have we all been sleeping on Mistral? Usually I hear people trash them, but I'm actually suprised.

I need to spend a few weeks testing this in production - because a bench mark again isn't real life. And who knows, maybe I'm a moron who built a bad bench.

Anyone else test this model?

And if you are curious, this is Scaffold Bench - a work in progress. Whether or not my 30 tests are even good, is open to debate.

Github: https://github.com/1337hero/scaffold-bench

submitted by /u/alphatrad
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top