Quick specs, this is a workstation that was morphed into something LocalLLaMa friendly over time:
3950x
96GB DDR4 (dual channel, running at 3000mhz)
w6800 + Rx6800 (48GB of VRAM at ~512GB/s)
most tests done with ~20k context; kv-cache at q8_0
llama cpp main branch with ROCM
The model used was the UD_IQ2_M weights from Unsloth which is ~122GB on disk. I have not had success with Q2 levels of quantization since Qwen3-235B - so I was assuming that this test would be a throwaway like all of my recent tests, but it turns out it's REALLY good and somewhat usable.
For Performance: , after allowing it to warm up (like 2-3 minutes of token gen) I'm getting:
~11 tokens/second token-gen
~43 tokens/second prompt-processing for shorter prompts (I did not record PP speeds on very long agentic workflows to see what caching benefits might look like)
That prompt-processing is a bit under the bar for interactive coding sessions, but for 24/7 agent loops I have it can get a lot done.
For the output quality: It codes incredibly well and is beating Qwen3.5 27B (full), Qwen3.5 122B (Q4), MiniMax M2.5 (Q4) GPT-OSS-120B (full), and Gemma 4 31B (full) in coding and knowledge tasks (I keep a long set of trivia questions that can have different levels of correctness). I can catch hallucinations in the reasoning output (I don't think any Q2 is immune to this) but it quickly steers itself back on course. I had some fun using it without reasoning budget as well - but it cannot correct any hallucinations so I wouldn't advise it to be used without reasoning tokens.
The point of this post: Basically everything Q2 and under I've found to be unusable for the last several months. I wanted to point a few people towards Qwen3.5-397B and recommend giving it a chance. It's suddenly the strongest model my system can run and might be good for you too.
[link] [comments]