TL;DR
- best setup I tested on a RTX 3090 24 GB:
ik_llama.cpp+Qwen3.6-27B-MTP-IQ4_KS.gguf 156kcontext,q8_0/q8_0KV, MTP, vision on CPU- benchmark result on a
~5.9kprompt +1koutput: about1261 tok/sprefill,72.9 tok/sdecode llama.cppwas a good start, BeeLlama worth testing, butik_llama.cppperformed the best
What was tested
- upstream
llama.cpp: easy baseline and a good place to start beellama.cpp: promising on paper, but I could not reproduce the expected speed on my setupik_llama.cpp: best decode/prefill, best VRAM fit
I also spent time with vLLM / club-3090, but I am leaving it out of the table because I did not finish a clean apples-to-apples run in this batch. We were seeing about 78 tok/s on responses, but the high-context OOM cliffs were too flaky, so I dropped it until that is fixed. I have not tested it recently, but the repo still flags the single-card long-context issue as unresolved.
The benchmark
One-shot chat-completion task:
- prompt size: about
5.9ktokens - output size:
1024tokens - task shape: a code-review / migration note over local setup files
So it mostly tests:
- prefill speed on a medium-large real prompt
- decode speed on a sustained
1k-token generation
So that is not best-case tok/s, but closer to reality.
The setup I kept
This is the profile I kept as my default:
- backend:
ikawrakow/ik_llama.cpp - current tested build:
4507 (c35189d8) - model:
ubergarm/Qwen3.6-27B-GGUF - direct model file:
Qwen3.6-27B-MTP-IQ4_KS.gguf
High-level launch shape:
--ctx-size 156000--cache-type-k q8_0--cache-type-v q8_0--flash-attn on--multi-token-prediction--draft-max 4--draft-p-min 0.0--merge-qkv--merge-up-gate-experts--cache-ram 32768--ctx-checkpoints 32--reasoning on--reasoning-format deepseek--chat-template-kwargs '{"preserve_thinking":true}'--no-mmproj-offload
Notes:
- built-in MTP in
ik_llama.cppworked better for me than the other speculative paths q8_0KV was good quality; you can opt intoq4, but there is plenty of VRAM headroom withIQ4_KS
Why IQ4_KS
- much smaller than Unsloth
UD-Q4_K_XL - quality stayed high enough that I did not feel a real penalty
- on a
24 GBcard, those saved GiB matter once you start pushing context and sane u-batch sizes - to be fair, there is probably room for a higher quant, maybe
q5; I have not tested that yet Qwen-3.6 quantsdiscussion #1663
TLDR:
Qwen 3.6quantizes very well inIQ4_KSikawrakowmeasuredIQ4_KSas very close to, or better than,UD_Q4_XL- Unsloth
UD-Q4_K_XLneeds about2.8 GiBmore to land in the same neighborhood
If you want the background on the quant family itself:
Vision
- projector on CPU by default:
--mmproj ...+--no-mmproj-offload - move it to GPU if you want faster image processing and are willing to spend roughly
1.5 GiBmore VRAM - if that OOMs, lower context or switch to
q4KV
GPU Stuff
This was on Linux with the desktop on the iGPU and the RTX 3090 used only for LLMs.
- power limit:
330 W - memory OC:
+600 - undervolt: flattened at about
1875 MHz @ 868 mV(LACTnow has a curve editor)
Some experiments did not make the default setup better
--spec-autotuneonik_llama.cpp: no meaningful gain on this workload--mtp-requantize-output-tensor q6_K: sometimes faster, but inconsistent and costs about1 GiBextra VRAM, so I did not keep it- BeeLlama DFlash precision quickstart: loaded fine, but was much slower here than expected
- upstream
llama.cppMTP paths: good baseline, but slower thanik_llama.cppin my tests
BeeLlama and vLLM are still worth exploring. I just did not land on a setup there that beat the ik_llama.cpp profile for my workload.
Results
These are the useful comparison points from the same real prompt / 1024-token output benchmark.
| Backend | Model / quant | Spec path | Context | KV cache | Prefill tok/s | Decode tok/s | Wall time | Notes |
|---|---|---|---|---|---|---|---|---|
ik_llama.cpp | Qwen3.6-27B-MTP-IQ4_KS | built-in MTP | 156k | q8_0/q8_0 | 1260.95 | 72.93 | 18.79s | best overall default profile |
llama.cpp upstream | Qwen3.6-27B-UD-Q4_K_XL | draft-mtp | 32k | q4_0/q4_0 | 1247.65 | 51.20 | 24.80s | easiest starting point |
llama.cpp upstream tuned | Qwen3.6-27B-UD-Q4_K_XL | draft-mtp | 32k | q8_0/q8_0 | 1242.81 | 56.66 | 22.88s | old-like flags helped, still slower |
beellama.cpp | Q5_K_S + DFlash Q4_K_M | DFlash | 122.8k | turbo4/turbo3_tcq | 1117.66 | 36.32 | 33.55s | text-only quickstart-style run |
Flags tested:
--spec-autotunedid not produce better results on this workload--mtp-requantize-output-tensor q6_Khad occasional upside, about+5 tok/sdecode in the best run, but it was not stable enough to justify the extra~1 GiBVRAM
Flag comparison
These are the high-level config differences that mattered most.
| Backend | Quant(s) | Draft / spec mode | Key draft params | KV cache | Other notable flags |
|---|---|---|---|---|---|
ik_llama.cpp | target IQ4_KS MTP | built-in --multi-token-prediction | --draft-max 4, --draft-p-min 0.0 | q8_0/q8_0 | --merge-qkv, --merge-up-gate-experts, --ctx-checkpoints 32, CPU mmproj |
llama.cpp upstream | target UD-Q4_K_XL | draft-mtp | --spec-draft-n-max 6, --spec-draft-p-min 0.75 | q4_0/q4_0 default, q8_0/q8_0 tuned | --flash-attn on, --jinja |
beellama.cpp | target Q5_K_S, draft Q4_K_M | dflash | --spec-dflash-cross-ctx 1024 | turbo4/turbo3_tcq | --kv-unified, -b 2048, -ub 256, text-only in my run |
Links
ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cppExLlamaV3: https://github.com/turboderp-org/exllamav3- BeeLlama: https://github.com/Anbeeld/beellama.cpp
- BeeLlama Qwen 3.6 quickstart: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md
club-3090: https://github.com/noonghunna/club-3090IQ4_KSwith MTP: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-MTP-IQ4_KS.ggufQwen-3.6 quantsdiscussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/1663IQ4_KSquant family discussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/8
This is the best 24 GB setup I found so far, but things are moving fast and I do not think this is settled yet.
The point of this thread is to compare real single-3090 / 24 GB results: backend choice, quants, flags, and what stays stable under actual use.
I would like this to become a useful reference thread for 24 GB cards: what works, what breaks, and what is actually worth running day to day. I have not tested ExLlamaV3 yet, and there may be other setups that are better.
Also, thanks to everyone building this stuff: backend authors, quant makers, template tinkerers, and the people doing the boring debugging work that makes local LLMs usable.
[link] [comments]