Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

TL;DR

best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf
156k context, q8_0/q8_0 KV, MTP, vision on CPU
benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode
llama.cpp was a good start, BeeLlama worth testing, but ik_llama.cpp performed the best

What was tested

upstream llama.cpp: easy baseline and a good place to start
beellama.cpp: promising on paper, but I could not reproduce the expected speed on my setup
ik_llama.cpp: best decode/prefill, best VRAM fit

I also spent time with vLLM / club-3090, but I am leaving it out of the table because I did not finish a clean apples-to-apples run in this batch. We were seeing about 78 tok/s on responses, but the high-context OOM cliffs were too flaky, so I dropped it until that is fixed. I have not tested it recently, but the repo still flags the single-card long-context issue as unresolved.

The benchmark

One-shot chat-completion task:

prompt size: about 5.9k tokens
output size: 1024 tokens
task shape: a code-review / migration note over local setup files

So it mostly tests:

prefill speed on a medium-large real prompt
decode speed on a sustained 1k-token generation

So that is not best-case tok/s, but closer to reality.

The setup I kept

This is the profile I kept as my default:

backend: ikawrakow/ik_llama.cpp
current tested build: 4507 (c35189d8)
model: ubergarm/Qwen3.6-27B-GGUF
direct model file: Qwen3.6-27B-MTP-IQ4_KS.gguf

High-level launch shape:

--ctx-size 156000
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
--multi-token-prediction
--draft-max 4
--draft-p-min 0.0
--merge-qkv
--merge-up-gate-experts
--cache-ram 32768
--ctx-checkpoints 32
--reasoning on
--reasoning-format deepseek
--chat-template-kwargs '{"preserve_thinking":true}'
--no-mmproj-offload

Notes:

built-in MTP in ik_llama.cpp worked better for me than the other speculative paths
q8_0 KV was good quality; you can opt into q4, but there is plenty of VRAM headroom with IQ4_KS

Why `IQ4_KS`

much smaller than Unsloth UD-Q4_K_XL
quality stayed high enough that I did not feel a real penalty
on a 24 GB card, those saved GiB matter once you start pushing context and sane u-batch sizes
to be fair, there is probably room for a higher quant, maybe q5; I have not tested that yet
Qwen-3.6 quants discussion #1663

TLDR:

Qwen 3.6 quantizes very well in IQ4_KS
ikawrakow measured IQ4_KS as very close to, or better than, UD_Q4_XL
Unsloth UD-Q4_K_XL needs about 2.8 GiB more to land in the same neighborhood

If you want the background on the quant family itself:

New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K discussion #8

Vision

projector on CPU by default: --mmproj ... + --no-mmproj-offload
move it to GPU if you want faster image processing and are willing to spend roughly 1.5 GiB more VRAM
if that OOMs, lower context or switch to q4 KV

GPU Stuff

This was on Linux with the desktop on the iGPU and the RTX 3090 used only for LLMs.

power limit: 330 W
memory OC: +600
undervolt: flattened at about 1875 MHz @ 868 mV (LACT now has a curve editor)

Some experiments did not make the default setup better

--spec-autotune on ik_llama.cpp: no meaningful gain on this workload
--mtp-requantize-output-tensor q6_K: sometimes faster, but inconsistent and costs about 1 GiB extra VRAM, so I did not keep it
BeeLlama DFlash precision quickstart: loaded fine, but was much slower here than expected
upstream llama.cpp MTP paths: good baseline, but slower than ik_llama.cpp in my tests

BeeLlama and vLLM are still worth exploring. I just did not land on a setup there that beat the ik_llama.cpp profile for my workload.

Results

These are the useful comparison points from the same real prompt / 1024-token output benchmark.

Backend	Model / quant	Spec path	Context	KV cache	Prefill tok/s	Decode tok/s	Wall time	Notes
`ik_llama.cpp`	`Qwen3.6-27B-MTP-IQ4_KS`	built-in MTP	`156k`	`q8_0/q8_0`	`1260.95`	`72.93`	`18.79s`	best overall default profile
`llama.cpp` upstream	`Qwen3.6-27B-UD-Q4_K_XL`	`draft-mtp`	`32k`	`q4_0/q4_0`	`1247.65`	`51.20`	`24.80s`	easiest starting point
`llama.cpp` upstream tuned	`Qwen3.6-27B-UD-Q4_K_XL`	`draft-mtp`	`32k`	`q8_0/q8_0`	`1242.81`	`56.66`	`22.88s`	old-like flags helped, still slower
`beellama.cpp`	`Q5_K_S` + DFlash `Q4_K_M`	DFlash	`122.8k`	`turbo4/turbo3_tcq`	`1117.66`	`36.32`	`33.55s`	text-only quickstart-style run

Flags tested:

--spec-autotune did not produce better results on this workload
--mtp-requantize-output-tensor q6_K had occasional upside, about +5 tok/s decode in the best run, but it was not stable enough to justify the extra ~1 GiB VRAM

Flag comparison

These are the high-level config differences that mattered most.

Backend	Quant(s)	Draft / spec mode	Key draft params	KV cache	Other notable flags
`ik_llama.cpp`	target `IQ4_KS` MTP	built-in `--multi-token-prediction`	`--draft-max 4`, `--draft-p-min 0.0`	`q8_0/q8_0`	`--merge-qkv`, `--merge-up-gate-experts`, `--ctx-checkpoints 32`, CPU `mmproj`
`llama.cpp` upstream	target `UD-Q4_K_XL`	`draft-mtp`	`--spec-draft-n-max 6`, `--spec-draft-p-min 0.75`	`q4_0/q4_0` default, `q8_0/q8_0` tuned	`--flash-attn on`, `--jinja`
`beellama.cpp`	target `Q5_K_S`, draft `Q4_K_M`	`dflash`	`--spec-dflash-cross-ctx 1024`	`turbo4/turbo3_tcq`	`--kv-unified`, `-b 2048`, `-ub 256`, text-only in my run

Links

ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp
ExLlamaV3: https://github.com/turboderp-org/exllamav3
BeeLlama: https://github.com/Anbeeld/beellama.cpp
BeeLlama Qwen 3.6 quickstart: https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md
club-3090: https://github.com/noonghunna/club-3090
IQ4_KS with MTP: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-MTP-IQ4_KS.gguf
Qwen-3.6 quants discussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/1663
IQ4_KS quant family discussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/8

This is the best 24 GB setup I found so far, but things are moving fast and I do not think this is settled yet.

The point of this thread is to compare real single-3090 / 24 GB results: backend choice, quants, flags, and what stays stable under actual use.

I would like this to become a useful reference thread for 24 GB cards: what works, what breaks, and what is actually worth running day to day. I have not tested ExLlamaV3 yet, and there may be other setups that are better.

Also, thanks to everyone building this stuff: backend authors, quant makers, template tinkerers, and the people doing the boring debugging work that makes local LLMs usable.

submitted by /u/VolandBerlioz
[link] [comments]