Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

TL;DR

  • best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf
  • 156k context, q8_0/q8_0 KV, MTP, vision on CPU
  • benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode
  • llama.cpp was a good start, BeeLlama worth testing, but ik_llama.cpp performed the best

What was tested

  • upstream llama.cpp: easy baseline and a good place to start
  • beellama.cpp: promising on paper, but I could not reproduce the expected speed on my setup
  • ik_llama.cpp: best decode/prefill, best VRAM fit

I also spent time with vLLM / club-3090, but I am leaving it out of the table because I did not finish a clean apples-to-apples run in this batch. We were seeing about 78 tok/s on responses, but the high-context OOM cliffs were too flaky, so I dropped it until that is fixed. I have not tested it recently, but the repo still flags the single-card long-context issue as unresolved.

The benchmark

One-shot chat-completion task:

  • prompt size: about 5.9k tokens
  • output size: 1024 tokens
  • task shape: a code-review / migration note over local setup files

So it mostly tests:

  • prefill speed on a medium-large real prompt
  • decode speed on a sustained 1k-token generation

So that is not best-case tok/s, but closer to reality.

The setup I kept

This is the profile I kept as my default:

High-level launch shape:

  • --ctx-size 156000
  • --cache-type-k q8_0
  • --cache-type-v q8_0
  • --flash-attn on
  • --multi-token-prediction
  • --draft-max 4
  • --draft-p-min 0.0
  • --merge-qkv
  • --merge-up-gate-experts
  • --cache-ram 32768
  • --ctx-checkpoints 32
  • --reasoning on
  • --reasoning-format deepseek
  • --chat-template-kwargs '{"preserve_thinking":true}'
  • --no-mmproj-offload

Notes:

  • built-in MTP in ik_llama.cpp worked better for me than the other speculative paths
  • q8_0 KV was good quality; you can opt into q4, but there is plenty of VRAM headroom with IQ4_KS

Why IQ4_KS

  • much smaller than Unsloth UD-Q4_K_XL
  • quality stayed high enough that I did not feel a real penalty
  • on a 24 GB card, those saved GiB matter once you start pushing context and sane u-batch sizes
  • to be fair, there is probably room for a higher quant, maybe q5; I have not tested that yet
  • Qwen-3.6 quants discussion #1663

TLDR:

  • Qwen 3.6 quantizes very well in IQ4_KS
  • ikawrakow measured IQ4_KS as very close to, or better than, UD_Q4_XL
  • Unsloth UD-Q4_K_XL needs about 2.8 GiB more to land in the same neighborhood

If you want the background on the quant family itself:

Vision

  • projector on CPU by default: --mmproj ... + --no-mmproj-offload
  • move it to GPU if you want faster image processing and are willing to spend roughly 1.5 GiB more VRAM
  • if that OOMs, lower context or switch to q4 KV

GPU Stuff

This was on Linux with the desktop on the iGPU and the RTX 3090 used only for LLMs.

  • power limit: 330 W
  • memory OC: +600
  • undervolt: flattened at about 1875 MHz @ 868 mV (LACT now has a curve editor)

Some experiments did not make the default setup better

  • --spec-autotune on ik_llama.cpp: no meaningful gain on this workload
  • --mtp-requantize-output-tensor q6_K: sometimes faster, but inconsistent and costs about 1 GiB extra VRAM, so I did not keep it
  • BeeLlama DFlash precision quickstart: loaded fine, but was much slower here than expected
  • upstream llama.cpp MTP paths: good baseline, but slower than ik_llama.cpp in my tests

BeeLlama and vLLM are still worth exploring. I just did not land on a setup there that beat the ik_llama.cpp profile for my workload.

Results

These are the useful comparison points from the same real prompt / 1024-token output benchmark.

Backend Model / quant Spec path Context KV cache Prefill tok/s Decode tok/s Wall time Notes
ik_llama.cpp Qwen3.6-27B-MTP-IQ4_KS built-in MTP 156k q8_0/q8_0 1260.95 72.93 18.79s best overall default profile
llama.cpp upstream Qwen3.6-27B-UD-Q4_K_XL draft-mtp 32k q4_0/q4_0 1247.65 51.20 24.80s easiest starting point
llama.cpp upstream tuned Qwen3.6-27B-UD-Q4_K_XL draft-mtp 32k q8_0/q8_0 1242.81 56.66 22.88s old-like flags helped, still slower
beellama.cpp Q5_K_S + DFlash Q4_K_M DFlash 122.8k turbo4/turbo3_tcq 1117.66 36.32 33.55s text-only quickstart-style run

Flags tested:

  • --spec-autotune did not produce better results on this workload
  • --mtp-requantize-output-tensor q6_K had occasional upside, about +5 tok/s decode in the best run, but it was not stable enough to justify the extra ~1 GiB VRAM

Flag comparison

These are the high-level config differences that mattered most.

Backend Quant(s) Draft / spec mode Key draft params KV cache Other notable flags
ik_llama.cpp target IQ4_KS MTP built-in --multi-token-prediction --draft-max 4, --draft-p-min 0.0 q8_0/q8_0 --merge-qkv, --merge-up-gate-experts, --ctx-checkpoints 32, CPU mmproj
llama.cpp upstream target UD-Q4_K_XL draft-mtp --spec-draft-n-max 6, --spec-draft-p-min 0.75 q4_0/q4_0 default, q8_0/q8_0 tuned --flash-attn on, --jinja
beellama.cpp target Q5_K_S, draft Q4_K_M dflash --spec-dflash-cross-ctx 1024 turbo4/turbo3_tcq --kv-unified, -b 2048, -ub 256, text-only in my run

Links


This is the best 24 GB setup I found so far, but things are moving fast and I do not think this is settled yet.

The point of this thread is to compare real single-3090 / 24 GB results: backend choice, quants, flags, and what stays stable under actual use.

I would like this to become a useful reference thread for 24 GB cards: what works, what breaks, and what is actually worth running day to day. I have not tested ExLlamaV3 yet, and there may be other setups that are better.

Also, thanks to everyone building this stuff: backend authors, quant makers, template tinkerers, and the people doing the boring debugging work that makes local LLMs usable.

submitted by /u/VolandBerlioz
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top