80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft acceptance rate on the benchmark found here: https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

This is on an RTX 4070 Super, so results with other cards might vary.

To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

llama.cpp command:

llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --no-mmap \ --mlock \ --no-warmup \ --spec-type mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0

The most important parameter here is -fitt 1536. Since part of the model is offloaded to CPU because of its size, this tells llama.cpp to properly balance the load on your GPU/CPU to get the best possible performance, and leaves 1536 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1536 might be too small if you use your dGPU as your primary GPU.

Benchmark results:

mtp-bench.py code_python pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8 code_cpp pred= 58 draft= 40 acc= 37 rate=0.925 tok/s=81.8 explain_concept pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0 summarize pred= 53 draft= 40 acc= 32 rate=0.800 tok/s=75.4 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8 translation pred= 22 draft= 16 acc= 13 rate=0.812 tok/s=81.9 creative_short pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2 stepwise_math pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

If you have any questions, feel free to ask :)

Cheers.

submitted by /u/janvitos
[link] [comments]

Leave a Comment