Qwen3.6-27B MTP depth benchmark — RTX 3090Ti

Hardware: RTX 3090Ti, 64GB RAM

Model: unsloth/Qwen3.6-27B-UD-Q4_K_XL. (MTP Version)

Prompt: "make a flappy bird in html" — fresh chat each run, all numbers pulled directly from llama.cpp output stats (open-webui)

Launch args

@echo off
cd "C:\Llama CPP"
set GGML_CUDA_NO_PEER_COPY=1

llama-server.exe ^
--host 0.0.0.0 ^
--port 8082 ^
-ngl 99 ^
--threads 2 ^
-b 2048 ^
-ub 512 ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
-fa on ^
-c 32768 ^
--mlock ^
--cont-batching ^
--spec-type draft-mtp ^
--spec-draft-n-max 1 ^
--reasoning off ^
--no-mmproj ^
--models-max 1 ^
--sleep-idle-seconds 480
pause

Full results

Depth |Gen Speed |Prefill Speed |Acceptance Rate |Draft Accepted |vs Baseline
No MTP |41.1 t/s |175.9 t/s |— |— |1x
MTP 1 |52.5 t/s |164.1 t/s |95.5% |1436/1503 |1.28x
MTP 2 |73.5 t/s |105.0 t/s |91.3% |1698/1860 |1.79x
MTP 3 |75.2 t/s |152.6 t/s |86.9% |2105/2421 |1.83x
MTP 4 |7.93 t/s |48.1 t/s |79.1% |1738/2196 |0.19x Still not sure why prefill drops in MTP2 and goes back to 152 with MTP3.

This was without any context so all of those responses started instantly, once there is context (system prompt, tools, etc) the context is at 1000t/s with MTP3.
Need to make more tests so don’t take notes of the prefill speeds.

Also tested other creative writing tasks and the speed-up is also noticeable!

This really shrinks the gap between the speeds of the 35B MoE and the dense model, I might just main one model now.

submitted by /u/iChrist
[link] [comments]

Leave a Comment