TLDR: The hype is real! 1.5x speedup. Up to 2x speedup with tensor parallelism!

After reading the PR I immediately hunted for MTP-compatible Q4_1 quants (they offer a small speedup on these compute-lacking older cards) but couldn't find any.

Luckily I came across this post which highlighted how to transplant MTP grafting onto your own quants, and thus attached it to Bartowski's quant I already had.

Setup

CachyOS (Arch Linux)
ROCm 7.2

Built the llama.cpp fork https://github.com/skyne98/llama.cpp-gfx906 with https://github.com/ggml-org/llama.cpp/pull/22673 and ran the following command with the included PR benchmark script:

llama-server -m ~/models/Qwen3.6-27B-MTP-Q4_1.gguf \ --temp 1.0 --min-p 0.0 --top-k 20 --top-p 0.95 \ --jinja --presence-penalty 1.5 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -ub 2048 -b 2048 \ -fa 1 -np 1 \ --no-mmap --no-warmup \ -dev ROCm0,ROCm1 --fit on -fitt 256

Script Benchmark

Stock:

code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.2 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.4 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.3 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=26.0

With MTP on: --spec-type mtp --spec-draft-n-max 2

code_python pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.6 code_cpp pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.5 explain_concept pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=36.7 summarize pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=40.7 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=39.4 translation pred= 192 draft= 152 acc= 115 rate=0.757 tok/s=37.5 creative_short pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=36.6 stepwise_math pred= 192 draft= 146 acc= 118 rate=0.808 tok/s=39.0 long_code_review pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=37.8 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1340, "total_draft_accepted": 1046, "aggregate_accept_rate": 0.7806, "wall_s_total": 51.42 }

With tensor parallelism on: -sm tensor

code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=35.0 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.8 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.7 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.6 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=34.3

Combining MTP and tensor parallelism:

code_python pred= 192 draft= 142 acc= 120 rate=0.845 tok/s=59.8 code_cpp pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=56.6 explain_concept pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=56.8 summarize pred= 53 draft= 42 acc= 31 rate=0.738 tok/s=54.5 qa_factual pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.8 translation pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=57.3 creative_short pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=54.8 stepwise_math pred= 192 draft= 140 acc= 121 rate=0.864 tok/s=59.6 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=56.2 Aggregate: { "n_requests": 9, "total_predicted": 1589, "total_draft": 1214, "total_draft_accepted": 970, "aggregate_accept_rate": 0.799, "wall_s_total": 32.24

Real-world benchmark

The numbers above looks absolutely insane, however in the real-world the speed up dwindles very quickly - not to mention there's a regression in prefill speed which is currently being worked on. I ran this 18k coding prompt and it's clear the 60t/s is only observable for very short prompts, but combining MTP and tensor parallelism does indeed net a hefty 2x speedup.

Stock:

prompt eval time = 53173.24 ms / 19191 tokens ( 2.77 ms per token, 360.91 tokens per second) eval time = 337695.94 ms / 7791 tokens ( 43.34 ms per token, 23.07 tokens per second) total time = 390869.18 ms / 26982 tokens

With MTP on:

prompt eval time = 84388.11 ms / 19191 tokens ( 4.40 ms per token, 227.41 tokens per second) eval time = 260732.83 ms / 8408 tokens ( 31.01 ms per token, 32.25 tokens per second) total time = 345120.94 ms / 27599 tokens

With tensor parallelism:

prompt eval time = 41925.27 ms / 19191 tokens ( 2.18 ms per token, 457.74 tokens per second) eval time = 253262.25 ms / 8104 tokens ( 31.25 ms per token, 32.00 tokens per second) total time = 295187.53 ms / 27295 tokens

Combining MTP and tensor parallelism:

prompt eval time = 49696.04 ms / 19191 tokens ( 2.59 ms per token, 386.17 tokens per second) eval time = 155821.64 ms / 7440 tokens ( 20.94 ms per token, 47.75 tokens per second) total time = 205517.69 ms / 26631 tokens

submitted by /u/legit_split_
[link] [comments]

More Qwen3.6-27B MTP success but on dual Mi50s

Setup

Script Benchmark

Real-world benchmark

Leave a Comment