MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close.

I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when I tried to reproduce their exact usage, it confirmed what they were experiencing.

I tried to analyse the problem, made a few conjectures which later turned out to be false, and started a full blown systematical analysis, running 300+ tests and benchmarks, collecting and comparing the results of changing various parameters. This is what I found:

F16 + MTP nearly triples coding tasks speed. Q4_K_M + MTP slows down creative writing. Same feature, same model, same settings, opposite results.

I did not test all quant sizes, otherwise I would still be here in a few days, but restricted my self to 5 significant ones. The other parameters I varied were task type (4 types), temperature (0.0 0.3 0.7), quantisation of the MTP layer (q8 and matching the model quant). Temp and MTP quant have very little impact on the outcome.

Cumulative average decode speeds with MTP compared to the baseline without MTP, varying the model quant and task type:

quant	base tok/s	code	factual	analysis	creative
Q4_K_M	15.1	19.7	17.5	14.9	13.7
Q5_K_M	13.1	19.2	16.5	14.7	12.6
Q6_K	13.4	20.1	17.6	15.2	13.4
Q8_0	11.4	25.4	21.7	18.6	16.9
F16	6.6	17.9	14.9	12.6	11.0

The memory bandwidth dictates how much the model can benefit from speculative decoding. F16 at 51GB crawls at 6.6 tok/s because every token means dragging the full model through memory. Accepted MTP drafts skip that pass. Q4_K_M at 16GB is already fast enough that the draft overhead is barely worth it on anything less predictable than code.

What controls the draft tokens acceptance rate:

task	acceptance	examples
code	79-89%	writing functions, debugging, refactoring
factual	62-70%	definitions, translation, math proofs
analysis	48-56%	tradeoff breakdowns, technical comparisons
creative	39-48%	stories, poetry, brainstorming, roleplay

40 points from code to creative. I tried three temperatures and five quants. The numbers barely changed. 4/5 draft tokens are correct on coding task; not even 1/2 on creative tasks. Nothing else comes close to mattering as much as what you're generating.

I also tested the optimal number of draft tokens for this model in all the above scenarios. 3 is the sweet spot for draft tokens. Go higher and acceptance falls faster than the extra drafts compensate. F16 is the exception: N=4 beats N=3 (17.9 vs 16.2) because at 6.6 tok/s every surviving draft token is worth the lower hit rate.

use case	Q4_K_M	Q5_K_M	Q6_K	Q8_0	F16
coding	🟢 +31%	🟢 +47%	🟢 +50%	🟢 +123%	🟢 +171%
factual QA	🟡 +16%	🟢 +26%	🟢 +31%	🟢 +90%	🟢 +125%
analysis	🔴 -1%	🟡 +12%	🟡 +13%	🟢 +64%	🟢 +91%
creative	🔴 -9%	🔴 -4%	🔴 -1%	🟢 +48%	🟢 +67%

🟢 speeds up, 🟡 marginal gain, 🔴 slowdown.

Q8_0 and F16: always use speculative decoding with MTP layer.
Coding tasks at any quant: keep it on.
Q4_K_M (and below) creative tasks keep it off

One last obervation: with thinking mode turned on for coding tasks: Q8_0 draft token acceptance drops from 87% to 73%. Still +94% speedup, just not the full +123%.

Test environment: Apple Silicon M2 Max 96GB, llama.cpp manual build with the MTP PR, Qwen3.6-27B with MTP layers preserved.

submitted by /u/ex-arman68
[link] [comments]

Leave a Comment