MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close.

I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when I tried to reproduce their exact usage, it confirmed what they were experiencing.

I tried to analyse the problem, made a few conjectures which later turned out to be false, and started a full blown systematical analysis, running 300+ tests and benchmarks, collecting and comparing the results of changing various parameters. This is what I found:

F16 + MTP nearly triples coding tasks speed. Q4_K_M + MTP slows down creative writing. Same feature, same model, same settings, opposite results.

I did not test all quant sizes, otherwise I would still be here in a few days, but restricted my self to 5 significant ones. The other parameters I varied were task type (4 types), temperature (0.0 0.3 0.7), quantisation of the MTP layer (q8 and matching the model quant). Temp and MTP quant have very little impact on the outcome.

Cumulative average decode speeds with MTP compared to the baseline without MTP, varying the model quant and task type:

quant base tok/s code factual analysis creative
Q4_K_M 15.1 19.7 17.5 14.9 13.7
Q5_K_M 13.1 19.2 16.5 14.7 12.6
Q6_K 13.4 20.1 17.6 15.2 13.4
Q8_0 11.4 25.4 21.7 18.6 16.9
F16 6.6 17.9 14.9 12.6 11.0

The memory bandwidth dictates how much the model can benefit from speculative decoding. F16 at 51GB crawls at 6.6 tok/s because every token means dragging the full model through memory. Accepted MTP drafts skip that pass. Q4_K_M at 16GB is already fast enough that the draft overhead is barely worth it on anything less predictable than code.

What controls the draft tokens acceptance rate:

task acceptance examples
code 79-89% writing functions, debugging, refactoring
factual 62-70% definitions, translation, math proofs
analysis 48-56% tradeoff breakdowns, technical comparisons
creative 39-48% stories, poetry, brainstorming, roleplay

40 points from code to creative. I tried three temperatures and five quants. The numbers barely changed. 4/5 draft tokens are correct on coding task; not even 1/2 on creative tasks. Nothing else comes close to mattering as much as what you're generating.

I also tested the optimal number of draft tokens for this model in all the above scenarios. 3 is the sweet spot for draft tokens. Go higher and acceptance falls faster than the extra drafts compensate. F16 is the exception: N=4 beats N=3 (17.9 vs 16.2) because at 6.6 tok/s every surviving draft token is worth the lower hit rate.

use case Q4_K_M Q5_K_M Q6_K Q8_0 F16
coding 🟢 +31% 🟢 +47% 🟢 +50% 🟢 +123% 🟢 +171%
factual QA 🟡 +16% 🟢 +26% 🟢 +31% 🟢 +90% 🟢 +125%
analysis 🔴 -1% 🟡 +12% 🟡 +13% 🟢 +64% 🟢 +91%
creative 🔴 -9% 🔴 -4% 🔴 -1% 🟢 +48% 🟢 +67%

🟢 speeds up, 🟡 marginal gain, 🔴 slowdown.

  • Q8_0 and F16: always use speculative decoding with MTP layer.
  • Coding tasks at any quant: keep it on.
  • Q4_K_M (and below) creative tasks keep it off

One last obervation: with thinking mode turned on for coding tasks: Q8_0 draft token acceptance drops from 87% to 73%. Still +94% speedup, just not the full +123%.

Test environment: Apple Silicon M2 Max 96GB, llama.cpp manual build with the MTP PR, Qwen3.6-27B with MTP layers preserved.

submitted by /u/ex-arman68
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top