PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:
Strix Halo (Framework Desktop, ROCm 7.0.2):
- Q4_K_M: 11.7 → 21.2 tok/s (1.81×)
- Q8_0: 7.4 → 18.1 tok/s (2.44×)
Single RTX 3090 @ 450W (CUDA 12.9, driver 590.26):
- Q4_K_M: 38.7 → 59.5 tok/s (1.54×, n=2)
Dual RTX 3090, layer-split:
- Q8_0: 25.7 → 55.9 tok/s (2.17×, n=3)
Qwen3.6 35B-A3B (MoE):
- Strix Halo: 49.5 → 69.4 tok/s (1.40×)
- 3090: 120.0 → 148.3 tok/s (1.24×)
Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.
MTP helps MoE less because only ~3B of 35B params run per token — each forward pass is already cheap, so saving N-1 of them is a smaller win. Sweet-spot N also depends on the rig: uncapped 3090 prefers n=2 at Q4, capped 3090 and Strix Halo prefer n=3.
Couple of follow-ups from the last thread:
- The 3090 numbers in my earlier post were undercut by an undisclosed 200W cap (breaker-popping issue with 4 cards on one circuit). I re-benched 26 of the 3090 models at 350W and 450W; dense 27-32B models gained +70 to +113%. Writeup with the curve and full table: https://calebcoffie.com/blog/how-much-do-power-limits-affect-llm-benchmark-tok-s
- Prompt-processing tok/s and prompt-token columns are now on every row of the benchmarks page.
MTP writeup with both rigs side-by-side, build commands, and per-shape tables: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo
Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs
[link] [comments]