MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

TLDR: 28 tok/s → 63 tok/s on Qwen3.6-27B on a MacBook Pro M5 Max. 2.24× faster at real temperature 0.6.

Works for coding, creative writing, and chat

https://i.redd.it/i9x794c0q7zg1.gif

  • Works on ANY MTP model: No external drafter. No extra memory usage. Uses the model's own built-in MTP heads. Works on any model that ships them.
  • Not greedy: Unlike similar speculative decoding projects, we use mathematically exact temperature sampling with rejection sampling. Adjustable temperatures for any task. Every other speculative decode project on Apple Silicon is greedy-only.
  • Custom kernel: Built on a patched MLX fork with custom Metal kernels, compiled verify graphs, innovation-tape GDN rollback, and a draft-only requantised LM head.
  • Full CLI: mtplx start wizard, model download, model inspection with four-tier MTP compatibility detection, configurable depth 2-7+, OpenAI/Anthropic API server, browser chat, terminal chat, benchmarking suite, health diagnostics, crash-safe fan control with idle-aware auto-restore, and a 562-test suite.
  • Full serving stack: OpenAI + Anthropic compatible API, browser chat UI, terminal chat. Point your editor at localhost and go.

What Is MTPLX?

MTPLX uses a model's built-in MTP heads as speculative drafters to increase decode speeds on LLMs by up to 2.25x, all while preserving the model's default inference settings, allowing you to do coding or creative writing tasks.

QWEN 3.6 27B @ 63 TPS on a MacBook Pro M5 Max

Using MTPLX I increased decode speeds on Qwen 3.6 27B 4-bit MLX from 28 tok/s → 63 tok/s on a MacBook Pro M5 Max at temperature 0.6 with top_p 0.95 and top_k 20. The exact sampling settings Qwen recommends for coding.

Qwen 3.6 27B ships with built-in MTP heads that support up to depth 5. I ran a sweep across D2, D3, D4, and D5 to find the optimal depth for this model on this hardware:

https://preview.redd.it/erim8d4rq7zg1.png?width=1200&format=png&auto=webp&s=0fd76cbffd9bbfcb67acac16ef4c302e1310d8e9

D3 was the optimal spot, high enough acceptance to verify time ratio to where TPS increased the most. D4 and D5 have good acceptance at the early positions but the deeper positions start costing more in verify time than they save in accepted tokens.

These results are at real temperature 0.6 with exact probability-ratio rejection sampling and residual correction.

This means you can actually use Qwen 3.6 27B for real coding work with a 2.25x speed increase without sacrificing output quality.

How Is This Different From DFlash / DDTree?

https://preview.redd.it/ycxf4qptq7zg1.png?width=1200&format=png&auto=webp&s=8591cd1acfb3ff7d20801cd5bbca5339ff977e6d

DFlash MLX has greater absolute speed, however it is restricted to greedy (temp 0) only sampling which severely restricts its real world use case. It also requires an external drafter model which requires additional memory and needs to be created for every model that is released.

DDTree adds tree-based verification on top of DFlash so it inherits the same limitations: greedy only, external drafter required.

The reason for this comes down to how each system drafts. MTP heads draft sequentially. Each token sees the previous draft tokens, so every position produces a real probability distribution. DFlash drafts all 16 tokens simultaneously in a parallel diffusion pass. Token 8 does not know what token 7 is. Without that sequential dependency, there is no per-token probability distribution, which means you cannot do the rejection sampling maths that makes temperature work.

MTPLX works with any model that retains the MTP heads and gives full customisability to the user to choose the number of MTP heads and run any locally saved or HuggingFace model with MTP heads.

Architecture

https://preview.redd.it/q0m2sjwyq7zg1.png?width=1200&format=png&auto=webp&s=696b2e35abe190815b42ef350dfb4288ce794439

Layer 0: MLX Runtime

MTPLX runs on a patched MLX fork. Stock MLX's quantised matrix-vector kernel is tuned for large M (prefill). During MTP verify, M is 3 to 6, one position per draft token. Stock stalls at these shapes. The patch: wider simdgroups, loop unrolling, 10 lines of Metal. Exact, 0.0 diff against stock.

On top of the fork sit four custom Metal kernels registered as MLX primitives:

  • Innovation-tape GDN capture: records KB-scale (token, gate, state-delta) tuples during draft. On rejection, replays from the tape instead of restoring full recurrent state. Replaces hundreds of MB of state snapshots with tiny deltas. Bit-exact against reference.
  • GraphBank: a cache of mx.compile-compiled verify graphs keyed by (suffix_length, depth, profile). Each verify shape gets one compiled graph reused across all cycles. Capture-commit overhead: 0.073 ms per cycle versus 47 ms verify per cycle. Three orders of magnitude smaller than the work it manages.
  • Draft-only requantised LM head: the target's lm_head stays at model precision. A separate 4-bit LM head is built in memory for draft-only use. Cuts draft time by 29% without touching target accuracy.
  • Small-M verify qmv: direct successor of dflash-mlx's M=16 approach, retuned for MTPLX's M=3 to 6 verify shapes.

Layer 1: Single-model runtime

One checkpoint. The target model and drafter are the same model. Qwen3.6-27B ships native MTP heads and MTPLX uses them. Zero RAM for a second model. The trunk's KV cache uses a committed-history contract verified against the vLLM CUDA reference at cosine > 0.9998 through depth 5.

Layer 2: Speculative cycle (the hot loop)

Per cycle: the MTP head drafts K tokens, each seeing the previous draft. The target verifies all K in one batched forward via a compiled GraphBank path. Probability-ratio acceptance (Leviathan-Chen) decides per position in fp32. Residual correction (p - q)+ emits a clean replacement on rejection. A bonus token falls out free when all K accept. The innovation tape commits accepted GDN state deltas and rolls back rejected ones.

Layer 3: Serving stack

Real API server. OpenAI-compatible /v1/chat/completions and /v1/completions with streaming SSE. Anthropic-compatible /v1/messages. /v1/models, /health, /metrics. Engine sessions with per-chat KV state. Session Bank preserves warm-prefix exact state across turns, verified at logits max_abs_diff = 0.0 against fresh forwards. Browser chat UI at localhost with live tok/s, markdown rendering, code-block copy, and stop button. Terminal chat via mtplx chat.

What I Had To Solve

https://preview.redd.it/qc80pu52r7zg1.png?width=1200&format=png&auto=webp&s=f28b17e1c061cb4c623b02995970591132b05485

Native MTP on Apple Silicon did not work by default. There were four stacked problems

1) Recursive depth collapse

Running MTP recursively, accuracy collapses after depth 1: 91% → 63% → 44% → 27% → 17%.

Everyone who tried native MTP saw this and gave up. I SSH'd into my 2x3090 PC running vLLM with MTP-5, traced the exact MTP execution, and compared it against MLX token-by-token. The finding: MLX was resetting the MTP attention KV cache every speculative cycle. vLLM does not. It persists MTP history across cycles. One contract fix: depth 2 acceptance jumped from 49% to 74%.

2) Precision mismatch

Every project was using BF16 MTP heads on quantised 4-bit trunks. The MTP head is more precise than the hidden states it receives, which amplifies quantisation noise through recursive prediction. I grafted calibrated INT4 MTP weights onto the trunk, matching MTP precision to trunk precision. Depth 3 jumped from 30% to 88%.

3) MLX verify bottleneck

Even with high acceptance, stock MLX's verify pass was so expensive that MTP was slower than plain autoregressive decode. MLP operations accounted for 51% of verify time.

I patched MLX's Metal qmv shader for the small verify shapes MTP produces (10 lines, wider simdgroups + loop unrolling), built an innovation-tape GDN capture system for efficient state rollback, batched target probability distributions into a single MLX eval boundary, and deferred MTP history materialisation.

Four stacked optimisations that cut verify cycle time from ~90ms to ~47ms per call, taking MTP from slower than plain autoregressive to 2.24× faster.

4) TPS decay

On long responses (8k+ tokens), throughput collapsed. I spent 16 hours trying to figure out why TPS would decay from 50 to 25, a 50% decrease, investigating 24 different profiles: lazy-eval graph accumulation, cache growth, state provenance, paged attention, owned recurrent caches, two-pass Metal SDPA.

None of them solved it.

The problem was hilariously simple. It turns out the speculative decode loop sustains significantly heavier GPU load than normal autoregressive. Every cycle runs a full batched verify forward plus draft computation plus MTP history maintenance.

The additional sustained workload was pushing the M5 Max SoC to 103°C, and macOS's default fan curve ramps far too late. By the time the fans respond, the GPU has already downclocked.

I introduced a MAX mode into the CLI. Using ThermalForge, fans are locked at full speed before generation starts, with a detached watchdog that restores fans to auto if the process dies for any reason. TPS decay dropped from 50% to 6.7%, and GPU clock retention went from 85.6% to 97.1%.

16 hours of kernel debugging, solved by a fan controller.

Caveats

  1. The 63 TPS figure was achieved on a 160-token high-acceptance prompt. Real workflows on an M5 Max will most likely see 50-55 TPS.
  2. I am currently working on the thermal issue by optimising the kernel. If you do not run MAX mode (100% fan mode) you will see significant TPS decline on long prompts due to thermal throttling.
  3. Unsurprisingly, most MLX quants have MTP heads stripped since they used to be pointless on MLX. Many MLX models are incompatible with MTPLX for now. I am hoping my work with MTPLX will drive more people to create MLX quants with MTP heads present and optimised for inference.

In the meantime you can run my official Qwen 3.6 27B MTPLX Optimised from

HuggingFace

. The CLI makes it easy to set up and download.

If you publish MLX quants, please keep the MTP heads. They are around 200MB on a 27B model, cost almost nothing in memory, and are now worth a 2.25× speedup.

Really looking forward to everyone's thoughts and contributions to this project. Making local LLMs on MLX faster and more viable for everyone.

GitHub: https://github.com/youssofal/MTPLX

submitted by /u/YoussofAl
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top