DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

A few days ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.

A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork.

Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.

Results @ 2048 tokens

Model |Baseline |DFlash |Speedup |Acceptance

Qwen3.5-4B |53.74 tok/s |219.83 tok/s |4.10x |89.3%

Qwen3.5-9B |30.96 tok/s |127.07 tok/s |4.13x |89.4%

Qwen3.5-27B-4bit |32.35 tok/s |62.78 tok/s |1.90x |89.1%

Qwen3.5-35B-A3B-4bit |142.12 tok/s |240.21 tok/s |1.69x |88.7% Full results at 1024/2048/4096 in the repo.

What changed since last post

Baseline is now stock mlx_lm (was a custom Python loop that was slower, inflating the speedup)

What I learned

On unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization.

The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.

Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits.

Roadmap

Full-attention model optimization
Draft model compression

https://github.com/bstnxbt/dflash-mlx

submitted by /u/No_Shift_4543
[link] [comments]

Results @ 2048 tokens

What changed since last post

What I learned

Roadmap

Leave a Comment