Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset.

Setup:

Hardware: 1x H100 80GB
Runtime: vLLM
Dataset: SPEED-Bench qualitative
Prompts: 880 total, 80 prompts across each of 11 categories
Models: google/gemma-4-31B-it and google/gemma-4-26B-A4B-it
MTP drafts: Google's matching Gemma 4 assistant models
DFlash drafts: z-lab's matching Gemma 4 DFlash models
MTP used num_speculative_tokens=8
DFlash used num_speculative_tokens=15
Context length / max model length: 32768
Temperature: 0
Prefix caching was disabled

Results:

For Gemma 4 31B dense, MTP was 3.11x faster and DFlash was 3.03x faster than baseline decoding at concurrency 1. Baseline hit 40.3 output tok/s, MTP hit 125.3 output tok/s, and DFlash hit 122.1 output tok/s. At concurrency 16, baseline reached 375 tok/s, MTP reached 953 tok/s, and DFlash reached 725 tok/s.

https://preview.redd.it/4zyyt58j7p0h1.png?width=2571&format=png&auto=webp&s=930d3a8383fb7fe40749217867f4f3ab9877b4a4

For Gemma 4 26B-A4B MoE, the result flipped. DFlash was 1.73x faster and MTP was 1.49x faster than baseline decoding at concurrency 1. Baseline hit 177.1 output tok/s, MTP hit 264.2 output tok/s, and DFlash hit 306.4 output tok/s. At concurrency 16, baseline reached 975 tok/s, MTP reached 1808 tok/s, and DFlash reached 1957 tok/s.
The MoE speedups were smaller than the dense-model speedups because the baseline MoE target is already relatively cheap to run. Gemma 4 26B-A4B has 25.2B total parameters, but only 3.8B active parameters during inference. That means speculative decoding has less target-model compute to remove compared with the dense 31B model.

https://preview.redd.it/twdqm7pk7p0h1.png?width=2596&format=png&auto=webp&s=71b388e143bd384fec08e299b3996ba8337e42f8

The gains were not uniform across workloads. Coding, math, STEM, and reasoning benefited more because these tasks often have more predictable token patterns. Writing, summarization, and roleplay improved less because there are many valid ways for the model to continue the text.
Higher per-position acceptance did not automatically mean higher throughput. MTP accepted more draft tokens, but DFlash showed better throughput on the MoE model. Acceptance is only one side of it. DFlash drafts the whole block in a single forward pass, while MTP drafts token by token. When the target is this fast, the cheaper draft path can matter more even with lower acceptance.
Most accepted draft tokens came from the first few positions. Position-1 acceptance was around 80% for MTP and 75% for DFlash, but by position 8 it dropped to under 20% for both.

https://preview.redd.it/di8n1c3m7p0h1.png?width=2615&format=png&auto=webp&s=e769d24d5ae9ad4722270437eef1f26a998ac6e8

For a real deployment, try both approaches on your own setup and workload instead of assuming one will always be better. The results can change with the model, prompts, hardware, and serving configuration. Hope these numbers give people a useful reference point.

All the benchmark setup and scripts used for benchmarking and to reproduce these results are in the Github repository.

You can read about more results and in-depth analysis in our blog: https://jarvislabs.ai/blog/gemma-4-mtp-vs-dflash-benchmark

submitted by /u/LayerHot
[link] [comments]

Setup:

Results:

Leave a Comment