Model Overview
Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.
https://preview.redd.it/mwyq7b7hx42h1.png?width=3915&format=png&auto=webp&s=744bd87267338a6236269a8d915b185cff8a82d2
Highlights
- SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
- Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
- Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
- Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
- Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base
https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B
submitted by