Qwen3.6-35B-A3B KLDs – INTs and NVFPs

KLD for INTs and NVFP4s.

AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs.

Things to note again:

This is done in VLLM, with REAL logits. My Repo (https://github.com/phaelon74/vllm/tree/feature/score-mode-ppl-kld) has made changes in the VLLM "hot path", so it's real, it's on GPU, and it's ~3-5 minutes on RTX 6000s
- KLD does not lie, it's just raw math against Logits
KLD tells a story of divergence.
- Evals are still important, for use-case specific
- A quant can have a worse KLD and get a better eval on a test versus a better KLD quant. This is bench maxing, and it's real. Choose the Quant for your Use-Case.
FP8 has worse quality than INT8
- This is expected, as W8A8 has activations at 8
- FP8 (W8A8) should stay in 8bit, meaning it should be faster than INT8
The NVFP4 cake, as always, is a lie.
- But similar to FP8, NVFP4 (W4A4) should stay in FP4 and "should" be faster than an INT4
- NVPF4A16 has activation of 16, and will generally have a higher quality/accuracy than NVFP4A4, but remember, this may come at a cost.