Qwen3.6-35B-A3B KLDs – INTs and NVFPs

Qwen3.6-35B-A3B KLDs - INTs and NVFPs

https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932

KLD for INTs and NVFP4s.

AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs.

Things to note again:

  • This is done in VLLM, with REAL logits. My Repo (https://github.com/phaelon74/vllm/tree/feature/score-mode-ppl-kld) has made changes in the VLLM "hot path", so it's real, it's on GPU, and it's ~3-5 minutes on RTX 6000s
    • KLD does not lie, it's just raw math against Logits
  • KLD tells a story of divergence.
    • Evals are still important, for use-case specific
    • A quant can have a worse KLD and get a better eval on a test versus a better KLD quant. This is bench maxing, and it's real. Choose the Quant for your Use-Case.
  • FP8 has worse quality than INT8
    • This is expected, as W8A8 has activations at 8
    • FP8 (W8A8) should stay in 8bit, meaning it should be faster than INT8
  • The NVFP4 cake, as always, is a lie.
    • But similar to FP8, NVFP4 (W4A4) should stay in FP4 and "should" be faster than an INT4
    • NVPF4A16 has activation of 16, and will generally have a higher quality/accuracy than NVFP4A4, but remember, this may come at a cost.
submitted by /u/Phaelon74
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top