| https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932 KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important. Accuracy versus speed versus native kernels on your GPUs. Things to note again: - This is done in VLLM, with REAL logits. My Repo (https://github.com/phaelon74/vllm/tree/feature/score-mode-ppl-kld) has made changes in the VLLM "hot path", so it's real, it's on GPU, and it's ~3-5 minutes on RTX 6000s
- KLD does not lie, it's just raw math against Logits
- KLD tells a story of divergence.
- Evals are still important, for use-case specific
- A quant can have a worse KLD and get a better eval on a test versus a better KLD quant. This is bench maxing, and it's real. Choose the Quant for your Use-Case.
- FP8 has worse quality than INT8
- This is expected, as W8A8 has activations at 8
- FP8 (W8A8) should stay in 8bit, meaning it should be faster than INT8
- The NVFP4 cake, as always, is a lie.
- But similar to FP8, NVFP4 (W4A4) should stay in FP4 and "should" be faster than an INT4
- NVPF4A16 has activation of 16, and will generally have a higher quality/accuracy than NVFP4A4, but remember, this may come at a cost.
submitted by /u/Phaelon74 [link] [comments] |