Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using BeeLlama v0.1.2, with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate.
Tests were done with Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about vLLM study, but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison.
Here are my findings:
- PPL Hides the Tail, KLD Exposes It. Through
q4_0, the entire PPL range stays under 0.01 abovebf16. Eventurbo3_tcqonly adds ~0.02 PPL. But 99.9% KL divergence tells a different story: whileq5_0(at 34.4% ofbf16) is obviously behindq8_0, it's still not bad. But thenq4_0's tail KLD is 32% worse than q5_0's. Now this is what breaks your tool calls and JSON structure. - Rotation closed the gap at 4 bits. llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits,
turbo4has no quality advantage overq4_0, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways. - TCQ saves the low end.
turbo3_tcqis consistently much better than plainturbo3, andturbo2_tcqis much better thanturbo2. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well! - Asymmetric KV beats symmetric at the same size.
q5_0/q4_0is the same memory asq4_1/q4_1but beats it across all test configs in 99.9% precision. After K reachesq5_0, the next useful bit goes to V, not toq5_1K. - Higher model precision means more cache damage.
Q5_K_Stook 3-5% more 99.9% precision damage thanIQ4_XSat the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool. - q8 is mostly a luxury tier, unless you have spare VRAM.
q8_0/q5_0at 43.8% ofbf16KV keeps 99.9% precision at 93.7-98.2% across configs, so fullq8_0/q8_0at 53.1% is mostly validation when you don't struggle with VRAM anyways.
Here's the article, with all the data and quite a bit of analysis:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
[link] [comments]