Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using BeeLlama v0.1.2, with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate.

Tests were done with Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about vLLM study, but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison.

Here are my findings:

  • PPL Hides the Tail, KLD Exposes It. Through q4_0, the entire PPL range stays under 0.01 above bf16. Even turbo3_tcq only adds ~0.02 PPL. But 99.9% KL divergence tells a different story: while q5_0 (at 34.4% of bf16) is obviously behind q8_0, it's still not bad. But then q4_0's tail KLD is 32% worse than q5_0's. Now this is what breaks your tool calls and JSON structure.
  • Rotation closed the gap at 4 bits. llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways.
  • TCQ saves the low end. turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than turbo2. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well!
  • Asymmetric KV beats symmetric at the same size. q5_0/q4_0 is the same memory as q4_1/q4_1 but beats it across all test configs in 99.9% precision. After K reaches q5_0, the next useful bit goes to V, not to q5_1 K.
  • Higher model precision means more cache damage. Q5_K_S took 3-5% more 99.9% precision damage than IQ4_XS at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool.
  • q8 is mostly a luxury tier, unless you have spare VRAM. q8_0/q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7-98.2% across configs, so full q8_0/q8_0 at 53.1% is mostly validation when you don't struggle with VRAM anyways.

Here's the article, with all the data and quite a bit of analysis:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

submitted by /u/Anbeeld
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top