I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most cases.
So I wanted to check how it is possible and I learned that llama-perplexity.exe is the right tool for this test. I'm using TheTom's turboquant_plus built on my machine - AFAIK you can download a pre-built release by now as well. I have a 3090 eGPU and using 200k context.
This is how I used the tool:
First I executed in without KV cache quantization (PowerShell):\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw\ After around 7-8 minutes, it will give you a result something like Final estimate: PPL = 6.9233 +/- 0.04564
Then you can repeat it with your qant values, like\ .\llama-perplexity.exe -m models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_M.gguf -f wiki.test.raw --cache-type-k turbo3 --cache-type-v turbo3
(wiki.test.raw is just a test file well suited for this test, you can download it from anywhere)
And the results were something I didn't expect at all. All quants are performing well within the limits. Since I'm quite new to local LLMs, I tried to understand how it was possible and as far as I could understand, if you have a dense model above 20B params and above Q4, then it is intelligent enough to be less sensitive to KV cache quants. I can confirm, that turbo3 was not working well for me with 35B and also, probably all small models would be totally confused with a highly compressed V cache.
Let me switch to AI from now on, since I pasted my results to Gemini and it come up with a nicely formatted post idea based on our conversation and I'm happy to use it, since English is not my first language.
What is Perplexity (PPL)?
For those new to benchmarking, Perplexity is a measure of how "surprised" a model is by a sequence of text. * Lower is better. * A score under 10.0 on Wikitext is generally the mark of a very coherent, "smart" model. Edit: might not be true in some cases - see comments * We are looking at the Delta (change). If a quantization setting increases PPL by more than 0.1–0.2, you’ll likely start seeing "drunken" behavior or loops in long conversations.
Results
The results blew me away. The "common wisdom" that Q4 is unusable appears to be a myth for the 27B+ dense class.
| KV Cache Setting | Perplexity (PPL) | Delta vs. F16 | Verdict |
|---|---|---|---|
| F16 (Baseline) | 6.9233 | - | Reference |
| Q8_0 | 6.9193 | -0.0040 | Identical (Margin of Error) |
| Q4_0 | 6.9381 | +0.0148 | Transparent (Highly Recommended) |
| Turbo4 (4-bit) | 6.9483 | +0.0250 | Excellent |
| Turbo3 (3-bit) | 7.0121 | +0.0888 | Great for Extreme Context |
Observations & Recommendations
1. The Q4 "Sweet Spot" The jump from F16 to Q4_0 is only 0.014. To put that in perspective, the margin of error for the test was 0.045. This means Q4_0 is mathematically indistinguishable from uncompressed cache. If you aren't using Q4 or Q8 on a 3090, you're just wasting VRAM.
2. When to use Turbo3? I’ve been using Turbo3 for a week in programming tasks. It allows for a 200k context window on a single 3090 without breaking a sweat. While the PPL hit is measurable (+0.08), it's still well within the "safe zone."
3. The MoE Exception While this dense 27B model handles Turbo3 perfectly, I noticed that 35B MoE models tend to loop or error out with 3-bit cache. It seems the "Router" in MoE architectures is much more sensitive to the noise introduced by heavy quantization.
The "Needle in a Haystack" Test
To be 100% sure your setup is safe for production work, try this "Needle in a Haystack" test: 1. Paste a long piece of code (e.g., 50k tokens). 2. In the middle, hide a very specific, weird comment like // The password is: BANANA-123. 3. Ask the model: "What was the hidden password in the code I gave you?" 4. If it finds it instantly, your 200k context is working perfectly.
TL;DR: Don't fear KV quantization on 27B+ models. Q4_0 is a "free lunch," and Turbo3 is a game-changer for repo-level coding if you need the 200k+ context.
[link] [comments]