LocalLLaMA

I found why KV cache INT4 breaks on some models (Qwen2-7B: ΔPPL +238) and built a 4-line fix no training, no calibration, 12 models tested up to 40B

I've been playing with KV cache INT4 quantization and noticed something weird: it works perfectly on some models and completely destroys others. Examples: Falcon-40B: ΔPPL +0.08 ✅ (basically free compression) OPT-13B: ΔPPL +0.28 ✅ Qwen2-7B: ΔPPL +…