Results from combining two KV-cache reduction methods in llama.cpp on AMD/HIP:
- TurboQuant KV cache compression (turbo3): ~5.1× reduction
- TriAttention KV cache pruning (75% retention): ~1.33× reduction
- Combined: ~6.8× total KV reduction
At 131K context: f16 KV = 8.2 GiB → combo ≈ 1.2 GiB.
TurboQuant numbers (Qwen3.5-27B, RX 7900 XTX): - GSM8K: 72.0% on 1319 problems (vs 66% f16) - NIAH: 28/28 up to 64K context - Tool calling: 26/26 - PPL: +0.02% at 4K, -0.9% at 16K - Speed overhead: ~1-2%
TriAttention is based on the recent NVIDIA/MIT paper (arXiv:2604.04921). My implementation is in C/ggml — no Python needed at runtime. Pre-built calibration stats for Qwen3 family included.
As far as I know, this is currently the only HIP/ROCm TurboQuant implementation for llama.cpp and the only C/ggml implementation of TriAttention.
Repos: - TurboQuant (HIP): llama.cpp-turboquant-hip - TriAttention (C/ggml): triattention-ggml - llama.cpp discussion: #20969
3 users currently testing on Strix Halo (gfx1201) and RDNA3 (gfx1100). Feedback and testing results welcome.
[link] [comments]