TurboQuant + TriAttention (C/HIP): ~6.8× total KV cache reduction in llama.cpp

Results from combining two KV-cache reduction methods in llama.cpp on AMD/HIP:

TurboQuant KV cache compression (turbo3): ~5.1× reduction
TriAttention KV cache pruning (75% retention): ~1.33× reduction
Combined: ~6.8× total KV reduction

At 131K context: f16 KV = 8.2 GiB → combo ≈ 1.2 GiB.

TurboQuant numbers (Qwen3.5-27B, RX 7900 XTX): - GSM8K: 72.0% on 1319 problems (vs 66% f16) - NIAH: 28/28 up to 64K context - Tool calling: 26/26 - PPL: +0.02% at 4K, -0.9% at 16K - Speed overhead: ~1-2%

TriAttention is based on the recent NVIDIA/MIT paper (arXiv:2604.04921). My implementation is in C/ggml — no Python needed at runtime. Pre-built calibration stats for Qwen3 family included.

As far as I know, this is currently the only HIP/ROCm TurboQuant implementation for llama.cpp and the only C/ggml implementation of TriAttention.

Repos: - TurboQuant (HIP): llama.cpp-turboquant-hip - TriAttention (C/ggml): triattention-ggml - llama.cpp discussion: #20969

3 users currently testing on Strix Halo (gfx1201) and RDNA3 (gfx1100). Feedback and testing results welcome.

submitted by /u/Acrobatic_Bee_6660
[link] [comments]

Leave a Comment