Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark

Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression.

System Specs

Model: gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB)
Build: TheTom/llama-cpp-turboquant branch feature/turboquant-kv-cache, merged with latest upstream master for Gemma 4 support
KV Cache: turbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16)
Config: --n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3

VRAM usage at 262K: 27.7 GB / 32 GB (4.3 GB headroom)
GPU temp: 78-80°C at 575W (some thermal throttling occurred during 262K runs, actual unthrottled speed likely ~950+ t/s... maybe)

256K full context fits on a single 5090 — The turbo3 KV cache compresses K/V from 8 bits to effectively 3 bits with near-zero quality loss (based on the TurboQuant paper, arXiv 2504.19874). Without it, 256K would be impossible on 32GB VRAM.
Prompt processing scales predictably — Roughly halving speed per 4x context increase due to O(n²) attention.
Token generation is constant — 61.5 t/s regardless of context length. Memory bandwidth bound.
Gemma 4 support required fixes — Had to fix an MSVC bug in llama.cpp where std::transform with (const bool*) fails to correctly read GGUF bool arrays beyond ~48 elements in Release builds. This breaks the SWA (sliding window attention) layer pattern for Gemma 4's hybrid attention architecture. Fix: replace with manual uint8_t* loop.

If you're building TheTom's TurboQuant fork on Windows:

ggml-turbo-quant.c — Add #define _USE_MATH_DEFINES before #include <math.h> (MSVC doesn't define M_PI by default)
ggml-cpu/ops.cpp — Add extern "C" int turbo3_cpu_wht_group_size; at file scope (C/C++ linkage mismatch)
llama-model-loader.cpp — Replace the std::transform((const bool*)...) in get_arr() with a manual uint8_t* loop (MSVC optimization bug with bool pointer casting)
Build with -DBUILD_SHARED_LIBS=OFF to avoid DLL symbol export issues with the turbo globals
Use -DCMAKE_CUDA_ARCHITECTURES=120a for RTX 5090 (sm_120a required for MXFP4 tensor core instructions)