Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark

Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression.

System Specs

Component Spec
GPU NVIDIA GeForce RTX 5090 (32GB VRAM)
CPU AMD Ryzen 9 9950X3D (16-core)
RAM 64GB DDR5
OS Windows 11

Setup

  • Model: gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB)
  • Build: TheTom/llama-cpp-turboquant branch feature/turboquant-kv-cache, merged with latest upstream master for Gemma 4 support
  • KV Cache: turbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16)
  • Config: --n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3

Benchmark Results

Test Speed (t/s)
pp4096 3,362.71
pp16384 3,047.00
pp65536 2,077.96
pp131072 1,428.80
pp262144 899.55
tg128 61.51
  • VRAM usage at 262K: 27.7 GB / 32 GB (4.3 GB headroom)
  • GPU temp: 78-80°C at 575W (some thermal throttling occurred during 262K runs, actual unthrottled speed likely ~950+ t/s... maybe)

Key Takeaways

  1. 256K full context fits on a single 5090 — The turbo3 KV cache compresses K/V from 8 bits to effectively 3 bits with near-zero quality loss (based on the TurboQuant paper, arXiv 2504.19874). Without it, 256K would be impossible on 32GB VRAM.

  2. Prompt processing scales predictably — Roughly halving speed per 4x context increase due to O(n²) attention.

  3. Token generation is constant — 61.5 t/s regardless of context length. Memory bandwidth bound.

  4. Gemma 4 support required fixes — Had to fix an MSVC bug in llama.cpp where std::transform with (const bool*) fails to correctly read GGUF bool arrays beyond ~48 elements in Release builds. This breaks the SWA (sliding window attention) layer pattern for Gemma 4's hybrid attention architecture. Fix: replace with manual uint8_t* loop.

Build Notes (Windows/MSVC)

If you're building TheTom's TurboQuant fork on Windows:

  1. ggml-turbo-quant.c — Add #define _USE_MATH_DEFINES before #include <math.h> (MSVC doesn't define M_PI by default)
  2. ggml-cpu/ops.cpp — Add extern "C" int turbo3_cpu_wht_group_size; at file scope (C/C++ linkage mismatch)
  3. llama-model-loader.cpp — Replace the std::transform((const bool*)...) in get_arr() with a manual uint8_t* loop (MSVC optimization bug with bool pointer casting)
  4. Build with -DBUILD_SHARED_LIBS=OFF to avoid DLL symbol export issues with the turbo globals
  5. Use -DCMAKE_CUDA_ARCHITECTURES=120a for RTX 5090 (sm_120a required for MXFP4 tensor core instructions)
submitted by /u/PerceptionGrouchy187
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top