| I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, with pretty similar speed. Benchmark 2: M3 Max 48GB — Qwen3.5 35B A3B Q6 at 128K context → Without TurboQuant: KV cache 2560 MiB, K (f16): 1280 MiB, V (f16): 1280 MiB — 45.34 t/s → With TurboQuant: KV cache 930 MiB, K (q8_0): 680 MiB, V (turbo3): 250 MiB — 42.88 t/s How to run it This uses the community fork by TheTom, which includes Metal kernels for Apple Silicon. It’s not in mainline llama.cpp yet, although PRs are open. # Clone the TurboQuant fork (not in mainline llama.cpp yet) git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant git checkout feature/turboquant-kv-cache # Configure with Metal (Apple Silicon GPU) cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release # Compile using all CPU cores cmake --build build -j$(sysctl -n hw.ncpu) # Run with TurboQuant: keys at q8_0, values compressed with turbo3 ./build/bin/llama-server Full walkthrough on YouTube soon. [link] [comments] |