[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs – reorder optimization fix (PR submitted)

TL;DR: Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation.

The problem:

On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x more data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path.

Root cause:

llama.cpp's SYCL backend has a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This was implemented for Q4_0, Q4_K, and Q6_K - but Q8_0 was never added. Q8_0's 34-byte blocks (not power-of-2) make the non-reordered layout especially bad for GPU cache performance.

Sooo, the fix:

~200 lines of code extending the existing reorder framework to Q8_0. The most critical bug was actually a single line - Q8_0 tensors weren't getting the "extra" struct allocated during buffer init, so the reorder flag was silently never set.

Results on Qwen3.5-27B (Intel Arc Pro B70):

Q8_0 before: 4.88 t/s (21% bandwidth)
**Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster*\*
Q4_K_M: 20.12 t/s (unchanged)
Q6_K: 13.83 t/s (no reorder)

Q8_0 is now faster than Q6_K (15.24 vs 13.83 t/s) in my testing; while providing higher quality.

Validation: Before writing the fix, we binary-patched Intel's closed-source IPEX-LLM to run on my GPU (it doesn't support B70's PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. My open-source implementation achieves 66%.

PR: https://github.com/ggml-org/llama.cpp/pull/21527

Issue: https://github.com/ggml-org/llama.cpp/issues/21517

Hardware: Intel Arc Pro B70, 32 GB GDDR6, 608 GB/s bandwidth

submitted by /u/Katostrofik
[link] [comments]

Leave a Comment