Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Just sharing the results from experimenting with the B70 on my setup....

These results compare three llama.cpp execution paths on the same machine:

  • RTX 3090 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
  • Arc Pro B70 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026)
  • Arc Pro B70 (SYCL) inside an Ubuntu 24.04 Docker container, using a separate SYCL-enabled llama-bench build from the aicss-genai/llama.cpp fork

Prompt processing (pp512)

model RTX 3090 (Vulkan) Arc Pro B70 (Vulkan) Arc Pro B70 (SYCL) B70 best vs 3090 B70 SYCL vs B70 Vulkan
TheBloke/Llama-2-7B-GGUF:Q4_K_M 4550.27 ± 10.90 1236.65 ± 3.19 1178.54 ± 5.74 -72.8% -4.7%
unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL 9359.15 ± 168.11 2302.80 ± 5.26 3462.19 ± 36.07 -63.0% +50.3%
unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M 3902.28 ± 21.37 1126.28 ± 6.17 945.89 ± 17.53 -71.1% -16.0%
unsloth/gemma-4-31B-it-GGUF:Q4_K_XL 991.47 ± 1.73 295.66 ± 0.60 268.50 ± 0.65 -70.2% -9.2%
ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 4740.04 ± 13.78 1176.34 ± 1.68 1192.99 ± 5.75 -74.8% +1.4%
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 oom 990.32 ± 5.34 552.37 ± 5.76 -44.2%
Qwen/Qwen3-8B-GGUF:Q8_0 4195.89 ± 41.31 1048.39 ± 2.66 1098.90 ± 1.02 -73.8% +4.8%
unsloth/Qwen3.5-4B-GGUF:Q4_K_XL 5233.55 ± 8.29 1430.72 ± 9.68 1767.21 ± 21.27 -66.2% +23.5%
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M 3357.03 ± 18.47 886.39 ± 6.14 445.56 ± 7.46 -73.6% -49.7%
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M 3417.76 ± 17.84 878.15 ± 5.32 442.01 ± 6.51 -74.3% -49.7%
Average (excluding oom) -71.1%

Token generation (tg128)

model RTX 3090 (Vulkan) Arc Pro B70 (Vulkan) Arc Pro B70 (SYCL) B70 best vs 3090 B70 SYCL vs B70 Vulkan
TheBloke/Llama-2-7B-GGUF:Q4_K_M 137.92 ± 0.41 58.61 ± 0.09 92.39 ± 0.30 -33.0% +57.6%
unsloth/gemma-4-E2B-it-GGUF:Q4_K_XL 207.21 ± 2.00 89.33 ± 0.60 70.65 ± 0.84 -56.9% -20.9%
unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_M 131.33 ± 0.14 42.00 ± 0.01 37.75 ± 0.32 -68.0% -10.1%
unsloth/gemma-4-31B-it-GGUF:Q4_K_XL 31.49 ± 0.05 14.49 ± 0.04 18.30 ± 0.05 -41.9% +26.3%
ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF:Q8_0 98.96 ± 0.56 21.30 ± 0.03 55.37 ± 0.02 -44.1% +160.0%
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF:Q8_0 oom 37.69 ± 0.03 28.58 ± 0.09 -24.2%
Qwen/Qwen3-8B-GGUF:Q8_0 92.29 ± 0.17 19.78 ± 0.01 50.74 ± 0.02 -45.0% +156.5%
unsloth/Qwen3.5-4B-GGUF:Q4_K_XL 162.58 ± 0.76 60.45 ± 0.06 79.09 ± 0.05 -51.4% +30.8%
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M 148.01 ± 0.38 43.30 ± 0.05 37.93 ± 0.89 -70.7% -12.4%
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M 148.64 ± 0.53 43.46 ± 0.02 36.87 ± 0.42 -70.8% -15.2%
Average (excluding oom) -53.5%

Commands used

Host Vulkan runs

For each model, the host benchmark commands were:

llama-bench -hf <MODEL> -dev Vulkan0 llama-bench -hf <MODEL> -dev Vulkan2 

Where:

  • Vulkan0 = RTX 3090
  • Vulkan2 = Arc Pro B70

Container SYCL runs

For each model, the SYCL benchmark was run inside the Docker container with:

./build/bin/llama-bench -hf <MODEL> -dev SYCL0 

Where:

  • SYCL0 = Arc Pro B70

Test machine

  • CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
    • 24 cores / 48 threads
    • 1 socket
    • 2.2 GHz min / 3.0 GHz max
  • RAM: 128 GiB total
  • GPUs:
    • NVIDIA GeForce RTX 3090, 24 GiB
    • NVIDIA GeForce RTX 3090, 24 GiB
    • Intel Arc Pro B70, 32 GiB
submitted by /u/tovidagaming
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top