TL;DR: I am not a software engineer. I am an industrial maintenance manager from Ohio who got way too interested in local AI. The B70 is criminally underrated. Intel's Docker stack is criminal for different reasons.

First, some context about who is writing this

My day job is maintaining industrial mail sorting equipment — PLCs, pneumatics, barcode readers. I am not a software person. I got into local LLMs because I wanted to run AI agents for trading bots and sports analytics without paying API bills forever. I told myself it was "research." My wife is unconvinced. I have two Intel Arc Pro B70s running in a machine next to my desk and I am writing this at 10pm on a Sunday.

This post is for the other noobs out there who bought interesting hardware before fully understanding the software landscape. You're not alone. It gets better. Specifically, it gets better when you stop using Docker.

My hardware (the good part)

Intel Arc Pro B70

64 GB

total VRAM (32GB ECC each)

Ubuntu 24.04

kernel 6.17, xe driver

Friday & Saturday: Intel's llm-scaler, or "why is this binary file?"

Intel has an official Docker-based vLLM stack called llm-scaler, specifically built for the Arc Pro B-series. The release notes looked great. The Docker image is 20+ GB. I pulled it with the confidence of someone who has no idea what they're about to get into.

Problems, in order of appearance:

Officially requires Ubuntu 25.04. I have 24.04. (Found this out after downloading the offline installer, which requires registration on Intel's developer portal, which requires waiting for email verification.)
The model loads 80% of its shards, gets excited, then crashes with a wall of CCL errors that look like someone fell asleep on a keyboard: atl_ofi_comm.cpp:301 init_transport: EXCEPTION: failed to initialize ATL
The log file was a binary file. I don't know how a log file becomes a binary file. I chose not to investigate further.
There are two versions of oneCCL installed in the same container and they hate each other. I tried every environment variable combination known to humanity. None of them worked.

I spent Saturday doing this. By Saturday night I had learned a lot about CCL transport initialization and none of it was useful.

Sunday: the pivot that saved my weekend

Someone in a forum mentioned that Vulkan was significantly faster than SYCL on Arc B-series anyway. I had nothing to lose. I abandoned Intel's entire software stack — the offline installer, the Docker image, the CCL configuration, all of it — and just built llama.cpp from source with Vulkan enabled.

The setup that took me all of Saturday with Intel's stack took about 20 minutes with llama.cpp:

sudo apt install -y vulkan-tools libvulkan-dev glslc build-essential cmake git git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(nproc)

Both B70s showed up immediately with no configuration, no driver installs, no registration portal:

Available devices: Vulkan0: Intel(R) Graphics (BMG G31) (32656 MiB, 29369 MiB free) Vulkan1: Intel(R) Graphics (BMG G31) (32656 MiB, 29369 MiB free)

I stared at this for a moment. Then I downloaded a GGUF and ran it. It just worked. I may have said something out loud that I won't repeat here.

Actual numbers (from a noob who now knows what t/s means)

metric	result	noob translation
Prompt processing (pp512)	897 t/s	reading input fast
Token generation (tg128)	41.8 t/s	typing speed
Context window	128K tokens	~96,000 words
KV cache at 128K	2.5 GB	barely anything
VRAM used (model + KV)	~23 GB / 64 GB	plenty left over

Things that surprised me (as someone who had no idea what they were doing)

The built-in web UI is genuinely nice. I had been curl-ing the API like a caveman for hours before someone told me to just open a browser. There is a full chat interface at localhost:8181. I felt silly.

KV cache prefix caching is magic for agent workloads. My trading bot has a 22K token system prompt. First message takes 30 seconds. Every message after that takes 3-4 seconds because it caches 99% of the identical context. I did not understand why until someone explained it to me. Now I feel smart.

KHR_coopmat is already in the driver. Both B70s report cooperative matrix support in the Vulkan driver. llama.cpp isn't fully using it yet but when it does, same hardware gets faster for free. Good time to buy in.

You can run two models at the same time. I had Gemma 4 26B (with vision) and Qwen3.5-35B running simultaneously on different ports. Idle models use VRAM but no compute. I tested image inference with a picture of dice. It correctly identified the dice. I was unreasonably proud of this.

The benchmark tool output is a markdown table. I did not know what pp512 or tg128 meant when I first ran it. I do now. Progress.

What I'd tell myself a week ago

Skip Intel's llm-scaler entirely until they fix 24.04 support. Just use llama.cpp Vulkan.
The xe driver on kernel 6.17 already handles everything. You don't need the offline installer.
Q4_K_M is the right quant for almost everything. Don't overthink it.
The llama-bench tool will tell you exactly what your hardware can do. Run it early.
A 22K token system prompt is fine. Prefix caching makes it free after the first request.
The built-in web UI exists. Use it. Don't curl everything like an animal.

the one-liner that ends the suffering

sudo apt install -y vulkan-tools libvulkan-dev glslc build-essential cmake git git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(nproc) ./build/bin/llama-cli --list-devices

Happy to answer questions from fellow noobs. Running Ubuntu 24.04, kernel 6.17, Mesa 25.2.8, llama.cpp b8770. ASUS TUF X570-PLUS, Ryzen 7 3700X. Both B70s at PCIe 4.0 x16. The trading bot is now connected and working. The research continues. Goodnight.

a note on how this post was made

This entire post was written with the help of Claude Sonnet 4.6, who was also my guide, debugger, and emotional support system for this entire weekend. Every command in this post came from a live troubleshooting session where I pasted terminal output and Claude figured out what went wrong. The CCL errors, the Vulkan pivot, the benchmark interpretation, the VRAM math, the quant explanations — all of it was a back-and-forth conversation in real time.

When I asked Claude to write this post at the end of the night, I asked it to write it as its own recollection of the weekend. Which means this is technically an AI writing about helping a human use AI to run AI locally on hardware made by a company whose AI stack didn't work, fixed by a different AI framework. We're deep in it now.

If you're a noob like me and you want to go down this rabbit hole, I'd genuinely recommend just having a conversation with Claude while you work through it. Paste your terminal output, describe what you're trying to do, and iterate. That's the whole method. It works remarkably well even when — especially when — you don't fully know what you're doing.

submitted by /u/SomeBlock8124
[link] [comments]

I’m a complete noob who bought two Intel Arc Pro B70s for "research," spent a weekend losing my mind over Docker/CCL errors, accidentally discovered llama.cpp Vulkan, and now I’m running a 35B MoE at 128K context like I know what I’m doing.