Warpdrv – my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro.

Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro.

I wanted to share an open-source app that I built for running LLMs locally on my setup.

My setup

Hardware

  • FEVM FAEX1 (128GB)
  • RTX Pro 5000 Blackwell (48GB), connected over OCuLink
  • Aoostar AG02
  • 2x2TB internal m.2 drives on raid-0 using mdadm.

Software: Ubuntu 25.10, llama.cpp built from source for cuda + vulkan, rocm.

How I use this app

I generally run two models in parallel using different Llama backends simultaneously - Qwen3.6 27b UD-Q6-KXL or NVFP4 on CUDA, and Qwen3.6 35b A3B UD-Q6-KXL on the Strix Halo unified memory. I mostly use them with opencode for coding. The built in model-router comes in handy.

What else can the app do

Does basic things any llama.cpp wrappers can do + some other things. Overall it's a convenience app to spin up llama-server instances for any purposes. And it's open-source.

  • MCP.json + tool calling in chat
  • Model Router for opencode / claude-code local.
  • KV-cache checkpointing (experimental).

More info on the Read Me, along with some guides.

Visit warpdrv on GitHub

It's an early-stage alpha release, so expect some minor bugs - I have mostly fixed the major ones. Feature requests as well as bug reports are welcome.

---

Setting up ROCm on Strix Halo (Ubuntu 25.10)

Strix Halo on Linux needs some setup before ROCm works natively for gfx1151. I am aware of the docker-based toolboxes for Strix Halo. They work and are a good option. I just wanted bare-metal without containers.

I am including the steps below for those interested in trying it out.

  1. Install mainline kernel 6.18. Use the Mainline Kernels desktop app on Ubuntu 25.10. Reboot.
    • Verify: uname -r shows 6.18.x.
  2. In BIOS, I set dedicated iGPU VRAM to 4GB and enabled Resizable BAR. The remaining 124GB stays as unified memory accessible via GTT.
  3. Add GRUB params. In /etc/default/grub.d/ add: iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.cwsr_enable=0. Note: amdgpu.gttsize is deprecated on recent kernels but still respected. Kept alongside ttm.pages_limit as belt-and-suspenders. Run update-grub and reboot.
    • Verify: cat /sys/class/drm/card*/device/mem_info_gtt_total shows ~124GB.
  4. Optionally update firmware. Clone the upstream linux-firmware tree and copy the MES blobs to /lib/firmware/amdgpu/. Check md5 first - my firmware was already the latest one, so I didnt run this step.
  5. Install ROCm 7.2. On the host via AMD repo. Add symlink: libxml2.so.16 -> libxml2.so.2, otherwise some libs won't load.
    • Verify: rocminfo | grep gfx shows gfx1151.
  6. Build llama.cpp for ROCm. cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" \ -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
  7. Three things to know when running:
  • Don't set HSA_OVERRIDE_GFX_VERSION. It forces gfx1100 kernel dispatch on gfx1151 and segfaults in rms_norm.
  • Required runtime flags: --no-warmup -fa 1 -dio --no-mmap. Without --no-warmup it segfaults during the warmup phase.
  • Verify: run llama-cli with a model, confirm it loads and generates tokens without segfault.

Additionally, I build llama.cpp from source for CUDA 13.2 (for RTX Pro 5000) with the standard -DGGML_CUDA=ON flow, no special handling.

---

PS. Apple Mac: I dont own a Mac so I am unable to test the app on MacOS yet. Feel free to build from source, or share the build with me so I can add it to the releases on GitHub, I can shout-out to your GitHub handle in the ReadMe, thanks :)

submitted by /u/xornullvoid
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top