I wanted to share an open-source app that I built for running LLMs locally on my setup.
My setup
Hardware
- FEVM FAEX1 (128GB)
- RTX Pro 5000 Blackwell (48GB), connected over OCuLink
- Aoostar AG02
- 2x2TB internal m.2 drives on raid-0 using
mdadm.
Software: Ubuntu 25.10, llama.cpp built from source for cuda + vulkan, rocm.
How I use this app
I generally run two models in parallel using different Llama backends simultaneously - Qwen3.6 27b UD-Q6-KXL or NVFP4 on CUDA, and Qwen3.6 35b A3B UD-Q6-KXL on the Strix Halo unified memory. I mostly use them with opencode for coding. The built in model-router comes in handy.
What else can the app do
Does basic things any llama.cpp wrappers can do + some other things. Overall it's a convenience app to spin up llama-server instances for any purposes. And it's open-source.
- MCP.json + tool calling in chat
- Model Router for opencode / claude-code local.
- KV-cache checkpointing (experimental).
More info on the Read Me, along with some guides.
Visit warpdrv on GitHub
It's an early-stage alpha release, so expect some minor bugs - I have mostly fixed the major ones. Feature requests as well as bug reports are welcome.
---
Setting up ROCm on Strix Halo (Ubuntu 25.10)
Strix Halo on Linux needs some setup before ROCm works natively for gfx1151. I am aware of the docker-based toolboxes for Strix Halo. They work and are a good option. I just wanted bare-metal without containers.
I am including the steps below for those interested in trying it out.
- Install mainline kernel 6.18. Use the Mainline Kernels desktop app on Ubuntu 25.10. Reboot.
- Verify:
uname -r shows 6.18.x.
- In BIOS, I set dedicated iGPU VRAM to 4GB and enabled Resizable BAR. The remaining 124GB stays as unified memory accessible via GTT.
- Add GRUB params. In
/etc/default/grub.d/ add: iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.cwsr_enable=0. Note: amdgpu.gttsize is deprecated on recent kernels but still respected. Kept alongside ttm.pages_limit as belt-and-suspenders. Run update-grub and reboot. - Verify:
cat /sys/class/drm/card*/device/mem_info_gtt_total shows ~124GB.
- Optionally update firmware. Clone the upstream linux-firmware tree and copy the MES blobs to
/lib/firmware/amdgpu/. Check md5 first - my firmware was already the latest one, so I didnt run this step. - Install ROCm 7.2. On the host via AMD repo. Add symlink:
libxml2.so.16 -> libxml2.so.2, otherwise some libs won't load. - Verify:
rocminfo | grep gfx shows gfx1151.
- Build llama.cpp for ROCm.
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" \ -DCMAKE_BUILD_TYPE=Release -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" - Three things to know when running:
- Don't set
HSA_OVERRIDE_GFX_VERSION. It forces gfx1100 kernel dispatch on gfx1151 and segfaults in rms_norm. - Required runtime flags:
--no-warmup -fa 1 -dio --no-mmap. Without --no-warmup it segfaults during the warmup phase. - Verify: run
llama-cli with a model, confirm it loads and generates tokens without segfault.
Additionally, I build llama.cpp from source for CUDA 13.2 (for RTX Pro 5000) with the standard -DGGML_CUDA=ON flow, no special handling.
---
PS. Apple Mac: I dont own a Mac so I am unable to test the app on MacOS yet. Feel free to build from source, or share the build with me so I can add it to the releases on GitHub, I can shout-out to your GitHub handle in the ReadMe, thanks :)
submitted by