I have a docker stack with a bunch of AI services and llama.cpp server is the brain.
I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.
Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?
[link] [comments]