Linux – Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

I have a docker stack with a bunch of AI services and llama.cpp server is the brain.

I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement. In fact, I noticed that for the SAME model, SAME context setting and same KV Cache quant (Q8_0) - the ROCm version consumed 29.1gb of VRAM -vs- 25.3gb with Vulkan.

Am I missing something here? Is this phenomenon unique to my GPU or some other variable in my setup, hardware or software?

submitted by /u/Jorlen
[link] [comments]

Leave a Comment