Something I'm noticing that I don't think I've noticed before. I've been testing out Gemma 4 31B with 32GB of VRAM and 64GB of DDR5. I can load up the UD_Q5_K_XL Unsloth quant with about 100k context with plenty of VRAM headroom, but what ends up killing me is sending a few prompts and the actual system RAM fills up and the process gets terminated for OOM, not a GPU or CUDA OOM, like Linux killing it because llama.cpp was using 63GB of system RAM.
I've since switched to another slower PC with a bunch of older GPUs where I have with 128GB of DDR4, and while I've got heaps of GPU VRAM spare there, it still eats into the system RAM, but gives me a bigger buffer before the large prompts kill the process, so is more usable. Although I've been running a process for a little while now that has been prompting a bit and has done a few ~25k token prompts and I'm sitting at 80GB of system ram and climbing, so I don't think it'll make it anywhere near 100k.
I even tried switching to the Q4, which only used ~23GB of my 32GB of VRAM, but still, throw a few large prompts at it and the system RAM fills up quick and kills llama.cpp.
I'm using the latest llama.cpp as of 2 hours ago and have tested across a couple of different machines and am seeing the same thing.
It's weird that I would need to lower the context of the model so that it takes up only like 18GB of my 32GB of VRAM just because my system RAM isn't big enough, right?
running with params -ngl 999 -c 102400 -fa on --cache-type-k q8_0 --cache-type-v q8_0 --temp 1.0 --top-k 64 --top-p 0.95
[link] [comments]