I've got a 128GB Strix Halo box. Yesterday I wanted to try out Step-3.5-flash. It's a model that barely fits in my system as is - I found a bartowski Q4_XS that's 105GB. With about 150K context it takes to about 108GB. That leaves about 20GB minus what linux is taking so more like 17GB left. I ran opencode --continue so that I could try this model out in previous context. What I noticed was that with each query the memory (monitored in htop) bumped up but never completely went back to the previous use. So after a while it was up to 120GB. I figured that maybe doing a /compact would free up some of that memory, but no, it stayed at 120GB. I unloaded the model before the system ran out of memory.
I guess I would have thought that the memory use (weights + context) would be mostly fixed so that it would stay under about 110GB. But this gradually increasing memory use seems indicative of a memory leak.
I'm using llama.cpp 2.13.0 vulkan backend through LM Studio.
[link] [comments]