Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.
I estimated full VRAM and ~18GB of my RAM to be used but I'm not sure and fuckass Windows is showing 50.1GB (out of 32GB physical) of memory is committed though that also includes every other apps and might not even be used.
I've already set --mlock for llama-server, but I want to make sure that other apps won't use paging file either for like 99% of the time, as I don't think it's worth ruining my SSD in the long term. I won't be using my desktop at all when running it.
How do I estimate the total memory usage? Am I being unrealistic with my hardware and is torturing it with this large model and context?
[link] [comments]