Anyone running Kimi on low VRAM + offloading to RAM? (im sure most)

Im curious how much output token benefits from something smaller like a 12gb Tesla T4, and offloading the remainder of the model to RAM

I get about ~1.6t/s output ~20t/s input CPU only.. which is obviously terrible. I'm using NUMA.. I have dual xeon platinum 24c(so 48c/96t) and 1.5T of RAM

Strangely enough, the Q8 model from un sloth, run slightly faster than the Q4 model on my system

submitted by /u/Creative-Type9411
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top