Please help improving a CPU-only inference speed

This is a request for help for the people that want to use locally very large models on Q8 and better quanta at all costs, in my case the cost is inference speed.

So I have a 512GB DDR4 ECC 2666 with a Threadripper Pro 3945WS that gives me ca. 5-7tok/second for MiniMax-2.7 with llama.cpp CPU backend. Yes, it probably feels like torture for the ADHD generation, but I'm using it for processing LARGE specs and planning, and it steers a Qwen-3.6-27B for implementation and testing. Of course I've tried first low-bit quanta but the decrease in performance was not worth the marginal increase in speed.

So I was wondering if someone has any "tricks", unmerged PRs or hidden gems (I get that the CPU only inference is not the most popular topic right now, but maybe there are some half forgotten github repos somewhere), to maximize the inference output without sacrificing the model weights.

Also another topic of interest will be upgrading the bottom of the barrel CPU to a 5975, while everyone emphatically says that the inference speed is memory bandwidth bound, I see that during the PP and also on the inference all the cores are at 100% load. Here even the cloud models have contradictory answers, from "no significant increase" to doubling the speed. I really want to hear it from someone that actually did this.

submitted by /u/HumanDrone8721
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top