| As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and also due to the inclusion of an unusual part, Intel Optane Persistent Memory, which I haven’t seen anyone use in an LLM inference build before. Optane PMem is a DIMM form factor memory unit that can function in a way that is somewhere between DRAM and an SSD. Intel has discontinued the line, and I found sticks on the secondhand market for much less than what the equivalent DRAM capacity would cost. It is this large PMem capacity (768GB) that allows me to host such large models on my system. For my build I used the PMem in Memory Mode, which is where the PMem is available to the computer as RAM, with the computer’s DRAM sticks functioning as a cache. Kimi K2.5’s mixture-of-experts architecture is an ideal test model for my build. To get the results I did, I used hybrid GPU/CPU inference with llama.cpp. Kimi K2.5’s (Unsloth Q2_K_XL quant) attention weights, the dense layer, the shared expert in each MoE layer, and the routing components are actually able to fit on my 12GB GPU using llama.cpp’s “override-tensor” flag, although I also did pretty good results just using llama.cpp’s “ngl auto” and “cmoe” flags and letting llama.cpp decide tensor placement as it sees fit too. Regardless, the sparse experts’ weights (the bulk of the model size) generally live on PMem/DRAM and get processed as needed from there. The end result from my testing with this setup is around 4 tokens per second for generation! Given the fact that this is a trillion parameter frontier-class model running on such a limited hardware budget, I would consider it to be a great success. It’s a shame Intel discontinued Optane Persistent Memory, because the current direction of some local inference innovation, including SSD offloading and broader memory tiering approaches, could have been really interesting with this specific kind of memory tier on modern hardware platforms. Overall I was pleased with this Optane PMem-centric build, it allows me to run very big models at surprisingly acceptable speeds, and the process was highly educational. Parts: - Intel Xeon Gold 6246 CPU - TYAN S5630GMRE-CGN motherboard - ASUS Dual GeForce RTX 3060 OC 12GB GPU - 6x 32GB Samsung 2666MHz DDR4 ECC DRAM sticks - 6x 128GB Intel Optane DCPMM PC4-2666 NMA1XBD128GQS persistent memory modules - Western Digital WD SN850X 2TB M.2 2280 NVMe SSD - ASRock Steel Legend SL-850G 850W 80 PLUS GOLD & Cybenetics PLATINUM Full Modular Power Supply - Silverstone SST-GD08B (Black) Grandia Series Home Theater PC Case I hope you enjoyed this rundown. There is a lot more detail that I didn’t include here, so I’m happy to answer questions about the build, the configuration, or the reasoning behind any of the component choices in the comments. Also if anyone else has explored similarly unusual hardware/builds for LLM inference, I’d love to discuss! [link] [comments] |