| moonshotai/Kimi-K2.6 int4 @ 9.7 tok/s (output of 136 tok) and 263 tok/s (input of 14564 tok) on vllm-gfx906-mobydick Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick Power draw: ~640W (idle) / ~4800W (peak inference) Is it worth ? No, unless you’ve got solar panels or free energy… Setup details: https://github.com/ai-infos/guidances-setup-16-mi50-deepseek-v32 cmd i run: the script "openai_server_kimi.py" is just based on official vllm example with torchrun (modified to support openai api..and not really optimized... the vllm default command that included torchrun didn't work for me, need more investigation to debug...), i can share it on github too if there's any interest (but need to be more optimized) ps: I still didn’t do a full guidance setup for this because i’m quite not satisfied of the perf… First, this setup run at pcie gen3 x8 and pcie gen4 x4 , all are supposed to be at 7GB/s but got one at 3.5GB/s (due to instability of risers…) Theoretically, if i manage to do a new setup with max pcie bandwidth : 28GB/s (if x16) or 14GB/s (if x8) in TP8 PP4 (or TP4 PP8) and with optimized vllm software stack, I believe we can jum to 600-1000 PP and 9-12 TG (without mtp)… and now this setup might be interesting if we compare to hybrid setup (ddr5-rtx 6000 pro, etc) but i think i’m done with all of it and I might just enjoy small models, much faster on smaller setups. Feel free to ask any questions and/or share any comments. [link] [comments] |