MiniMax M2.7 AWQ-4bit on 2x Spark vs 2x RTX 6000 96GB - performance and energy efficiency

Hello,

This model/quant is my daily driver and I wanted to have some reference benchs for comparing my setup with a 3x more expensive and 4x time power hungry setup.

Results first, methodology after, link at the end with all results

Model: cyankiwi/MiniMax-M2.7-AWQ-4bit

Results (c1)

https://preview.redd.it/dzp6qzfc0pyg1.png?width=858&format=png&auto=webp&s=368debb16760ecaaf8d5bd4013bfeaa5ef940a69

https://preview.redd.it/2gziemld0pyg1.png?width=859&format=png&auto=webp&s=84e2f3c389013854734fecf89a25d1dd095f4d62

(tried to upload the table as text, didn't work as expected)

So to my surprise, the Spark cluster isn't that far behind. On average the 2x RTX 6000 is 2.7x faster on prompt processing and 4.88x faster on token generation ; for a price difference of around 2.9x.

Power consumption is very close (reported back to 1M tokens), and at $0.10/kWh, you get:

(you can change your energy price on the link I added)

Results (c2)

https://preview.redd.it/eid3d8rm0pyg1.png?width=858&format=png&auto=webp&s=471f80aa92fc9968177e40e53b6bb000eb3a214d

https://preview.redd.it/drz219on0pyg1.png?width=859&format=png&auto=webp&s=eac3cd8e3617a90b4887090a32282fbacd6af923

https://preview.redd.it/voqn4fro0pyg1.png?width=1741&format=png&auto=webp&s=06c656bb1ef7826480db3595b9eb32adf130be13

At two requests in parallel, it gets a bit weird (all benchs at each context size are run 3 times and averaged)

Well, I don't have all the explanations, you tell me if I'm doing something wrong haha. But yeah with parallel high contexts, we're hitting the limit of what the KV-cache can handle at once, so requests get throttled and that destroys the perfs.

RunPod config

GPUs: 2xRTX PRO 6000 96GB
Cost: rent $3.78/hour (cheaper options exist) (or ~$20K to own)
Image: vLLM Latest (vllm/vllm-openai:latest)
Time to get the model running: ~5-10 minutes (depends mostly on the 130GB to download from HF)
Storage: only "Container disk" at 160GB, others at 0 (no need for persistent storage, which is very expensive)
"Container start command" (to reproduce)

cyankiwi/MiniMax-M2.7-AWQ-4bit --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization=0.95 --trust-remote-code --kv-cache-dtype fp8_e4m3 --enable-auto-tool-choice --tool-call-parser minimax_m2

Power consumption (estimated): 1450W (maybe overshot this, not sure, happy to correct, and assumes some kind of threadripper cpu)

Spark config

2x Asus Ascent GX10
Cost: ~$7K to own (rent options limited)
Power consumption: 365W average (idles at 100W with model ready to go - which is quite bad imo)

Benchmark

uvx llama-benchy --base-url https://{pod_id}-8000.proxy.runpod.net/v1 --depth 0 4096 8192 16384 32768 65536 131072 --latency-mode generation --concurrency 1 2 --tg 512

(I tested with more concurrency, but I focused my analysis on 1 and 2 concurrent requests, results available here: https://nicefox.net/benchmarks/minimax-m2.7-awq-4bit/benchmarks_concurrency.md )

Conclusion

Well... Prefill is only 2.7x time faster, and token generation is 4.9x faster, and both setup display similar energy efficiency. My bet is that the Max-Q version would be very energy efficient.

The main difference is the Spark cluster is my daily driver, so I spent time making it better and ensuring I had the best setup possible ; while for the RTX 6000 I "just" launched the vllm image from RunPod with the same parameters, but I know there is optimization to be done.

I'm very interested in the 2x RTX 6000 setup because I'm working with a small company to set it up properly on-prem for their devs, so I'm happy to re-bench with other params if people give me a better setup for it.

You can find more details here (it's just the data compiled): https://nicefox.net/benchmarks/minimax-m2.7-awq-4bit/

submitted by /u/t4a8945
[link] [comments]

MiniMax M2.7 AWQ-4bit on 2x Spark vs 2x RTX 6000 96GB – performance and energy efficiency