Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick)

Inference engine used (vllm fork): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: cyankiwi/MiniMax-M2.7-AWQ-4bit

Relevant commands to run:

docker run -it --name vllm-gfx906-mobydick-mixa3607 -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video \ --group-add $(getent group render | cut -d: -f3) --ipc=host mixa3607/vllm-gfx906:0.19.1-rocm-7.2.1-aiinfos-20260405173349 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=INFO vllm serve \ /llm/models/MiniMax-M2.7-AWQ-4bit \ --served-model-name MiniMax-M2.7-AWQ-4bit \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2_append_think \ --trust-remote-code \ --max-model-len 196608 \ --gpu-memory-utilization 0.94 \ --enable-log-requests \ --enable-log-outputs \ --log-error-stack \ --dtype float16 \ --tensor-parallel-size 8 --port 8000 2>&1 | tee log.txt FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \ --dataset-name random \ --random-input-len 10000 \ --random-output-len 1000 \ --num-prompts 4 \ --request-rate 10000 \ --ignore-eos 2>&1 | tee logb.txt

RESULTS

8xMI50 32GB setup

============ Serving Benchmark Result ============ Successful requests: 4 Failed requests: 0 Request rate configured (RPS): 10000.00 Benchmark duration (s): 125.90 Total input tokens: 40000 Total generated tokens: 4000 Request throughput (req/s): 0.03 Output token throughput (tok/s): 31.77 Peak output token throughput (tok/s): 64.00 Peak concurrent requests: 4.00 Total token throughput (tok/s): 349.48 ---------------Time to First Token---------------- Mean TTFT (ms): 37281.45 Median TTFT (ms): 37480.25 P99 TTFT (ms): 58355.92 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 88.39 Median TPOT (ms): 88.22 P99 TPOT (ms): 109.47 ---------------Inter-token Latency---------------- Mean ITL (ms): 88.39 Median ITL (ms): 66.85 P99 ITL (ms): 73.62 ==================================================

Benchmark result

FINAL NOTES :

To me, perf is « acceptable » for agentic coding use cases and the quality output is pretty good for its size. This setup might be a reliable alternative to 3090s setup as it’s much cheaper or CPU/GPU setup as it’s faster (prefill/decode). Don't hesitate to ask any questions.

submitted by /u/ai-infos
[link] [comments]

Leave a Comment