| MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128 Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup. **Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology). **Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe. **Decode throughput (ctx=0, 3x mean, 30s/cell):** | C | agg tok/s | per-req tok/s | |---|-----------|---------------| | 1 | 127.7 | 127.7 | | 8 | 471.6 | 59.0 | | 32 | 1078.9 | 33.7 | | 64 | 1695.4 | 26.5 | | 128 | 2800.2 | 21.9 | **Prefill (C=1):** | ctx | TTFT | tok/s | |-----|------|-------| | 8K | 0.50s | 17,286 | | 16K | 0.99s | 16,926 | | 32K | 2.09s | 15,861 | | 64K | 4.94s | 13,319 | | 128K | 13.25s | 9,908 | No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency. Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory. Full methodology and caveats: Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights. [link] [comments] |