MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell

https://preview.redd.it/zxd2awig4vug1.png?width=656&format=png&auto=webp&s=f72dc0fd05ad1380c56166e3af3de48a57fbbd75

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — 127.7 tok/s C=1, 2800 peak C=128

Ran a full sweep on Luke Alonso's M2.7 NVFP4 quant. Writing it down for anyone shopping the same setup.

**Hardware:** AsRock Rack B650D4U-2L2T, EPYC 4564P, 128GB DDR5 ECC, 2x RTX PRO 6000 Blackwell (96GB, 600W) behind a C-Payne PM50100 PLX Gen5 switch (PIX topology).

**Software:** SGLang via voipmonitor/sglang:cu130 docker (b12x 0.8.3), modelopt_fp4, bf16 KV, TP=2, Luke's default recipe.

**Decode throughput (ctx=0, 3x mean, 30s/cell):**

| C | agg tok/s | per-req tok/s |

|---|-----------|---------------|

| 1 | 127.7 | 127.7 |

| 8 | 471.6 | 59.0 |

| 32 | 1078.9 | 33.7 |

| 64 | 1695.4 | 26.5 |

| 128 | 2800.2 | 21.9 |

**Prefill (C=1):**

| ctx | TTFT | tok/s |

|-----|------|-------|

| 8K | 0.50s | 17,286 |

| 16K | 0.99s | 16,926 |

| 32K | 2.09s | 15,861 |

| 64K | 4.94s | 13,319 |

| 128K | 13.25s | 9,908 |

No speculative decoding — there's no NEXTN drafter for M2.7 yet. When one ships expect a meaningful jump at low concurrency.

Long-context cells skip at high concurrency (KV pool is ~83K tokens on bf16-KV TP=2). 16K is fine up to about C=8 per-req before queue contention kicks in; 128K is C=1-only territory.

Full methodology and caveats:

https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/b650d4u-2gpu.md

Thanks to Luke for the kernels + quant, and to Jon for the recent calibration data update on the M2.7 NVFP4 weights.

submitted by /u/Visual_Synthesizer
[link] [comments]

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

Leave a Comment