After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year.

Throughput by Context Depth

Prefilled	PP@4096	TG@512
0	2229.0	42.03
4K	1943.6	41.41
16K	1558.9	39.72
32K	1234.2	38.19
64K	863.5	35.87

TG Peak (burst throughput)

43.00 42.00 40.00 39.00 37.00

Overall experience with opencode is pretty close to Sonnet + Claude Code. 100-200k sessions are stable.

Will play with different concurrency settings this weekend.

Anyone seen better performance on this hardware?

submitted by /u/val_in_tech
[link] [comments]

GLM 5.1 Locally: 40tps, 2000+ pp/s

Throughput by Context Depth

TG Peak (burst throughput)

Leave a Comment