GLM 5.1 Locally: 40tps, 2000+ pp/s

After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year.

Throughput by Context Depth

Prefilled PP@4096 TG@512
0 2229.0 42.03
4K 1943.6 41.41
16K 1558.9 39.72
32K 1234.2 38.19
64K 863.5 35.87

TG Peak (burst throughput)

43.00 42.00 40.00 39.00 37.00

Overall experience with opencode is pretty close to Sonnet + Claude Code. 100-200k sessions are stable.

Will play with different concurrency settings this weekend.

Anyone seen better performance on this hardware?

submitted by /u/val_in_tech
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top