TPU v7x Ironwood vs Nvidia B200

Google published Ironwood inference benchmarks in their AI-Hypercomputer/tpu-recipes repo. Nvidia has InferenceMAX numbers for B200. Nobody has compared them head-to-head under identical conditions. Ironwood skipped MLPerf v6.0, so there's no neutral standard either.

I rented B200s on Vast.ai and ran exactly the same FP8 configs Google published, on two models: Qwen3-32B (dense) and Qwen3-Coder-480B-A35B (MoE). Same quantization (FP8 e4m3 weights + activations + KV cache), same sequence lengths, same concurrency, same prompt count, same seed — every arg copied from Google's recipe yaml.

The finding: whichever chip is "faster per chip" depends entirely on the model.

Why the flip (speculation):

32B dense is monolithic matmul — Ironwood's mature TPU kernels nail this, and at TP=2 across 2 TensorCores inside one chip there's almost no collective traffic.
480B MoE is 128 experts / 8 active per token — most runtime is expert routing + dispatch. SGLang on B200 uses DeepGEMM + NVLink all-to-all; vLLM-on-TPU dispatches through XLA's HLO. The 80% B200 win at 8k/1k looks like SGLang's MoE dispatch being meaningfully better optimized, not a raw-hardware gap.

Config match (identical unless noted):

Same HF FP8 checkpoints (Qwen/Qwen3-32B-FP8, Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)
FP8 e4m3 weights + activations + KV cache both sides
random dataset, --random-range-ratio 0.8, --num-prompts 320, --max-concurrency 64, --seed 100, --ignore-eos
Serving stacks: SGLang 0.5.10 on B200 (state-of-the-art for Blackwell per vLLM team + InferenceMAX collaborators), vLLM-on-TPU on Ironwood (Google's default for this workload)
32B: 1 chip vs 1 GPU. 480B: 4 chips vs 4 GPUs.

What this isn't:

Not TensorRT-LLM + FP4 + EAGLE speculative decoding on B200 — that's the real production ceiling; it would widen B200's lead further.
Not pod-scale. Ironwood's ICI 3D torus shines above the NVL72 B200 domain (~72 GPUs). This is 1-chip and 4-chip slices.
vLLM-on-TPU MoE routing likely has headroom Google hasn't unlocked.

Reproducible (B200 side):

python3 -m sglang.launch_server \ --model-path <model> --host 0.0.0.0 --port 8000 \ --tp {1|4} --trust-remote-code \ --mem-fraction-static {0.9|0.8} \ --kv-cache-dtype fp8_e4m3 python3 -m sglang.bench_serving \ --backend sglang --model <model> \ --dataset-name random \ --random-input-len {1024|1024|8192} \ --random-output-len {1024|8192|1024} \ --random-range-ratio 0.8 \ --num-prompts 320 --max-concurrency 64 --seed 100

Ironwood side is Google's published recipe: github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/ironwood/vLLM

submitted by /u/bigboyparpa
[link] [comments]

Leave a Comment