Google published Ironwood inference benchmarks in their AI-Hypercomputer/tpu-recipes repo. Nvidia has InferenceMAX numbers for B200. Nobody has compared them head-to-head under identical conditions. Ironwood skipped MLPerf v6.0, so there's no neutral standard either.
I rented B200s on Vast.ai and ran exactly the same FP8 configs Google published, on two models: Qwen3-32B (dense) and Qwen3-Coder-480B-A35B (MoE). Same quantization (FP8 e4m3 weights + activations + KV cache), same sequence lengths, same concurrency, same prompt count, same seed — every arg copied from Google's recipe yaml.
The finding: whichever chip is "faster per chip" depends entirely on the model.
Why the flip (speculation):
- 32B dense is monolithic matmul — Ironwood's mature TPU kernels nail this, and at TP=2 across 2 TensorCores inside one chip there's almost no collective traffic.
- 480B MoE is 128 experts / 8 active per token — most runtime is expert routing + dispatch. SGLang on B200 uses DeepGEMM + NVLink all-to-all; vLLM-on-TPU dispatches through XLA's HLO. The 80% B200 win at 8k/1k looks like SGLang's MoE dispatch being meaningfully better optimized, not a raw-hardware gap.
Config match (identical unless noted):
- Same HF FP8 checkpoints (
Qwen/Qwen3-32B-FP8, Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8) - FP8 e4m3 weights + activations + KV cache both sides
random dataset, --random-range-ratio 0.8, --num-prompts 320, --max-concurrency 64, --seed 100, --ignore-eos - Serving stacks: SGLang 0.5.10 on B200 (state-of-the-art for Blackwell per vLLM team + InferenceMAX collaborators), vLLM-on-TPU on Ironwood (Google's default for this workload)
- 32B: 1 chip vs 1 GPU. 480B: 4 chips vs 4 GPUs.
What this isn't:
- Not TensorRT-LLM + FP4 + EAGLE speculative decoding on B200 — that's the real production ceiling; it would widen B200's lead further.
- Not pod-scale. Ironwood's ICI 3D torus shines above the NVL72 B200 domain (~72 GPUs). This is 1-chip and 4-chip slices.
- vLLM-on-TPU MoE routing likely has headroom Google hasn't unlocked.
Reproducible (B200 side):
python3 -m sglang.launch_server \ --model-path <model> --host 0.0.0.0 --port 8000 \ --tp {1|4} --trust-remote-code \ --mem-fraction-static {0.9|0.8} \ --kv-cache-dtype fp8_e4m3 python3 -m sglang.bench_serving \ --backend sglang --model <model> \ --dataset-name random \ --random-input-len {1024|1024|8192} \ --random-output-len {1024|8192|1024} \ --random-range-ratio 0.8 \ --num-prompts 320 --max-concurrency 64 --seed 100
Ironwood side is Google's published recipe: github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/ironwood/vLLM
submitted by