GB200 NVL72 vs B200 on DeepSeek R1 670B: Up to 4.4x Throughput per GPU at 125 tok/s/user

DeepSeek R1 FP4 1k/1k. NVL72's 72-GPU NVLink scale-up fabric lets decode run wide EP up to EP=32, where B200's 8-GPU NVLink island caps out at EP=8 over RoCEv2

SemiAnalysis··10 min read·benchmarkgpuinferencedeepseeknvidiagb200b200nvl72trtllmdynamowide-epdisagg
On this page (click to expand)

On DeepSeek R1 0528 FP4 1k/1k with Dynamo TRT-LLM + MTP and disaggregated prefill/decode on both SKUs, GB200 NVL72 delivers up to 4.39x throughput per GPU vs B200 at iso-interactivity — peaking at 125 tok/s/user (4,130 tok/s/GPU on GB200 NVL72 vs 941 tok/s/GPU on B200).

NVIDIA GB200 NVL72 connects all 72 GPUs over NVLink 5 at 900 GB/s per GPU uni-directional (1.8 TB/s jensen math bidi rx + tx). A B200 server connects only 8 GPUs over NVLink; once decode EP needs more than 8 ranks, the all-to-all has to leave the NVLink island and cross ConnectX-7 RoCEv2 Ethernet at 400 Gbit/s per GPU. So per-GPU bandwidth available to any wider-than-8 EP collective drops from 900 GB/s to 50 GB/s, 18x. DeepSeek R1's 256 routed experts amortize beautifully when the all-to-all stays on NVLink end-to-end across 16 or 32 ranks.

GB200 NVL72 42U rack layout. 18 compute trays each holding 4 GPUs (72 GPUs total), 9 non-scalable NVSwitch5 trays in the middle of the rack stitching the 72 GPUs into a single NVLink-5 scale-up domain, 4 33 kW power shelves, IPMI management blades, and a drip tray.
GB200 NVL72 rack layout — 18 compute trays × 4 GPUs each = 72 GPUs in one NVLink-5 scale-up domain, stitched together by 9 NVSwitch5 trays. The whole rack runs the same fabric the GPUs inside an HGX B200 node use; B200 multinode disagg crosses InfiniBand or RoCEv2 Ethernet between racks at 18x lower per-GPU bandwidth.
DeepSeek R1 0528 FP4 1k/1k tok/s/GPU vs interactivity. GB200 NVL72 (Dynamo TRT, MTP) lighter green and B200 (Dynamo TRT, MTP) darker green. Each curve point labeled with its decode TP value.
DeepSeek R1 0528 FP4 1k/1k Pareto frontier. GB200 NVL72 vs B200, both Dynamo TRT-LLM + MTP, both disaggregated prefill/decode. Measured on InferenceX 2026-05-22. Point labels denote decode TP.

DeepSeek R1 0528 is the 671B-parameter MoE that DeepSeek released in May 2025 — Multi-head Latent Attention (MLA) for KV-cache compression, 256 routed experts with 8 active per token plus 1 shared expert, and 61 transformer layers. Every MoE layer fires a routed all-to-all dispatch followed by an all-to-all combine on each forward pass: roughly 120 all-to-alls per token. That collective volume is exactly what NVLink-class scale-up bandwidth is for.

Why GB200 NVL72 Wins in the Middle of the Curve

In the middle of the curve — roughly 75–175 tok/s/user on this workload — decode becomes network-bound on the EP dispatch and combine collectives. Each MoE layer fires two all-to-all collectives per token: a dispatch that routes each token to the 8 of 256 experts it was assigned to (which generally live on remote ranks under wide EP), and a combine that gathers the expert outputs back to each token's home rank. Across DeepSeek R1's ~60 MoE layers that is roughly 120 collectives per forward pass.

When the network is fast enough, the runtime overlaps each dispatch and combine with the matmul compute it is serving: issue the dispatch, start the expert GEMM on tokens that have already arrived, finish the GEMM in roughly the time it takes for the remaining bytes to land, then issue the combine. The collective latency mostly disappears from the critical path because the GPU was busy doing useful compute throughout.

On ConnectX-7 RoCEv2 Ethernet at 50 GB/s per GPU — 18x less per-rank bandwidth than NVLink — that overlap collapses. The same collective takes up to 18x longer per byte moved, no longer fits inside the GEMM time budget, and exposes itself as raw communication time.

The Numbers

All rows are DeepSeek R1 0528 FP4 at ISL 1024 / OSL 1024, Dynamo TRT-LLM with MTP enabled, disaggregated prefill/decode on both SKUs, multinode in both cases, measured on InferenceX on 2026-05-22 (run 26306422380). Cost per million total tokens is computed as TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6), with B200 at $1.95/GPU/hr and GB200 NVL72 at $2.21/GPU/hr per the SemiAnalysis AI Cloud TCO Model.

GB200 NVL72 (Dynamo TRT, MTP), DeepSeek R1 FP4 1k/1k disagg:

ConcPrefillDecodetok/s/GPUtok/s/userTPOT (ms)$/M tok
44 GPU, TP=432 GPU, EP=860.7286.403.49$10.12
84 GPU, TP=432 GPU, EP=8111.8272.643.67$5.49
124 GPU, TP=432 GPU, EP=8165.2257.113.89$3.72
244 GPU, TP=432 GPU, EP=8274.8222.284.50$2.23
484 GPU, TP=432 GPU, EP=8363.3207.304.82$1.69
1804 GPU, TP=432 GPU, EP=321,149.1164.376.08$0.53
2,25312 GPU, TP=1232 GPU, EP=327,698.090.9910.99$0.08
4,3018 GPU, TP=816 GPU, EP=1612,659.743.2923.10$0.05
16,13012 GPU, TP=1220 GPU, EP=414,659.417.8256.11$0.04

B200 (Dynamo TRT, MTP), DeepSeek R1 FP4 1k/1k disagg multinode:

ConcPrefillDecodetok/s/GPUtok/s/userTPOT (ms)$/M tok
64 GPU, TP=440 GPU, EP=849.3309.173.23$10.99
104 GPU, TP=440 GPU, EP=8118.7277.393.61$4.56
154 GPU, TP=440 GPU, EP=8168.9261.093.83$3.21
254 GPU, TP=440 GPU, EP=8242.4224.594.45$2.23
454 GPU, TP=440 GPU, EP=8369.9191.185.23$1.46
904 GPU, TP=440 GPU, EP=8577.3150.566.64$0.94
1804 GPU, TP=440 GPU, EP=8897.9126.427.91$0.60
8754 GPU, TP=440 GPU, EP=82,832.9101.799.82$0.19
1,2144 GPU, TP=416 GPU, EP=87,111.474.0413.51$0.08
4,96812 GPU, TP=1232 GPU, EP=89,660.756.3517.75$0.06
10,86012 GPU, TP=1220 GPU, EP=412,515.721.3446.86$0.04

Iso-Interactivity Throughput Comparison

Interactivity (tok/s/user)GB200 NVL72 tok/s/GPUB200 tok/s/GPUGB200 NVL72 / B200
2514,12512,2921.15x
4512,50810,8531.15x
6011,0179,1851.20x
759,3796,9681.35x
907,7964,5121.73x
1006,7813,0472.23x
1254,1309414.39x
1501,9225833.30x
1758264291.93x
2004323321.30x
2252622411.09x
2501861930.97x
2751031260.82x
300unreachable67 (B200 wins)

And the same comparison normalized to cost per million tokens, which dilutes the GB200 NVL72 advantage by its 13% per-GPU TCO premium ($2.21 vs $1.95 per GPU-hour):

Interactivity (tok/s/user)GB200 NVL72 $/M tokB200 $/M tokB200 / GB200 NVL72
25$0.0435$0.04411.01x
45$0.0491$0.04991.02x
60$0.0557$0.05901.06x
75$0.0655$0.07771.19x
100$0.0905$0.17781.96x
125$0.1486$0.57553.87x
150$0.3194$0.92922.91x
175$0.7430$1.26381.70x
200$1.4215$1.63141.15x
225$2.3450$2.24540.96x
250$3.2962$2.80670.85x (B200 wins)

The 4.39x throughput peak (3.87x cost gap) at 125 tok/s/user is where wide EP across the NVLink fabric is doing the most work.

DeepSeek R1 0528 FP4 1k/1k tok/s/GPU vs interactivity. GB200 NVL72 (Dynamo TRT, MTP) lighter green and B200 (Dynamo TRT, MTP) darker green. Each curve point labeled with its decode TP value.
DeepSeek R1 0528 FP4 1k/1k Pareto frontier. GB200 NVL72 vs B200, both Dynamo TRT-LLM + MTP, both disaggregated prefill/decode. Measured on InferenceX 2026-05-22. Point labels denote decode TP.

Live chart, pre-filtered to B200 and GB200 NVL72 Dynamo TRT MTP on DeepSeek R1 FP4 1k/1k for the 2026-05-22 run.

When Each SKU Wins

  • GB200 NVL72 Dynamo TRT is the right choice for everything in the 75 to 200 tok/s/user band where wide EP across the 72-GPU NVLink fabric is the dominant factor. The cost gap peaks at 3.87x in favor of GB200 NVL72 at 125 tok/s/user — chat-style and reasoning serving at production interactivity targets land squarely inside this band.

NVIDIA's SGLang GB200 NVL72 results show the same scale-up fabric advantage on the SGLang stack. AMD's MI300/MI355X have no rack-scale UALoE72 equivalent shipping until H2 2026 engineering samples per the inferencex-v2 launch piece, so there is no rack scale comparator on the AMD side yet for this workload.

Acknowledgments

Thanks to NVIDIA's Dynamo and TensorRT-LLM teams — including Jatin Gangani, Kedar Potdar, Sridhar Ramaswamy, Ishan Dhanani, and Sahithi Chigurupati — for shipping the disagg recipes on both B200 multinode RoCEv2 and GB200 NVL72. Checkout our other blog post on GB200 NVL72 vs B200 Kimi K2.5 post.

All articles and posts are © SemiAnalysis. All rights reserved. The AGPL-3.0 license covering the application source code does not apply to article content.