On DeepSeek R1 0528 FP4 1k/1k with Dynamo TRT-LLM + MTP and disaggregated prefill/decode on both SKUs, GB200 NVL72 delivers up to 4.39x throughput per GPU vs B200 at iso-interactivity — peaking at 125 tok/s/user (4,130 tok/s/GPU on GB200 NVL72 vs 941 tok/s/GPU on B200).

NVIDIA GB200 NVL72 connects all 72 GPUs over NVLink 5 at 900 GB/s per GPU uni-directional (1.8 TB/s jensen math bidi rx + tx). A B200 server connects only 8 GPUs over NVLink; once decode EP needs more than 8 ranks, the all-to-all has to leave the NVLink island and cross ConnectX-7 RoCEv2 Ethernet at 400 Gbit/s per GPU. So per-GPU bandwidth available to any wider-than-8 EP collective drops from 900 GB/s to 50 GB/s, 18x. DeepSeek R1's 256 routed experts amortize beautifully when the all-to-all stays on NVLink end-to-end across 16 or 32 ranks.

GB200 NVL72 42U rack layout. 18 compute trays each holding 4 GPUs (72 GPUs total), 9 non-scalable NVSwitch5 trays in the middle of the rack stitching the 72 GPUs into a single NVLink-5 scale-up domain, 4 33 kW power shelves, IPMI management blades, and a drip tray. — GB200 NVL72 rack layout — 18 compute trays × 4 GPUs each = 72 GPUs in one NVLink-5 scale-up domain, stitched together by 9 NVSwitch5 trays. The whole rack runs the same fabric the GPUs inside an HGX B200 node use; B200 multinode disagg crosses InfiniBand or RoCEv2 Ethernet between racks at 18x lower per-GPU bandwidth.

Click to see the full InferenceX dashboard →

DeepSeek R1 0528 FP4 1k/1k tok/s/GPU vs interactivity. GB200 NVL72 (Dynamo TRT, MTP) lighter green and B200 (Dynamo TRT, MTP) darker green. Each curve point labeled with its decode TP value. — DeepSeek R1 0528 FP4 1k/1k Pareto frontier. GB200 NVL72 vs B200, both Dynamo TRT-LLM + MTP, both disaggregated prefill/decode. Measured on InferenceX 2026-05-22. Point labels denote decode TP.

DeepSeek R1 0528 is the 671B-parameter MoE that DeepSeek released in May 2025 — Multi-head Latent Attention (MLA) for KV-cache compression, 256 routed experts with 8 active per token plus 1 shared expert, and 61 transformer layers. Every MoE layer fires a routed all-to-all dispatch followed by an all-to-all combine on each forward pass: roughly 120 all-to-alls per token. That collective volume is exactly what NVLink-class scale-up bandwidth is for.

Why GB200 NVL72 Wins in the Middle of the Curve

In the middle of the curve — roughly 75–175 tok/s/user on this workload — decode becomes network-bound on the EP dispatch and combine collectives. Each MoE layer fires two all-to-all collectives per token: a dispatch that routes each token to the 8 of 256 experts it was assigned to (which generally live on remote ranks under wide EP), and a combine that gathers the expert outputs back to each token's home rank. Across DeepSeek R1's ~60 MoE layers that is roughly 120 collectives per forward pass.

When the network is fast enough, the runtime overlaps each dispatch and combine with the matmul compute it is serving: issue the dispatch, start the expert GEMM on tokens that have already arrived, finish the GEMM in roughly the time it takes for the remaining bytes to land, then issue the combine. The collective latency mostly disappears from the critical path because the GPU was busy doing useful compute throughout.

On ConnectX-7 RoCEv2 Ethernet at 50 GB/s per GPU — 18x less per-rank bandwidth than NVLink — that overlap collapses. The same collective takes up to 18x longer per byte moved, no longer fits inside the GEMM time budget, and exposes itself as raw communication time.

The Numbers

All rows are DeepSeek R1 0528 FP4 at ISL 1024 / OSL 1024, Dynamo TRT-LLM with MTP enabled, disaggregated prefill/decode on both SKUs, multinode in both cases, measured on InferenceX on 2026-05-22 (run 26306422380). Cost per million total tokens is computed as TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6), with B200 at $1.95/GPU/hr and GB200 NVL72 at $2.21/GPU/hr per the SemiAnalysis AI Cloud TCO Model.

GB200 NVL72 (Dynamo TRT, MTP), DeepSeek R1 FP4 1k/1k disagg:

Conc	Prefill	Decode	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tok
4	4 GPU, TP=4	32 GPU, EP=8	60.7	286.40	3.49	$10.12
8	4 GPU, TP=4	32 GPU, EP=8	111.8	272.64	3.67	$5.49
12	4 GPU, TP=4	32 GPU, EP=8	165.2	257.11	3.89	$3.72
24	4 GPU, TP=4	32 GPU, EP=8	274.8	222.28	4.50	$2.23
48	4 GPU, TP=4	32 GPU, EP=8	363.3	207.30	4.82	$1.69
180	4 GPU, TP=4	32 GPU, EP=32	1,149.1	164.37	6.08	$0.53
2,253	12 GPU, TP=12	32 GPU, EP=32	7,698.0	90.99	10.99	$0.08
4,301	8 GPU, TP=8	16 GPU, EP=16	12,659.7	43.29	23.10	$0.05
16,130	12 GPU, TP=12	20 GPU, EP=4	14,659.4	17.82	56.11	$0.04

B200 (Dynamo TRT, MTP), DeepSeek R1 FP4 1k/1k disagg multinode:

Conc	Prefill	Decode	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tok
6	4 GPU, TP=4	40 GPU, EP=8	49.3	309.17	3.23	$10.99
10	4 GPU, TP=4	40 GPU, EP=8	118.7	277.39	3.61	$4.56
15	4 GPU, TP=4	40 GPU, EP=8	168.9	261.09	3.83	$3.21
25	4 GPU, TP=4	40 GPU, EP=8	242.4	224.59	4.45	$2.23
45	4 GPU, TP=4	40 GPU, EP=8	369.9	191.18	5.23	$1.46
90	4 GPU, TP=4	40 GPU, EP=8	577.3	150.56	6.64	$0.94
180	4 GPU, TP=4	40 GPU, EP=8	897.9	126.42	7.91	$0.60
875	4 GPU, TP=4	40 GPU, EP=8	2,832.9	101.79	9.82	$0.19
1,214	4 GPU, TP=4	16 GPU, EP=8	7,111.4	74.04	13.51	$0.08
4,968	12 GPU, TP=12	32 GPU, EP=8	9,660.7	56.35	17.75	$0.06
10,860	12 GPU, TP=12	20 GPU, EP=4	12,515.7	21.34	46.86	$0.04

Iso-Interactivity Throughput Comparison

Interactivity (tok/s/user)	GB200 NVL72 tok/s/GPU	B200 tok/s/GPU	GB200 NVL72 / B200
25	14,125	12,292	1.15x
45	12,508	10,853	1.15x
60	11,017	9,185	1.20x
75	9,379	6,968	1.35x
90	7,796	4,512	1.73x
100	6,781	3,047	2.23x
125	4,130	941	4.39x
150	1,922	583	3.30x
175	826	429	1.93x
200	432	332	1.30x
225	262	241	1.09x
250	186	193	0.97x
275	103	126	0.82x
300	unreachable	67	∞ (B200 wins)

And the same comparison normalized to cost per million tokens, which dilutes the GB200 NVL72 advantage by its 13% per-GPU TCO premium ($2.21 vs $1.95 per GPU-hour):

Interactivity (tok/s/user)	GB200 NVL72 $/M tok	B200 $/M tok	B200 / GB200 NVL72
25	$0.0435	$0.0441	1.01x
45	$0.0491	$0.0499	1.02x
60	$0.0557	$0.0590	1.06x
75	$0.0655	$0.0777	1.19x
100	$0.0905	$0.1778	1.96x
125	$0.1486	$0.5755	3.87x
150	$0.3194	$0.9292	2.91x
175	$0.7430	$1.2638	1.70x
200	$1.4215	$1.6314	1.15x
225	$2.3450	$2.2454	0.96x
250	$3.2962	$2.8067	0.85x (B200 wins)

The 4.39x throughput peak (3.87x cost gap) at 125 tok/s/user is where wide EP across the NVLink fabric is doing the most work.

Live chart, pre-filtered to B200 and GB200 NVL72 Dynamo TRT MTP on DeepSeek R1 FP4 1k/1k for the 2026-05-22 run.

When Each SKU Wins

GB200 NVL72 Dynamo TRT is the right choice for everything in the 75 to 200 tok/s/user band where wide EP across the 72-GPU NVLink fabric is the dominant factor. The cost gap peaks at 3.87x in favor of GB200 NVL72 at 125 tok/s/user — chat-style and reasoning serving at production interactivity targets land squarely inside this band.

NVIDIA's SGLang GB200 NVL72 results show the same scale-up fabric advantage on the SGLang stack. AMD's MI300/MI355X have no rack-scale UALoE72 equivalent shipping until H2 2026 engineering samples per the inferencex-v2 launch piece, so there is no rack scale comparator on the AMD side yet for this workload.

Acknowledgments

Thanks to NVIDIA's Dynamo and TensorRT-LLM teams — including Jatin Gangani, Kedar Potdar, Sridhar Ramaswamy, Ishan Dhanani, and Sahithi Chigurupati — for shipping the disagg recipes on both B200 multinode RoCEv2 and GB200 NVL72. Checkout our other blog post on GB200 NVL72 vs B200 Kimi K2.5 post.

Click to see the full InferenceX dashboard →