Articles

Insights on AI inference benchmarking, GPU performance, and ML infrastructure.

·29 min read

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time — Huawei, GB300 NVL72, MI355X, B200

Day 0 Inference Performance, InferenceX, 100x performance improvement in 26 Days, Cost per Million Tokens, Huawei 950DT Inference Trace Analysis

benchmarkgpuinferencedeepseeknvidiaamdhuaweigb300b300b200mi355xh200sglangvllmtrtllmcann
·11 min read

GB300 NVL72 vs GB200 NVL72 Inference Performance & Perf per Dollar - on DeepSeek-V4-Pro 1.6T: Up to 2.83x Throughput

DSv4-Pro FP4 8K/1K, Dynamo+vLLM, disaggregated on both racks. GB300's 50% extra HBM (288 vs 192 GB/GPU) unlocks a wider prefill+decode recipe GB200 can't fit — lifting middle-of-curve perf/$ by 2.31x despite a 20% per-GPU TCO premium.

benchmarkgpuinferencedeepseeknvidiagb300gb200nvl72vllmdynamowide-epdisagg
·14 min read

B200 NVFP4 vs H100 FP8 on MiniMax-M2.5: Up to 8.2x Better Performance per Dollar with vLLM

vLLM PR #36307 unlocks the trtllm-gen FP8 MoE kernel for MiniMax on B200; combined with NVFP4, perf/$ scales from 4.0x at 22 tok/s/user to 8.2x at 110 on 8K/1K

benchmarkgpuinferenceminimaxnvidiab200h100vllmfp4
·13 min read

B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6: Up to 2.95x Better Performance per Dollar

On vLLM 8K/1K the NVFP4 path on B200 is 2.71x–2.95x cheaper per million tokens than H200 INT4 across the entire 30–90 tok/s/user serving band, and 2.45x–2.74x cheaper than B200 INT4 on the same silicon. Both factors decompose cleanly into B200's HBM bandwidth, HBM capacity, and NVFP4 tensor cores

benchmarkgpuinferencekiminvidiab200h200vllmnvfp4
·7 min read

GB200 NVL72 vs B200 on Kimi K2.5: 3.1x from Wide EP vLLM

Rack scale NVLink on NVL72 lets Dynamo vLLM run Kimi K2.5 wide EP up to Decode EP 16, taking peak throughput from 4,021 to 12,587 tok/s/GPU on 8k/1k NVFP4

benchmarkgpuinferencekiminvidiagb200b200vllmnvl72wide-ep
·7 min read

AMD MI355X Kimi K2.5 Inference: 7.7x Throughput, Up To 15x Interactivity in 25 Days on vLLM

vLLM PR #35850 Fixed AITER MLA Dispatch on MI355X CDNA4, Unlocking Kimi K2.5 Inference Performance at TP=8, Shipped in vLLM 0.18

benchmarkgpuinferencekimiamdvllmrocmmi355x