Articles

Insights on AI inference benchmarking, GPU performance, and ML infrastructure.

All amd announcement b200 b300 benchmark cann deepseek disagg dynamo fp4 gb200 gb300 glm5 gpu h100 h200 huawei inference kimi mi355x minimax nvfp4 nvidia nvl72 qwen rocm sglang trtllm vllm wide-ep

June 9, 2026·29 min read

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time — Huawei, GB300 NVL72, MI355X, B200

Day 0 Inference Performance, InferenceX, 100x performance improvement in 26 Days, Cost per Million Tokens, Huawei 950DT Inference Trace Analysis

benchmarkgpuinferencedeepseeknvidiaamdhuaweigb300b300b200mi355xh200sglangvllmtrtllmcann

May 26, 2026·12 min read

B200 NVFP4 vs H200 FP8 on GLM-5: Up to 3.65x Better Performance per Dollar with SGLang MTP

Both SKUs run SGLang EAGLE MTP; the Blackwell generation lifts perf/$ by ~1.2x at the peak and the NVIDIA GLM-5-NVFP4 checkpoint on FlashInfer TRT-LLM sparse MLA stacks another ~2.4–3.0x on 8K/1K

benchmarkgpuinferenceglm5nvidiab200h200sglangfp4

May 26, 2026·14 min read

B200 NVFP4 vs H100 FP8 on MiniMax-M2.5: Up to 8.2x Better Performance per Dollar with vLLM

vLLM PR #36307 unlocks the trtllm-gen FP8 MoE kernel for MiniMax on B200; combined with NVFP4, perf/$ scales from 4.0x at 22 tok/s/user to 8.2x at 110 on 8K/1K

benchmarkgpuinferenceminimaxnvidiab200h100vllmfp4

May 26, 2026·13 min read

B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6: Up to 2.95x Better Performance per Dollar

On vLLM 8K/1K the NVFP4 path on B200 is 2.71x–2.95x cheaper per million tokens than H200 INT4 across the entire 30–90 tok/s/user serving band, and 2.45x–2.74x cheaper than B200 INT4 on the same silicon. Both factors decompose cleanly into B200's HBM bandwidth, HBM capacity, and NVFP4 tensor cores

benchmarkgpuinferencekiminvidiab200h200vllmnvfp4

May 25, 2026·8 min read

AMD MI355X GLM-5 Inference: Up to 40% Cheaper per Million Tokens than B200 on SGLang FP8

14 weeks after GLM-5 launched, AMD landed both MTP and non-MTP SGLang FP8 recipes on MI355X — fused MLA + FP8 KV cache via TileLang flips the single-node FP8 cost curve in AMD favor across most of the performance Pareto

benchmarkgpuinferenceglm5amdnvidiami355xb200sglangrocm

May 23, 2026·10 min read

GB200 NVL72 vs B200 on DeepSeek R1 670B: Up to 4.4x Throughput per GPU at 125 tok/s/user

DeepSeek R1 FP4 1k/1k. NVL72's 72-GPU NVLink scale-up fabric lets decode run wide EP up to EP=32, where B200's 8-GPU NVLink island caps out at EP=8 over RoCEv2

benchmarkgpuinferencedeepseeknvidiagb200b200nvl72trtllmdynamowide-epdisagg

May 2, 2026·5 min read

SGLang 0.5.6 on B200 DeepSeek R1 FP4: Up to 1.8x at Low Concurrency

Piecewise CUDA graphs for DeepSeek V3, a unified event loop, and JIT kernels push 8k/1k throughput from 508 to 907 tok/s/GPU on the same 16 GPU B200 pool

benchmarkinferencegpunvidiab200deepseeksglangfp4

April 23, 2026·7 min read

GB200 NVL72 vs B200 on Kimi K2.5: 3.1x from Wide EP vLLM

Rack scale NVLink on NVL72 lets Dynamo vLLM run Kimi K2.5 wide EP up to Decode EP 16, taking peak throughput from 4,021 to 12,587 tok/s/GPU on 8k/1k NVFP4

benchmarkgpuinferencekiminvidiagb200b200vllmnvl72wide-ep