Articles

Insights on AI inference benchmarking, GPU performance, and ML infrastructure.

All amd announcement b200 b300 benchmark cann deepseek disagg dynamo fp4 gb200 gb300 glm5 gpu h100 h200 huawei inference kimi mi355x minimax nvfp4 nvidia nvl72 qwen rocm sglang trtllm vllm wide-ep

June 9, 2026·29 min read

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time — Huawei, GB300 NVL72, MI355X, B200

Day 0 Inference Performance, InferenceX, 100x performance improvement in 26 Days, Cost per Million Tokens, Huawei 950DT Inference Trace Analysis

benchmarkgpuinferencedeepseeknvidiaamdhuaweigb300b300b200mi355xh200sglangvllmtrtllmcann

May 27, 2026·11 min read

GB300 NVL72 vs GB200 NVL72 Inference Performance & Perf per Dollar - on DeepSeek-V4-Pro 1.6T: Up to 2.83x Throughput

DSv4-Pro FP4 8K/1K, Dynamo+vLLM, disaggregated on both racks. GB300's 50% extra HBM (288 vs 192 GB/GPU) unlocks a wider prefill+decode recipe GB200 can't fit — lifting middle-of-curve perf/$ by 2.31x despite a 20% per-GPU TCO premium.

benchmarkgpuinferencedeepseeknvidiagb300gb200nvl72vllmdynamowide-epdisagg

May 26, 2026·15 min read

MI355X DeepSeek-V4-Pro on SGLang: 110.5x Throughput per GPU in 26 Days

The amd/deepseek_v4 side branch shipped TileLang attention indexer, Triton sparse MLA, fused RoPE/Hadamard, FlyDSL MoE, and FP4 weights across 31 performance optimizations PRs — lifting first-light 20 tok/s/GPU at 2.4 tok/s/user into 2,256 tok/s/GPU at 9.4 tok/s/user on 8K/1K, with both throughput and interactivity climbing together

benchmarkgpuinferencedeepseekamdmi355xsglangrocmfp4

May 23, 2026·10 min read

GB200 NVL72 vs B200 on DeepSeek R1 670B: Up to 4.4x Throughput per GPU at 125 tok/s/user

DeepSeek R1 FP4 1k/1k. NVL72's 72-GPU NVLink scale-up fabric lets decode run wide EP up to EP=32, where B200's 8-GPU NVLink island caps out at EP=8 over RoCEv2

benchmarkgpuinferencedeepseeknvidiagb200b200nvl72trtllmdynamowide-epdisagg

May 2, 2026·5 min read

SGLang 0.5.6 on B200 DeepSeek R1 FP4: Up to 1.8x at Low Concurrency

Piecewise CUDA graphs for DeepSeek V3, a unified event loop, and JIT kernels push 8k/1k throughput from 508 to 907 tok/s/GPU on the same 16 GPU B200 pool

benchmarkinferencegpunvidiab200deepseeksglangfp4