Articles

Insights on AI inference benchmarking, GPU performance, and ML infrastructure.

All amd announcement b200 b300 benchmark cann deepseek disagg dynamo fp4 gb200 gb300 glm5 gpu h100 h200 huawei inference kimi mi355x minimax nvfp4 nvidia nvl72 qwen rocm sglang trtllm vllm wide-ep

May 26, 2026·15 min read

MI355X DeepSeek-V4-Pro on SGLang: 110.5x Throughput per GPU in 26 Days

The amd/deepseek_v4 side branch shipped TileLang attention indexer, Triton sparse MLA, fused RoPE/Hadamard, FlyDSL MoE, and FP4 weights across 31 performance optimizations PRs — lifting first-light 20 tok/s/GPU at 2.4 tok/s/user into 2,256 tok/s/GPU at 9.4 tok/s/user on 8K/1K, with both throughput and interactivity climbing together

benchmarkgpuinferencedeepseekamdmi355xsglangrocmfp4

May 25, 2026·8 min read

AMD MI355X GLM-5 Inference: Up to 40% Cheaper per Million Tokens than B200 on SGLang FP8

14 weeks after GLM-5 launched, AMD landed both MTP and non-MTP SGLang FP8 recipes on MI355X — fused MLA + FP8 KV cache via TileLang flips the single-node FP8 cost curve in AMD favor across most of the performance Pareto

benchmarkgpuinferenceglm5amdnvidiami355xb200sglangrocm

May 25, 2026·7 min read

AMD MI355X Qwen3.5 397B-A17B Inference: Up to 19x Throughput per GPU in 3 Months on SGLang FP8

From v0.5.8 (Feb) → v0.5.10rc0 (Apr) → v0.5.12 (May), three AITER kernel landings on MI355X plus a TP=8 → TP=2/TP=4 retune push Qwen3.5 8k/1k peak from 1.3k to 6.4k tok/s/GPU and extend the curve out to 75 tok/s/user

benchmarkgpuinferenceqwenamdmi355xsglangrocm

April 22, 2026·7 min read

AMD MI355X Kimi K2.5 Inference: 7.7x Throughput, Up To 15x Interactivity in 25 Days on vLLM

vLLM PR #35850 Fixed AITER MLA Dispatch on MI355X CDNA4, Unlocking Kimi K2.5 Inference Performance at TP=8, Shipped in vLLM 0.18

benchmarkgpuinferencekimiamdvllmrocmmi355x