14 weeks after GLM-5's release, AMD MI355X SGLang FP8 undercuts NVIDIA B200 SGLang FP8 on cost per million tokens across most of the single-node Pareto frontier on the 8k/1k workload (from ~10 to ~77 tok/s/user; B200 noses back ahead above ~90 tok/s/user). The peak gap is 1.41x at 18 tok/s/user with MTP ($0.30/M on B200 vs $0.22/M on MI355X — a 40% reduction) and 1.36x at 10 tok/s/user without MTP ($0.31/M vs $0.23/M). Both runs are on SGLang v0.12, where the MI355X ROCm stack is now feature-matched with the CUDA stack on B200: both MTP and non-MTP recipes, both with FP8 KV cache, both on SGLang's latest TileLang-backed MLA path.

This is the cadence that matters. GLM-5 dropped, and within a quarter AMD shipped an upstream SGLang kernel (sgl-project/sglang PR #21511) along with other optimizations plus the matching InferenceX recipes (InferenceX PR #1440) that flipped the FP8 single-node cost curve on the model. Speed is the moat.

Click to see the full InferenceX dashboard →

GLM-5 is ZAI's (Zhipu) MoE flagship, released 2026-02-11 — exactly 14 weeks before the InferenceX run in this post. It's a 744B-parameter sparse MoE with ~40B activated per token: 256 experts with top-8 routing (~5.9% sparsity) plus shared experts. The published architecture name is glm_moe_dsa — the model integrates DeepSeek Sparse Attention (DSA) on the decode path, the same sparse-attention pattern DeepSeek introduced in V3.2 and that SGLang's TileLang backend was built around, paired with Multi-head Latent Attention (MLA) for KV-cache compression to support its 200K context window.

On MI355X, the equivalent capability landed via SGLang's TileLang backend in mid-April, and the resulting decode throughput moved enough that MI355X's lower per-GPU TCO ($1.48/GPU/hr vs B200 at $1.95/GPU/hr per the SemiAnalysis AI Cloud TCO Model) now compounds into a real cost-per-token advantage instead of being swamped by software gaps.

What Shipped to Make This Happen

One of the headline performance optimizations on the AMD side is sgl-project/sglang PR #21511 by HaiShaw, merged 2026-04-03. The PR enables FP8 KV cache and an FP8 attention kernel on MI300/MI355 using SGLang's TileLang backend (tested on both DeepSeek-V3.2 and GLM-5), with a different fusion strategy per generation:

On MI355, the PR reuses the existing fused_qk_rope_cat_and_cache_mla kernel for both Q and KV FP8 quantization. The QK rope concat, MLA cache write, and FP8 quant for both Q and KV all collapse into one kernel pass per decode step — no extra HBM round-trips, no separate quantization launches.

The TileLang dependency was bumped to enable FP8 GEMM on AMD, and a new sparse_mla_fwd_decode_partial_fp8 kernel was added for the partial-decode reduction path. The PR reports throughput gains of greater than 5% on MI355 (greater than 10% on MI300), no accuracy regression on gsm8k (DeepSeek-V3.2 0.945 → 0.946; GLM-5 0.946 → 0.950), and activates with --kv-cache-dtype fp8_e4m3 alongside the TileLang prefill/decode backends.

The Numbers

All rows are GLM-5 FP8 at ISL 8192 / OSL 1024 on a single non-disaggregated node, measured on InferenceX on 2026-05-20 on SGLang v0.12 for both CUDA (B200) and ROCm (MI355X). Cost per million total tokens is computed as TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6), with B200 at $1.95/GPU/hr and MI355X at $1.48/GPU/hr.

Container images used:

B200: lmsysorg/sglang:v0.5.12-cu130
MI355X: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517

B200 SGLang FP8 MTP, TP=8 on 8 GPUs:

Conc	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tokens
4	417.0	100.85	9.92	$1.30
8	650.1	77.82	12.85	$0.83
16	952.7	56.93	17.57	$0.57
32	1,296.8	38.16	26.21	$0.42
64	1,619.3	23.56	42.45	$0.34
128	1,929.5	13.78	72.59	$0.28
256	1,947.3	11.88	84.15	$0.28

MI355X SGLang FP8 MTP, TP=4 on 4 GPUs (the Pareto-anchor recipe):

Conc	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tokens
4	625.5	76.80	13.02	$0.66
8	911.7	54.59	18.32	$0.45
16	1,208.1	35.82	27.92	$0.34
32	1,707.4	24.83	40.27	$0.24
64	1,895.0	18.19	54.99	$0.22
128	1,911.7	18.05	55.40	$0.22

MI355X SGLang FP8 MTP, TP=8 on 8 GPUs (the high-interactivity arm):

Conc	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tokens
4	373.4	90.43	11.06	$1.10
8	534.2	65.05	15.37	$0.77

B200 SGLang FP8 non-MTP, TP=8 on 8 GPUs:

Conc	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tokens
4	231.3	54.25	18.43	$2.34
8	382.4	46.07	21.71	$1.42
16	613.2	36.65	27.28	$0.88
32	933.7	27.47	36.40	$0.58
64	1,291.8	18.42	54.28	$0.42
128	1,669.1	11.87	84.23	$0.32
256	1,746.1	10.72	93.27	$0.31

MI355X SGLang FP8 non-MTP, TP=4 on 4 GPUs:

Conc	tok/s/GPU	tok/s/user	TPOT (ms)	$/M tokens
4	358.8	42.03	23.79	$1.15
8	579.6	34.68	28.83	$0.71
16	870.8	25.86	38.67	$0.47
32	1,274.0	18.57	53.86	$0.32
64	1,660.1	11.83	84.56	$0.25
128	2,071.4	7.33	136.36	$0.20
256	2,189.4	6.69	149.45	$0.19

Iso-Interactivity Cost Comparison

Interpolating both Pareto frontiers at matched interactivity. For MI355X MTP the Pareto frontier is the lower of TP=4 and TP=8 at each interactivity — TP=4 dominates up to ~77 tok/s/user, with TP=8 conc 4 taking over at the high-interactivity end (~90 tok/s/user) where TP=4 can't reach.

MTP:

Interactivity (tok/s/user)	B200 SGLang MTP $/M tok	MI355X SGLang MTP $/M tok	B200 / MI355X
18	$0.30	$0.22	1.41x
24	$0.34	$0.24	1.40x
35	$0.40	$0.34	1.17x
55	$0.55	$0.45	1.22x
77	$0.82	$0.66	1.25x
90	$1.08	$1.10	0.98x

Non-MTP:

Interactivity (tok/s/user)	B200 SGLang $/M tok	MI355X SGLang $/M tok	B200 / MI355X
15	$0.37	$0.28	1.31x
20	$0.45	$0.35	1.27x
30	$0.66	$0.58	1.14x
40	$1.07	$1.03	1.05x

GLM-5 FP8 8k/1k Cost per Million Total Tokens vs Interactivity, B200 SGLang and MI355X SGLang, with and without MTP speculative decoding — GLM-5 FP8 8k/1k. Cost per million total tokens vs interactivity. B200 SGLang and MI355X SGLang, both with and without MTP. Labels denote GPU count per config.

Live chart, pre-filtered to GLM-5 FP8 on B200 and MI355X SGLang for the 2026-05-20 run.

What's Next for MI355X on GLM-5

This result is single-node, aggregated, FP8 only. Two gaps still need closing:

FP4 composability. B200 in this comparison is FP8 on a CUDA nightly. B200 NVFP4 SGLang for GLM-5 is now shipping and will further compress B200's cost curve. MI355X MXFP4 GLM-5.1 SGLang shipped via InferenceX PR #1098 on 2026-04-21, but the FP4 + MTP composition on MI355X is not yet at parity with the FP8 + MTP recipe shown here.
Disaggregation and wide expert parallelism. MI355X GLM-5 has no disagg or wide-EP recipe yet. NVIDIA's GB200 NVL72 Dynamo TRT-LLM and Dynamo vLLM recipes on Kimi K2.5 already demonstrated a ~3x throughput-per-GPU advantage from rack-scale wide EP. AMD has still not shipped disagg for GLM-5 yet.

Acknowledgments

This recipe loop moved fast because Anush Elangovan, HaiShaw, and the broader AMD AI team landed both the upstream SGLang TileLang fused MLA + FP8 KV kernel in a 14-week window after GLM-5 dropped. The SGLang maintainers reviewed and shipped the kernel within days of submission. Speed of the upstream-to-benchmark loop is the moat.

Click to see the full InferenceX dashboard →