AMD MI355X Qwen3.5 397B-A17B Inference: Up to 19x Throughput per GPU in 3 Months on SGLang FP8

From v0.5.8 (Feb) → v0.5.10rc0 (Apr) → v0.5.12 (May), three AITER kernel landings on MI355X plus a TP=8 → TP=2/TP=4 retune push Qwen3.5 8k/1k peak from 1.3k to 6.4k tok/s/GPU and extend the curve out to 75 tok/s/user

SemiAnalysis··7 min read·benchmarkgpuinferenceqwenamdmi355xsglangrocm
On this page (click to expand)

13 weeks after Alibaba's Qwen3.5-397B-A17B release on 2026-02-16, AMD MI355X SGLang FP8 throughput per GPU on the 8k/1k workload has moved up to 19.0x at iso-interactivity at 40 tok/s/user (192 → 3,660 tok/s/GPU between the 2026-02-20 v0.5.8.post1 baseline and the 2026-05-19 v0.5.12 run, on the dashboard's monotone-cubic-Hermite Pareto interpolation). The gains compound across three SGLang releases plus three AITER MoE kernel landings drove most of the move, with another ~1.5x from the May v0.5.10rc0 → v0.5.12 image bump on top.

The story is software-only — same MI355X CDNA4 silicon at $1.48/GPU/hr the whole time. The receipts: sgl-project/sglang#20736, sgl-project/sglang#21188, and sgl-project/sglang#21421, all merged Mar–Apr and all gated on SGLANG_USE_AITER=1. Speed of the upstream-to-benchmark loop is the moat.

Qwen3.5 FP8 8k/1k tok/s/GPU vs interactivity on MI355X SGLang across three dates: 2026-02-20 (v0.5.8.post1), 2026-04-16 (v0.5.10rc0), 2026-05-19 (v0.5.12). Each curve labeled with its date and the TP value at each point.
Qwen3.5-397B-A17B FP8 8k/1k on MI355X SGLang. Three runs over 3 months: v0.5.8.post1 (Feb 20, TP=8), v0.5.10rc0 (Apr 16, TP=2/4), v0.5.12 (May 19, TP=2/4). Point labels denote the TP value used for that config.

Qwen3.5-397B-A17B is Alibaba's MoE flagship, released 2026-02-16 is an 397B total parameters with 17B activated per token across 512 experts (top-K routing), with a hybrid attention stack interleaving Gated DeltaNet and Gated Attention layers. The first InferenceX benchmark ran on MI355X four days after the release.

What Shipped to Make This Happen

Some of the performance optimizations that lead to these massive gains are:

  • sgl-project/sglang PR #20736 by zhentaocc (with co-author yichiche), merged 2026-04-15 — fuses the shared expert with routed experts in Qwen2 MoE and Qwen3.5 MoE. When shared_expert_intermediate_size == moe_intermediate_size, the shared expert is treated as an additional expert (top-K + 1) inside a single AITER MoE dispatch. One fewer kernel launch per MoE layer, fewer HBM round-trips for the shared-expert weights. Reported +4.6% total throughput, −4% TPOT on Qwen3.5 at concurrency 16; FP8 accuracy initially required an AITER split-K fix before being enabled.
  • sgl-project/sglang PR #21188 by yichiche, merged 2026-03-23 — adds a forward_hip path to GemmaRMSNorm so AMD GPUs use fused RMSNorm kernels (AITER fused_add_rms_norm / rms_norm) instead of the native fallback. The native path was scalar-bound on MI355X; the fused path absorbs the Gemma-style weight + 1.0 offset into the kernel. Reported on 8x MI355X at conc 1, 8k/1k: −23.1% median E2E latency, +30.0% total throughput, −17.0% median TTFT, with GSM8K accuracy rising from 0.943 to 0.955.
  • sgl-project/sglang PR #21421 by zhentaocc, merged 2026-03-26 — integrates AITER's fused_topk kernel into SGLang's fused_topk for softmax-scored MoE top-K selection. Auto-dispatches to aiter.fused_moe.fused_topk when AITER is enabled. Kernel microbenchmarks: ~1.31x to 6.29x faster than the sgl-kernel baseline on Qwen3.5 shapes (E=512, top-K=10), with the largest gains at high token counts. End-to-end bs=64 1k/1k: +1.9% total throughput, GSM8K within ±0.001 of baseline.

The Numbers

All rows are Qwen3.5-397B-A17B FP8 at ISL 8192 / OSL 1024 on a single non-disaggregated MI355X node, measured on InferenceX. Cost per million total tokens is computed as TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6) with MI355X TCO at $1.48/GPU/hr per the SemiAnalysis AI Cloud TCO Model.

Container images per date:

  • 2026-02-20: rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260218
  • 2026-04-16: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414
  • 2026-05-19: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517

2026-02-20, MI355X SGLang FP8, TP=8 on 8 GPUs (baseline):

Conctok/s/GPUtok/s/userTPOT (ms)$/M tokens
4171.940.8624.47$2.39
8312.137.6626.55$1.32
16568.035.4728.19$0.72
32917.828.4835.11$0.45
641,288.019.2252.03$0.32

2026-04-16, MI355X SGLang FP8, TP=2 on 2 GPUs (post-retune + AITER PRs):

Conctok/s/GPUtok/s/userTPOT (ms)$/M tokens
41,074.363.8915.65$0.38
81,704.650.9819.61$0.24
162,571.938.5026.51$0.16
323,567.826.2238.15$0.12

2026-04-16, MI355X SGLang FP8, TP=4 on 4 GPUs (high-throughput arm):

Conctok/s/GPUtok/s/userTPOT (ms)$/M tokens
322,584.938.5625.94$0.16
643,426.624.8440.25$0.12
1284,263.215.3865.01$0.10
2565,099.39.20108.64$0.08

2026-05-19, MI355X SGLang FP8, TP=2 on 2 GPUs (v0.5.12 bump):

Conctok/s/GPUtok/s/userTPOT (ms)$/M tokens
41,267.575.2213.29$0.32
82,008.159.6716.76$0.20
163,175.646.7321.40$0.13
324,346.831.9131.34$0.09

2026-05-19, MI355X SGLang FP8, TP=4 on 4 GPUs (v0.5.12 bump):

Conctok/s/GPUtok/s/userTPOT (ms)$/M tokens
323,171.846.8221.36$0.13
644,113.429.8333.53$0.10
1285,019.618.0955.27$0.08
2566,409.111.5686.53$0.06

Iso-Interactivity Throughput Comparison

Each date is interpolated on its Pareto frontier (the higher of TP=2 and TP=4 throughput at each interactivity for the April and May runs; TP=8 only for the Feb baseline). Ratios are throughput-per-GPU at matched tok/s/user:

Interactivity (tok/s/user)Feb v0.5.8 tok/s/GPUApr v0.5.10rc0 tok/s/GPUMay v0.5.12 tok/s/GPUMay / FebMay / Apr
201,2593,9064,8613.86x1.24x
308593,2784,4495.18x1.36x
356122,8674,1146.72x1.44x
401922,4763,66019.0x1.48x
50unreachable1,7652,9591.68x
60unreachable1,2441,9851.60x

The 19x peak at 40 tok/s/user is partly a regime extension — the Feb TP=8 recipe had a 24.5 ms TPOT floor at conc 4 (40.86 tok/s/user) and couldn't run cheaper than that on this workload, so the comparison band tops out where the old recipe was already in collapse. By 50 tok/s/user the v0.5.8 curve doesn't exist at all; by 75 tok/s/user only the v0.5.12 curve still has a point. The May v0.5.12 image alone adds 1.44x to 1.68x on top of the April baseline across the entire shared band — a clean version-bump win.

Qwen3.5 FP8 8k/1k tok/s/GPU vs interactivity on MI355X SGLang across three dates: 2026-02-20 (v0.5.8.post1), 2026-04-16 (v0.5.10rc0), 2026-05-19 (v0.5.12). Each curve labeled with its date and the TP value at each point.
Qwen3.5-397B-A17B FP8 8k/1k on MI355X SGLang. Three runs over 3 months: v0.5.8.post1 (Feb 20, TP=8), v0.5.10rc0 (Apr 16, TP=2/4), v0.5.12 (May 19, TP=2/4). Point labels denote the TP value used for that config.

Live chart, pre-filtered to MI355X SGLang Qwen3.5 FP8 across all three runs.

What's Next for MI355X on Qwen3.5

  • Disaggregated Serving. Qwen3.5's 512-expert pool is exactly the regime where a disaggregated prefill/decode split should shine. There is no MI355X Qwen3.5 disagg recipe yet, and AMD has still not shipped disagg for Qwen3.5.

Acknowledgments

This 3-month curve move is the work of zhentaocc (Todd Chen) and yichiche (Jacky Cheng) at AMD, who authored all three upstream SGLang PRs, with HaiShaw reviewing and merging. Speed of the upstream-to-benchmark loop is the moat.

All articles and posts are © SemiAnalysis. All rights reserved. The AGPL-3.0 license covering the application source code does not apply to article content.