Qwen 3.5 397B-A17B — B200 vs MI355X
Head-to-head AI inference benchmark comparison of B200 (NVIDIA Blackwell) and MI355X (AMD CDNA 4) on Qwen 3.5 397B-A17B. Latency, throughput, and cost across LLM workloads. Use the chart controls below to switch sequences, precisions, and metrics — same interactions as the main inference chart.
At 81 tok/s/user interactivity on Qwen 3.5 397B-A17B, B200 delivers 2426 tok/s/GPU at $0.22 per million tokens; MI355X delivers 1625 tok/s/GPU at $0.26. B200 is 15% cheaper per token; B200 delivers 49% more tok/s/GPU at this point.
B200 posts 1057 tok/s/GPU for $0.51 per million tokens at 133 tok/s/user on Qwen 3.5 397B-A17B; MI355X posts 913 tok/s/GPU for $0.45. MI355X is 13% cheaper per token; B200 delivers 16% more tok/s/GPU.
Throughput at 185 tok/s/user on Qwen 3.5 397B-A17B: B200 hits 510 tok/s/GPU, MI355X hits 510. Per-million costs land at $1.02 and $0.78 respectively. MI355X is 31% cheaper per token; throughput per GPU is essentially tied. (Numbers reflect the default 1k/1k · fp8 selection for this URL — table and chart below update if you change sequence, precision, or model in the controls.)
| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | B200:2425.9MI355X:1625.2 | B200:1057.1MI355X:913.0 | B200:510.1MI355X:510.4 |
| Cost ($/M tok) | B200:$0.223MI355X:$0.257 | B200:$0.512MI355X:$0.453 | B200:$1.021MI355X:$0.781 |
| tok/s/MW | B200:1117918MI355X:613278 | B200:487154MI355X:344529 | B200:235082MI355X:192596 |
| Concurrency | B200:~65MI355X:~41 | B200:~18MI355X:~14 | B200:~6MI355X:~8 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.