MI325X BF16: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on MI325X BF16 (AMD CDNA 3) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
MTP posts 392 tok/s/GPU for $0.90 per million tokens at 55 tok/s/user on Qwen 3.5 397B-A17B (MI325X BF16); Off posts 392 tok/s/GPU for $0.90. Cost per token is essentially tied; throughput per GPU is essentially tied. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level.
Throughput at 70 tok/s/user on Qwen 3.5 397B-A17B (MI325X BF16): MTP hits 274 tok/s/GPU, Off hits 275. Per-million costs land at $1.30 and $1.29 respectively. Cost per token is essentially tied; throughput per GPU is essentially tied. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size.
Toward the upper edge of the 41–98 tok/s/user interactivity band, at 84 tok/s/user on Qwen 3.5 397B-A17B (MI325X BF16): MTP runs 168 tok/s/GPU at $2.11/M tokens, Off runs 170 at $2.09/M. Cost per token is essentially tied; throughput per GPU is essentially tied. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less. (Numbers reflect this URL's pinned 1k/1k · bf16 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:392.4Off:392.0 | MTP:274.3Off:275.0 | MTP:168.0Off:169.6 |
| Cost ($/M tok) | MTP:$0.900Off:$0.901 | MTP:$1.296Off:$1.291 | MTP:$2.113Off:$2.092 |
| tok/s/MW | MTP:179995Off:179800 | MTP:125832Off:126151 | MTP:77057Off:77807 |
| Concurrency | MTP:~29Off:~29 | MTP:~16Off:~16 | MTP:~8Off:~8 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.