MI325X FP8: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on MI325X FP8 (AMD CDNA 3) running GLM 5/5.1. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Throughput at 21 tok/s/user on GLM 5/5.1 (MI325X FP8): MTP hits 186 tok/s/GPU, Off hits 185. Per-million costs land at $1.86 and $1.87 respectively. Cost per token is essentially tied; throughput per GPU is essentially tied. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size.
Around the middle of the 17–34 tok/s/user interactivity band, at 26 tok/s/user on GLM 5/5.1 (MI325X FP8): MTP runs 100 tok/s/GPU at $3.58/M tokens, Off runs 100 at $3.56/M. Cost per token is essentially tied; throughput per GPU is essentially tied. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 30 tok/s/user on GLM 5/5.1 (MI325X FP8), MTP delivers 62 tok/s/GPU at $5.69 per million tokens; Off delivers 62 tok/s/GPU at $5.66. Cost per token is essentially tied; throughput per GPU is essentially tied. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution. (Numbers reflect this URL's pinned 1k/1k · fp8 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:186.0Off:185.3 | MTP:99.5Off:99.9 | MTP:61.7Off:62.0 |
| Cost ($/M tok) | MTP:$1.864Off:$1.874 | MTP:$3.578Off:$3.564 | MTP:$5.691Off:$5.661 |
| tok/s/MW | MTP:85315Off:85020 | MTP:45647Off:45803 | MTP:28305Off:28433 |
| Concurrency | MTP:~39Off:~38 | MTP:~16Off:~16 | MTP:~9Off:~9 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.