H200 FP8: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on H200 FP8 (NVIDIA Hopper) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
At 55 tok/s/user on Qwen 3.5 397B-A17B (H200 FP8), MTP delivers 549 tok/s/GPU at $0.71 per million tokens; Off delivers 342 tok/s/GPU at $1.14. MTP is 62% cheaper per token; MTP delivers 61% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 458 tok/s/GPU for $0.86 per million tokens at 75 tok/s/user on Qwen 3.5 397B-A17B (H200 FP8); Off posts 247 tok/s/GPU for $1.56. MTP is 82% cheaper per token; MTP delivers 85% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level.
Throughput at 96 tok/s/user on Qwen 3.5 397B-A17B (H200 FP8): MTP hits 378 tok/s/GPU, Off hits 197. Per-million costs land at $1.04 and $1.98 respectively. MTP is 91% cheaper per token; MTP delivers 92% more tok/s/GPU. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size. (Numbers reflect this URL's pinned 1k/1k · fp8 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:549.4Off:342.0 | MTP:457.7Off:247.2 | MTP:378.5Off:197.1 |
| Cost ($/M tok) | MTP:$0.706Off:$1.143 | MTP:$0.857Off:$1.559 | MTP:$1.035Off:$1.980 |
| tok/s/MW | MTP:317594Off:197685 | MTP:264575Off:142862 | MTP:218781Off:113948 |
| Concurrency | MTP:~40Off:~25 | MTP:~24Off:~13 | MTP:~16Off:~9 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.