H100 FP8: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on H100 FP8 (NVIDIA Hopper) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Near the low end of the 60–157 tok/s/user interactivity band, at 84 tok/s/user on Qwen 3.5 397B-A17B (H100 FP8): MTP runs 361 tok/s/GPU at $1.00/M tokens, Off runs 214 at $1.68/M. MTP is 67% cheaper per token; MTP delivers 69% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 108 tok/s/user on Qwen 3.5 397B-A17B (H100 FP8), MTP delivers 292 tok/s/GPU at $1.22 per million tokens; Off delivers 160 tok/s/GPU at $2.29. MTP is 88% cheaper per token; MTP delivers 82% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 240 tok/s/GPU for $1.50 per million tokens at 133 tok/s/user on Qwen 3.5 397B-A17B (H100 FP8); Off posts 102 tok/s/GPU for $3.59. MTP is 139% cheaper per token; MTP delivers 137% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level. (Numbers reflect this URL's pinned 1k/1k · fp8 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:360.9Off:213.9 | MTP:291.9Off:160.2 | MTP:240.5Off:101.5 |
| Cost ($/M tok) | MTP:$1.005Off:$1.677 | MTP:$1.219Off:$2.290 | MTP:$1.505Off:$3.593 |
| tok/s/MW | MTP:208590Off:123652 | MTP:168728Off:92574 | MTP:139002Off:58681 |
| Concurrency | MTP:~18Off:~11 | MTP:~11Off:~6 | MTP:~8Off:~3 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.