H200 FP8: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on H200 FP8 (NVIDIA Hopper) running GLM 5/5.1. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Throughput at 41 tok/s/user on GLM 5/5.1 (H200 FP8): MTP hits 329 tok/s/GPU, Off hits 145. Per-million costs land at $1.19 and $2.72 respectively. MTP is 128% cheaper per token; MTP delivers 128% more tok/s/GPU. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size.
Around the middle of the 34–60 tok/s/user interactivity band, at 47 tok/s/user on GLM 5/5.1 (H200 FP8): MTP runs 286 tok/s/GPU at $1.36/M tokens, Off runs 113 at $3.45/M. MTP is 153% cheaper per token; MTP delivers 153% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 54 tok/s/user on GLM 5/5.1 (H200 FP8), MTP delivers 247 tok/s/GPU at $1.58 per million tokens; Off delivers 82 tok/s/GPU at $4.90. MTP is 210% cheaper per token; MTP delivers 203% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution. (Numbers reflect this URL's pinned 1k/1k · fp8 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:329.5Off:144.6 | MTP:285.9Off:112.9 | MTP:247.3Off:81.8 |
| Cost ($/M tok) | MTP:$1.190Off:$2.716 | MTP:$1.363Off:$3.447 | MTP:$1.577Off:$4.895 |
| tok/s/MW | MTP:190452Off:83561 | MTP:165244Off:65238 | MTP:142970Off:47261 |
| Concurrency | MTP:~34Off:~14 | MTP:~24Off:~10 | MTP:~19Off:~6 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.