B200 BF16: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on B200 BF16 (NVIDIA Blackwell) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Near the low end of the 103–168 tok/s/user interactivity band, at 119 tok/s/user on Qwen 3.5 397B-A17B (B200 BF16): MTP runs 1058 tok/s/GPU at $0.52/M tokens, Off runs 270 at $2.02/M. MTP is 286% cheaper per token; MTP delivers 291% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 135 tok/s/user on Qwen 3.5 397B-A17B (B200 BF16), MTP delivers 828 tok/s/GPU at $0.65 per million tokens; Off delivers 214 tok/s/GPU at $2.49. MTP is 286% cheaper per token; MTP delivers 288% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 732 tok/s/GPU for $0.73 per million tokens at 152 tok/s/user on Qwen 3.5 397B-A17B (B200 BF16); Off posts 178 tok/s/GPU for $2.98. MTP is 311% cheaper per token; MTP delivers 312% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level. (Numbers reflect this URL's pinned 1k/1k · bf16 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:1058.2Off:270.4 | MTP:828.2Off:213.6 | MTP:731.8Off:177.8 |
| Cost ($/M tok) | MTP:$0.523Off:$2.019 | MTP:$0.645Off:$2.492 | MTP:$0.726Off:$2.980 |
| tok/s/MW | MTP:487664Off:124609 | MTP:381637Off:98447 | MTP:337218Off:81926 |
| Concurrency | MTP:~40Off:~10 | MTP:~27Off:~7 | MTP:~22Off:~5 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.