B200 FP4: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on B200 FP4 (NVIDIA Blackwell) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Throughput at 77 tok/s/user on Qwen 3.5 397B-A17B (B200 FP4): MTP hits 7266 tok/s/GPU, Off hits 2670. Per-million costs land at $0.07 and $0.20 respectively. MTP is 165% cheaper per token; MTP delivers 172% more tok/s/GPU. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size.
Around the middle of the 41–186 tok/s/user interactivity band, at 114 tok/s/user on Qwen 3.5 397B-A17B (B200 FP4): MTP runs 4613 tok/s/GPU at $0.12/M tokens, Off runs 998 at $0.54/M. MTP is 364% cheaper per token; MTP delivers 362% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 150 tok/s/user on Qwen 3.5 397B-A17B (B200 FP4), MTP delivers 2640 tok/s/GPU at $0.20 per million tokens; Off delivers 515 tok/s/GPU at $1.05. MTP is 416% cheaper per token; MTP delivers 412% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution. (Numbers reflect this URL's pinned 1k/1k · fp4 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:7265.5Off:2670.4 | MTP:4613.1Off:998.3 | MTP:2639.9Off:515.3 |
| Cost ($/M tok) | MTP:$0.075Off:$0.199 | MTP:$0.116Off:$0.537 | MTP:$0.204Off:$1.053 |
| tok/s/MW | MTP:3348176Off:1230587 | MTP:2125852Off:460036 | MTP:1216561Off:237461 |
| Concurrency | MTP:~406Off:~67 | MTP:~191Off:~26 | MTP:~92Off:~8 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.