B300 BF16: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on B300 BF16 (NVIDIA Blackwell) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
At 97 tok/s/user on Qwen 3.5 397B-A17B (B300 BF16), MTP delivers 1475 tok/s/GPU at $0.44 per million tokens; Off delivers 506 tok/s/GPU at $1.25. MTP is 183% cheaper per token; MTP delivers 191% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 1053 tok/s/GPU for $0.62 per million tokens at 122 tok/s/user on Qwen 3.5 397B-A17B (B300 BF16); Off posts 376 tok/s/GPU for $1.73. MTP is 179% cheaper per token; MTP delivers 180% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level.
Throughput at 146 tok/s/user on Qwen 3.5 397B-A17B (B300 BF16): MTP hits 803 tok/s/GPU, Off hits 256. Per-million costs land at $0.80 and $2.56 respectively. MTP is 219% cheaper per token; MTP delivers 214% more tok/s/GPU. Speculative decoding trades extra compute on draft tokens for fewer decoding steps — the payoff depends on sequence length and batch size. (Numbers reflect this URL's pinned 1k/1k · bf16 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:1475.1Off:506.2 | MTP:1053.4Off:376.3 | MTP:802.6Off:255.8 |
| Cost ($/M tok) | MTP:$0.441Off:$1.247 | MTP:$0.620Off:$1.729 | MTP:$0.801Off:$2.559 |
| tok/s/MW | MTP:679771Off:233268 | MTP:485449Off:173429 | MTP:369860Off:117889 |
| Concurrency | MTP:~64Off:~11 | MTP:~41Off:~6 | MTP:~24Off:~4 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.