B300 FP4: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on B300 FP4 (NVIDIA Blackwell) running Qwen 3.5 397B-A17B. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Near the low end of the 63–184 tok/s/user interactivity band, at 93 tok/s/user on Qwen 3.5 397B-A17B (B300 FP4): MTP runs 4729 tok/s/GPU at $0.14/M tokens, Off runs 1584 at $0.41/M. MTP is 202% cheaper per token; MTP delivers 199% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 124 tok/s/user on Qwen 3.5 397B-A17B (B300 FP4), MTP delivers 3194 tok/s/GPU at $0.20 per million tokens; Off delivers 966 tok/s/GPU at $0.67. MTP is 229% cheaper per token; MTP delivers 231% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 2158 tok/s/GPU for $0.30 per million tokens at 154 tok/s/user on Qwen 3.5 397B-A17B (B300 FP4); Off posts 614 tok/s/GPU for $1.06. MTP is 250% cheaper per token; MTP delivers 251% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level. (Numbers reflect this URL's pinned 1k/1k · fp4 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:4729.0Off:1583.6 | MTP:3193.8Off:966.3 | MTP:2157.6Off:614.2 |
| Cost ($/M tok) | MTP:$0.137Off:$0.414 | MTP:$0.205Off:$0.673 | MTP:$0.301Off:$1.055 |
| tok/s/MW | MTP:2179271Off:729785 | MTP:1471810Off:445283 | MTP:994302Off:283034 |
| Concurrency | MTP:~64Off:~37 | MTP:~63Off:~16 | MTP:~16Off:~8 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.