B300 FP4: MTP vs Off Speculative Decoding
Speculative decoding comparison of MTP versus Off on B300 FP4 (NVIDIA Blackwell) running GLM 5/5.1. Throughput, cost, and interactivity differences across LLM workloads. Use the chart controls below to switch sequences and metrics — same interactions as the main inference chart.
MTP acceptance-rate implementations differ across inference engines. Points from different engines are not directly comparable on the same curve — throughput and cost at matched interactivity may reflect engine-level differences rather than pure speculative decoding gains. Interpret cross-engine comparisons with caution.
Near the low end of the 27–92 tok/s/user interactivity band, at 43 tok/s/user on GLM 5/5.1 (B300 FP4): MTP runs 2310 tok/s/GPU at $0.29/M tokens, Off runs 984 at $0.65/M. MTP is 127% cheaper per token; MTP delivers 135% more tok/s/GPU. Gains from speculative decoding vary by workload; short-output prompts tend to benefit less.
At 59 tok/s/user on GLM 5/5.1 (B300 FP4), MTP delivers 1606 tok/s/GPU at $0.40 per million tokens; Off delivers 519 tok/s/GPU at $1.25. MTP is 209% cheaper per token; MTP delivers 210% more tok/s/GPU. Speculative decoding accepts draft tokens to reduce per-token latency — gains vary by workload and prompt distribution.
MTP posts 1128 tok/s/GPU for $0.58 per million tokens at 76 tok/s/user on GLM 5/5.1 (B300 FP4); Off posts 259 tok/s/GPU for $2.53. MTP is 338% cheaper per token; MTP delivers 336% more tok/s/GPU. Draft-token acceptance rates determine whether speculative decoding helps or hurts at a given concurrency level. (Numbers reflect this URL's pinned 1k/1k · fp4 workload — changing sequence or model updates both the table and chart; the table stays pinned to this page's precision, so precision toggles in the controls affect the chart only.)

| Metric | Interactivity (tok/s/user) | Interactivity (tok/s/user) | Interactivity (tok/s/user) |
|---|---|---|---|
| Throughput (tok/s/gpu) | MTP:2310.4Off:983.9 | MTP:1605.5Off:518.7 | MTP:1128.4Off:258.8 |
| Cost ($/M tok) | MTP:$0.288Off:$0.653 | MTP:$0.404Off:$1.250 | MTP:$0.576Off:$2.525 |
| tok/s/MW | MTP:1064712Off:453413 | MTP:739878Off:239048 | MTP:519978Off:119278 |
| Concurrency | MTP:~129Off:~47 | MTP:~55Off:~18 | MTP:~31Off:~7 |
Inference Performance
Inference performance metrics across different models, hardware configurations, and serving parameters.