SGLang 0.5.6 on B200 DeepSeek R1 FP4: Up to 1.8x at Low Concurrency

Piecewise CUDA graphs for DeepSeek V3, a unified event loop, and JIT kernels push 8k/1k throughput from 508 to 907 tok/s/GPU on the same 16 GPU B200 pool

SemiAnalysis··5 min read·benchmarkinferencegpunvidiab200deepseeksglangfp4
On this page (click to expand)

B200 running SGLang 0.5.6 on DeepSeek R1 NVFP4 reaches 907 tok/s/GPU at concurrency 4 on the 8k/1k workload, up 1.79x from 508 tok/s/GPU on 0.5.5. Both runs use the same 16 GPU pool at TP 4 / EP 4. The only change was the Docker image being updated from lmsysorg/sglang:v0.5.5-cu129-amd64 to lmsysorg/sglang:v0.5.6-cu129-amd64.

SGLang 0.5.6 shipped on 2025-12-03 and the InferenceX benchmark caught the full effect 28 days later, on 2025-12-31, the same day the image bump landed. This is the reason we built InferenceX's automated benchmark loop, to catch software-driven performance changes on the same hardware as soon as they land.

The performance gain is largest at low concurrency. At concurrency 4 and 8 the decode loop spends a meaningful fraction of each step in Python scheduler and kernel dispatch code rather than in matmuls, so the 0.5.6 scheduler and graph changes apply most directly. At high concurrency the tensor cores are near saturation and the smaller throughput gain (1.03x at conc 64, 1.16x at conc 128) comes from the refactored attention kernel path.

What Shipped in SGLang 0.5.6

SGLang 0.5.6 shipped on 2025-12-03. Three release items apply to the low-concurrency throughput gains. Piecewise CUDA graph support was extended to DeepSeek V3 and the MLA attention path, reducing the per-step Python cost of constructing and replaying graphs. The event loop was unified across PD-disaggregated, overlap, and DP-attention serving modes, reducing inner-loop overhead. JIT kernels were introduced, reducing startup cost and allowing kernel compilation to specialize for shapes seen at run time.

Three other 0.5.6 changes affect the attention kernel path. MHA and MLA KV caches were refactored to support FP4. The FlashInfer TRTLLM GEN MHA path was re-enabled. FlashInfer bumped to 0.5.2. These matter at high concurrency where the KV cache is large and attention is the dominant cost. The 1.16x at concurrency 128 comes from this path.

The Numbers

All rows are DeepSeek R1 NVFP4 at ISL 8192 / OSL 1024 on InferenceX. 0.5.5 data is from the 2025-12-15 run on the image set by InferenceX PR #204, which moved the B200 SGLang configs from v0.5.3rc1-cu129-b200 to v0.5.5-cu129-amd64 on 2025-11-10. 0.5.6 data is from the 2025-12-31 run, triggered by InferenceX PR #276 which bumped the Docker image to v0.5.6-cu129-amd64 with no other configuration change.

B200 SGLang, DeepSeek R1 NVFP4, TP 4 / EP 4 decode, 16 GPU non-disaggregated pool. The recipe follows the SGLang DeepSeek V3/R1 deployment guide.

VersionConctok/s/GPUTPOT (ms)tok/s/userGain
0.5.545089.2108.4baseline
0.5.5890311.686.5baseline
0.5.5161,47115.763.8baseline
0.5.5322,30222.245.1baseline
0.5.5643,32333.729.6baseline
0.5.51284,43054.918.2baseline
0.5.649079.2108.51.79x
0.5.681,43711.686.01.59x
0.5.6161,50015.564.61.02x
0.5.6323,06322.045.61.33x
0.5.6643,41932.930.41.03x
0.5.61285,14553.718.61.16x

The bolded row is the headline: 907 tok/s/GPU on 0.5.6 at concurrency 4 vs 508 on 0.5.5, a 1.79x lift on identical hardware and recipe. Interactivity at matched concurrency is almost identical across versions. TPOT at each concurrency is unchanged within rounding. 0.5.6 serves more users per GPU at the same per-user token rate.

DeepSeek R1 NVFP4 8k/1k throughput per GPU on B200 SGLang at TP 4 EP 4, SGLang 0.5.5 vs 0.5.6 across the concurrency sweep
B200 DeepSeek R1 NVFP4 8k/1k, SGLang 0.5.5 (2025-12-15) vs SGLang 0.5.6 (2025-12-31) throughput per GPU across concurrency.

Live chart, pre-filtered to B200 SGLang DeepSeek R1 across the 0.5.5 and 0.5.6 runs.

Where Each Improvement Lands on the Curve

Decode on a TP 4 / EP 4 DeepSeek R1 NVFP4 deployment has a fixed per-step cost. Kernel launches, Python scheduler work, and graph construction are the main contributors alongside attention and MoE GEMMs. At concurrency 4 the GEMMs are small enough that fixed cost is a meaningful slice of the step. Reducing fixed cost speeds up the step directly, which is why the biggest ratios (1.79x at conc 4, 1.59x at conc 8) appear at low concurrency. Piecewise CUDA graphs and JIT kernels are the relevant release items.

At concurrency 128 the KV cache is large and attention is the dominant cost per step. The refactored MHA and MLA KV caches for FP4 and the re-enabled FlashInfer TRTLLM GEN MHA path produce a 1.16x ratio at conc 128 even though the scheduler-overhead reduction has flattened at that point. At middle concurrencies (16, 32, 64) neither effect is dominant and the throughput gain is smaller and less stable (1.02x, 1.33x, 1.03x).

All articles and posts are © SemiAnalysis. All rights reserved. The AGPL-3.0 license covering the application source code does not apply to article content.