·5 min read
SGLang 0.5.6 on B200 DeepSeek R1 FP4: Up to 1.8x at Low Concurrency
Piecewise CUDA graphs for DeepSeek V3, a unified event loop, and JIT kernels push 8k/1k throughput from 508 to 907 tok/s/GPU on the same 16 GPU B200 pool
benchmarkinferencegpunvidiab200deepseeksglangfp4