@PyTorch: While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering …
Summary
SGLang provided Day-0 support for DeepSeek-V4, and collaboration between LMSys and NVIDIA engineering teams achieved up to 5x throughput increase in production, with improvements shown on the SemiAnalysis InferenceX dashboard.
View Cached Full Text
Cached at: 06/24/26, 03:57 AM
While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.
According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.
Find the full technical breakdown in the comments below:
Similar Articles
@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…
Update on running a non-quantized DeepSeek-v4-Flash model at 11 tok/s on a single DGX Spark using sglang inference and a custom mega-kernel, progressing towards GLM-5.2.
@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…
Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.
I have (even faster) DeepSeek V4 Pro at home
A user reports successfully running the DeepSeek V4 Pro model locally using ktransformers and sharing detailed benchmark results across various context depths, demonstrating improved inference speeds.
@0xSero: Deepseek-V4-Flash helping me setup Nvidia's Dynamo for disaggregated inference. I have really gotten this model to be a…
User @0xSero shares that Deepseek-V4-Flash is helping them set up Nvidia's Dynamo for disaggregated inference, and they find it strong for agentic workflows and programming, now using it locally instead of Claude.
@h100envy: Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs. She also built Fle…
Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs, achieving 5x cost cuts over DeepSeek's API; she also built FlexGen and helped build Chatbot Arena.