@PyTorch: While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering …

X AI KOLs Following 06/23/26, 04:00 PM News

deepseek-v4 sglang nvidia lmsys throughput inference production

Summary

SGLang provided Day-0 support for DeepSeek-V4, and collaboration between LMSys and NVIDIA engineering teams achieved up to 5x throughput increase in production, with improvements shown on the SemiAnalysis InferenceX dashboard.

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level. According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane. Find the full technical breakdown in the comments below:

Original Article

View Cached Full Text

Cached at: 06/24/26, 03:57 AM

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.

According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.

Find the full technical breakdown in the comments below:

@PyTorch: While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering …

Similar Articles

@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…

@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

I have (even faster) DeepSeek V4 Pro at home

@0xSero: Deepseek-V4-Flash helping me setup Nvidia's Dynamo for disaggregated inference. I have really gotten this model to be a…

@h100envy: Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs. She also built Fle…

Submit Feedback

Similar Articles

@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…

@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…

I have (even faster) DeepSeek V4 Pro at home

@0xSero: Deepseek-V4-Flash helping me setup Nvidia's Dynamo for disaggregated inference. I have really gotten this model to be a…

@h100envy: Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs. She also built Fle…