sglang

#sglang

@Ex0byt: Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a sing…

X AI KOLs Timeline ↗ · 2d ago Cached

Update on running a non-quantized DeepSeek-v4-Flash model at 11 tok/s on a single DGX Spark using sglang inference and a custom mega-kernel, progressing towards GLM-5.2.

0 favorites 0 likes

#sglang

@PyTorch: While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering …

X AI KOLs Following ↗ · 2d ago Cached

SGLang provided Day-0 support for DeepSeek-V4, and collaboration between LMSys and NVIDIA engineering teams achieved up to 5x throughput increase in production, with improvements shown on the SemiAnalysis InferenceX dashboard.

0 favorites 0 likes

#sglang

@vanstriendaniel: It's raining OCR models again! @Baidu_Inc's Unlimited-OCR is one of the more interesting. You can try it without much e…

X AI KOLs Following ↗ · 2d ago Cached

This post shows how to serve Baidu's Unlimited-OCR model as a temporary, OpenAI-compatible endpoint on Hugging Face Jobs, enabling multi-page document parsing with features like table-to-HTML and equation-to-LaTeX extraction.

0 favorites 0 likes

#sglang

@Mayhem4Markets: https://x.com/Mayhem4Markets/status/2069090022117019928

X AI KOLs Following ↗ · 3d ago Cached

A detailed technical comparison of two dominant LLM serving frameworks, SGLang and vLLM, covering architectural differences in KV cache management (RadixAttention vs PagedAttention), throughput, latency, and deployment considerations for self-hosted environments.

0 favorites 0 likes

#sglang

@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…

X AI KOLs Following ↗ · 4d ago Cached

Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.

0 favorites 0 likes

#sglang

@h100envy: Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs. She also built Fle…

X AI KOLs Timeline ↗ · 6d ago Cached

Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs, achieving 5x cost cuts over DeepSeek's API; she also built FlexGen and helped build Chatbot Arena.

0 favorites 0 likes

#sglang

@didier_lopes: Incredible how Z. ai literally has their RL infrastructure open source. The entire OPD post-training of GLM-5.2 took on…

X AI KOLs Following ↗ · 2026-06-19 Cached

Z. ai has open-sourced its RL infrastructure, the slime framework, which enabled efficient OPD post-training of GLM-5.2 in about two days. slime is an LLM post-training framework for RL scaling that integrates Megatron and SGLang, and has been battle-tested by frontier models like GLM, Qwen, DeepSeek, and Llama.

0 favorites 0 likes

#sglang

My GLM-5.2-FP8 HGX-H200 SGLang docker deploy config

Reddit r/LocalLLaMA ↗ · 2026-06-17

A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.

0 favorites 0 likes

#sglang

What is Speculative Decoding? (trending on paperswithco.de) [R]

Reddit r/MachineLearning ↗ · 2026-06-17

Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.

0 favorites 0 likes

#sglang

@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…

X AI KOLs Following ↗ · 2026-06-15 Cached

A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.

0 favorites 0 likes

#sglang

@lmsysorg: New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughpu…

X AI KOLs Following ↗ · 2026-06-15 Cached

New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.

0 favorites 0 likes

#sglang

@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

X AI KOLs Following ↗ · 2026-06-15 Cached

DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.

0 favorites 0 likes

#sglang

@TheAhmadOsman: How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batchi…

X AI KOLs Following ↗ · 2026-06-08 Cached

A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.

0 favorites 0 likes

#sglang

@charles_irl: congrats to my colleague @nanjiangwill on getting this important technique merged into slime!

X AI KOLs Following ↗ · 2026-05-30 Cached

Delta-compressed weight sync technique merged into slime, enabling lossless delta sync for Megatron ↔ SGLang disaggregation, enhancing reinforcement learning at scale.

0 favorites 0 likes

#sglang

@grapeot: How does the LLM inference system actually work? The SGLang Omni team recently published a rare article that lays out the complete decision-making chain of a top inference system team. I followed the original text and organized a popular science post, starting from autoregressive decoding, KV cache, continuous batching...

X AI KOLs Timeline ↗ · 2026-05-30

Based on the SGLang Omni team's internal decision-making article, this post introduces the operating principles of LLM inference systems in an accessible way, starting from basic concepts such as autoregressive decoding, KV cache, and continuous batching.

0 favorites 0 likes

#sglang

@wey_gu: It's such a pleasure to hear him speak — Zhu Banghua: SGLang, Reinforcement Learning, NVIDIA Acquisition, Second Startup, Tsinghua, Berkeley, LMSYS, Chatbot Arena, Good at Letting Go

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Sharing an in-depth interview with SGLang/RadixArk CTO Zhu Banghua, covering his journey from Tsinghua to Berkeley, founding NexusFlow (later acquired by NVIDIA), and his second startup RadixArk, touching on topics like SGLang, reinforcement learning, and the NVIDIA acquisition.

0 favorites 0 likes

#sglang

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Reddit r/LocalLLaMA ↗ · 2026-05-17

This article benchmarks vLLM, SGLang, and llama.cpp on a mixed Blackwell/Ada GPU cluster for long context prefill, finding vLLM significantly outperforms others on heterogeneous setups while SGLang crashes with Ada cards due to FP4 support limitations.

0 favorites 0 likes

#sglang

Boosting multimodal inference performance by >10% with a single Python dict

Hacker News Top ↗ · 2026-05-06 Cached

Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.

0 favorites 0 likes

#sglang

z-lab/gemma-4-31B-it-DFlash

Hugging Face Models Trending ↗ · 2026-04-30 Cached

Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.

0 favorites 0 likes

#sglang

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.

0 favorites 0 likes

sglang

Submit Feedback