Tag
Update on running a non-quantized DeepSeek-v4-Flash model at 11 tok/s on a single DGX Spark using sglang inference and a custom mega-kernel, progressing towards GLM-5.2.
SGLang provided Day-0 support for DeepSeek-V4, and collaboration between LMSys and NVIDIA engineering teams achieved up to 5x throughput increase in production, with improvements shown on the SemiAnalysis InferenceX dashboard.
This post shows how to serve Baidu's Unlimited-OCR model as a temporary, OpenAI-compatible endpoint on Hugging Face Jobs, enabling multi-page document parsing with features like table-to-HTML and equation-to-LaTeX extraction.
A detailed technical comparison of two dominant LLM serving frameworks, SGLang and vLLM, covering architectural differences in KV cache management (RadixAttention vs PagedAttention), throughput, latency, and deployment considerations for self-hosted environments.
Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.
Ying Sheng co-wrote SGLang, the inference engine now serving Grok at xAI on a hundred thousand GPUs, achieving 5x cost cuts over DeepSeek's API; she also built FlexGen and helped build Chatbot Arena.
Z. ai has open-sourced its RL infrastructure, the slime framework, which enabled efficient OPD post-training of GLM-5.2 in about two days. slime is an LLM post-training framework for RL scaling that integrates Megatron and SGLang, and has been battle-tested by frontier models like GLM, Qwen, DeepSeek, and Llama.
A user shares their Docker deployment configuration for running the GLM-5.2-FP8 model on HGX-H200 hardware using SGLang, achieving 262k context and 70 tokens/s.
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.
A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.
New research on DFlash and Spec V2 speculative decoding methods achieves >4.3X baseline throughput for LLM inference, released as the default speculative decoding engine in SGLang.
DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.
A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.
Delta-compressed weight sync technique merged into slime, enabling lossless delta sync for Megatron ↔ SGLang disaggregation, enhancing reinforcement learning at scale.
Based on the SGLang Omni team's internal decision-making article, this post introduces the operating principles of LLM inference systems in an accessible way, starting from basic concepts such as autoregressive decoding, KV cache, and continuous batching.
Sharing an in-depth interview with SGLang/RadixArk CTO Zhu Banghua, covering his journey from Tsinghua to Berkeley, founding NexusFlow (later acquired by NVIDIA), and his second startup RadixArk, touching on topics like SGLang, reinforcement learning, and the NVIDIA acquisition.
This article benchmarks vLLM, SGLang, and llama.cpp on a mixed Blackwell/Ada GPU cluster for long context prefill, finding vLLM significantly outperforms others on heterogeneous setups while SGLang crashes with Ada cards due to FP4 support limitations.
Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.
Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.