@modal: New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal. Read the article to learn how --…

X AI KOLs Following 05/12/26, 04:33 PM Tools

serverless-gpu startup-speed vllm sgl-kernel cuda-optimization ai-inference cloud-computing

Summary

Modal has announced that replicas of vLLM and SGLang servers now start up 3-10x faster, leveraging improvements in GPU health management and CUDA context checkpointing.

New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal. Read the article to learn how -- from GPU health management to CUDA context checkpointing. https://t.co/ugAreYxcGD

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 08:16 AM

New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal.

Read the article to learn how – from GPU health management to CUDA context checkpointing. https://t.co/ugAreYxcGD

Similar Articles

Boosting multimodal inference performance by >10% with a single Python dict

Hacker News Top

Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.

@charles_irl: Inference isn't everything, but it does require a new stack -- not Kubernetes, not SLURM. At @modal, we dove deep to bu…

X AI KOLs Following

Modal engineers detail their approach to achieving truly serverless GPUs for AI inference, combining cloud buffers, a custom content-addressed filesystem, and CPU/GPU checkpoint/restore to scale replicas in tens of seconds instead of minutes.

vllm-project/vllm v0.19.1

GitHub Releases Watchlist

vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

X AI KOLs Following

User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

X AI KOLs Timeline

A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.