@modal: New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal. Read the article to learn how --…
Summary
Modal has announced that replicas of vLLM and SGLang servers now start up 3-10x faster, leveraging improvements in GPU health management and CUDA context checkpointing.
View Cached Full Text
Cached at: 05/13/26, 08:16 AM
New replicas of @vllm_project and @sgl_project servers start up 3-10x faster on Modal.
Read the article to learn how – from GPU health management to CUDA context checkpointing. https://t.co/ugAreYxcGD
Similar Articles
Boosting multimodal inference performance by >10% with a single Python dict
Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.
@charles_irl: Inference isn't everything, but it does require a new stack -- not Kubernetes, not SLURM. At @modal, we dove deep to bu…
Modal engineers detail their approach to achieving truly serverless GPUs for AI inference, combining cloud buffers, a custom content-addressed filesystem, and CPU/GPU checkpoint/restore to scale replicas in tens of seconds instead of minutes.
vllm-project/vllm v0.19.1
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.