Tag
A blog post explaining continuous batching, a technique for improving LLM serving throughput by dynamically adding new requests to a batch as old ones finish, keeping the GPU busy and reducing idle time.
Recommends an introduction to LLM serving, inference basics, and VLLM, covering paged attention and continuous batching.
Continuous batching has been added to TRL for GRPO, improving speed and VRAM usage without needing vLLM. The tweet explains how it works and when to use it.
zml/llmd now runs fully on Apple's Metal API, serving 8 simultaneous requests at full bf16 precision, with continuous batching and other modern features.
Based on the SGLang Omni team's internal decision-making article, this post introduces the operating principles of LLM inference systems in an accessible way, starting from basic concepts such as autoregressive decoding, KV cache, and continuous batching.
dlmserve is the first open-source serving engine for diffusion language models, providing an OpenAI-compatible API, continuous batching, and 2.5x throughput over Hugging Face, all within 12GB VRAM.
oMLX 0.3.9rc1, an LLM inference server optimized for Apple Silicon Macs, adds low-memory stability, chunked prefill, multi-tasking admin chat, and more.
This article explains how to implement asynchronous continuous batching for LLM inference, overlapping CPU batch preparation with GPU computation to maximize utilization and reduce idle time.