@vllm_project: vLLM v0.21.0 is out! 367 commits from 202 contributors (49 new). Highlights: KV Offload + HMA, spec decode with thinkin…
Summary
vLLM v0.21.0 has been released with KV Offload + HMA, speculative decoding with thinking budget for reasoning models, TOKENSPEED_MLA on Blackwell for DSR1/Kimi K2.5, Mooncake distributed KV, DeepSeek V4 pipeline parallelism, and a C++20 + Transformers v5 baseline.
View Cached Full Text
Cached at: 05/16/26, 09:23 PM
vLLM v0.21.0 is out! 367 commits from 202 contributors (49 new).
Highlights: KV Offload + HMA, spec decode with thinking budget (reasoning models), TOKENSPEED_MLA on Blackwell for DSR1 / Kimi K2.5, Mooncake distributed KV, DeepSeek V4 pipeline parallelism. C++20 + Transformers v5 baseline.
Thread
Similar Articles
vllm-project/vllm v0.20.0rc1
vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.
vllm-project/vllm v0.21.0rc1
vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.
vllm-project/vllm v0.20.1
vLLM v0.20.1 is a minor version update for the popular open-source LLM inference and serving library, maintaining its focus on high-throughput and efficient memory management.
vllm-project/vllm v0.20.0
vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.
vllm-project/vllm v0.19.1
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.