Tag
This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.
vLLM v0.20.1 is a minor version update for the popular open-source LLM inference and serving library, maintaining its focus on high-throughput and efficient memory management.
vLLM v0.20.2rc0 release candidate adds a shutdown() method to the LLM serving library.
vLLM version 0.20.1rc0 is released, adding a system_fingerprint field to OpenAI-compatible API responses for better request tracking.
Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.
vLLM v0.19.2rc0 release candidate includes a bugfix for k_proj's bias handling in GLM-ASR models, addressing a specific compatibility issue in the LLM serving framework.
This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.