llm-serving

#llm-serving

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.

0 favorites 1 likes

#llm-serving

vllm-project/vllm v0.20.1

GitHub Releases Watchlist ↗ · 6d ago Cached

vLLM v0.20.1 is a minor version update for the popular open-source LLM inference and serving library, maintaining its focus on high-throughput and efficient memory management.

0 favorites 0 likes

#llm-serving

vllm-project/vllm v0.20.2rc0: [MRV2] Add shutdown() method (#41297)

GitHub Releases Watchlist ↗ · 2026-05-03 Cached

vLLM v0.20.2rc0 release candidate adds a shutdown() method to the LLM serving library.

0 favorites 0 likes

#llm-serving

vllm-project/vllm v0.20.1rc0: Add system_fingerprint field to OpenAI-compatible API responses (#40537)

GitHub Releases Watchlist ↗ · 2026-04-27 Cached

vLLM version 0.20.1rc0 is released, adding a system_fingerprint field to OpenAI-compatible API responses for better request tracking.

0 favorites 0 likes

#llm-serving

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

Hacker News Top ↗ · 2026-04-19 Cached

Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.

0 favorites 0 likes

#llm-serving

vllm-project/vllm v0.19.1

GitHub Releases Watchlist ↗ · 2026-04-18 Cached

vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.

0 favorites 0 likes

#llm-serving

vllm-project/vllm v0.19.2rc0: [Bugfix] Fix k_proj's bias for GLM-ASR (#40160)

GitHub Releases Watchlist ↗ · 2026-04-18 Cached

vLLM v0.19.2rc0 release candidate includes a bugfix for k_proj's bias handling in GLM-ASR models, addressing a specific compatibility issue in the LLM serving framework.

0 favorites 0 likes

#llm-serving

Efficient Memory Management for Large Language Model Serving with PagedAttention

Papers with Code Trending ↗ · 2023-09-12 Cached

This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.

0 favorites 0 likes

llm-serving

Submit Feedback