@m_sirovatka: KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vL…
Summary
vLLM integrates Mooncake Store for distributed KV cache reuse, enabling cross-node prefix caching to efficiently serve agentic workloads with high token reuse.
View Cached Full Text
Cached at: 06/03/26, 09:55 PM
KV Cache re-use is the most important thing for agentic rollouts. We’ve integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀
vLLM (@vllm_project): 🚀 New on the @vllm_project blog: Serving Agentic Workloads at Scale with vLLM x Mooncake.
Agentic traces grow to 80K+ tokens with 94%+ reusable prefixes, but local KV caches evict them and cross-instance routing misses them.
By integrating Mooncake Store as a distributed KV
Similar Articles
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse
ObjectCache proposes using S3-compatible object storage for LLM KV cache reuse to reduce cost and increase capacity, with a co-designed storage protocol and transfer schedule that minimizes latency overhead. Experiments show it adds only 5.6% latency over local DRAM for 64K contexts.
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.
llama.cpp has a clever trick for speeding up KV cache decode
A setting in llama.cpp's webUI re-sends generated tokens to the KV cache to significantly reduce prompt processing latency, improving responsiveness for long generations or tool calls without apparent trade-offs.