Tag
This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.
Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.
Dustin introduces a sparse verification framework for speculative decoding that leverages draft model signals and sparse attention head scoring to overcome the KV cache verification bottleneck, achieving up to 27.85x speedup in self-attention and 9.17x end-to-end decoding speedup on long-context tasks with negligible accuracy loss.
DualPath is a system that breaks the storage bandwidth bottleneck in agentic LLM inference by introducing a dual-path KV-cache loading mechanism, improving throughput by up to 1.87x offline and 1.96x online.
The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.
Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.
Introduces Nexus Sampling, a training-free KV-cache eviction method using weighted reservoir sampling instead of deterministic top-k, improving long-context LLM inference under fixed memory budgets, matching dense attention performance at 80% eviction.
LMCache is an open-source library that makes KV cache persistent and shareable across requests, eliminating recomputation in RAG and multi-turn chat workloads, achieving up to 15x throughput gain and 3-10x reduction in time-to-first-token.
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
This article explains vLLM's weight syncing API for reinforcement learning, covering how it facilitates weight updates and KV cache recompute in RL training, with a focus on reducing complexity for training frameworks.
PaddlePaddle releases Unlimited-OCR, a new OCR model using Reference Sliding Window Attention (R-SWA) to maintain constant KV cache during decoding, achieving 93% on OmniDocBench and a 6% improvement over previous methods.
Unlimited OCR introduces Reference Sliding Window Attention to eliminate growing memory consumption in long-sequence OCR tasks, enabling efficient transcription of multiple pages in a single forward pass.
An open, in-progress handbook explaining LLM inference internals including GPU memory hierarchy, KV cache, batching, and popular inference engines like vLLM and TensorRT-LLM.
LMCache is a KV cache management layer that accelerates large model inference and reduces VRAM consumption by caching and reusing KV cache. It has received 9.2K stars and joined the PyTorch Foundation, and is integrated by NVIDIA Dynamo.
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.
An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.
A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
A custom FPGA implementation of a Transformer with KV cache achieves 56,000 tokens per second at 80 MHz, running microGPT on a tiny LCD.