Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
Summary
This paper explores how an exponentially decaying memory module from RAT+ can improve query-aware sparse inference methods for long-context language models, demonstrating consistent accuracy gains across various sparse budgets on needle-in-a-haystack tasks.
View Cached Full Text
Cached at: 06/08/26, 03:16 PM
Paper page - Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
Source: https://huggingface.co/papers/2605.28640
Abstract
RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.
Efficient inference is critical for long-context language models, whereattention computationandKV-cacheaccess dominate the cost. Recent workRAT+, introduces a recurrence-augmented attention backbone that enables flexibledilated attentionat inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-awaresparse inferencemethods. Using representative methods includingQuest,MoBA, andSnapKV, we show thatRAT+consistently improves accuracy over standard attention across sparse budgets on eightneedle-in-a-haystack tasks. We validate these gains both on the released checkpoints from theRAT+paper and onOLMo2-7B, which we continue pretraining with the addedmemory modulefor 10B tokens. Finally, we propose two hypotheses explaining why thismemory modulebenefits query-awaresparse inferenceand design targeted experiments to support them.
View arXiv pageView PDFProject pageGitHub6Add to collection
Get this paper in your agent:
hf papers read 2605\.28640
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.
Memory
Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.