Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Hugging Face Daily Papers Papers

Summary

This paper explores how an exponentially decaying memory module from RAT+ can improve query-aware sparse inference methods for long-context language models, demonstrating consistent accuracy gains across various sparse budgets on needle-in-a-haystack tasks.

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:16 PM

Paper page - Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Source: https://huggingface.co/papers/2605.28640

Abstract

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

Efficient inference is critical for long-context language models, whereattention computationandKV-cacheaccess dominate the cost. Recent workRAT+, introduces a recurrence-augmented attention backbone that enables flexibledilated attentionat inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-awaresparse inferencemethods. Using representative methods includingQuest,MoBA, andSnapKV, we show thatRAT+consistently improves accuracy over standard attention across sparse budgets on eightneedle-in-a-haystack tasks. We validate these gains both on the released checkpoints from theRAT+paper and onOLMo2-7B, which we continue pretraining with the addedmemory modulefor 10B tokens. Finally, we propose two hypotheses explaining why thismemory modulebenefits query-awaresparse inferenceand design targeted experiments to support them.

View arXiv pageView PDFProject pageGitHub6Add to collection

Get this paper in your agent:

hf papers read 2605\.28640

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

Memory

Reddit r/artificial

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.