Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Hugging Face Daily Papers 05/27/26, 12:00 AM Papers

attention kv-cache sparse-inference long-context language-models memory-module recurrent-attention

Summary

This paper explores how an exponentially decaying memory module from RAT+ can improve query-aware sparse inference methods for long-context language models, demonstrating consistent accuracy gains across various sparse budgets on needle-in-a-haystack tasks.

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:16 PM

Paper page - Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Source: https://huggingface.co/papers/2605.28640

Abstract

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

Efficient inference is critical for long-context language models, whereattention computationandKV-cacheaccess dominate the cost. Recent workRAT+, introduces a recurrence-augmented attention backbone that enables flexibledilated attentionat inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-awaresparse inferencemethods. Using representative methods includingQuest,MoBA, andSnapKV, we show thatRAT+consistently improves accuracy over standard attention across sparse budgets on eightneedle-in-a-haystack tasks. We validate these gains both on the released checkpoints from theRAT+paper and onOLMo2-7B, which we continue pretraining with the addedmemory modulefor 10B tokens. Finally, we propose two hypotheses explaining why thismemory modulebenefits query-awaresparse inferenceand design targeted experiments to support them.

View arXiv page View PDF Project page GitHub6 Add to collection

Get this paper in your agent:

hf papers read 2605\.28640

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Paper page - Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

@omarsar0: NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memor…

Memory

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Information-Aware KV Cache Compression for Long Reasoning

Submit Feedback

Similar Articles

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

@omarsar0: NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memor…

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Information-Aware KV Cache Compression for Long Reasoning