Dynamic Linear Attention

Hugging Face Daily Papers Papers

Summary

DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.
Original Article
View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - Dynamic Linear Attention

Source: https://huggingface.co/papers/2606.10650

Abstract

DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention.

The scalability ofLarge Language Models(LLMs) to long contexts is fundamentally constrained by the quadratic complexity ofstandard attention, motivating the adoption oflinear attention mechanismswith sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existingmulti-state linear attentionmethods rely on fixed state merging policies that cannot adapt to dynamically varyingtoken importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework formulti-state linear attention. DLA introduces (i)Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii)Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.10650

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.10650 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.10650 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.10650 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Dynamic Linear Attention

arXiv cs.CL

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

Memory

Reddit r/artificial

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.