Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Summary
Gated DeltaNet-2 introduces separate erase and write gates for linear attention, achieving superior performance in long-context language modeling and retrieval tasks.
View Cached Full Text
Cached at: 05/22/26, 02:29 AM
Paper page - Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Source: https://huggingface.co/papers/2605.22791
Abstract
Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks.
Linear attentionreplaces the unbounded cache ofsoftmax attentionwith a fixed-sizerecurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations.Delta-rule modelssubtract the current read before writing a new value, andKimi Delta Attention(KDA) sharpens forgetting withchannel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduceGated DeltaNet-2, which generalizes bothGated DeltaNetand KDA by inheriting adaptive forgetting andchannel-wise decaywhile addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wiseerase gateb_t and a channel-wisewrite gatew_t, reducing to KDA when both gates collapse to the same scalar and toGated DeltaNetwhen the decay also collapses. We derive afast-weight updateview, achunkwise WY algorithmwithchannel-wise decayabsorbed into asymmetric erase factors, and agate-aware backward passthat preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens,Gated DeltaNet-2 achieves the strongest overall results amongMamba-2,Gated DeltaNet, KDA, andMamba-3variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-contextRULERneedle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.
View arXiv pageView PDFGitHub19Add to collection
Get this paper in your agent:
hf papers read 2605\.22791
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.22791 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.22791 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.22791 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090
NVIDIA's Nemotron-3-Super-120B-A12B, a hybrid Mamba and mixture-of-experts model, achieves perfect needle-in-haystack retrieval at 504K tokens using only four RTX 3090 GPUs.
OpenBioRQ: AI Agents Cite Wrong Papers 15.9% of the Time
A new benchmark paper, OpenBioRQ, reveals that AI agents rarely fabricate citations but often cite papers that do not support the claim, with 15.9% of citations being mismatched in biomedical contexts.
Information-Aware KV Cache Compression for Long Reasoning
This paper proposes InfoKV, an entropy-aware KV cache compression framework that combines token-level predictive uncertainty with attention scores to improve long-context reasoning efficiency. Experiments show it outperforms existing attention-based methods on Llama-3.1, Llama-3.2, and DeepSeek-R1.
Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Introduces Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation for parameter-efficient fine-tuning, outperforming LoRA on long-context tasks.