Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
Summary
Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.
View Cached Full Text
Cached at: 06/26/26, 05:18 AM
# Decoupling Erase and Write Addresses in Delta-Rule Linear Attention
Source: [https://arxiv.org/html/2606.26560](https://arxiv.org/html/2606.26560)
Xiao Li1,2,Chengruidong Zhang1,Hao Luo1,Xi Lin1,3,Zekun Wang1,Zihan Qiu1,Yunfei Mao1,Langshi Chen1,Man Yuan1,Minmin Sun1,Huiqiang Jiang1,Siqi Zhang1,Rui Men1,Wei Hu2,Gong Cheng2,Bo Zheng1†,Dayiheng Liu1†,Jingren Zhou1 1Qwen Team2Nanjing University3Zhejiang University †Corresponding authors
###### Abstract
Delta\-rule linear attention improves recurrent memory updates by correcting what is already stored at the current write address before writing new content\. However, the active correction is still anchored to that same write address\. As a result, stale information stored at a different address cannot be actively removed before new content is written elsewhere\. We proposeErase\-then\-Delta Attention\(EDA\), a memory update rule that decouples where to erase from where to write\. The key insight is that recurrent memory models should not only correct the current write, but also selectively suppress outdated memory at an independently chosen address\. Concretely, our method first applies a targeted erase step along a learned erase direction, and then performs the standard delta\-style corrective write along the current write direction\. This preserves the corrective behavior of delta\-rule updates while expanding their memory\-management capacity\. Language\-model pretraining experiments across dense 2\.5B and MoE 25B\-A2\.8B model families show that EDA performs best in both settings\. The gain persists after 80B\-token long\-context midtraining of the MoE models, where EDA also performs best in long\-context evaluations from 4k to 128k contexts\. A compact update analysis and memory\-state probes suggest why: EDA keeps the delta\-rule corrective write intact while allocating an additional cleanup path most strongly when passive decay is weak\. These results suggest that recurrent memory models should decide not only what to write, but also what stale information to erase and where\.
## 1Introduction
Autoregressive Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.26560#bib.bib6)\)have become the foundation of modern language modeling, in part because softmax\-based self\-attention enables efficient parallel computation\. This mechanism achieves strong performance on in\-context learning and long\-context retrieval by maintaining an explicit key–value cache\. However, it also introduces fundamental bottlenecks at inference time: quadratic time complexity and linearly growing memory overhead that limit scalability for long\-sequence tasks and agentic reasoning trajectories\. To address these constraints, a growing body of work has explored efficient alternatives that maintain constant memory and𝒪\(1\)\\mathcal\{O\}\(1\)inference time while preserving the expressive power of attention\.
Recurrent models based on linear attention\(Katharopouloset al\.,[2020](https://arxiv.org/html/2606.26560#bib.bib7)\)and state space models\(Guet al\.,[2022](https://arxiv.org/html/2606.26560#bib.bib8); Gu and Dao,[2024](https://arxiv.org/html/2606.26560#bib.bib9)\)offer a principled solution: they compress contextual information into a fixed\-size state, enabling constant memory and linear\-time training\. Early variants such as Linformer\(Wanget al\.,[2020](https://arxiv.org/html/2606.26560#bib.bib10)\)and RetNet\(Sunet al\.,[2023](https://arxiv.org/html/2606.26560#bib.bib11)\)lacked data\-dependent memory control and underperformed softmax attention\. Subsequent models introduced dynamic gating mechanisms\(Yanget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib1); Dao and Gu,[2024](https://arxiv.org/html/2606.26560#bib.bib13); Becket al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib14)\), allowing selective forgetting and significantly narrowing the performance gap\. However, additive gated updates still write new content into a finite state without explicitly correcting the association currently stored at the write address\.
A more recent line of work replaces additive updates with the*delta rule*\(Schlaget al\.,[2021](https://arxiv.org/html/2606.26560#bib.bib15)\), which treats the recurrent state as a learnable associative memory that corrects itself toward the current key–value mapping\. Gated DeltaNet \(GDN\)\(Yanget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib1)\)combines this corrective write with a head\-wise forget gate, and recent channel\-wise variants further refine this gate into a diagonal decay that gives each key feature its own retention rate\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\. GDN\-2 further separates the scalar delta gate into key\-side erase and value\-side write gates, but the active edit remains organized around the current write key\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib3)\)\. We build on this channel\-wise gated delta setting, also known as diagonal\-plus\-low\-rank \(DPLR\), which combines GDN’s hardware\-efficient delta\-rule structure with finer\-grained channel\-wise forgetting\. Despite this progress, a structural limitation remains unaddressed: the active delta correction still uses the current write direction𝐤t\\mathbf\{k\}\_\{t\}as its only address\. This coupling means the model can only suppress memory at the address it is currently writing to; stale information stored elsewhere must either persist or decay through channel\-wise but address\-agnostic forgetting\.
This limitation has tangible consequences\. In language modeling and state\-tracking tasks, useful memory updates require not only writing new content but also removing obsolete information that would otherwise interfere with future reads and writes\. When the model encounters a situation where earlier information must be invalidated—for example, a variable reassignment, a fact correction, or a context shift—it has no direct mechanism to remove the old content before committing the new one\. The core missing capability is therefore not stronger forgetting, but*targeted deletion of outdated memory at an address chosen independently of the current write*\.
We address this problem withErase\-then\-Delta Attention\(EDA\), a memory update rule that decouples erasure from writing\. Instead of tying memory suppression to the current write address, EDA first removes stale content at an independently selected address and then performs the usual delta\-style corrective write at the current write address\. Intuitively, the erase step actively clears obsolete memory, while the delta step preserves the corrective writing behavior that makes delta\-rule models effective\. This yields a strictly richer update rule: the model can erase at one address and write at another within the same recurrent step\.
We show that this simple modification has three important consequences\. First, it provides a cleaner memory\-management view of channel\-wise gated delta recurrence by separating diagonal decay, independently addressed erasure, and write\-coupled correction\. Second, empirical analysis reveals that the model learns a near\-orthogonal separation between erase and write addressing, indicating that the two operations serve genuinely different roles\. Third, language\-model pretraining experiments show that EDA improves over a DPLR\-style gated delta baseline and compares favorably with several strong update\-rule variants\.
In summary, we introduce EDA, a gated delta\-rule linear\-attention update that decouples erase and write addresses while preserving the standard delta corrective write\. We analyze the resulting erase\-then\-delta update and evaluate it through language\-model pretraining, long\-context evaluation, and memory\-state probes, showing that the extra address acts as a conditional cleanup path rather than merely stronger forgetting\.
## 2Preliminary
We briefly introduce the recurrent memory notation and the channel\-wise gated delta update most relevant to our method\. The key point is that a diagonal forget gate already provides fine\-grained decay, but the active correction and writing remain tied to the same address\.
### 2\.1Notation and Linear Associative Memory
We consider a recurrent memory state𝐒t∈ℝdk×dv\\mathbf\{S\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\_\{v\}\}updated at each steptt\. The key𝐤t∈ℝdk\\mathbf\{k\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}serves as a write address, the value𝐯t∈ℝdv\\mathbf\{v\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{v\}\}is the content to store, and the query𝐪t∈ℝdk\\mathbf\{q\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}reads from memory through𝐒t⊤𝐪t∈ℝdv\\mathbf\{S\}\_\{t\}^\{\\top\}\\mathbf\{q\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{v\}\}\.
Standard linear attention updates memory additively:
𝐒t=𝐒t−1\+𝐤t𝐯t⊤,𝐨t=𝐒t⊤𝐪t\.\\mathbf\{S\}\_\{t\}=\\mathbf\{S\}\_\{t\-1\}\+\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\},\\,\\qquad\\mathbf\{o\}\_\{t\}=\\mathbf\{S\}\_\{t\}^\{\\top\}\\mathbf\{q\}\_\{t\}\.\(1\)This rule is efficient but does not explicitly decide what stale information to suppress\.
### 2\.2Coupled Erasure and Corrective Writing
DeltaNet\(Schlaget al\.,[2021](https://arxiv.org/html/2606.26560#bib.bib15); Yanget al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib24)\)replaces additive writing with a corrective update derived from the reconstruction loss
ℒtdelta\(𝐒\)=12∥𝐒⊤𝐤t−𝐯t∥2\.\\mathcal\{L\}\_\{t\}^\{\\mathrm\{delta\}\}\(\\mathbf\{S\}\)=\\frac\{1\}\{2\}\\lVert\\mathbf\{S\}^\{\\top\}\\mathbf\{k\}\_\{t\}\-\\mathbf\{v\}\_\{t\}\\rVert^\{2\}\.\(2\)Taking a gradient step with learning rateβt\\beta\_\{t\}gives
𝐒t=\(𝐈−βt𝐤t𝐤t⊤\)𝐒t−1\+βt𝐤t𝐯t⊤\.\\mathbf\{S\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\.\(3\)Rather than simply accumulating𝐤t𝐯t⊤\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}, DeltaNet first corrects what memory currently returns at address𝐤t\\mathbf\{k\}\_\{t\}and then writes the new content at that same address\.
Gated DeltaNet \(GDN\)\(Yanget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib1)\)augments this rule with a head\-wise scalar forget gateαt∈\(0,1\)\\alpha\_\{t\}\\in\(0,1\):
𝐒t=αt\(𝐈−βt𝐤t𝐤t⊤\)𝐒t−1\+βt𝐤t𝐯t⊤\.\\mathbf\{S\}\_\{t\}=\\alpha\_\{t\}\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\.\(4\)Hereαt\\alpha\_\{t\}provides uniform decay within a head, while\(𝐈−βt𝐤t𝐤t⊤\)\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)provides address\-specific correction\. However, the erase\-and\-write behavior is still coupled: the same key𝐤t\\mathbf\{k\}\_\{t\}determines both where memory is strongly modified and where new content is written\. As a result, GDN can strongly suppress only the address it is currently writing to\.
Following Kimi Delta Attention \(KDA\)\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\), we use the channel\-wise version of this GDN design, replacing the head\-wise scalar forget gate with a diagonal decay𝐃t=Diag\(𝜶t\)\\mathbf\{D\}\_\{t\}=\\operatorname\{Diag\}\(\\boldsymbol\{\\alpha\}\_\{t\}\):
𝐒t=\(𝐈−βt𝐤t𝐤t⊤\)𝐃t𝐒t−1\+βt𝐤t𝐯t⊤\.\\mathbf\{S\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\.\(5\)The diagonal gate gives each key channel its own retention rate and makes the transition compatible with a diagonal\-plus\-low\-rank view\. This improves how strongly different channels are preserved or decayed, but it does not change the addressing structure of the delta update itself: the corrective modification is still anchored to the current write key\. Therefore, even with channel\-wise gating, stale information stored at a different address cannot be explicitly erased before writing new content elsewhere\.
GDN\-2 addresses a closely related coupling by separating the scalar delta gate into key\-side erase and value\-side write gates\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib3)\):
𝐒t=\(𝐈−𝐤t𝐞~t⊤\)𝐃t𝐒t−1\+𝐤t𝒛t⊤,𝐞~t=𝒃t⊙𝐤t,𝒛t=𝒘t⊙𝐯t\.\\mathbf\{S\}\_\{t\}=\\left\(\\mathbf\{I\}\-\\mathbf\{k\}\_\{t\}\\widetilde\{\\mathbf\{e\}\}\_\{t\}^\{\\top\}\\right\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\mathbf\{k\}\_\{t\}\\boldsymbol\{z\}\_\{t\}^\{\\top\},\\qquad\\widetilde\{\\mathbf\{e\}\}\_\{t\}=\\boldsymbol\{b\}\_\{t\}\\odot\\mathbf\{k\}\_\{t\},\\quad\\boldsymbol\{z\}\_\{t\}=\\boldsymbol\{w\}\_\{t\}\\odot\\mathbf\{v\}\_\{t\}\.\(6\)This decouples the channel\-wise erase and write strengths inside the delta residual\. However, the erase/read direction𝐞~t\\widetilde\{\\mathbf\{e\}\}\_\{t\}is still constructed from the current write key𝐤t\\mathbf\{k\}\_\{t\}, and the correction is still committed along𝐤t\\mathbf\{k\}\_\{t\}\. Thus GDN\-2 relaxes the gate\-level coupling, while the address\-level coupling between erasure and writing remains\.
This coupling is the limitation we target\. If stale information is stored at an address different from the current write address, the diagonal gate can decay feature channels but cannot selectively remove that stale association before writing elsewhere\.
### 2\.3Relation to Recent Delta\-Style Variants
Recent linear\-recurrent models often improve performance by enriching the transition rule or embedding delta\-style memory updates inside stronger architectures\. DeltaProduct\(Siemset al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib4)\)increases transition expressivity through multiple Householder\-like factors per step, while RWKV\-7\(Penget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib16)\)and Comba\(Huet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib5)\)adopt richer structured transition parameterizations\. Recent hybrid architectures further demonstrate that strong designs built around expressive channel\-wise gated delta components can be highly competitive with full attention\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\.
Our goal is different\. We do not primarily seek a globally richer transition; instead, we introduce a missing memory\-management capability: erasing stale memory at one address before performing the standard delta\-style corrective write at another\. In that sense, our method is best viewed as orthogonal to transition\-enrichment approaches and potentially compatible with stronger channel\-wise gated delta backbones\.
## 3Method
### 3\.1Overview
Our goal is to extend gated delta\-rule linear attention with a missing memory\-management capability: selectively deleting stale memory at an address different from the current write address\. To do this, we revisit the DPLR\-style update rule and identify a structural coupling between active correction and writing\. We then introduceErase\-then\-Delta Attention\(EDA\), a sequential update rule that adds an independently addressed erase step before the standard delta\-style corrective write\. This section first formalizes the limitation of the decay\-gated delta baseline, then derives the new rule, and finally discusses its algebraic structure and stability properties\.
### 3\.2Erase\-Write Coupling in Gated Delta Updates
We consider a recurrent memory state𝐒t\\mathbf\{S\}\_\{t\}updated by a gated delta rule with diagonal decay:
𝐒t=\(𝐈−βt𝐤t𝐤t⊤\)𝐃t𝐒t−1\+βt𝐤t𝐯t⊤,𝐃t=Diag\(𝜶t\)\.\\mathbf\{S\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\},\\qquad\\mathbf\{D\}\_\{t\}=\\operatorname\{Diag\}\(\\boldsymbol\{\\alpha\}\_\{t\}\)\.\(7\)Here𝐃t\\mathbf\{D\}\_\{t\}is a diagonal decay matrix with retention factors𝜶t\\boldsymbol\{\\alpha\}\_\{t\},βt\\beta\_\{t\}controls the delta\-style correction strength,𝐤t\\mathbf\{k\}\_\{t\}is the current write direction, and𝐯t\\mathbf\{v\}\_\{t\}is the value vector written into memory\. This update is effective because it is not a naive additive write: after diagonal decay, the factor\(𝐈−βt𝐤t𝐤t⊤\)\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)corrects the memory response along the current write direction, and the additive termβt𝐤t𝐯t⊤\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}writes the new content at the same address\.
Equation \([7](https://arxiv.org/html/2606.26560#S3.E7)\) already contains both fine\-grained decay and address\-specific correction\. The diagonal gate𝐃t\\mathbf\{D\}\_\{t\}decides which key channels persist, while the rank\-1 term\(𝐈−βt𝐤t𝐤t⊤\)\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)induces stronger correction along the current write direction\. GDN\-2 relaxes the scalar\-gate version of this coupling by separating key\-side erase and value\-side write gates\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib3)\)\. However, the active correction remains structurally coupled to writing: the edit is still constructed from the current write key, and the correction is still committed along𝐤t\\mathbf\{k\}\_\{t\}\. Consequently, these updates can only strongly suppress memory through the address they are currently writing to\.
This coupling is the core limitation we address\. If stale information is stored at an address different from the current write direction, the model has no direct mechanism to remove it selectively before performing the current write\. Instead, it must rely on the decay gate𝐃t\\mathbf\{D\}\_\{t\}, which is not tied to a specific stale address, or wait until future writes happen to revisit that address\. Our central design question is therefore: can a delta\-rule memory model erase at one address and write at another within the same recurrent step?
### 3\.3Erase\-then\-Delta: Decoupled Erase\-Write Addressing
EDA decouples cleanup from writing by inserting an independently addressed erase operator before the standard delta write:
𝐒t=\(𝐈−βt𝐤t𝐤t⊤\)\(𝐈−γt𝐞t𝐞t⊤\)𝐃t𝐒t−1\+βt𝐤t𝐯t⊤\.\\mathbf\{S\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\.\(8\)The factors in Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\) are applied from right to left\. The diagonal decay𝐃t\\mathbf\{D\}\_\{t\}first attenuates retained key coordinates, the erase factor\(𝐈−γt𝐞t𝐞t⊤\)\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)contracts the decayed memory along a learned cleanup address𝐞t\\mathbf\{e\}\_\{t\}, and the usual delta factor then performs corrective forgetting and writing at the current write key𝐤t\\mathbf\{k\}\_\{t\}\. This order is part of the update rule: for diagonal decay,𝐃t\\mathbf\{D\}\_\{t\}generally does not commute with the rank\-1 erase operator unless𝐃t\\mathbf\{D\}\_\{t\}degenerates to a scalar decay or𝐞t\\mathbf\{e\}\_\{t\}lies in an equal\-decay subspace\.
To see what the new operator actually erases, let
𝐒^t=𝐃t𝐒t−1\\widehat\{\\mathbf\{S\}\}\_\{t\}=\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\(9\)denote the memory after diagonal decay and before address\-selective cleanup\. EDA defines the erase address through the online objective
ℒterase\(𝐒^t\)=12∥𝐒^t⊤𝐞t∥2\.\\mathcal\{L\}^\{\\mathrm\{erase\}\}\_\{t\}\(\\widehat\{\\mathbf\{S\}\}\_\{t\}\)=\\frac\{1\}\{2\}\\lVert\\widehat\{\\mathbf\{S\}\}\_\{t\}^\{\\top\}\\mathbf\{e\}\_\{t\}\\rVert^\{2\}\.\(10\)This objective penalizes the content currently returned when the decayed memory is queried at𝐞t\\mathbf\{e\}\_\{t\}\. A gradient step with learning rateγt\\gamma\_\{t\}gives
𝐒~t=\(𝐈−γt𝐞t𝐞t⊤\)𝐒^t,\\widetilde\{\\mathbf\{S\}\}\_\{t\}=\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)\\widehat\{\\mathbf\{S\}\}\_\{t\},\(11\)where𝐞t\\mathbf\{e\}\_\{t\}is L2\-normalized\. Thus𝐞t\\mathbf\{e\}\_\{t\}is not merely an extra projection: it is the address whose current memory response is explicitly pushed toward zero\.
This readout\-level view clarifies why the new direction is more targeted than stronger decay\. For any query direction𝐪\\mathbf\{q\}, the erased memory reads out as
𝐒~t⊤𝐪=𝐒^t⊤𝐪−γt\(𝐪⊤𝐞t\)𝐒^t⊤𝐞t\.\\widetilde\{\\mathbf\{S\}\}\_\{t\}^\{\\top\}\\mathbf\{q\}=\\widehat\{\\mathbf\{S\}\}\_\{t\}^\{\\top\}\\mathbf\{q\}\-\\gamma\_\{t\}\(\\mathbf\{q\}^\{\\top\}\\mathbf\{e\}\_\{t\}\)\\widehat\{\\mathbf\{S\}\}\_\{t\}^\{\\top\}\\mathbf\{e\}\_\{t\}\.\(12\)When𝐪=𝐞t\\mathbf\{q\}=\\mathbf\{e\}\_\{t\}, Eq\. \([12](https://arxiv.org/html/2606.26560#S3.E12)\) suppresses the response at the erase address by a factor of1−γt1\-\\gamma\_\{t\}\. When𝐪\\mathbf\{q\}is orthogonal to𝐞t\\mathbf\{e\}\_\{t\}, the erase step leaves that readout unchanged before the later delta update\. The decay gate𝐃t\\mathbf\{D\}\_\{t\}controls retention by key coordinate; in contrast, Eq\. \([12](https://arxiv.org/html/2606.26560#S3.E12)\) subtracts the content currently returned at a learned memory address, scaled by how much the query aligns with that address\.
After this cleanup, EDA applies the standard delta\-style corrective write to the erased memory:
𝐒t=\(𝐈−βt𝐤t𝐤t⊤\)𝐒~t\+βt𝐤t𝐯t⊤\.\\mathbf\{S\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\widetilde\{\\mathbf\{S\}\}\_\{t\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\.\(13\)Substituting Eq\. \([11](https://arxiv.org/html/2606.26560#S3.E11)\) into Eq\. \([13](https://arxiv.org/html/2606.26560#S3.E13)\) recovers Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\)\. The delta correction and write at𝐤t\\mathbf\{k\}\_\{t\}are therefore unchanged; the new degree of freedom is that stale memory can be suppressed at𝐞t\\mathbf\{e\}\_\{t\}before new content is written at𝐤t\\mathbf\{k\}\_\{t\}\. If𝐞t\\mathbf\{e\}\_\{t\}collapses to𝐤t\\mathbf\{k\}\_\{t\}, EDA reduces to a stronger same\-address correction; when the two directions differ, cleanup and writing are no longer forced to use the same address\. This also distinguishes EDA from gate\-level erase/write separation, where the residual can be reweighted by gates but remains organized around the current write key\.
The resulting rule separates memory management into three levels of specificity: diagonal decay through𝐃t\\mathbf\{D\}\_\{t\}, independent directional erasure throughγt𝐞t\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}, and write\-coupled correction throughβt𝐤t\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\. In this sense, EDA adds the missing degree of freedom needed to suppress stale memory at one address before performing a corrective write at another\. Figure[1](https://arxiv.org/html/2606.26560#S3.F1)illustrates the full EDA layer architecture\.
Figure 1:Architecture of an EDA layer\. The input is projected into query, key, value, output gate, erase gate \(γ\\gamma\), delta gate \(β\\beta\), decay parameters \(α\\alpha\), and erase address \(𝐞\\mathbf\{e\}\)\. The query, key, and erase address are L2\-normalized; the erase address uses a low\-rank projection\. All signals feed into the EDA kernel, whose output is normalized and gated before a final linear projection\.#### Safe gate for bounded decay\.
The diagonal decay in Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\) is parameterized in log space\. Let𝐃t=Diag\(exp\(𝒈t\)\)\\mathbf\{D\}\_\{t\}=\\operatorname\{Diag\}\(\\exp\(\\boldsymbol\{g\}\_\{t\}\)\),𝐀=exp\(𝐀log\)\>0\\mathbf\{A\}=\\exp\(\\mathbf\{A\}\_\{\\log\}\)\>0,𝒖t=𝒂t\+𝒃Δ\\boldsymbol\{u\}\_\{t\}=\\boldsymbol\{a\}\_\{t\}\+\\boldsymbol\{b\}\_\{\\Delta\}, and𝚫t=softplus\(𝒖t\)\\boldsymbol\{\\Delta\}\_\{t\}=\\operatorname\{softplus\}\(\\boldsymbol\{u\}\_\{t\}\), where𝒂t\\boldsymbol\{a\}\_\{t\}is the decay projection and𝒃Δ\\boldsymbol\{b\}\_\{\\Delta\}is a learned bias\. The Mamba2/GDN\-style log\-space gate\(Dao and Gu,[2024](https://arxiv.org/html/2606.26560#bib.bib13); Yanget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib1)\)uses
𝒈tlog=−𝐀⊙𝚫t,\\boldsymbol\{g\}\_\{t\}^\{\\mathrm\{log\}\}=\-\\mathbf\{A\}\\odot\\boldsymbol\{\\Delta\}\_\{t\},\(14\)which guaranteesexp\(𝒈t\)≤1\\exp\(\\boldsymbol\{g\}\_\{t\}\)\\leq 1but leaves the log\-decay unbounded below\. KDA computes its safe gate as𝒈tKDA=ℓσ\(𝐀⊙𝒖t\)\\boldsymbol\{g\}\_\{t\}^\{\\mathrm\{KDA\}\}=\\ell\\,\\sigma\(\\mathbf\{A\}\\odot\\boldsymbol\{u\}\_\{t\}\)withℓ<0\\ell<0, mapping each log\-decay coordinate into\(ℓ,0\)\(\\ell,0\)\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\. EDA instead uses a bounded safe gate with the same lower log\-decay limit and maximum value0:
𝒈t=ℓ\+\(−ℓ\)exp\(−𝐀\|ℓ\|⊙𝚫t\),\\boldsymbol\{g\}\_\{t\}=\\ell\+\(\-\\ell\)\\exp\\\!\\left\(\-\\frac\{\\mathbf\{A\}\}\{\\lvert\\ell\\rvert\}\\odot\\boldsymbol\{\\Delta\}\_\{t\}\\right\),\(15\)where the exponential is applied elementwise\. Sinceexp\(−x\)∈\(0,1\]\\exp\(\-x\)\\in\(0,1\]forx≥0x\\geq 0, this parameterization keeps𝒈t∈\(ℓ,0\]\\boldsymbol\{g\}\_\{t\}\\in\(\\ell,0\]and therefore bounds each decay coordinate byexp\(ℓ\)<αt,i≤1\\exp\(\\ell\)<\\alpha\_\{t,i\}\\leq 1\.
Comparing Eq\. \([14](https://arxiv.org/html/2606.26560#S3.E14)\) and Eq\. \([15](https://arxiv.org/html/2606.26560#S3.E15)\) shows why this bounded form is useful beyond numerical clipping\. The Mamba2/GDN\-style log\-space gate separates two roles:𝐀\\mathbf\{A\}controls the decay magnitude, while𝚫t=softplus\(𝒖t\)\\boldsymbol\{\\Delta\}\_\{t\}=\\operatorname\{softplus\}\(\\boldsymbol\{u\}\_\{t\}\)acts as a ReLU\-like nonnegative switch for whether a coordinate should decay\. Our safe gate preserves this amplitude–switch decomposition near the active region: elementwise, a Taylor expansion aroundΔt,i=0\\Delta\_\{t,i\}=0givesgt,i=−AiΔt,i\+𝒪\(Ai2Δt,i2/\|ℓ\|\)g\_\{t,i\}=\-A\_\{i\}\\Delta\_\{t,i\}\+\\mathcal\{O\}\(A\_\{i\}^\{2\}\\Delta\_\{t,i\}^\{2\}/\\lvert\\ell\\rvert\)\. It therefore behaves like the log\-space gate for small decay inputs, but saturates for large inputs instead of driving the log\-decay toward−∞\-\\infty\. By contrast, the KDA sigmoid gate bounds the log\-decay by applying a sigmoid directly to the affine decay signal, so𝐀\\mathbf\{A\}mainly changes the sigmoid slope and saturation rather than acting as a separate decay\-amplitude parameter\. In practice we setℓ=−5\\ell=\-5, making the smallest per\-step decay factorexp\(ℓ\)≈6\.7×10−3\\exp\(\\ell\)\\approx 6\.7\\times 10^\{\-3\}, well within the normal range of half\-precision formats\. This prevents decay factors from becoming subnormal or zero, allowing decay\-weighted chunk tensors to remain in half precision and preserving Tensor\-Core\-friendly dense matrix multiplications\.
#### Cross\-term structure and update order\.
The order in Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\)—erase first, then delta—is essential\. Expanding the product of the two rank\-1 operators reveals why:
\(𝐈−βt𝐤t𝐤t⊤\)\(𝐈−γt𝐞t𝐞t⊤\)=𝐈−γt𝐞t𝐞t⊤−βt𝐤t𝐤t⊤\+γtβt\(𝐤t⊤𝐞t\)𝐤t𝐞t⊤\.\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)=\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\+\\gamma\_\{t\}\\beta\_\{t\}\(\\mathbf\{k\}\_\{t\}^\{\\top\}\\mathbf\{e\}\_\{t\}\)\\,\\mathbf\{k\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\.\(16\)The final term—the cross\-term—is proportional to the cosine similarityct=𝐞t⊤𝐤tc\_\{t\}=\\mathbf\{e\}\_\{t\}^\{\\top\}\\mathbf\{k\}\_\{t\}between the erase and write directions\. It quantifies the “leakage” that occurs when the two directions are not orthogonal: the erase operation can influence the subsequent write\-address correction through𝐤t𝐞t⊤\\mathbf\{k\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\. Reversing the order \(delta first, then erase\) would apply the erase operator after the write, allowing memory cleanup to suppress newly written content\. By applying erasure first, our rule ensures that cleanup acts on old content before the new corrective write is committed\.
When the model learns a near\-orthogonal separation between𝐞t\\mathbf\{e\}\_\{t\}and𝐤t\\mathbf\{k\}\_\{t\}\(mean\|ct\|≈0\.105\|c\_\{t\}\|\\approx 0\.105, see Figure[2](https://arxiv.org/html/2606.26560#S4.F2)\(c\)\), the cross\-term becomes small and the update is well\-approximated by two independent corrections acting on orthogonal subspaces\. In this regime, the sequential rule is stable and the first\-order effects dominate\.
### 3\.4EDA with Chunk\-wise Parallel
Referring to Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\), the EDA state is multiplied by two rank\-1 correction factors per step\. To reuse existing DPLR chunk\-wise kernels, we interleave the erase and delta sub\-steps into a doubled sequence of length2t2t\. Let
\[𝐪τ′,𝐤τ′,𝐯τ′,βτ′,𝜶τ′\]=\{\[𝟎,𝐞t,𝟎,γt,𝜶t\],τ=2t−1\[𝐪t,𝐤t,𝐯t,βt,𝟏\],τ=2t\[\\mathbf\{q\}^\{\\prime\}\_\{\\tau\},\\mathbf\{k\}^\{\\prime\}\_\{\\tau\},\\mathbf\{v\}^\{\\prime\}\_\{\\tau\},\\beta^\{\\prime\}\_\{\\tau\},\\boldsymbol\{\\alpha\}^\{\\prime\}\_\{\\tau\}\]=\\left\\\{\\begin\{aligned\} &\[\\mathbf\{0\},\\mathbf\{e\}\_\{t\},\\mathbf\{0\},\\gamma\_\{t\},\\boldsymbol\{\\alpha\}\_\{t\}\],&\\tau&=2t\-1\\\\ &\[\\mathbf\{q\}\_\{t\},\\mathbf\{k\}\_\{t\},\\mathbf\{v\}\_\{t\},\\beta\_\{t\},\\mathbf\{1\}\],&\\tau&=2t\\end\{aligned\}\\right\.\(17\)
Each original stepttmaps to two sub\-steps in the doubled sequence: the odd sub\-step applies the erase operator with decay, and the even sub\-step applies the delta correction with identity decay\. This reduces EDA to a standard DPLR recurrence over twice as many steps\. We can rewrite Eq\. \([8](https://arxiv.org/html/2606.26560#S3.E8)\) as:
𝐒t\\displaystyle\\mathbf\{S\}\_\{t\}=\(𝐈−βt𝐤t𝐤t⊤\)\(𝐈−γt𝐞t𝐞t⊤\)𝐃t𝐒t−1\+βt𝐤t𝐯t⊤\\displaystyle=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}\(18\)=\(𝐈−βt𝐤t𝐤t⊤\)𝐈\(\(𝐈−γt𝐞t𝐞t⊤\)𝐃t𝐒t−1\+0\)\+βt𝐤t𝐯t⊤\\displaystyle=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\\mathbf\{I\}\\left\(\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)\\mathbf\{D\}\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+0\\right\)\+\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{v\}\_\{t\}^\{\\top\}=\(𝐈−β2t′𝐤2t′𝐤2t′⊤\)𝐃2t′\(\(𝐈−β2t−1′𝐤2t−1′𝐤2t−1′⊤\)𝐃2t−1′𝐒t−1\+β2t−1′𝐤2t−1′𝐯2t−1′⊤\)\+β2t′𝐤2t′𝐯2t′⊤\\displaystyle=\(\\mathbf\{I\}\-\\beta^\{\\prime\}\_\{2t\}\\mathbf\{k\}^\{\\prime\}\_\{2t\}\\mathbf\{k\}\_\{2t\}^\{\\prime\\top\}\)\\mathbf\{D\}^\{\\prime\}\_\{2t\}\\left\(\(\\mathbf\{I\}\-\\beta^\{\\prime\}\_\{2t\-1\}\\mathbf\{k\}^\{\\prime\}\_\{2t\-1\}\\mathbf\{k\}\_\{2t\-1\}^\{\\prime\\top\}\)\\mathbf\{D\}^\{\\prime\}\_\{2t\-1\}\\mathbf\{S\}\_\{t\-1\}\+\\beta^\{\\prime\}\_\{2t\-1\}\\mathbf\{k\}^\{\\prime\}\_\{2t\-1\}\\mathbf\{v\}\_\{2t\-1\}^\{\\prime\\top\}\\right\)\+\\beta^\{\\prime\}\_\{2t\}\\mathbf\{k\}^\{\\prime\}\_\{2t\}\\mathbf\{v\}\_\{2t\}^\{\\prime\\top\}
By partially expanding the recurrence for Eq\. \([18](https://arxiv.org/html/2606.26560#S3.E18)\) into a chunk\-wise formulation, we have:
𝐒t=\(∏i=12t\(𝐈−βi′𝐤i′𝐤i′⊺\)𝐃i′\)⏟:=𝐏𝐒0\+∑i=12t\(∏j=i\+12t\(𝐈−βj′𝐤j′𝐤j′⊺\)𝐃j′\)βi′𝐤i′𝐯i′⊺⏟:=𝐇\\mathbf\{S\}\_\{t\}=\\underbrace\{\\left\(\\prod\_\{i=1\}^\{2t\}\\left\(\\mathbf\{I\}\-\\beta^\{\\prime\}\_\{i\}\\mathbf\{k\}^\{\\prime\}\_\{i\}\\mathbf\{k\}\_\{i\}^\{\\prime\\intercal\}\\right\)\\mathbf\{D\}^\{\\prime\}\_\{i\}\\right\)\}\_\{:=\\mathbf\{P\}\}\\mathbf\{S\}\_\{0\}\+\\underbrace\{\\sum\_\{i=1\}^\{2t\}\\left\(\\prod\_\{j=i\+1\}^\{2t\}\\left\(\\mathbf\{I\}\-\\beta^\{\\prime\}\_\{j\}\\mathbf\{k\}^\{\\prime\}\_\{j\}\\mathbf\{k\}\_\{j\}^\{\\prime\\intercal\}\\right\)\\mathbf\{D\}^\{\\prime\}\_\{j\}\\right\)\\beta^\{\\prime\}\_\{i\}\\mathbf\{k\}^\{\\prime\}\_\{i\}\\mathbf\{v\}\_\{i\}^\{\\prime\\intercal\}\}\_\{:=\\mathbf\{H\}\}\(19\)
Following the chunk\-wise algorithm of KDA\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\), we applyWY representationto pack a series of updates into a single compact representation:
𝐰2t\\displaystyle\\mathbf\{w\}\_\{2t\}=β2t′\(\(∏i=12t𝐃i′\)⏟:=𝐃1→2t′𝐤2t′−∑i=12t−1𝐰i\(𝐤i′⊺\(∏j=i2t𝐃j′\)⏟:=𝐃i→2t′𝐤2t′\)\)\\displaystyle=\\beta^\{\\prime\}\_\{2t\}\\left\(\\underbrace\{\\left\(\\prod\_\{i=1\}^\{2t\}\\mathbf\{D\}^\{\\prime\}\_\{i\}\\right\)\}\_\{:=\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2t\}\}\\mathbf\{k\}^\{\\prime\}\_\{2t\}\-\\sum\_\{i=1\}^\{2t\-1\}\\mathbf\{w\}\_\{i\}\\left\(\\mathbf\{k\}\_\{i\}^\{\\prime\\intercal\}\\underbrace\{\\left\(\\prod\_\{j=i\}^\{2t\}\\mathbf\{D\}^\{\\prime\}\_\{j\}\\right\)\}\_\{:=\\mathbf\{D\}^\{\\prime\}\_\{i\\to 2t\}\}\\mathbf\{k\}^\{\\prime\}\_\{2t\}\\right\)\\right\)\(20\)𝐮2t\\displaystyle\\mathbf\{u\}\_\{2t\}=β2t′\(𝐯2t′−∑i=12t−1𝐮i\(𝐤i′⊺𝐃i→2t′𝐤2t′\)\)\\displaystyle=\\beta^\{\\prime\}\_\{2t\}\\left\(\\mathbf\{v\}^\{\\prime\}\_\{2t\}\-\\sum\_\{i=1\}^\{2t\-1\}\\mathbf\{u\}\_\{i\}\\left\(\\mathbf\{k\}\_\{i\}^\{\\prime\\intercal\}\\mathbf\{D\}^\{\\prime\}\_\{i\\to 2t\}\\mathbf\{k\}^\{\\prime\}\_\{2t\}\\right\)\\right\)𝐏\\displaystyle\\mathbf\{P\}=𝐃1→2t′−∑i=12t𝐃i→2t′𝐤i′𝐰i⊺\\displaystyle=\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2t\}\-\\sum\_\{i=1\}^\{2t\}\\mathbf\{D\}^\{\\prime\}\_\{i\\to 2t\}\\mathbf\{k\}^\{\\prime\}\_\{i\}\\mathbf\{w\}\_\{i\}^\{\\intercal\}𝐇\\displaystyle\\mathbf\{H\}=∑i=12t𝐃i→2t′𝐤i′𝐮i⊺\\displaystyle=\\sum\_\{i=1\}^\{2t\}\\mathbf\{D\}^\{\\prime\}\_\{i\\to 2t\}\\mathbf\{k\}^\{\\prime\}\_\{i\}\\mathbf\{u\}\_\{i\}^\{\\intercal\}
AndUT transformto reduce non\-matmul FLOPs:
𝐀1→2t\\displaystyle\\mathbf\{A\}\_\{1\\to 2t\}=\[diag\(𝐃1→1′\)diag\(𝐃1→2′\)⋯diag\(𝐃1→2t′\)\]\\displaystyle=\\left\[\\begin\{array\}\[\]\{c\|c\|c\|c\}\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{1\\to 1\}\)&\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2\}\)&\\cdots&\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2t\}\)\\end\{array\}\\right\]\(21\)𝐀i→2t\\displaystyle\\mathbf\{A\}\_\{i\\to 2t\}=\[diag\(𝐃1→2t′\)diag\(𝐃2→2t′\)⋯diag\(𝐃2t→2t′\)\]\\displaystyle=\\left\[\\begin\{array\}\[\]\{c\|c\|c\|c\}\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2t\}\)&\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{2\\to 2t\}\)&\\cdots&\\mathrm\{diag\}\(\\mathbf\{D\}^\{\\prime\}\_\{2t\\to 2t\}\)\\end\{array\}\\right\]𝐌\\displaystyle\\mathbf\{M\}=\(𝐈\+StrictTril\(Diag\(β′\)\(𝐀1→2t⊙𝐊′\)\(𝐊′𝐀1→2t\)⊺\)\)\\displaystyle=\\left\(\\mathbf\{I\}\+\\mathrm\{StrictTril\}\\left\(\\mathrm\{Diag\}\(\\beta^\{\\prime\}\)\\left\(\\mathbf\{A\}\_\{1\\to 2t\}\\odot\\mathbf\{K\}^\{\\prime\}\\right\)\\left\(\\frac\{\\mathbf\{K\}^\{\\prime\}\}\{\\mathbf\{A\}\_\{1\\to 2t\}\}\\right\)^\{\\intercal\}\\right\)\\right\)𝐖\\displaystyle\\mathbf\{W\}=𝐌\(𝐀1→2t⊙𝐊′\)\\displaystyle=\\mathbf\{M\}\\left\(\\mathbf\{A\}\_\{1\\to 2t\}\\odot\\mathbf\{K\}^\{\\prime\}\\right\)𝐔\\displaystyle\\mathbf\{U\}=𝐌𝐕′\\displaystyle=\\mathbf\{M\}\\mathbf\{V\}^\{\\prime\}
Finally, the state and output can be computed in a chunk\-wise manner using the matrix form:
𝐒t\\displaystyle\\mathbf\{S\}\_\{t\}=𝐃1→2t′𝐒0\+\(𝐀i→2t⊙𝐊′\)⊺\(𝐔−𝐖𝐒0\)\\displaystyle=\\mathbf\{D\}^\{\\prime\}\_\{1\\to 2t\}\\mathbf\{S\}\_\{0\}\+\\left\(\\mathbf\{A\}\_\{i\\to 2t\}\\odot\\mathbf\{K\}^\{\\prime\}\\right\)^\{\\intercal\}\(\\mathbf\{U\}\-\\mathbf\{W\}\\mathbf\{S\}\_\{0\}\)\(22\)𝐎\\displaystyle\\mathbf\{O\}=\(𝐀1→2t⊙𝐐′\)𝐒0\+Tril\(\(𝐀1→2t⊙𝐐′\)\(𝐊′𝐀1→2t\)⊺\)\(𝐔−𝐖𝐒0\)\\displaystyle=\\left\(\\mathbf\{A\}\_\{1\\to 2t\}\\odot\\mathbf\{Q\}^\{\\prime\}\\right\)\\mathbf\{S\}\_\{0\}\+\\mathrm\{Tril\}\\left\(\\left\(\\mathbf\{A\}\_\{1\\to 2t\}\\odot\\mathbf\{Q\}^\{\\prime\}\\right\)\\left\(\\frac\{\\mathbf\{K\}^\{\\prime\}\}\{\\mathbf\{A\}\_\{1\\to 2t\}\}\\right\)^\{\\intercal\}\\right\)\(\\mathbf\{U\}\-\\mathbf\{W\}\\mathbf\{S\}\_\{0\}\)
This formulation reduces EDA’s two\-factor update to the standard DPLR chunk\-wise recurrence\.
### 3\.5Efficiency Analysis
The chunk\-wise parallel formulation above increases the per\-chunk sequence length, which raises the compute workload during prefill\. However, the only additional inputs to the kernel are the erase address𝐞\\mathbf\{e\}and the scalar gateγ\\gamma, so the increase in HBM traffic remains modest after kernel fusion\. Since the chunk\-forward pass of channel\-wise gated delta models is inherently memory\-bound, the wall\-clock overhead remains moderate in practice\. During autoregressive decoding the effect is smaller still, as the dominant cost is reading and writing the recurrent state rather than computing the rank\-1 updates\. Moreover, linear\-attention layers typically account for a minor fraction of end\-to\-end model latency, further limiting the overall impact\. Optimized kernel implementations will be released at[https://github\.com/QwenLM/FlashQLA](https://github.com/QwenLM/FlashQLA)\.
## 4Experiments
### 4\.1Experimental Setup
We evaluate EDA under two matched pretraining scales: a dense 2\.5B model family and a larger MoE 25B\-A2\.8B family\. The goal is to test whether the proposed erase\-then\-delta update improves the recurrent component in both a standard dense setting and a sparse\-activated large\-model setting\. Within each scale, the compared models share the same training setup; detailed architecture hyperparameters and parameter counts are listed in Appendix[A](https://arxiv.org/html/2606.26560#A1)\.
#### Compared models\.
For the dense comparison, we compare a full\-attention Transformer baseline with GDN, GDN\-2, KDA, and EDA\. For the MoE comparison, we compare GDN, KDA, and EDA under the same sparse\-activation backbone\. Except for the Transformer baseline, all compared linear attention models are hybrid architectures with three linear\-attention layers followed by one full\-attention Transformer layer, corresponding to a 3:1 linear\-to\-full attention ratio\. This ratio is not tuned specifically for EDA; it follows the common hybrid configuration used in Qwen3\.5\-style and Kimi Linear architectures\(Team,[2025](https://arxiv.org/html/2606.26560#bib.bib21); Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\.
#### Training setup\.
All models were pretrained for 400B tokens with sequence length 4096 and global batch size 1024\. The dense models used a learning rate decayed from4×10−34\\times 10^\{\-3\}to3×10−53\\times 10^\{\-5\}, while the MoE 25B\-A2\.8B models used a learning rate decayed from2×10−32\\times 10^\{\-3\}to3×10−53\\times 10^\{\-5\}\. We additionally report MoE checkpoints after an 80B\-token midtraining stage initialized from the 400B\-token pretrained MoE checkpoints\. The midtraining stage used sequence length 32k\.
#### Evaluation setup\.
For downstream evaluation, we report MMLU, MMLU\-Pro, GSM8K, MATH, BBH, and EvalPlus\. Unless otherwise stated, entries are percentages averaged over two evaluation runs of the same checkpoint, and the Avg\. column denotes the unweighted mean over the displayed benchmarks\. Brief descriptions of the downstream benchmarks are provided in Appendix[B](https://arxiv.org/html/2606.26560#A2)\.
### 4\.2Model Results
Table 1:Evaluation results after 400B\-token pretraining\. Values are percentages averaged over two evaluation runs; Avg\. is the unweighted mean over the six benchmark columns\. Within each model family, best results are bold and second\-best results are underlined\.At the dense 2\.5B scale, EDA achieves the strongest average score among all dense models\. Compared with KDA, which shares the same channel\-wise gated delta backbone but lacks the independent erase address, EDA improves the Avg\. score by 0\.63 points\.
The larger MoE 25B\-A2\.8B setting gives a clearer picture of the scaling behavior\. EDA performs best on most benchmarks and improves the overall evaluation performance across knowledge\-heavy, reasoning\-heavy, and code\-oriented tasks\. This larger\-scale result suggests that address\-level erase/write decoupling provides a broadly useful memory\-management degree of freedom: the model can preserve the delta\-rule correction at the current write key while using a separate learned address to suppress stale content elsewhere\.
### 4\.3Midtraining Results
Table[2](https://arxiv.org/html/2606.26560#S4.T2)reports the same benchmark suite after the MoE 25B\-A2\.8B checkpoints were further trained for 80B tokens at 32k sequence length\.
Table 2:Evaluation results for MoE 25B\-A2\.8B checkpoints after 400B\-token pretraining followed by 80B\-token midtraining at 32k sequence length\. Values are percentages averaged over two evaluation runs; Avg\. is the unweighted mean over the six benchmark columns\. Best results are bold and second\-best results are underlined\.Midtraining tests whether the pretraining\-stage advantage survives a harder adaptation setting rather than only appearing at the original 4k training length\. After the 80B\-token long\-context stage, EDA continues to provide the strongest overall performance, with especially clear gains on knowledge and reasoning benchmarks such as MMLU, MMLU\-Pro, MATH, and BBH\. This persistence is important because long\-context midtraining changes the operating regime of the recurrent state: the model must maintain useful information over longer spans while still removing outdated content that can interfere with later reads\.
Combined with the 400B\-token pretraining results, the midtraining result strengthens the main conclusion: decoupling erase and write addresses remains useful after the model is further trained for longer contexts, suggesting that the erase path is compatible with, rather than fragile under, subsequent sequence\-length adaptation\.
### 4\.4Long\-Context Evaluation
We evaluate the midtrained MoE checkpoints on the RULER task from 4k to 128k context length\. Since midtraining used 32k sequences, the 64k and 128k settings evaluate length extrapolation beyond the training context\. Table[3](https://arxiv.org/html/2606.26560#S4.T3)reports the RULER score at each context length, aggregated over all sub\-tasks and four evaluation runs\. EDA outperforms both GDN and KDA in the short\-context regime from 4k to 16k, and remains close to the two baselines from 32k to 128k\.
Table 3:RULER\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib32)\)long\-context results for MoE 25B\-A2\.8B checkpoints after 400B\-token pretraining and 80B\-token midtraining at 32k sequence length\. Values are percentages averaged over four evaluation runs; 64k and 128k are length\-extrapolation settings\. Avg\. is the unweighted mean over the six displayed context lengths\. Best results are bold and second\-best results are underlined\.
### 4\.5Memory\-State Analysis
The benchmark gains above do not by themselves explain why an additional erase address helps, since the delta update already has two ways to reduce old content: diagonal decay𝐃t=Diag\(𝜶t\)\\mathbf\{D\}\_\{t\}=\\operatorname\{Diag\}\(\\boldsymbol\{\\alpha\}\_\{t\}\)and write\-coupled correction\(𝐈−βt𝐤t𝐤t⊤\)\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)\. Throughout this subsection we analyze a fixed layer and attention head unless stated otherwise:𝐤t∈ℝdk\\mathbf\{k\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}is the L2\-normalized write key at tokentt,dkd\_\{k\}is the per\-head key dimension,𝐈∈ℝdk×dk\\mathbf\{I\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\_\{k\}\}is the identity matrix,𝜶t∈\(0,1\]dk\\boldsymbol\{\\alpha\}\_\{t\}\\in\(0,1\]^\{d\_\{k\}\}is the per\-channel retention vector, andβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)is the delta correction gate\. We therefore ask a narrower mechanistic question: when the recurrent state must remove stale content, does the model use the new erase address in a way that cannot be explained by these two existing contraction paths alone?
We first measure a*gate\-strength allocation*, not exact removed state energy\. Recall that the diagonal decay is parameterized in log space as𝐃t=Diag\(exp\(𝒈t\)\)\\mathbf\{D\}\_\{t\}=\\operatorname\{Diag\}\(\\exp\(\\boldsymbol\{g\}\_\{t\}\)\), where𝒈t∈ℝdk\\boldsymbol\{g\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}is the per\-channel log\-retention vector\. Therefore𝜶t=exp\(𝒈t\)\\boldsymbol\{\\alpha\}\_\{t\}=\\exp\(\\boldsymbol\{g\}\_\{t\}\)is the per\-channel retention factor applied before the erase and delta operators\. For tokenttand headhh, let
α¯t,h=1dk∑j=1dkαt,h,j\\bar\{\\alpha\}\_\{t,h\}=\\frac\{1\}\{d\_\{k\}\}\\sum\_\{j=1\}^\{d\_\{k\}\}\\alpha\_\{t,h,j\}\(23\)be the mean retention factor of the diagonal decay within that head, whereαt,h,j\\alpha\_\{t,h,j\}is thejj\-th key\-channel retention value and the sum averages over alldkd\_\{k\}key channels\. Below we writeα=α¯t,h\\alpha=\\bar\{\\alpha\}\_\{t,h\}for compactness\. This averaging deliberately collapses the diagonal operator to a scalar summary, so it should not be used to compare the full operator rank or total energy removed by𝐃t\\mathbf\{D\}\_\{t\}against the two rank\-1 contractions\. Its purpose is narrower: under the average retained scale of a head, we ask how the learned gates allocate contraction strength among the decay path, the write\-key correction, and the independent erase path\. Since both rank\-1 operators act after𝐃t\\mathbf\{D\}\_\{t\}, we define the unnormalized scores
bD=1−α,bΔ=αβt,bE=αγt,b\_\{D\}=1\-\\alpha,\\qquad b\_\{\\Delta\}=\\alpha\\beta\_\{t\},\\qquad b\_\{E\}=\\alpha\\gamma\_\{t\},\(24\)for diagonal decay, same\-address correction, and independent erase, respectively; hereγt∈\(0,1\)\\gamma\_\{t\}\\in\(0,1\)is the erase gate\. For readability, after fixing headhh, we omit the head index onβt,h\\beta\_\{t,h\}andγt,h\\gamma\_\{t,h\}and write them asβt\\beta\_\{t\}andγt\\gamma\_\{t\}\. Here1−α1\-\\alphais the average decay removal fraction obtained after summarizing the diagonal retention vector by its mean, rather than an exact operator\-level decomposition of𝐃t\\mathbf\{D\}\_\{t\}\. We plot the normalized shareqm=bm/\(bD\+bΔ\+bE\)q\_\{m\}=b\_\{m\}/\(b\_\{D\}\+b\_\{\\Delta\}\+b\_\{E\}\)for mechanismm∈\{D,Δ,E\}m\\in\\\{D,\\Delta,E\\\}\. This definition is appropriate for the allocation question becauseβt\\beta\_\{t\}andγt\\gamma\_\{t\}are exactly the contraction factors of the two rank\-1 readouts, while the multiplierα\\alphaaccounts for the fact that both contractions operate on the state retained after decay\. It should not be read as an exact or fully fair state\-energy decomposition: the actual content removed also depends on the anisotropic diagonal decay, the current state projections onto𝐞t\\mathbf\{e\}\_\{t\}and𝐤t\\mathbf\{k\}\_\{t\}, and the overlap between the two addresses\.
As a boundary check, we also evaluate raw write\-key recall, which asks whether older hidden values can be read back from their original write keys\. KDA performs better under this strict probe, so EDA’s advantage should not be interpreted as uniformly better historical recall\. This motivates focusing on cleanup allocation and erase\-address structure rather than raw recall alone\.
To test whether the learned erase direction is structured, we use two address\-level diagnostics\. First, we compare the readout\-level control induced by the actual erase address𝐞t∈ℝdk\\mathbf\{e\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}with counterfactual directions: random unit vectors, head\-shuffled learned erase directions, and the degenerate same\-address choice𝐞t=𝐤t\\mathbf\{e\}\_\{t\}=\\mathbf\{k\}\_\{t\}\. For each direction strategy, we replay the recurrent state sequence with the same gates and measure the local effect of the erase step:𝐨t−\\mathbf\{o\}\_\{t\}^\{\-\}is the readout just before erase at tokentt, andδ𝐨t\\delta\\mathbf\{o\}\_\{t\}is the readout change caused by that erase step\. We compute the collateral perturbation score∥δ𝐨t∥2/∥𝐨t−∥2\\lVert\\delta\\mathbf\{o\}\_\{t\}\\rVert\_\{2\}/\\lVert\\mathbf\{o\}\_\{t\}^\{\-\}\\rVert\_\{2\}over layers; the raw means are 0\.064 for Actual, 0\.143 for Random, 0\.115 for Shuffle, and 0\.223 for𝐞t=𝐤t\\mathbf\{e\}\_\{t\}=\\mathbf\{k\}\_\{t\}\. Figure[2](https://arxiv.org/html/2606.26560#S4.F2)\(b\) plots the same data as a layerwise fold change relative to Actual, with the raw means annotated\. A smaller score does not mean “no erase”; rather, under the same erase\-gate budget, it means that the chosen address changes the currently readable state less than an alternative address\. This probe therefore does not measure task benefit directly, but asks whether replacing the learned erase address causes larger collateral changes to the current readout\. Second, we measure\|cos\(𝐞t,𝐤t\)\|\|\\cos\(\\mathbf\{e\}\_\{t\},\\mathbf\{k\}\_\{t\}\)\|, the absolute cosine similarity between the L2\-normalized erase address and write key, on GSM8K few\-shot prompts\. The independent reference for this geometry check is the analytic mean of\|𝐮⊤𝐫\|\|\\mathbf\{u\}^\{\\top\}\\mathbf\{r\}\|for two independent random unit directions𝐮,𝐫∈ℝ128\\mathbf\{u\},\\mathbf\{r\}\\in\\mathbb\{R\}^\{128\}, where 128 is the per\-head key dimension in this model\.
Figure 2:EDA uses an independent cleanup path\.\(a\) Gate\-strength allocation by mean\-retention bin\. Independent erase becomes dominant when decay is weak \(α¯\\bar\{\\alpha\}close to one\); red percentages above bars denote the erase share\. \(b\) Under the same erase gates, counterfactual erase directions cause larger local readout perturbations than the learned direction; bars show layerwise fold change relative to Actual, andμ\\mudenotes the raw mean perturbation score\. \(c\) The erase address stays close to the independent\-direction reference; same\-address collapse would give\|cos\(e,k\)\|=1\|\\cos\(e,k\)\|=1\.Figure[2](https://arxiv.org/html/2606.26560#S4.F2)shows where the new erase degree of freedom is used\. The allocation in Figure[2](https://arxiv.org/html/2606.26560#S4.F2)\(a\) shows that diagonal decay through𝐃t\\mathbf\{D\}\_\{t\}, same\-address correction through\(𝐈−βt𝐤t𝐤t⊤\)\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\), and independent erase through\(𝐈−γt𝐞t𝐞t⊤\)\(\\mathbf\{I\}\-\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}\\mathbf\{e\}\_\{t\}^\{\\top\}\)account for 35\.4%, 31\.8%, and 32\.8% of the global share, respectively\. More importantly, the allocation shifts with decay speed in the expected direction: whenα¯<0\.3\\bar\{\\alpha\}<0\.3, decay already supplies most of the contraction strength, while for nearly persistent heads withα¯≥0\.9\\bar\{\\alpha\}\\geq 0\.9, independent erase contributes 69\.1% and is about3\.0×3\.0\\timesthe same\-address correction contribution\. This high\-retention regime is exactly where stale content would otherwise survive𝐃t\\mathbf\{D\}\_\{t\}, so the model assigns the extra cleanup budget toγt𝐞t\\gamma\_\{t\}\\mathbf\{e\}\_\{t\}rather than forcing it through the current write key\.
The learned erase direction is also controlled at the readout level rather than arbitrary\. Figure[2](https://arxiv.org/html/2606.26560#S4.F2)\(b\) shows that replacing𝐞t\\mathbf\{e\}\_\{t\}with random, shuffled, or same\-address alternatives increases local readout perturbation by about2\.4×2\.4\\times,1\.9×1\.9\\times, and3\.4×3\.4\\times, respectively, after normalizing each analyzed layer by its Actual score\. Thus, the learned erase address is not just an additional direction for removing state; under the same erase gates, it changes the current readout less than alternative directions, suggesting a more controlled cleanup operation\. Figure[2](https://arxiv.org/html/2606.26560#S4.F2)\(c\) provides a complementary geometry check: the observed mean\|cos\(𝐞t,𝐤t\)\|\|\\cos\(\\mathbf\{e\}\_\{t\},\\mathbf\{k\}\_\{t\}\)\|stays around 0\.105 across layers, close to the independent\-direction reference and far from the value near one expected under same\-address collapse\. Together with the raw\-recall boundary check above, these probes support the address\-decoupling interpretation of EDA while clarifying its limitation: independent erase is a conditional cleanup mechanism, not a uniformly better historical\-recall mechanism\.
## 5Related Work
#### Delta\-rule and gated linear memory models\.
DeltaNet\(Schlaget al\.,[2021](https://arxiv.org/html/2606.26560#bib.bib15); Yanget al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib24)\)reinterprets recurrent state updates as online gradient descent on a reconstruction loss, replacing naive additive writes with corrective writes that depend on what is already stored at the current address\. Gated DeltaNet \(GDN\)\(Yanget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib1)\)extends this with a head\-wise forget gate, and recent channel\-wise gated variants further replace that head\-wise gate with diagonal decay for finer retention control\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\. GDN\-2 is the closest motivation\-side comparison: it also argues that erase and write should be decoupled in delta\-rule memory, but it targets a different axis of coupling\(Hatamizadehet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib3)\)\. Specifically, GDN\-2 separates key\-side erase and value\-side write gates, allowing the model to assign different strengths to erasing and writing inside the delta residual\. The active edit, however, remains organized around the current write key\. EDA targets the complementary address\-level coupling: it keeps the corrective delta write at𝐤t\\mathbf\{k\}\_\{t\}while adding an independently addressed erase direction before the write\. The two designs are therefore orthogonal in spirit: GDN\-2 decouples how strongly erase and write are applied, while EDA decouples where erasing and writing are applied\.
#### Expressive state\-transition mechanisms for linear RNNs\.
A growing body of work seeks to enrich the state transition in linear RNNs beyond the single\-step delta correction\. DeltaProduct\(Siemset al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib4)\)applies a sequence of Householder reflections per step, enabling smooth interpolation between diagonal and dense transitions\. RWKV\-7\(Penget al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib16)\)adopts a diagonal\-plus\-low\-rank \(DPLR\) parameterization with vector\-valued gating, improving state\-tracking capacity\. Comba\(Huet al\.,[2026](https://arxiv.org/html/2606.26560#bib.bib5)\)proposes a scalar\-plus\-low\-rank \(SPLR\) form motivated by closed\-loop control theory, adding output correction alongside state feedback\. These approaches increase the expressive power of state evolution globally\. Our method is complementary but different in purpose: rather than enriching the transition matrix, we introduce a specific memory\-management capability—selectively deleting stale memory at one address before performing a corrective write at another—while preserving the delta\-rule structure\.
#### Hybrid architectures and inference efficiency\.
The computational bottleneck of softmax attention at inference time has motivated hybrid architectures that combine full attention with linear recurrent layers\. Models such as Jamba\(Lieberet al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib19)\)and Nemotron\(Guet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib20)\)interleave sparse full\-attention layers with predominantly linear recurrent layers, achieving a practical trade\-off between quality and efficiency\. Recent channel\-wise gated delta hybrids demonstrate that this design can match or exceed full\-attention quality while reducing KV cache usage substantially\(Teamet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib2)\)\. EDA is orthogonal to these architectural choices: it improves the recurrent component itself, making it a candidate drop\-in replacement for channel\-wise gated delta layers in hybrid designs\.
## 6Conclusion
We introduced Erase\-then\-Delta Attention, an address\-level modification to delta\-rule linear attention that separates where the model erases from where it writes\. Instead of relying only on diagonal decay or same\-address delta correction to remove stale content, EDA first applies a learned erase operation at an independent address and then performs the corrective delta write at the current write key\. This keeps the core delta\-rule update intact while giving the recurrent state a more direct way to clean up memory that is not aligned with the current write\.
Across dense 2\.5B and MoE 25B\-A2\.8B pretraining, EDA achieves the strongest average performance among the compared models, and the advantage persists after long\-context midtraining of the MoE checkpoints\. The memory\-state analysis further supports the intended mechanism: the learned erase path is used most strongly when passive decay is weak, and counterfactual erase directions cause larger readout changes under the same erase gates\. These results suggest that recurrent memory models benefit from deciding not only what to write, but also where stale information should be removed\.
#### Limitations\.
Our work has several limitations\. Introducing the independent erase step reduces raw write\-key recall, so the erase path should be understood as a conditional cleanup mechanism rather than a uniform improvement to memory fidelity\. Additionally, the current probes measure gate allocation and readout perturbation but do not directly trace individual erase events to specific downstream prediction improvements\.
## References
- M\. Beck, K\. Pöppel, M\. Spanring, A\. Auer, O\. Prudnikova, M\. K\. Kopp, G\. Klambauer, J\. Brandstetter, and S\. Hochreiter \(2024\)XLSTM: extended long short\-term memory\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ARAxPPIAhq)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.ArXivabs/2110\.14168\.External Links:[Link](https://api.semanticscholar.org/CorpusID:239998651)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- Transformers are SSMs: generalized models and efficient algorithms through structured state space duality\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=ztn8FCR1td)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.26560#S3.SS3.SSS0.Px1.p1.6)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=tEYskw1VY2)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- A\. Gu, K\. Goel, and C\. Re \(2022\)Efficiently modeling long sequences with structured state spaces\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uYLFoz1vlAC)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- Y\. Gu, Q\. Hu, H\. Xi, J\. Chen, S\. Yang, S\. Han, and H\. Cai \(2025\)Jet\-nemotron: efficient language model with post neural architecture search\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=WZQXaTNYEB)Cited by:[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Hatamizadeh, Y\. Choi, and J\. Kautz \(2026\)Gated deltanet\-2: decoupling erase and write in linear attention\.External Links:2605\.22791,[Link](https://arxiv.org/abs/2605.22791)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26560#S2.SS2.p4.4),[§3\.2](https://arxiv.org/html/2606.26560#S3.SS2.p2.3),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. X\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.ArXivabs/2009\.03300\.External Links:[Link](https://api.semanticscholar.org/CorpusID:221516475)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),External Links:[Link](https://openreview.net/forum?id=7Bywt2mQsCe)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=kIoBbc76Sy)Cited by:[Table 3](https://arxiv.org/html/2606.26560#S4.T3)\.
- J\. Hu, Y\. Pan, J\. Du, D\. Lan, X\. Tang, Q\. Wen, Y\. Liang, and W\. Sun \(2026\)Improving bilinear RNN with closed\-loop control\.InNeural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=jlJaRXDzCE)Cited by:[§2\.3](https://arxiv.org/html/2606.26560#S2.SS3.p1.1),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret \(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational Conference on Machine Learning,External Links:[Link](https://api.semanticscholar.org/CorpusID:220250819)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- O\. Lieber, B\. Lenz, H\. Bata, G\. Cohen, J\. Osin, I\. Dalmedigos, E\. Safahi, S\. Meirom, Y\. Belinkov, S\. Shalev\-Shwartz, O\. Abend, R\. Alon, T\. Asida, A\. Bergman, R\. Glozman, M\. Gokhman, A\. Manevich, N\. Ratner, N\. Rozen, E\. Shwartz, M\. Zusman, and Y\. Shoham \(2024\)Jamba: a hybrid transformer\-mamba language model\.External Links:2403\.19887,[Link](https://arxiv.org/abs/2403.19887)Cited by:[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px3.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=1qvx610Cu7)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- B\. Peng, R\. Zhang, D\. Goldstein, E\. Alcaide, X\. Du, H\. Hou, J\. Lin, J\. Liu, J\. Lu, W\. Merrill, G\. Song, K\. Tan, S\. Utpala, N\. Wilce, J\. S\. Wind, T\. Wu, D\. Wuttke, and C\. Zhou\-Zheng \(2025\)RWKV\-7 "goose" with expressive dynamic state evolution\.External Links:2503\.14456,[Link](https://arxiv.org/abs/2503.14456)Cited by:[§2\.3](https://arxiv.org/html/2606.26560#S2.SS3.p1.1),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang, D\. Liu, J\. Zhou, and J\. Lin \(2025\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=1b7whO4SfY)Cited by:[Appendix A](https://arxiv.org/html/2606.26560#A1.p1.1)\.
- I\. Schlag, K\. Irie, and J\. Schmidhuber \(2021\)Linear transformers are secretly fast weight programmers\.InInternational Conference on Machine Learning,External Links:[Link](https://api.semanticscholar.org/CorpusID:235377069)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26560#S2.SS2.p1.4),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Siems, T\. Carstensen, A\. Zela, F\. Hutter, M\. Pontil, and R\. Grazzi \(2026\)DeltaProduct: improving state\-tracking in linear RNNs via householder products\.InNeural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=SoRiaijTGr)Cited by:[§2\.3](https://arxiv.org/html/2606.26560#S2.SS3.p1.1),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Sun, L\. Dong, S\. Huang, S\. Ma, Y\. Xia, J\. Xue, J\. Wang, and F\. Wei \(2023\)Retentive network: a successor to transformer for large language models\.ArXivabs/2307\.08621\.External Links:[Link](https://api.semanticscholar.org/CorpusID:259937453)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. Le, E\. Chi, D\. Zhou, and J\. Wei \(2023\)Challenging BIG\-bench tasks and whether chain\-of\-thought can solve them\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 13003–13051\.External Links:[Link](https://aclanthology.org/2023.findings-acl.824/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.824)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- K\. Team, Y\. Zhang, Z\. Lin, X\. Yao, J\. Hu, F\. Meng, C\. Liu, X\. Men, S\. Yang, Z\. Li, W\. Li, E\. Lu, W\. Liu, Y\. Chen, W\. Xu, L\. Yu, Y\. Wang, Y\. Fan, L\. Zhong, E\. Yuan, D\. Zhang, Y\. Zhang, T\. Y\. Liu, H\. Wang, S\. Fang, W\. He, S\. Liu, Y\. Li, J\. Su, J\. Qiu, B\. Pang, J\. Yan, Z\. Jiang, W\. Huang, B\. Yin, J\. You, C\. Wei, Z\. Wang, C\. Hong, Y\. Chen, G\. Chen, Y\. Wang, H\. Zheng, F\. Wang, Y\. Liu, M\. Dong, Z\. Zhang, S\. Pan, W\. Wu, Y\. Wu, L\. Guan, J\. Tao, G\. Fu, X\. Xu, Y\. Wang, G\. Lai, Y\. Wu, X\. Zhou, Z\. Yang, and Y\. Du \(2025\)Kimi linear: an expressive, efficient attention architecture\.External Links:2510\.26692,[Link](https://arxiv.org/abs/2510.26692)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26560#S2.SS2.p3.1),[§2\.3](https://arxiv.org/html/2606.26560#S2.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.26560#S3.SS3.SSS0.Px1.p1.11),[§3\.4](https://arxiv.org/html/2606.26560#S3.SS4.p7.1),[§4\.1](https://arxiv.org/html/2606.26560#S4.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px3.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.26560#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InNeural Information Processing Systems,External Links:[Link](https://api.semanticscholar.org/CorpusID:13756489)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p1.1)\.
- S\. Wang, B\. Z\. Li, M\. Khabsa, H\. Fang, and H\. Ma \(2020\)Linformer: self\-attention with linear complexity\.ArXivabs/2006\.04768\.External Links:[Link](https://api.semanticscholar.org/CorpusID:219530577)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=y10DM6R2r3)Cited by:[Appendix B](https://arxiv.org/html/2606.26560#A2.p1.1)\.
- S\. Yang, J\. Kautz, and A\. Hatamizadeh \(2025\)Gated delta networks: improving mamba2 with delta rule\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=r8H7xhYPwz)Cited by:[§1](https://arxiv.org/html/2606.26560#S1.p2.1),[§1](https://arxiv.org/html/2606.26560#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.26560#S2.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.26560#S3.SS3.SSS0.Px1.p1.6),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yang, B\. Wang, Y\. Zhang, Y\. Shen, and Y\. Kim \(2024\)Parallelizing linear transformers with the delta rule over sequence length\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=y8Rm4VNRPH)Cited by:[§2\.2](https://arxiv.org/html/2606.26560#S2.SS2.p1.4),[§5](https://arxiv.org/html/2606.26560#S5.SS0.SSS0.Px1.p1.1)\.
## Appendix AModel Configurations
The model configurations used in the evaluation are summarized in Tables[4](https://arxiv.org/html/2606.26560#A1.T4)and[5](https://arxiv.org/html/2606.26560#A1.T5)\. All evaluated models use the same vocabulary size \(248,320\); pretraining used 4096\-token sequences, and the MoE midtraining stage used 32k\-token sequences\. Training used bfloat16 with the AdamW optimizer, SiLU activations in the FFN/MoE blocks, and RMSNorm withϵ=10−6\\epsilon=10^\{\-6\}\. The hybrid models use one full\-attention Transformer layer in every four layers, placed after three linear\-attention layers\. The full\-attention layers in both the Transformer baseline and the hybrid models use Gated Attention\(Qiuet al\.,[2025](https://arxiv.org/html/2606.26560#bib.bib33)\)\. For parameter alignment, the dense Transformer baseline uses 8/4/4 query/key/value heads in its full\-attention layers\.
Table 4:Scale\-level architecture hyperparameters\. “Layers” reports total layers with linear/full\-attention counts in parentheses\. “Attn/KV” denotes the query and key/value head counts in hybrid full\-attention layers\. “LA K/V” denotes the number of key/value heads in the linear\-attention layers, and “LA dim” denotes their per\-head dimensions\. For the MoE scale, the FFN/expert column expert width\.Table 5:Total and active parameter counts for evaluated model variants\. Dense models activate all parameters; MoE models report both total parameters and the parameters active per token\.For parameter efficiency, variants with channel\-wise forget gates use rank\-16 \(per\-head\) low\-rank projections for the gate generator\. The EDA MoE configuration uses a rank\-16 \(per\-head\) erase\-address projection and a safe gate with lower bound−5\-5\.
## Appendix BEvaluation Benchmarks
We evaluate the pretrained checkpoints on a compact set of standard language\-model benchmarks\. MMLU measures broad multitask knowledge across academic and professional subjects\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.26560#bib.bib26)\), while MMLU\-Pro increases the difficulty with more challenging questions and larger answer sets\(Wanget al\.,[2024](https://arxiv.org/html/2606.26560#bib.bib27)\)\. GSM8K evaluates grade\-school mathematical reasoning with word problems\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.26560#bib.bib28)\), and MATH evaluates more advanced competition\-style mathematical problem solving\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.26560#bib.bib29)\)\. BBH covers difficult reasoning tasks selected from BIG\-Bench\(Suzgunet al\.,[2023](https://arxiv.org/html/2606.26560#bib.bib30)\)\. EvalPlus evaluates code generation with stricter test cases beyond the original HumanEval/MBPP\-style checks\(Liuet al\.,[2023](https://arxiv.org/html/2606.26560#bib.bib31)\)\.Similar Articles
Dynamic Linear Attention
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
Dynamic Linear Attention
DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.
Exact Linear Attention
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.