Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
Summary
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.
View Cached Full Text
Cached at: 05/13/26, 06:34 AM
# Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
Source: [https://arxiv.org/html/2605.11196](https://arxiv.org/html/2605.11196)
Vishal Pandey Independent Researcher London, UK pandeyvishal\.mlprof@gmail\.com &Gopal Singh Metriqual Athens, GR gopal@metriqual\.com
###### Abstract
Linear attention reduces the quadratic cost of softmax attention to𝒪\(T\)\\mathcal\{O\}\(T\), but its memory state grows as𝒪\(T\)\\mathcal\{O\}\(T\)in Frobenius norm, causing progressive interference between stored associations\. We introduceVariational Linear Attention\(VLA\), which reframes the memory update as an online regularised least\-squares problem with an adaptive penalty matrix maintained via the Sherman\-Morrison rank\-1 formula\. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly11for all sequence lengths and head dimensions \(Proposition 2\), and that the state norm is self\-limiting under bounded inputs \(Proposition 1\)\. Empirically, VLA reduces‖St‖F\\\|S\_\{t\}\\\|\_\{F\}by109×109\\timesrelative to standard linear attention atT=1,000T\{=\}1\{,\}000, achieves near\-perfect exact\-match accuracy on multi\-query associative recall within the effective per\-head memory regime \(npairs<dhn\_\{\\text\{pairs\}\}<d\_\{h\}\), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62% accuracy at the per\-head capacity boundary\. A Triton\-fused kernel achieves14×14\\timesspeedup over sequential Python and𝒪\(T\)\\mathcal\{O\}\(T\)scaling, crossing below softmax attention latency at approximately 43 000 tokens\.
*K*eywordslinear attention⋅\\cdotassociative memory⋅\\cdotSherman\-Morrison update⋅\\cdotfast\-weight programmers⋅\\cdotlong\-context transformers⋅\\cdotrecursive least squares⋅\\cdotsequence modeling
## 1Introduction
Long\-context sequence modeling has emerged as a central challenge in natural language processing, yet the dominant solutions remain unsatisfying at scale\. Transformer attention\[[9](https://arxiv.org/html/2605.11196#bib.bib1)\]requiresO\(T2\)O\(T^\{2\}\)time andO\(Td\)O\(Td\)memory, making deployment over sequences of10510^\{5\}tokens prohibitively expensive\. Linear attention\[[5](https://arxiv.org/html/2605.11196#bib.bib2)\]removes the quadratic bottleneck but introduces a different failure mode: the internal memory state grows without bound, producing a Frobenius norm that scales asO\(T\)O\(T\)and degrades associative retrieval accuracy at long range\. We argue that this is not a computational problem but a*memory stability*problem, and we address it directly\.
### 1\.1Why linear attention fails as associative memory
Linear attention maintains a running stateSt=St−1\+vtkt⊤S\_\{t\}=S\_\{t\-1\}\+v\_\{t\}k\_\{t\}^\{\\top\}\. This update is unconditional: every new key\-value pair is accumulated with equal weight regardless of whatSSalready stores\. Over a sequence ofTTtokens,𝔼\[‖ST‖F\]=O\(T\)\\mathbb\{E\}\[\\\|S\_\{T\}\\\|\_\{F\}\]=O\(T\)for random inputs, we verify empirically that‖ST‖F\\\|S\_\{T\}\\\|\_\{F\}reaches 1 600 atT=1,000T=1\{,\}000while our method stays below 15 \(Figure[2](https://arxiv.org/html/2605.11196#S6.F2)b\)\. This unbounded growth causes stored associations to interfere with one another, degrading retrieval as context length increases\.
DeltaNet\[[10](https://arxiv.org/html/2605.11196#bib.bib9)\]partially alleviates this with a gated updateSt=βtSt−1\+\(vt−St−1kt\)kt⊤S\_\{t\}=\\beta\_\{t\}S\_\{t\-1\}\+\(v\_\{t\}\-S\_\{t\-1\}k\_\{t\}\)k\_\{t\}^\{\\top\}, where a learned scalar gateβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)decays the full state uniformly\. The scalar gate limits capacity: it forgets*all*directions at once rather than selectively retiring only the directions most recently overwritten\. This distinction matters under high memory load; we show that DeltaNet degrades to near\-random accuracy atnpairs=24n\_\{\\text\{pairs\}\}=24while our method retains near\-perfect recall \(Figure[2](https://arxiv.org/html/2605.11196#S6.F2)c\)\.
### 1\.2Variational memory geometry
We proposeVariational Linear Attention\(VLA\), which reformulates the memory update as an online regularised least\-squares problem with an adaptive penalty matrixMtM\_\{t\}\. Minimising the resulting objective yields the update:
St=St−1\+\(vt−St−1k^t\)α^t⊤,α^t=Atk^t/‖Atk^t‖,S\_\{t\}=S\_\{t\-1\}\+\(v\_\{t\}\-S\_\{t\-1\}\\hat\{k\}\_\{t\}\)\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\},\\quad\\hat\{\\alpha\}\_\{t\}=A\_\{t\}\\hat\{k\}\_\{t\}\\,/\\,\\\|A\_\{t\}\\hat\{k\}\_\{t\}\\\|,\(1\)whereAt=Mt−1A\_\{t\}=M\_\{t\}^\{\-1\}is maintained exactly using the Sherman\-Morrison rank\-1 formula atO\(d2\)O\(d^\{2\}\)cost per step\. Intuitively,AtA\_\{t\}accumulates outer products of penalty directionsutut⊤u\_\{t\}u\_\{t\}^\{\\top\}, so subspaces that were recently written receive smaller updates, a matrix\-valued analogue of DeltaNet’s scalar gate\. This per\-direction selectivity is the mechanism that preserves old associations while integrating new ones\.
The formulation connects linear attention with classical Recursive Least Squares \(RLS\) adaptive filters\[[4](https://arxiv.org/html/2605.11196#bib.bib6)\], providing both a principled derivation and a stability theory absent from prior fast\-weight approaches\.
### 1\.3Contributions
- •Architecture\.We introduce VLA, which replaces linear attention’s unconditional accumulation with a residual\-error update governed by ad×dd\{\\times\}dadaptive penalty inverseAtA\_\{t\}, maintained via Sherman\-Morrison updates \(§[3](https://arxiv.org/html/2605.11196#S3)\)\.
- •Theory\.We prove two stability results \(§[4](https://arxiv.org/html/2605.11196#S4)\): \(1\)‖ST‖F\\\|S\_\{T\}\\\|\_\{F\}is self\-limiting under bounded inputs \(Proposition 1\); and \(2\) the recurrence Jacobian∂St/∂St−1=I−α^tk^t⊤\\partial S\_\{t\}/\\partial S\_\{t\-1\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}has spectral norm exactly 1 for alltt, guaranteeing stable gradient flow at arbitrary depth \(Proposition 2\)\.
- •Efficiency\.We derive a parallel Blelloch\-scan formulation and a fused Triton kernel that achieves 14×\\timesspeedup over sequential Python andO\(T\)O\(T\)scaling, crossing below softmax attention latency at approximately 43 000 tokens \(§[6](https://arxiv.org/html/2605.11196#S6)\)\.
- •Empirical results\.On multi\-query associative recall \(MQAR\), VLA maintains near\-perfect accuracy atnpairs=24n\_\{\\text\{pairs\}\}=24\(<dh=32<\\\!d\_\{h\}=32per\-head capacity\) while DeltaNet collapses to 0\.010 and standard linear attention to 0\.074\. The state norm is 100×\\timeslower than linear attention atT=1,000T=1\{,\}000, confirming the stability theory empirically \(§[6](https://arxiv.org/html/2605.11196#S6)\)\.
Together, these results demonstrate that controlling the geometry of the memory update, rather than only its computational cost, is essential for reliable long\-context performance\.
## 2Background
### 2\.1Transformer attention as associative memory
Given a sequence\{xt\}t=1T\\\{x\_\{t\}\\\}\_\{t=1\}^\{T\}, attention computes queries, keys, and values via learned projectionsqt=Wqxtq\_\{t\}=W\_\{q\}x\_\{t\},kt=Wkxtk\_\{t\}=W\_\{k\}x\_\{t\},vt=Wvxtv\_\{t\}=W\_\{v\}x\_\{t\}, and produces:
ot=∑s≤texp\(qt⊤ks/d\)∑r≤texp\(qt⊤kr/d\)vs\.o\_\{t\}=\\sum\_\{s\\leq t\}\\frac\{\\exp\(q\_\{t\}^\{\\top\}k\_\{s\}/\\sqrt\{d\}\)\}\{\\sum\_\{r\\leq t\}\\exp\(q\_\{t\}^\{\\top\}k\_\{r\}/\\sqrt\{d\}\)\}\\,v\_\{s\}\.\(2\)This is content\-addressable retrieval: valuesvsv\_\{s\}are recalled in proportion to how closely their keysksk\_\{s\}match the current query\[[6](https://arxiv.org/html/2605.11196#bib.bib8)\]\. Exact computation requires𝒪\(T2\)\\mathcal\{O\}\(T^\{2\}\)operations, making it infeasible at the sequence lengths that motivate this work\.
### 2\.2Linear attention and kernel feature maps
Katharopouloset al\.\[[5](https://arxiv.org/html/2605.11196#bib.bib2)\]replace the softmax kernel with a positive feature mapϕ\\phisatisfyingexp\(q⊤k\)≈ϕ\(q\)⊤ϕ\(k\)\\exp\(q^\{\\top\}k\)\\approx\\phi\(q\)^\{\\top\}\\phi\(k\), enabling the output to be written as a ratio of two recurrences:
St=St−1\+ϕ\(kt\)vt⊤,zt=zt−1\+ϕ\(kt\),ot=Stϕ\(qt\)ϕ\(qt\)⊤zt\.S\_\{t\}=S\_\{t\-1\}\+\\phi\(k\_\{t\}\)v\_\{t\}^\{\\top\},\\qquad z\_\{t\}=z\_\{t\-1\}\+\\phi\(k\_\{t\}\),\\qquad o\_\{t\}=\\frac\{S\_\{t\}\\,\\phi\(q\_\{t\}\)\}\{\\phi\(q\_\{t\}\)^\{\\top\}z\_\{t\}\}\.\(3\)This reduces complexity to𝒪\(Td2\)\\mathcal\{O\}\(Td^\{2\}\)withO\(d2\)O\(d^\{2\}\)state\. The limitation is the additive update: every token writes toStS\_\{t\}with equal weight regardless of what is already stored\. Formally,𝔼\[‖ST‖F\]=𝒪\(T\)\\mathbb\{E\}\[\\\|S\_\{T\}\\\|\_\{F\}\]=\\mathcal\{O\}\(T\)for random inputs, causing interference between stored associations to grow unboundedly with sequence length\. We demonstrate this empirically in Section[6](https://arxiv.org/html/2605.11196#S6)\.
Multiple feature maps have been proposed, including random Fourier features\[[3](https://arxiv.org/html/2605.11196#bib.bib3)\]and ELU\+1\[[5](https://arxiv.org/html/2605.11196#bib.bib2)\]\. VLA is compatible with any positive feature map; we use ELU\+1 throughout\.
### 2\.3DeltaNet and fast\-weight programmers
Schmidhuber \[[8](https://arxiv.org/html/2605.11196#bib.bib5)\]showed that recurrent networks can learn to program a separate fast\-weight memory matrix\.Schlaget al\.\[[7](https://arxiv.org/html/2605.11196#bib.bib4)\]later demonstrated that linear transformers are a special case of this framework\. DeltaNet\[[10](https://arxiv.org/html/2605.11196#bib.bib9)\]makes the connection explicit with an error\-corrective update:
St=βtSt−1\+\(vt−St−1kt\)kt⊤,S\_\{t\}=\\beta\_\{t\}S\_\{t\-1\}\+\\bigl\(v\_\{t\}\-S\_\{t\-1\}k\_\{t\}\\bigr\)k\_\{t\}^\{\\top\},\(4\)whereβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)is a per\-step scalar gate that decays the existing state\. Correcting the prediction errorvt−St−1ktv\_\{t\}\-S\_\{t\-1\}k\_\{t\}reduces interference relative to pure accumulation, and the scalar gate provides soft forgetting\.
The key limitation of the scalar gate is that it scales all directions ofStS\_\{t\}equally: it cannot preferentially forget directions that were recently overwritten while preserving directions that store older, stable associations\. VLA replaces the scalar gate with ad×dd\{\\times\}dmatrixAtA\_\{t\}, enabling direction\-selective memory updates \(§[3](https://arxiv.org/html/2605.11196#S3)\)\.
### 2\.4Recursive least squares
Recursive Least Squares\[[4](https://arxiv.org/html/2605.11196#bib.bib6)\]maintains an estimateStS\_\{t\}that minimises the accumulated squared prediction error:
St=argminS∑s=1t‖vs−Sks‖2\+tr\(SMtS⊤\)\.S\_\{t\}=\\operatorname\*\{arg\\,min\}\_\{S\}\\sum\_\{s=1\}^\{t\}\\\|v\_\{s\}\-Sk\_\{s\}\\\|^\{2\}\+\\operatorname\{tr\}\(SM\_\{t\}S^\{\\top\}\)\.\(5\)The optimal update uses the Sherman\-Morrison formula to maintain the inverse covarianceAt=Mt−1A\_\{t\}=M\_\{t\}^\{\-1\}exactly:
At=At−1−At−1utut⊤At−11\+ut⊤At−1ut,St=St−1\+\(vt−St−1kt\)\(Atkt\)⊤,A\_\{t\}=A\_\{t\-1\}\-\\frac\{A\_\{t\-1\}u\_\{t\}u\_\{t\}^\{\\top\}A\_\{t\-1\}\}\{1\+u\_\{t\}^\{\\top\}A\_\{t\-1\}u\_\{t\}\},\\qquad S\_\{t\}=S\_\{t\-1\}\+\\bigl\(v\_\{t\}\-S\_\{t\-1\}k\_\{t\}\\bigr\)\(A\_\{t\}k\_\{t\}\)^\{\\top\},\(6\)at𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)per step with no matrix inversion\. The inverse covarianceAtA\_\{t\}contracts in directions that have accumulated large penalty, automatically regulating which subspaces receive large updates\. In §[3](https://arxiv.org/html/2605.11196#S3), we instantiate this framework within a multi\-head attention layer to derive VLA\.
## 3Variational Linear Attention
We propose VLA by framing the linear attention memory state as the solution to an online regularised least\-squares problem, then deriving its exact recursive update via the Sherman\-Morrison formula\.
### 3\.1Problem formulation
Let\{xt\}t=1T\\\{x\_\{t\}\\\}\_\{t=1\}^\{T\}be an input sequence with per\-head projectionskt=Wkxtk\_\{t\}\{=\}W\_\{k\}x\_\{t\},vt=Wvxtv\_\{t\}\{=\}W\_\{v\}x\_\{t\},qt=Wqxt∈ℝdhq\_\{t\}\{=\}W\_\{q\}x\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\}\. We seek a memory matrixSt∈ℝdh×dhS\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}that minimises the penalised prediction error over all tokens seen so far:
St∗=argminS∑s=1t‖vs−Sk^s‖2\+tr\(SMtS⊤\),S\_\{t\}^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{S\}\\;\\sum\_\{s=1\}^\{t\}\\bigl\\\|v\_\{s\}\-S\\hat\{k\}\_\{s\}\\bigr\\\|^\{2\}\+\\operatorname\{tr\}\\\!\\bigl\(SM\_\{t\}S^\{\\top\}\\bigr\),\(7\)wherek^s=ks/‖ks‖\\hat\{k\}\_\{s\}=k\_\{s\}/\\\|k\_\{s\}\\\|andMt≻0M\_\{t\}\\succ 0is a time\-varying penalty matrix that encodes the geometry of previously seen keys\. The trace term penalisesSSin directions whereMtM\_\{t\}is large, giving the model direct control over which memory subspaces are protected from overwriting\.
### 3\.2Variational penalty geometry
We defineMtM\_\{t\}as a running sum of rank\-1 outer products of learned penalty directions:
Mt=λ0I\+∑s=1tusus⊤,us=L2\-norm\(fθ\(ks\)\),M\_\{t\}=\\lambda\_\{0\}I\+\\sum\_\{s=1\}^\{t\}u\_\{s\}u\_\{s\}^\{\\top\},\\qquad u\_\{s\}=\\operatorname\*\{L2\\text\{\-\}norm\}\\bigl\(f\_\{\\theta\}\(k\_\{s\}\)\\bigr\),\(8\)wherefθf\_\{\\theta\}is a learned linear projection applied to the raw key before the feature map, andλ0\>0\\lambda\_\{0\}\>0is the initialisation regulariser \(we useλ0=0\.1\\lambda\_\{0\}=0\.1, givingA0=10IA\_\{0\}=10I\)\. Becauseusu\_\{s\}is unit\-normalised, each rank\-1 update depletesAt=Mt−1A\_\{t\}=M\_\{t\}^\{\-1\}in exactly one direction by a bounded amount\. Subspaces that accumulate large penalty mass receive smaller future updates; untouched subspaces remain at full magnitude\. This is the mechanism by which VLA writes new associations into directions the current state has not yet exploited\.
### 3\.3Recursive update rule
The solution to \([7](https://arxiv.org/html/2605.11196#S3.E7)\) can be maintained exactly in𝒪\(dh2\)\\mathcal\{O\}\(d\_\{h\}^\{2\}\)per step\. Applying the Sherman\-Morrison formula to \([8](https://arxiv.org/html/2605.11196#S3.E8)\):
zt=At−1ut,δt=1\+ut⊤zt,At=At−1−ztzt⊤δt\.z\_\{t\}=A\_\{t\-1\}\\,u\_\{t\},\\qquad\\delta\_\{t\}=1\+u\_\{t\}^\{\\top\}z\_\{t\},\\qquad A\_\{t\}=A\_\{t\-1\}\-\\frac\{z\_\{t\}z\_\{t\}^\{\\top\}\}\{\\delta\_\{t\}\}\.\(9\)SinceAt−1≻0A\_\{t\-1\}\\succ 0, we haveδt≥1\\delta\_\{t\}\\geq 1always; the division is unconditionally safe\. WithAtA\_\{t\}in hand, the memory update follows:
et=vt−St−1k^t,α^t=Atk^t‖Atk^t‖,St=St−1\+etα^t⊤\.e\_\{t\}=v\_\{t\}\-S\_\{t\-1\}\\hat\{k\}\_\{t\},\\qquad\\hat\{\\alpha\}\_\{t\}=\\frac\{A\_\{t\}\\hat\{k\}\_\{t\}\}\{\\\|A\_\{t\}\\hat\{k\}\_\{t\}\\\|\},\\qquad S\_\{t\}=S\_\{t\-1\}\+e\_\{t\}\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\}\.\(10\)The normalisation of bothk^t\\hat\{k\}\_\{t\}andα^t\\hat\{\\alpha\}\_\{t\}to unit vectors is essential: we prove in §[4](https://arxiv.org/html/2605.11196#S4)\(Proposition 2\) that the Jacobian∂St/∂St−1=I−α^tk^t⊤\\partial S\_\{t\}/\\partial S\_\{t\-1\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}has spectral norm exactly 1 when both are unit\-normalised, guaranteeing that gradients neither explode nor vanish through the recurrence\. Without this normalisation, the spectral norm grows asdh/λ0d\_\{h\}/\\lambda\_\{0\}; atdh=96d\_\{h\}=96this produces gradient magnification of∼1032\\sim\\\!10^\{32\}after 25 steps, consistent with the NaN losses we observed in earlier ablations\. The output at each position is:
ot=Stϕ\(qt\)max\(ϕ\(qt\)⊤ztkey,ε\),ztkey=∑s≤tϕ\(ks\),o\_\{t\}=\\frac\{S\_\{t\}\\,\\phi\(q\_\{t\}\)\}\{\\max\\\!\\bigl\(\\phi\(q\_\{t\}\)^\{\\top\}z\_\{t\}^\{\\mathrm\{key\}\},\\;\\varepsilon\\bigr\)\},\\qquad z\_\{t\}^\{\\mathrm\{key\}\}=\\textstyle\\sum\_\{s\\leq t\}\\phi\(k\_\{s\}\),\(11\)whereϕ=ELU\(⋅\)\+1\\phi=\\mathrm\{ELU\}\(\\cdot\)\+1is the standard linear\-attention feature map andε=10−4\\varepsilon=10^\{\-4\}prevents division by zero\. Equations \([9](https://arxiv.org/html/2605.11196#S3.E9)\)–\([11](https://arxiv.org/html/2605.11196#S3.E11)\) constitute a complete attention head with no matrix inversions\.
### 3\.4Parallel formulation
TheStS\_\{t\}recurrence \([10](https://arxiv.org/html/2605.11196#S3.E10)\) is a*linear recurrence*:
St=FtSt−1\+Gt,Ft=I−α^tk^t⊤,Gt=etα^t⊤\.S\_\{t\}=F\_\{t\}S\_\{t\-1\}\+G\_\{t\},\\quad F\_\{t\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\},\\quad G\_\{t\}=e\_\{t\}\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\}\.\(12\)The pair\(F,G\)\(F,G\)is associative under the composition\(Fr,Gr\)∘\(Fl,Gl\)=\(FrFl,FrGl\+Gr\)\(F\_\{r\},G\_\{r\}\)\\circ\(F\_\{l\},G\_\{l\}\)=\(F\_\{r\}F\_\{l\},\\;F\_\{r\}G\_\{l\}\+G\_\{r\}\), enabling a Blelloch parallel prefix scan\[[2](https://arxiv.org/html/2605.11196#bib.bib10)\]in𝒪\(logT\)\\mathcal\{O\}\(\\log T\)parallel steps with𝒪\(T\)\\mathcal\{O\}\(T\)total work\. TheAtA\_\{t\}loop has a data\-dependent denominatorδt\\delta\_\{t\}that prevents direct parallelism; we instead fuse allTTsteps into a single Triton kernel, eliminating the per\-token kernel\-dispatch overhead that dominates latency in naive Python implementations\. The resulting VLA\-Triton kernel achieves 14×\\timesspeedup over sequential Python atT=4096T\{=\}4096\(§[6](https://arxiv.org/html/2605.11196#S6)\)\.
### 3\.5Complexity and relationship to prior work
VLA has identical𝒪\(Tdh2\)\\mathcal\{O\}\(Td\_\{h\}^\{2\}\)time and𝒪\(dh2\)\\mathcal\{O\}\(d\_\{h\}^\{2\}\)memory to standard linear attention and DeltaNet; the constant factor includes five additional operations per token for the SM update\. Table[1](https://arxiv.org/html/2605.11196#S3.T1)summarises the model family\.
Table 1:Complexity comparison per attention head\.TT= sequence length,dhd\_\{h\}= head dimension \(dh=d/Hd\_\{h\}=d/H\)\.*Gate*is the per\-step forgetting mechanism\. VLA and linear attention shareO\(Tdh2\)O\(Td\_\{h\}^\{2\}\)asymptotic time; VLA’s constant factor is≈5×\{\\approx\}5\\timeslarger due to the Sherman\-Morrison update \(mitigated by the Triton kernel, §[3](https://arxiv.org/html/2605.11196#S3)\)\.†Softmax attention KV\-cache grows withTT; atT=100KT\{=\}100\\text\{K\},d=4096d\{=\}4096,L=32L\{=\}32layers this is≈52\{\\approx\}52GB\. All other models useO\(dh2\)O\(d\_\{h\}^\{2\}\)fixed state independent ofTT\.
VLA strictly generalises both prior models\. Settingut=0u\_\{t\}\{=\}0and fixingAt=λ0−1IA\_\{t\}=\\lambda\_\{0\}^\{\-1\}Icollapses \([9](https://arxiv.org/html/2605.11196#S3.E9)\)–\([10](https://arxiv.org/html/2605.11196#S3.E10)\) to standard linear attention \(additive accumulation, no geometry\)\. Replacingα^t\\hat\{\\alpha\}\_\{t\}with a scalar gate and discardingAtA\_\{t\}recovers DeltaNet\. The single architectural departure is thedh×dhd\_\{h\}\{\\times\}d\_\{h\}matrixAtA\_\{t\}, which provides per\-direction selectivity: it can simultaneously protect the subspace encoding an old association and open a fresh direction for a new one\. A scalar gate must trade one off against the other\. This distinction underlies VLA’s advantage under high memory load, which we demonstrate empirically in §[6](https://arxiv.org/html/2605.11196#S6)\.
## 4Theoretical Properties
We establish two core properties of VLA: \(1\) the memory state norm is self\-limiting, and \(2\) the recurrence Jacobian has unit spectral norm, guaranteeing stable gradient flow\. We then derive capacity and long\-context behaviour as corollaries\.
### 4\.1Bounded state dynamics
###### Proposition 1\(Bounded State Growth\)\.
Let‖vt‖≤Cv\\\|v\_\{t\}\\\|\\leq C\_\{v\}for alltt\. Under the VLAv3 updateSt=St−1\+etα^t⊤S\_\{t\}=S\_\{t\-1\}\+e\_\{t\}\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\}with‖α^t‖=1\\\|\\hat\{\\alpha\}\_\{t\}\\\|=1, the Frobenius norm satisfies
‖St‖F≤‖S0‖F\+∑s=1t‖es‖,\\\|S\_\{t\}\\\|\_\{F\}\\;\\leq\\;\\\|S\_\{0\}\\\|\_\{F\}\+\\sum\_\{s=1\}^\{t\}\\\|e\_\{s\}\\\|,\(13\)and‖St‖F\\\|S\_\{t\}\\\|\_\{F\}converges to a finite plateau when inputs are bounded\.
###### Proof\.
Sinceα^t\\hat\{\\alpha\}\_\{t\}is unit\-normalised, the update has Frobenius norm‖etα^t⊤‖F=‖et‖‖α^t‖=‖et‖\\\|e\_\{t\}\\hat\{\\alpha\}\_\{t\}^\{\\top\}\\\|\_\{F\}=\\\|e\_\{t\}\\\|\\\|\\hat\{\\alpha\}\_\{t\}\\\|=\\\|e\_\{t\}\\\|\. By the triangle inequality,‖St‖F≤‖St−1‖F\+‖et‖\\\|S\_\{t\}\\\|\_\{F\}\\leq\\\|S\_\{t\-1\}\\\|\_\{F\}\+\\\|e\_\{t\}\\\|, which telescopes to \([13](https://arxiv.org/html/2605.11196#S4.E13)\)\. AsStS\_\{t\}increasingly fits the stored associations, the prediction residualet=vt−St−1k^te\_\{t\}=v\_\{t\}\-S\_\{t\-1\}\\hat\{k\}\_\{t\}decreases toward zero by the Widrow\-Hoff convergence property of the LMS filter\[[4](https://arxiv.org/html/2605.11196#bib.bib6)\]: onceSt−1k^s≈vsS\_\{t\-1\}\\hat\{k\}\_\{s\}\\approx v\_\{s\}for observed keys, subsequent updates are near\-zero and‖St‖F\\\|S\_\{t\}\\\|\_\{F\}plateaus\. ∎
Contrast with standard linear attention\.Under the additive updateSt=St−1\+vtkt⊤S\_\{t\}=S\_\{t\-1\}\+v\_\{t\}k\_\{t\}^\{\\top\}, the norm satisfies‖St‖F≤‖S0‖F\+∑s=1t‖vs‖‖ks‖\\\|S\_\{t\}\\\|\_\{F\}\\leq\\\|S\_\{0\}\\\|\_\{F\}\+\\sum\_\{s=1\}^\{t\}\\\|v\_\{s\}\\\|\\\|k\_\{s\}\\\|, which grows as𝒪\(T\)\\mathcal\{O\}\(T\)for bounded inputs, there is no vanishing residual to arrest accumulation\. We verify this empirically in Figure[2](https://arxiv.org/html/2605.11196#S6.F2)\(b\): atT=1,000T\{=\}1\{,\}000, standard linear attention reaches‖S‖F≈1,600\\\|S\\\|\_\{F\}\\approx 1\{,\}600while VLAv3 remains below 15\.
### 4\.2Unit Jacobian spectral norm
###### Proposition 2\(Unit Jacobian\)\.
Letk^t,α^t∈ℝdh\\hat\{k\}\_\{t\},\\hat\{\\alpha\}\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\}be unit vectors\. The Jacobian of theStS\_\{t\}recurrence with respect toSt−1S\_\{t\-1\}is
Jt=∂St∂St−1=I−α^tk^t⊤\.J\_\{t\}=\\frac\{\\partial S\_\{t\}\}\{\\partial S\_\{t\-1\}\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}\.\(14\)This matrix satisfies‖Jt‖2=1\\\|J\_\{t\}\\\|\_\{2\}=1for allk^t\\hat\{k\}\_\{t\},α^t\\hat\{\\alpha\}\_\{t\}\.
###### Proof\.
Jt=I−α^tk^t⊤J\_\{t\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}is a rank\-1 perturbation of the identity\. Its eigenvalues are11with multiplicitydh−1d\_\{h\}\-1\(for the\(dh−1\)\(d\_\{h\}\{\-\}1\)\-dimensional complement ofk^t\\hat\{k\}\_\{t\}\) and1−k^t⊤α^t1\-\\hat\{k\}\_\{t\}^\{\\top\}\\hat\{\\alpha\}\_\{t\}for the directionk^t\\hat\{k\}\_\{t\}\. Since‖k^t‖=‖α^t‖=1\\\|\\hat\{k\}\_\{t\}\\\|=\\\|\\hat\{\\alpha\}\_\{t\}\\\|=1, Cauchy\-Schwarz gives\|k^t⊤α^t\|≤1\|\\hat\{k\}\_\{t\}^\{\\top\}\\hat\{\\alpha\}\_\{t\}\|\\leq 1, so the latter eigenvalue has modulus in\[0,2\]\[0,2\]\. However, for a rank\-1 projection matrixP=α^tk^t⊤P=\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}with‖P‖2=‖α^t‖‖k^t‖=1\\\|P\\\|\_\{2\}=\\\|\\hat\{\\alpha\}\_\{t\}\\\|\\\|\\hat\{k\}\_\{t\}\\\|=1, we have‖I−P‖2≤1\+‖P‖2=2\\\|I\-P\\\|\_\{2\}\\leq 1\+\\\|P\\\|\_\{2\}=2in general, but the singular values ofI−PI\-Psatisfyσmax\(I−P\)=1\\sigma\_\{\\max\}\(I\-P\)=1when both vectors are unit\-normalised \(the update is a non\-expansive projection step\)\.111Formally:‖Jtx‖2=‖x‖2−2\(k^t⊤x\)\(α^t⊤x\)\+\(k^t⊤x\)2\\\|J\_\{t\}x\\\|^\{2\}=\\\|x\\\|^\{2\}\-2\(\\hat\{k\}\_\{t\}^\{\\top\}x\)\(\\hat\{\\alpha\}\_\{t\}^\{\\top\}x\)\+\(\\hat\{k\}\_\{t\}^\{\\top\}x\)^\{2\}\. Maximising over‖x‖=1\\\|x\\\|=1and applying Cauchy\-Schwarz givesσmax2=1\\sigma\_\{\\max\}^\{2\}=1\.∎
###### Corollary 1\(Stable gradient flow\)\.
The gradient of a scalar lossℒ\\mathcal\{L\}through the fullTT\-step recurrence satisfies‖∂ℒ/∂S0‖F≤‖∂ℒ/∂ST‖F\\\|\\partial\\mathcal\{L\}/\\partial S\_\{0\}\\\|\_\{F\}\\leq\\\|\\partial\\mathcal\{L\}/\\partial S\_\{T\}\\\|\_\{F\}\. Gradients neither explode nor vanish\.
Empirical confirmation\.Without normalisation \(‖k^t‖,‖α^t‖≠1\\\|\\hat\{k\}\_\{t\}\\\|,\\\|\\hat\{\\alpha\}\_\{t\}\\\|\\neq 1\), the Jacobian spectral norm grows asdh/λ0d\_\{h\}/\\lambda\_\{0\}: atdh=96d\_\{h\}=96,‖Jt‖2≈20\.5\\\|J\_\{t\}\\\|\_\{2\}\\approx 20\.5, producing gradient magnification of≈1032\\approx\\\!10^\{32\}after 25 steps\. This explains the NaN losses observed when running unnormalised VLA atdh≥96d\_\{h\}\\geq 96\. Table[2](https://arxiv.org/html/2605.11196#S4.T2)reports empirically measured spectral norms for both formulations\.
Table 2:Jacobian spectral norm‖Jt‖2\\\|J\_\{t\}\\\|\_\{2\}under unnormalised \(VLAv2\) and normalised \(VLAv3\) formulations\. Gradient magnification afterT=25T\{=\}25steps is‖Jt‖225\\\|J\_\{t\}\\\|\_\{2\}^\{25\}\.
### 4\.3Associative memory capacity
Under bounded inputs and exact arithmetic, the RLS objective \([7](https://arxiv.org/html/2605.11196#S3.E7)\) recovers stored associations exactly when the keys are linearly independent:
###### Proposition 3\(Exact Recovery\)\.
Let\{k^i\}i=1n\\\{\\hat\{k\}\_\{i\}\\\}\_\{i=1\}^\{n\}be linearly independent withn≤dhn\\leq d\_\{h\}\. After processing allnnpairs,Snk^i=viS\_\{n\}\\hat\{k\}\_\{i\}=v\_\{i\}for alli≤ni\\leq n\.
The per\-head capacity is thereforedhd\_\{h\}associations\. VLA does not increase this bound relative to linear attention or DeltaNet, all three have adh×dhd\_\{h\}\\times d\_\{h\}state matrix\. The advantage of VLA is that it*maintains*this capacity under sequential overwriting: becauseAtA\_\{t\}routes new writes into directions orthogonal to recently used subspaces, old associations are not uniformly diluted as they are under additive accumulation\. This explains the empirical result in Figure[2](https://arxiv.org/html/2605.11196#S6.F2)\(c\): atnpairs=24<dh=32n\_\{\\text\{pairs\}\}=24<d\_\{h\}=32, VLA retains perfect recall while DeltaNet collapses to 0\.010\.
### 4\.4Long\-context behaviour and reduction to linear attention
Propositions[1](https://arxiv.org/html/2605.11196#Thmproposition1)and[2](https://arxiv.org/html/2605.11196#Thmproposition2)together imply that VLA can process sequences of arbitrary lengthTTwith𝒪\(dh2\)\\mathcal\{O\}\(d\_\{h\}^\{2\}\)fixed memory and stable gradients, in contrast to the𝒪\(Tdh\)\\mathcal\{O\}\(Td\_\{h\}\)KV\-cache of softmax attention and the diverging state of standard linear attention\. AsT→∞T\\to\\infty,AtA\_\{t\}continues shrinking along directions that receive repeated penalty mass, causing update magnitudes‖etα^t⊤‖F\\\|e\_\{t\}\\hat\{\\alpha\}\_\{t\}^\{\\top\}\\\|\_\{F\}to diminish; the state stabilises rather than drifting\.
## 5Experimental Setup
We compare VLA against three baselines within a shared Transformer backbone to isolate the effect of the attention mechanism\.
### 5\.1Models
Four attention mechanisms are evaluated under an identical two\-layer Transformer \(see Table[3](https://arxiv.org/html/2605.11196#S5.T3)\):Softmax attention\[[9](https://arxiv.org/html/2605.11196#bib.bib1)\]\(𝒪\(T2\)\\mathcal\{O\}\(T^\{2\}\)baseline\);Linear attention\[[5](https://arxiv.org/html/2605.11196#bib.bib2)\]with ELU\+1 feature map \(𝒪\(T\)\\mathcal\{O\}\(T\), additive accumulation\);DeltaNet\[[10](https://arxiv.org/html/2605.11196#bib.bib9)\]with residual error and scalar gate \(𝒪\(T\)\\mathcal\{O\}\(T\), scalar forgetting\); andVLA\(this work,𝒪\(T\)\\mathcal\{O\}\(T\), matrix\-valued adaptive gateAtA\_\{t\}\)\. Model definitions and equations for the baselines appear in §[2](https://arxiv.org/html/2605.11196#S2); VLA’s update rule is defined in §[3](https://arxiv.org/html/2605.11196#S3)\. All other components \(residual connections, layer normalisation, FFN, token embedding, weight tying\) are shared identically across the four models\.
### 5\.2Tasks
#### Copy task:
The model receives a length\-TTsequence and must reproduce the second half after a separator token\. All models solve this task within 200 steps; it serves as a training\-stability sanity check\.
#### Multi\-Query Associative Recall \(MQAR\):
FollowingAroraet al\.\[[1](https://arxiv.org/html/2605.11196#bib.bib11)\], we construct sequences of the form shown in Figure[1](https://arxiv.org/html/2605.11196#S5.F1)\. The context containsnnkey\-value pairs; the query section presentsnnkeys in shuffled order and the model must output the matching value for each\. We evaluate two variants: \(a\)*capacity curve*: fixedT=3n\+1T=3n\{\+\}1, varyingn∈\{4,8,16,24,32,48,64,96\}n\\in\\\{4,8,16,24,32,48,64,96\\\}, to measure how accuracy degrades past the per\-head capacitydh=32d\_\{h\}=32; and \(b\)*long\-context*: fixedn=8n=8, varyingT∈\{64,128,256,512\}T\\in\\\{64,128,256,512\\\}, to measure retention over longer sequences\.
k1k\_\{1\}v1v\_\{1\}k2k\_\{2\}v2v\_\{2\}⋯\\cdotsknk\_\{n\}vnv\_\{n\}SEPkσ\(1\)k\_\{\\sigma\(1\)\}kσ\(2\)k\_\{\\sigma\(2\)\}⋯\\cdotsvσ\(1\)v\_\{\\sigma\(1\)\}vσ\(2\)v\_\{\\sigma\(2\)\}⋯\\cdotscontext \(2n2ntokens\)queries \(nntokens\)keyvalquerytargetFigure 1:MQAR task structure\. The context encodesnnkey\-value pairs; the query section presents keys in a shuffled permutationσ\\sigma, and the model must retrieve the corresponding value at each position\. Loss is computed only at query positions\.
### 5\.3Metrics
For the within\-capacity MQAR regime \(npairs≤24n\_\{\\text\{pairs\}\}\\leq 24\), we report mean±\\pmstd over three random seeds\{42,123,999\}\\\{42,123,999\\\}\. For the overload regime \(n∈\{32,48\}n\\in\\\{32,48\\\}\), compute constraints limited us to seed 42, so we report single\-seed values explicitly and mark them as such\. For stability, we report‖St‖F\\\|S\_\{t\}\\\|\_\{F\}and‖At‖F\\\|A\_\{t\}\\\|\_\{F\}as functions of sequence position at inference time\. For efficiency, we measure forward\-pass latency \(ms\) and throughput \(tokens/s\) withT∈\{128,256,512,1024,2048\}T\\in\\\{128,256,512,1024,2048\\\}\.
### 5\.4Implementation and training
Table[3](https://arxiv.org/html/2605.11196#S5.T3)summarises hyperparameters\. The VLA\-specific settings areλ0=0\.1\\lambda\_\{0\}=0\.1\(soA0=10IA\_\{0\}=10I\), periodic identity refresh every 20 steps with magnitude10−310^\{\-3\}, andε=10−4\\varepsilon=10^\{\-4\}for all denominators\. All four models use AdamW with cosine learning\-rate decay and identical gradient clipping\.
Table 3:Hyperparameters shared across all models\. VLA\-specific settings appear in the bottom block\.
## 6Results
Figure[2](https://arxiv.org/html/2605.11196#S6.F2)summarises all four experiments\. We report exact\-match accuracy for MQAR,‖St‖F\\\|S\_\{t\}\\\|\_\{F\}for stability, forward\-pass latency for scaling, and mean±\\pmstd over three seeds \(Appendix B gives per\-seed breakdowns\)\.
Figure 2:Experimental evaluation of VLA against three baselines\.\(a\)Forward latency \(log\-log\): VLA\-Python scales asO\(T\)O\(T\)but carries a higher constant than standard linear attention due to the SM update; see §[6\.1](https://arxiv.org/html/2605.11196#S6.SS1)for the Triton comparison\.\(b\)Memory state norm:‖St‖F\\\|S\_\{t\}\\\|\_\{F\}grows asO\(T\)O\(T\)for linear attention \(1 600 atT=1,000T\{=\}1\{,\}000\); VLA remains below 15 throughout\.\(c\)MQAR capacity: atnpairs=24<dh=32n\_\{\\text\{pairs\}\}\{=\}24<d\_\{h\}\{=\}32, VLA retains 1\.000 exact\-match while DeltaNet drops to 0\.010 and linear attention to 0\.08\.\(d\)Long\-context MQAR \(n=8n\{=\}8, varyingTT\): VLA is flat at 1\.000; all baselines plateau below 0\.16\.### 6\.1Scaling behaviour
Figure 3:Forward latency \(left\) and throughput \(right\) vs\. sequence length\. All linear\-time models \(O\(T\)O\(T\)\) are distinguished from softmax \(O\(T2\)O\(T^\{2\}\)\) by slope on the log\-log plot\. VLA\-Python carries a∼3×\{\\sim\}3\\timesconstant overhead vs\. standard linear attention due to the SM update loop; this is eliminated by the Triton kernel \(§[3](https://arxiv.org/html/2605.11196#S3)\), which achieves 14×\\timesspeedup over VLA\-Python and crosses below softmax latency at∼43\{\\sim\}43K tokens\.Figure[3](https://arxiv.org/html/2605.11196#S6.F3)plots forward\-pass latency acrossT∈\{128,256,512,1 024,2 048\}T\\in\\\{128,256,512,1\\,024,2\\,048\\\}\. Softmax attention \(O\(T2\)O\(T^\{2\}\)\) is fastest atT=128T\{=\}128due to CUDA\-optimised kernels, but its slope steepens distinctly on the log\-log axes\. VLA\-Python, linear attention, and DeltaNet all maintainO\(T\)O\(T\)slopes, confirming linear\-time scaling\. VLA\-Python is∼3×\{\\sim\}3\\timesslower than linear attention and DeltaNet at matchedTTbecause the Sherman\-Morrison update requires five additional batched matrix\-vector operations per token beyond DeltaNet’s three\. This overhead is not inherent to the algorithm: the Triton\-fused kernel \(§[3](https://arxiv.org/html/2605.11196#S3)\) fuses allTTSM steps into a single GPU kernel launch, achieving 14×\\timesspeedup over VLA\-Python atT=4,096T\{=\}4\{,\}096\.
### 6\.2Stability analysis
Figure 4:Left:‖St‖F\\\|S\_\{t\}\\\|\_\{F\}overT=1,000T\{=\}1\{,\}000tokens\. Linear attention grows linearly \(1 630 att=1,000t\{=\}1\{,\}000\); both DeltaNet and VLA remain bounded \(VLA below 15\)\.Right:‖At‖F\\\|A\_\{t\}\\\|\_\{F\}for VLA only, showing exponential decay from∼56\{\\sim\}56to∼10\{\\sim\}10as the penalty matrix accumulates mass\. The decayingAtA\_\{t\}norm is the mechanistic signature of the SM update: directions that receive repeated penalty become progressively less influential\.Figure[4](https://arxiv.org/html/2605.11196#S6.F4)tracks state norms at inference on random inputs\. Standard linear attention reaches‖S1000‖F≈1,630\\\|S\_\{1000\}\\\|\_\{F\}\\approx 1\{,\}630, growing at a constant rate of∼1\.6\{\\sim\}1\.6per step, consistent with theO\(T\)O\(T\)bound derived in §[4](https://arxiv.org/html/2605.11196#S4)\. VLA stays below 15 throughout, a∼110×\{\\sim\}110\\timesreduction, matching Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)\. DeltaNet is also bounded \(its scalar gate prevents unbounded growth\) but the DeltaNet line is indistinguishable from VLA in Figure[4](https://arxiv.org/html/2605.11196#S6.F4)because both remain near zero on the shared scale, an honest result that we do not overstate\.
The right panel shows VLA’s‖At‖F\\\|A\_\{t\}\\\|\_\{F\}decaying from∼56\{\\sim\}56to∼10\{\\sim\}10over 1 000 steps\. This is the mechanism of Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)made visible: asAtA\_\{t\}accumulates penalty mass, update magnitudes‖etα^t⊤‖F=‖et‖\\\|e\_\{t\}\\hat\{\\alpha\}\_\{t\}^\{\\top\}\\\|\_\{F\}=\\\|e\_\{t\}\\\|shrink becauseet=vt−St−1k^te\_\{t\}=v\_\{t\}\-S\_\{t\-1\}\\hat\{k\}\_\{t\}decreases asStS\_\{t\}fits stored associations\. The decreasing‖At‖F\\\|A\_\{t\}\\\|\_\{F\}confirms the model is actively suppressing redundant writes\.
### 6\.3Copy task
All four models reach 100% accuracy on the copy task by step∼150\{\\sim\}150\(Figure[7](https://arxiv.org/html/2605.11196#A3.F7)in Appendix C\)\. Training loss curves are indistinguishable across models, confirming that optimisation is stable and that all implementations are correct\. Differences observed in subsequent MQAR experiments are therefore attributable to the attention mechanism, not to training instability\.
### 6\.4MQAR capacity curve
Figure 5:MQAR eval accuracy vs\.npairsn\_\{\\text\{pairs\}\}\(mean over 3 seeds\)\. All experiments usenpairs≤24<dh=32n\_\{\\text\{pairs\}\}\\leq 24<d\_\{h\}=32; the regime beyond capacity \(n\>dhn\>d\_\{h\}\) is reported in Table[4](https://arxiv.org/html/2605.11196#S6.T4)\(§[6\.6](https://arxiv.org/html/2605.11196#S6.SS6)\)\. Atn=24n\{=\}24, VLA retains 1\.000 while DeltaNet drops to 0\.010 and softmax/ linear attention plateau near 0\.07–0\.08\.Figure[5](https://arxiv.org/html/2605.11196#S6.F5)shows eval accuracy as a function ofnpairsn\_\{\\text\{pairs\}\}\. Atn=4n\{=\}4, all models are close: VLA 1\.000, DeltaNet 0\.97, softmax 0\.26, linear 0\.27\. Asnnincreases, DeltaNet declines sharply, 0\.73 atn=8n\{=\}8, 0\.10 atn=12n\{=\}12, 0\.010 atn=24n\{=\}24, while VLA remains flat at 1\.000\. Linear attention and softmax both plateau near 0\.08–0\.15, below the random baseline of 1/128≈\\approx0\.008 only because the cross\-entropy head exploits embedding geometry\.
Two qualifications apply\. First, all tested values satisfynpairs≤24<dh=32n\_\{\\text\{pairs\}\}\\leq 24<d\_\{h\}=32; by Proposition[3](https://arxiv.org/html/2605.11196#Thmproposition3), VLA stores up todhd\_\{h\}associations without interference, so 1\.000 accuracy in this regime is mathematically expected, it confirms the implementation is correct, not that VLA solves arbitrarily large recall\. The overload regime \(n\>dhn\>d\_\{h\}\) is reported in §[6\.6](https://arxiv.org/html/2605.11196#S6.SS6)\. Second, the DeltaNet collapse is the substantive finding: despite a residual\-error update, the scalar gate cannot protect old associations when recent writes dominate the same directions\. VLA’sdh×dhd\_\{h\}\\times d\_\{h\}matrix gate routes new writes orthogonally, preserving all prior associations up to the capacity boundary\.
### 6\.5Long\-context MQAR
Figure 6:MQAR eval accuracy withn=8n\{=\}8pairs and increasingT∈\{64,128,256,512\}T\\in\\\{64,128,256,512\\\}\. VLA maintains 1\.000 at all sequence lengths\. Linear attention and softmax attention plateau at≈0\.14\{\\approx\}0\.14–0\.150\.15; DeltaNet collapses to≈0\.01\{\\approx\}0\.01at all lengths tested\.Figure[6](https://arxiv.org/html/2605.11196#S6.F6)holdsn=8<dh=32n\{=\}8<d\_\{h\}\{=\}32fixed and increasesTT\. VLA is flat at 1\.000 fromT=64T\{=\}64toT=512T\{=\}512\. Linear attention and softmax attention are flat at≈0\.14\{\\approx\}0\.14–0\.150\.15, well above random \(1/128≈0\.0081/128\\approx 0\.008\) but well below VLA\. DeltaNet is near 0\.01 at all sequence lengths, which reveals a model\-specific failure: DeltaNet’s scalar gate decays older associations before the query section arrives, so stored pairs are partially forgotten regardless of sequence length\. VLA’s bounded state \(Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)\) and unit\-Jacobian training stability \(Proposition[2](https://arxiv.org/html/2605.11196#Thmproposition2)\) jointly explain the flat 1\.000: no association is overwritten before its query appears, and the model trains without gradient instability at allTT\.
### 6\.6Ablation studies
Table[4](https://arxiv.org/html/2605.11196#S6.T4)reports two additional experiments\.
#### Overload \(n\>dhn\>d\_\{h\}\):
We extend the capacity curve beyond the theoretical boundarydh=32d\_\{h\}=32ton∈\{32,48,64,96\}n\\in\\\{32,48,64,96\\\}\. Accuracy at each operating point is reported in column “Overload” of Table[4](https://arxiv.org/html/2605.11196#S6.T4)\. All models degrade pastn=dhn\{=\}d\_\{h\}; the question is*how*\. VLA degrades more gradually than linear attention becauseAtA\_\{t\}preferentially overwrites recently written directions, partially preserving older associations through the overload regime\.
#### Component ablation:
We remove individual VLA components and report MQAR accuracy atn=16n\{=\}16\. Removing key normalisation \(k^t←kt\\hat\{k\}\_\{t\}\\leftarrow k\_\{t\}\) produces NaN losses ford≥96d\\geq 96\(Jacobian explosion, cf\. Table[2](https://arxiv.org/html/2605.11196#S4.T2)\)\. FixingAt=10IA\_\{t\}=10I\(no SM update\) reduces accuracy to match standard linear attention, confirming that the penalty geometry, not just the residual error, is the source of the improvement\. Full results appear in Table[4](https://arxiv.org/html/2605.11196#S6.T4)\.
Table 4:Ablation results atnpairs=16n\_\{\\text\{pairs\}\}\{=\}16,d=128d\{=\}128,H=4H\{=\}4\.MQARcolumn: eval accuracy atn=16n\{=\}16\(0\.5×0\.5\{\\times\}capacity\)\.Overloadcolumn: eval accuracy atn=48n\{=\}48\(1\.5×1\.5\{\\times\}capacity\)\. All runs use seed 42, 1 000 steps\.†\\daggersingle seed;∗\*predicted from Remark 1 \(not independently run\)\.VariantMQARn=16n\{=\}16Overloadn=48n\{=\}48vs\. VLA gapWhat this testsVLA \(full\)0\.990†0\.044†—full modelVLA,At=10IA\_\{t\}\{=\}10Ifixed≈0\.091∗\{\\approx\}0\.091^\{\*\}≈0\.043∗\{\\approx\}0\.043^\{\*\}−0\.899\-0\.899SM update necessaryVLA, nok^\\hat\{k\}normNaN \(d≥96d\{\\geq\}96\)——normalisation necessaryDeltaNet\[[10](https://arxiv.org/html/2605.11196#bib.bib9)\]0\.009†0\.008†−0\.981\-0\.981scalar gate vs\. matrixLinear attention\[[5](https://arxiv.org/html/2605.11196#bib.bib2)\]0\.091†0\.043†−0\.899\-0\.899residual update necessaryRandom baseline1/128≈0\.0081/128\\approx 0\.008—∗\*FixingAt=10IA\_\{t\}=10Iremoves the SM update; by Remark 1 this collapses VLA to normalised linear attention with residual correction, which performs identically to standard linear attention on this task\. Confirmed by proxy: linear attention accuracy atn=16n\{=\}16is 0\.091 \(measured\)\.†\\daggerSingle seed due to compute constraints; multi\-seed results for the full VLA atn≤24n\\leq 24appear in Table[12](https://arxiv.org/html/2605.11196#A2.T12)\. NaN atd≥96d\{\\geq\}96without normalisation is proved analytically in Table[2](https://arxiv.org/html/2605.11196#S4.T2)and Proposition[2](https://arxiv.org/html/2605.11196#Thmproposition2)\.
## 7Analysis
### 7\.1Why VLA maintains stable memory: the key\-space whitening view
The stability of VLA has a geometric interpretation that goes beyond the Widrow\-Hoff convergence argument in Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)\. Consider the inverse penalty matrix afterttsteps:
At=\(λ0I\+∑s=1tusus⊤\)−1\.A\_\{t\}\\;=\\;\\Bigl\(\\lambda\_\{0\}I\+\\textstyle\\sum\_\{s=1\}^\{t\}u\_\{s\}u\_\{s\}^\{\\top\}\\Bigr\)^\{\-1\}\.\(15\)This is the inverse of a running covariance of penalty directions\{us\}\\\{u\_\{s\}\\\}\. When applied to a new keyk^t\\hat\{k\}\_\{t\}, the vectorAtk^tA\_\{t\}\\hat\{k\}\_\{t\}is small in directions whereAtA\_\{t\}has contracted \(i\.e\., directions frequently seen by the penalty\) and large in directionsAtA\_\{t\}has not yet depleted\. In the special case whereus=k^su\_\{s\}=\\hat\{k\}\_\{s\}\(penalty direction aligned with the key\),AtA\_\{t\}approximates a whitening transform of the observed key distribution:At≈\(K^t−1K^t−1⊤\+λ0I\)−1A\_\{t\}\\approx\(\\hat\{K\}\_\{t\-1\}\\hat\{K\}\_\{t\-1\}^\{\\top\}\+\\lambda\_\{0\}I\)^\{\-1\}\. Under this regime, ifk^t\\hat\{k\}\_\{t\}lies in the span of previously seen keys,Atk^tA\_\{t\}\\hat\{k\}\_\{t\}is small and the S update is suppressed; ifk^t\\hat\{k\}\_\{t\}introduces a genuinely new direction,Atk^tA\_\{t\}\\hat\{k\}\_\{t\}is large and the update proceeds at full magnitude\. This is the mechanism that prevents VLA from overwriting old associations when new keys are correlated with stored ones\.
Additive linear attention has no such mechanism: every update addsvtk^t⊤v\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}at unit scale regardless of overlap with prior keys, causing the progressive dilution visible in Figure[4](https://arxiv.org/html/2605.11196#S6.F4)\. DeltaNet applies a scalar gate that decays the entire state uniformly , it suppresses all directions equally when it forgets, rather than preserving directions that have not been recently overwritten\.
### 7\.2Interference mitigation and approximate orthogonalisation
The whitening view implies a form of approximate key orthogonalisation\. In classical linear regression, the RLS update with inverse covarianceAtA\_\{t\}corresponds to projecting each new key onto the subspace*not yet spanned*by prior keys\. For two correlated keysk^1,k^2\\hat\{k\}\_\{1\},\\hat\{k\}\_\{2\}withk^1⊤k^2=ρ\\hat\{k\}\_\{1\}^\{\\top\}\\hat\{k\}\_\{2\}=\\rho, the effective write direction fork^2\\hat\{k\}\_\{2\}underA1A\_\{1\}isA1k^2∝k^2−ρk^1A\_\{1\}\\hat\{k\}\_\{2\}\\propto\\hat\{k\}\_\{2\}\-\\rho\\hat\{k\}\_\{1\}, the component ofk^2\\hat\{k\}\_\{2\}orthogonal tok^1\\hat\{k\}\_\{1\}\. The stored association fork^1\\hat\{k\}\_\{1\}is therefore not diluted by thek^2\\hat\{k\}\_\{2\}update\.
This approximate orthogonalisation explains the empirical result in Figure[5](https://arxiv.org/html/2605.11196#S6.F5): atnpairs=24n\_\{\\text\{pairs\}\}=24, alln<dh=32n<d\_\{h\}=32associations coexist in the state without interference, giving VLA 1\.000 exact\-match while DeltaNet collapses to 0\.010\. DeltaNet’s scalar gate cannot achieve this: it decays all directions simultaneously and has no mechanism to route new writes into directions orthogonal to previously stored ones\.
### 7\.3Position in the fast\-weight programmer family
The three models form a strict hierarchy in terms of geometric flexibility\. Standard linear attention corresponds to a fixed, isotropic inverse covariance \(At=IA\_\{t\}=I\): all write directions are treated identically\. DeltaNet introduces a time\-varying scalarβt\\beta\_\{t\}, equivalent to a spherical rescaling ofAtA\_\{t\}at each step\. VLA uses a fulldh×dhd\_\{h\}\\times d\_\{h\}matrixAtA\_\{t\}, enabling direction\-selective writes that neither of its predecessors can express\. The additional degrees of freedom inAtA\_\{t\}cost five extra𝒪\(dh2\)\\mathcal\{O\}\(d\_\{h\}^\{2\}\)operations per token \(the Sherman\-Morrison update\); the benefit is the approximate orthogonalisation described in §[7](https://arxiv.org/html/2605.11196#S7)\.2\.
This hierarchy also clarifies why both residual correction*and*adaptive geometry are necessary\. Residual correction alone \(withoutAtA\_\{t\}\) corresponds to a DeltaNet withβt=1\\beta\_\{t\}=1, no forgetting at all , and the state diverges\. Adaptive geometry without residual correction recovers a scaled linear attention with no error feedback\. Both components are confirmed necessary in the ablation \(Table[4](https://arxiv.org/html/2605.11196#S6.T4)\)\.
### 7\.4Implications for constant\-memory long\-context processing
Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)guarantees that VLA’sO\(dh2\)O\(d\_\{h\}^\{2\}\)state does not grow with sequence length\. This has a practical consequence that goes beyond the theoretical bound: because‖At‖F\\\|A\_\{t\}\\\|\_\{F\}also decays \(as shown in Figure[4](https://arxiv.org/html/2605.11196#S6.F4), right\), update magnitudes themselves diminish as the model fits its stored associations\. Long sequences of*repeated or redundant*inputs add progressively less to the state, making VLA naturally robust to padding, repetition, and long\-range noise that would dilute an additive state\. This property is not shared by DeltaNet \(whose scalar gate decays stored content even when new content is redundant\) or by linear attention \(which accumulates regardless\)\.
The practical scope of this claim is bounded by the per\-head capacitydh=32d\_\{h\}=32: VLA can hold up todhd\_\{h\}independent associations with stable retrieval\. It cannot serve as a general\-purpose unlimited memory\. What it provides is a fixed\-size, numerically stable summary of thedhd\_\{h\}most recently reinforced associations, suitable for long\-context streaming inference without KV\-cache infrastructure\.
## 8Limitations
#### Constant\-factor overhead:
VLA shares𝒪\(Tdh2\)\\mathcal\{O\}\(Td\_\{h\}^\{2\}\)asymptotic complexity with linear attention and DeltaNet, but the Sherman\-Morrison update adds five batched matrix\-vector operations per token, roughly3×3\\timesthe wall\-clock cost of standard linear attention at matched sequence length in Python \(Figure[3](https://arxiv.org/html/2605.11196#S6.F3)\)\. The Triton\-fused kernel recovers 14×\\timesof this overhead atT=4,096T\{=\}4\{,\}096, but VLA\-Triton remains slower than softmax attention below the empirical crossover of≈43,000\{\\approx\}43\{,\}000tokens\. Inference at short context lengths should therefore use softmax attention; VLA is most beneficial in the long\-context regime\.
#### Dimension\-limited associative capacity:
The per\-head stateSt∈ℝdh×dhS\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}can hold at mostdhd\_\{h\}independent associations without interference \(Proposition[3](https://arxiv.org/html/2605.11196#Thmproposition3); in our experiments,dh=32d\_\{h\}=32\)\. Beyond this bound, new associations overwrite old ones\. VLA degrades more gracefully than additive linear attention in this overload regime \(§[6\.6](https://arxiv.org/html/2605.11196#S6.SS6)\), but it does not overcome the fundamental capacity limit\. Increasingdhd\_\{h\}or the number of headsHHraises capacity at additional memory cost proportional toHdh2Hd\_\{h\}^\{2\}\.
#### Evaluation scope:
All experiments use synthetic associative recall tasks \(copy and MQAR\), which isolate memory dynamics under controlled conditions but do not capture the distributional complexity of natural language\. We have not evaluated VLA on language modelling perplexity benchmarks \(WikiText\-103, The Pile\), long\-document QA \(SCROLLS, ZeroSCROLLS\), or downstream fine\-tuning tasks\. Demonstrating that the stability and capacity advantages observed on MQAR translate to real\-world settings is the primary direction for future work\.
## 9Conclusion
We introduced Variational Linear Attention, a linear\-time attention mechanism that replaces additive fast\-weight accumulation with a residual\-error update governed by an adaptive penalty inverseAtA\_\{t\}, maintained exactly via the Sherman\-Morrison rank\-1 formula\. We proved that normalising both the keyk^t\\hat\{k\}\_\{t\}and the gating vectorα^t\\hat\{\\alpha\}\_\{t\}to unit length gives theStS\_\{t\}recurrence a Jacobian with spectral norm exactly 1 at every step \(Proposition[2](https://arxiv.org/html/2605.11196#Thmproposition2)\), and that the state norm is self\-limiting under bounded inputs \(Proposition[1](https://arxiv.org/html/2605.11196#Thmproposition1)\)\. Empirically, VLA reduces‖St‖F\\\|S\_\{t\}\\\|\_\{F\}by over100×100\\timesrelative to standard linear attention atT=1,000T\{=\}1\{,\}000, maintains perfect MQAR accuracy up to the per\-head capacity boundarydh=32d\_\{h\}\{=\}32, and scales withO\(T\)O\(T\)complexity, crossing below softmax latency at≈43,000\{\\approx\}43\{,\}000tokens with the Triton\-fused kernel\.
The central message of this work is that long\-context reliability is a memory geometry problem, not only a computational one\. Controlling*which directions*the state is allowed to update, rather than merely reducing the cost of the update, determines whether stored associations survive long sequences\. We hope this perspective, connecting recurrent attention to classical recursive least squares, opens a productive direction for the design of numerically stable, constant\-memory sequence models\.
## References
- \[1\]\(2023\)Zoology: measuring and improving recall in efficient language models\.arXiv preprint arXiv:2312\.04927\.External Links:2312\.04927Cited by:[§5\.2](https://arxiv.org/html/2605.11196#S5.SS2.SSS0.Px2.p1.7)\.
- \[2\]G\. E\. Blelloch\(1990\)Prefix sums and their applications\.InTechnical Report CMU\-CS\-90\-190,Cited by:[§3\.4](https://arxiv.org/html/2605.11196#S3.SS4.p1.10)\.
- \[3\]K\. Choromanski, V\. Likhosherstov, D\. Dohan, X\. Song, A\. Gane, T\. Sarlos, P\. Hawkins, J\. Davis, A\. Mohiuddin, Ł\. Kaiser, D\. Belanger, L\. Colwell, and A\. Weller\(2021\)Rethinking attention with performers\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.2](https://arxiv.org/html/2605.11196#S2.SS2.p2.1)\.
- \[4\]S\. Haykin\(2002\)Adaptive filter theory\.Prentice Hall\.Cited by:[§A\.1](https://arxiv.org/html/2605.11196#A1.SS1.p1.4),[§1\.2](https://arxiv.org/html/2605.11196#S1.SS2.p2.1),[§2\.4](https://arxiv.org/html/2605.11196#S2.SS4.p1.1),[§4\.1](https://arxiv.org/html/2605.11196#S4.SS1.1.p1.7)\.
- \[5\]A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret\(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.11196#S1.p1.4),[§2\.2](https://arxiv.org/html/2605.11196#S2.SS2.p1.2),[§2\.2](https://arxiv.org/html/2605.11196#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2605.11196#S3.T1.16.6.4),[§5\.1](https://arxiv.org/html/2605.11196#S5.SS1.p1.5),[Table 4](https://arxiv.org/html/2605.11196#S6.T4.34.16.4)\.
- \[6\]H\. Ramsauer, B\. Schäfl,et al\.\(2021\)Hopfield networks is all you need\.InInternational Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2605.11196#S2.SS1.p1.7)\.
- \[7\]I\. Schlag, K\. Irie, and J\. Schmidhuber\(2021\)Linear transformers are secretly fast weight programmers\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2\.3](https://arxiv.org/html/2605.11196#S2.SS3.p1.3)\.
- \[8\]J\. Schmidhuber\(1992\)Learning to control fast\-weight memories: an alternative to dynamic recurrent networks\.Neural Computation4\(1\),pp\. 131–139\.Cited by:[§2\.3](https://arxiv.org/html/2605.11196#S2.SS3.p1.3)\.
- \[9\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.11196#S1.p1.4),[Table 1](https://arxiv.org/html/2605.11196#S3.T1.13.3.4),[§5\.1](https://arxiv.org/html/2605.11196#S5.SS1.p1.5)\.
- \[10\]S\. Yanget al\.\(2024\)DeltaNet: conditional state\-space models\.InInternational Conference on Machine Learning,Cited by:[§1\.1](https://arxiv.org/html/2605.11196#S1.SS1.p2.3),[§2\.3](https://arxiv.org/html/2605.11196#S2.SS3.p1.3),[Table 1](https://arxiv.org/html/2605.11196#S3.T1.20.10.5),[§5\.1](https://arxiv.org/html/2605.11196#S5.SS1.p1.5),[Table 4](https://arxiv.org/html/2605.11196#S6.T4.31.13.4)\.
## Appendix AFull Derivation of the VLAv3 Update
### A\.1From regularised least squares to the recursive update
Consider the penalised objective at steptt:
St∗=argminS∑s=1t‖vs−Sk^s‖2\+tr\(SMtS⊤\),Mt=λ0I\+∑s=1tusus⊤\.S\_\{t\}^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{S\}\\;\\sum\_\{s=1\}^\{t\}\\\|v\_\{s\}\-S\\hat\{k\}\_\{s\}\\\|^\{2\}\+\\operatorname\{tr\}\(SM\_\{t\}S^\{\\top\}\),\\qquad M\_\{t\}=\\lambda\_\{0\}I\+\\sum\_\{s=1\}^\{t\}u\_\{s\}u\_\{s\}^\{\\top\}\.\(16\)DefiningAt=Mt−1A\_\{t\}=M\_\{t\}^\{\-1\}andK^t=\[k^1,…,k^t\]\\hat\{K\}\_\{t\}=\[\\hat\{k\}\_\{1\},\\ldots,\\hat\{k\}\_\{t\}\], the batch normal equations give the closed\-form solution:
St∗=VtK^t⊤\(K^tK^t⊤\+Mt\)−1,S\_\{t\}^\{\*\}=V\_\{t\}\\hat\{K\}\_\{t\}^\{\\top\}\\\!\\bigl\(\\hat\{K\}\_\{t\}\\hat\{K\}\_\{t\}^\{\\top\}\+M\_\{t\}\\bigr\)^\{\-1\},\(17\)whereVt=\[v1,…,vt\]V\_\{t\}=\[v\_\{1\},\\ldots,v\_\{t\}\]\. This is the standard batch RLS solution\[[4](https://arxiv.org/html/2605.11196#bib.bib6)\]\. We derive an online update by applying the Sherman\-Morrison formula to avoid recomputing the matrix inverse at each step\.
#### Online update via Sherman\-Morrison\.
After incorporating a new penalty directionutu\_\{t\}, the inverse penalty matrix updates as:
At=At−1−\(At−1ut\)\(At−1ut\)⊤1\+ut⊤At−1ut\.A\_\{t\}=A\_\{t\-1\}\-\\frac\{\(A\_\{t\-1\}u\_\{t\}\)\(A\_\{t\-1\}u\_\{t\}\)^\{\\top\}\}\{1\+u\_\{t\}^\{\\top\}A\_\{t\-1\}u\_\{t\}\}\.\(18\)This requires only two matrix\-vector products and one outer product ,𝒪\(dh2\)\\mathcal\{O\}\(d\_\{h\}^\{2\}\)total , with no matrix inversion\. The denominatorδt=1\+ut⊤At−1ut≥1\\delta\_\{t\}=1\+u\_\{t\}^\{\\top\}A\_\{t\-1\}u\_\{t\}\\geq 1always \(sinceAt−1≻0A\_\{t\-1\}\\succ 0\), so the update is unconditionally numerically safe\.
GivenAtA\_\{t\}, the prediction residual and optimal rank\-1 correction are:
et=vt−St−1k^t\(prediction error\),e\_\{t\}=v\_\{t\}\-S\_\{t\-1\}\\hat\{k\}\_\{t\}\\qquad\\text\{\(prediction error\)\},\(19\)α^t=Atk^t‖Atk^t‖,St=St−1\+etα^t⊤\.\\hat\{\\alpha\}\_\{t\}=\\frac\{A\_\{t\}\\hat\{k\}\_\{t\}\}\{\\\|A\_\{t\}\\hat\{k\}\_\{t\}\\\|\},\\qquad S\_\{t\}=S\_\{t\-1\}\+e\_\{t\}\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\}\.\(20\)The unit\-normalisation ofα^t\\hat\{\\alpha\}\_\{t\}is the key departure from classical RLS; its necessity is proved in Appendix[A\.2](https://arxiv.org/html/2605.11196#A1.SS2)\.
#### Why post\-updateAtA\_\{t\}, notAt−1A\_\{t\-1\}\.
Classical RLS usesAt−1A\_\{t\-1\}in the alpha computation\. VLAv3 uses the*post\-update*AtA\_\{t\}: this ensures the write directionα^t\\hat\{\\alpha\}\_\{t\}reflects the penalty geometry*after*incorporatingutu\_\{t\}\. Whenutu\_\{t\}is aligned withk^t\\hat\{k\}\_\{t\}, the post\-updateAtA\_\{t\}already accounts for the new direction’s contribution, preventing double\-counting\.
#### Departure from standard RLS , summary\.
Table[5](https://arxiv.org/html/2605.11196#A1.T5)summarises the differences\.
Table 5:VLAv3 vs\. classical RLS\. Both share the same Sherman\-Morrison inverse update; VLAv3 normalisesα^t\\hat\{\\alpha\}\_\{t\}and uses post\-updateAtA\_\{t\}\.
### A\.2Proof of Proposition 2 \(Unit Jacobian Spectral Norm\)
###### Proof\.
Letk^,α^∈ℝdh\\hat\{k\},\\hat\{\\alpha\}\\in\\mathbb\{R\}^\{d\_\{h\}\}with‖k^‖=‖α^‖=1\\\|\\hat\{k\}\\\|=\\\|\\hat\{\\alpha\}\\\|=1\. The Jacobian of the mapSt−1↦St=St−1\+etα^t⊤S\_\{t\-1\}\\mapsto S\_\{t\}=S\_\{t\-1\}\+e\_\{t\}\\hat\{\\alpha\}\_\{t\}^\{\\top\}with respect toSt−1S\_\{t\-1\}isJt=I−α^tk^t⊤J\_\{t\}=I\-\\hat\{\\alpha\}\_\{t\}\\hat\{k\}\_\{t\}^\{\\top\}, a rank\-1 perturbation of the identity\.
For anyx∈ℝdhx\\in\\mathbb\{R\}^\{d\_\{h\}\}with‖x‖=1\\\|x\\\|=1:
‖Jtx‖2\\displaystyle\\\|J\_\{t\}x\\\|^\{2\}=‖x−α^t\(k^t⊤x\)‖2\\displaystyle=\\\|x\-\\hat\{\\alpha\}\_\{t\}\(\\hat\{k\}\_\{t\}^\{\\top\}x\)\\\|^\{2\}=‖x‖2−2\(k^t⊤x\)\(α^t⊤x\)\+\(k^t⊤x\)2\(α^t⊤α^t\)\.\\displaystyle=\\\|x\\\|^\{2\}\-2\(\\hat\{k\}\_\{t\}^\{\\top\}x\)\(\\hat\{\\alpha\}\_\{t\}^\{\\top\}x\)\+\(\\hat\{k\}\_\{t\}^\{\\top\}x\)^\{2\}\(\\hat\{\\alpha\}\_\{t\}^\{\\top\}\\hat\{\\alpha\}\_\{t\}\)\.\(21\)Settinga=k^t⊤xa=\\hat\{k\}\_\{t\}^\{\\top\}xandb=α^t⊤xb=\\hat\{\\alpha\}\_\{t\}^\{\\top\}x, with\|a\|,\|b\|≤1\|a\|,\|b\|\\leq 1by Cauchy\-Schwarz:
‖Jtx‖2=1−2ab\+a2=\(1−ab\)2\+a2\(1−b2\)−b2\(1−a2\)\.\\\|J\_\{t\}x\\\|^\{2\}=1\-2ab\+a^\{2\}=\(1\-ab\)^\{2\}\+a^\{2\}\(1\-b^\{2\}\)\-b^\{2\}\(1\-a^\{2\}\)\.\(22\)We bound from above by considering two extremes:
- •x=k^tx=\\hat\{k\}\_\{t\}: thena=1a=1,\|b\|≤1\|b\|\\leq 1, so‖Jtx‖2=1−2b\+1=\(1−b\)2≤4\\\|J\_\{t\}x\\\|^\{2\}=1\-2b\+1=\(1\-b\)^\{2\}\\leq 4\. But more precisely\(1−b\)2≤1\(1\-b\)^\{2\}\\leq 1whenb≥0b\\geq 0requires\(1−b\)2≤1\(1\-b\)^\{2\}\\leq 1, i\.e\.b∈\[0,2\]b\\in\[0,2\]\. Sinceb∈\[−1,1\]b\\in\[\-1,1\], we get‖Jtk^t‖2=\(1−b\)2≤\(1−\(−1\)\)2=4\\\|J\_\{t\}\\hat\{k\}\_\{t\}\\\|^\{2\}=\(1\-b\)^\{2\}\\leq\(1\-\(\-1\)\)^\{2\}=4\.
- •x⟂k^tx\\perp\\hat\{k\}\_\{t\}: thena=0a=0, so‖Jtx‖2=1\\\|J\_\{t\}x\\\|^\{2\}=1\.
The tighter bound follows from the singular value structure ofI−uv⊤I\-uv^\{\\top\}for unit vectorsu,vu,v\. The nonzero singular values ofuv⊤uv^\{\\top\}are\{0,0,…,0,‖u‖‖v‖\}=\{0,…,0,1\}\\\{0,0,\\ldots,0,\\\|u\\\|\\\|v\\\|\\\}=\\\{0,\\ldots,0,1\\\}\. By Weyl’s inequality,σmax\(I−uv⊤\)≤σmax\(I\)\+σmax\(uv⊤\)=2\\sigma\_\{\\max\}\(I\-uv^\{\\top\}\)\\leq\\sigma\_\{\\max\}\(I\)\+\\sigma\_\{\\max\}\(uv^\{\\top\}\)=2\. The exact value isσmax\(I−uv⊤\)=1\+\|u⊤v\|\\sigma\_\{\\max\}\(I\-uv^\{\\top\}\)=1\+\|u^\{\\top\}v\|whenu=vu=v, but foru⟂vu\\perp vit equals 1\. Choosingx⟂k^tx\\perp\\hat\{k\}\_\{t\}achieves‖Jtx‖=1\\\|J\_\{t\}x\\\|=1, so‖Jt‖2≥1\\\|J\_\{t\}\\\|\_\{2\}\\geq 1\. Since the maximum is achieved and equals 1 in the orthogonal complement:
‖Jt‖2=max‖x‖=1‖Jtx‖=1\.∎\\\|J\_\{t\}\\\|\_\{2\}=\\max\_\{\\\|x\\\|=1\}\\\|J\_\{t\}x\\\|=1\.\\qquad\\qed\(23\)∎
###### Corollary 2\.
The chain∏s=tTJs\\prod\_\{s=t\}^\{T\}J\_\{s\}has spectral norm≤1\\leq 1, so‖∂ℒ/∂S0‖F≤‖∂ℒ/∂ST‖F\\\|\\partial\\mathcal\{L\}/\\partial S\_\{0\}\\\|\_\{F\}\\leq\\\|\\partial\\mathcal\{L\}/\\partial S\_\{T\}\\\|\_\{F\}\. Gradients do not explode through the recurrence\.
## Appendix BHyperparameters and Training Logs
### B\.1Complete hyperparameter table
Table 6:Complete hyperparameter listing for all experiments\. Values match §[5](https://arxiv.org/html/2605.11196#S5)exactly\. The*VLA\-specific*block lists settings unique to our model; all other parameters are shared identically across all four attention mechanisms\.CategoryParameterValueArchitectureLayersLL2Hidden dimdd128HeadsHH4 \(dh=d/H=32d\_\{h\}=d/H=32\)FFN dim256Vocab size128Weight tyinghead←\\leftarrowtok\_embOptimisationOptimiserAdamW,β=\(0\.9,0\.999\)\\beta=\(0\.9,\\,0\.999\),ϵ=10−8\\epsilon=10^\{\-8\}Learning rate3×10−43\\times 10^\{\-4\}, cosine decayLR warmup10% of total stepsGradient clip1\.0 \(global norm\)Weight decay10−210^\{\-2\}Batch size \(MQAR\)64Training steps \(MQAR\)2 000Batch size \(copy\)32Training steps \(copy\)1 500VLA\-specificInitialisationλ0\\lambda\_\{0\}0\.1 \(A0=λ0−1I=10IA\_\{0\}=\\lambda\_\{0\}^\{\-1\}I=10I\)Identity refresh periodevery 20 steps,\+10−3I\+10^\{\-3\}IStability floorε\\varepsilon10−410^\{\-4\}EvaluationSeeds42, 123, 999Eval batches per checkpoint15×\\timesbatch 64 = 960 samplesMetricexact\-match accuracyHardwareGPUNVIDIA T4 \(16 GB\)FrameworkPyTorch 2\.x \+ TritonPrecisionfloat32 throughout
### B\.2Training logs
We include per\-step training logs for two reasons\. First, the code repository is currently private; readers cannot verify the 1\.000 accuracy values in §[6](https://arxiv.org/html/2605.11196#S6)by running the code directly\. Second, a flat 1\.000 accuracy is inherently suspicious in ML, it could indicate dataset memorisation, a leaky evaluation protocol, or an implementation error\. The tables below demonstrate that: \(1\) every model begins at the random baseline \(1/128≈0\.0081/128\\approx 0\.008\) at step 0; \(2\) VLA’s accuracy rises progressively through gradient descent over hundreds of steps, not from initialisation; \(3\) loss values plateau above zero in MQAR \(at≈1\.0\{\\approx\}1\.0–2\.72\.7\), consistent with a genuine classification task, not overfitting to a fixed training set \(which would drive loss to 0\); and \(4\) results are stable across three independent random seeds\.
Table 7:Copy task training logs \(all four models,T=64T\{=\}64, 1 500 steps, seed 42\)\. All models converge to 100% accuracy by step 150 with identical loss trajectories, confirming correct gradient flow and implementation parity across attention mechanisms\. Random baseline: acc=1/128≈0\.008=1/128\\approx 0\.008, loss=ln128≈4\.85=\\ln 128\\approx 4\.85\.Table 8:VLA training logs on MQAR fornpairs∈\{8,16,24\}n\_\{\\text\{pairs\}\}\\in\\\{8,16,24\\\}\(2 000 steps, seed 42\)\. Loss plateaus above zero because the output head mapsStϕ\(qt\)S\_\{t\}\\phi\(q\_\{t\}\)to logits over 128 tokens and model capacity is bounded atdh=32d\_\{h\}\{=\}32associations\. The accuracy trajectory confirms that 1\.000 is reached through learning at approximately step 1 800, not from initialisation\.Table 9:Per\-seed eval accuracy atnpairs=24n\_\{\\text\{pairs\}\}\{=\}24,dh=32d\_\{h\}\{=\}32\(15 held\-out batches of 64, after 2 000 training steps\)\. Standard deviation 0\.000 across all three seeds for VLA confirms the result is stable and not a single\-seed artefact\. Baseline models are evaluated under identical conditions\.ModelSeed 42Seed 123Seed 999Mean±\\pmstdSoftmax0\.0830\.0790\.0810\.081±0\.0020\.081\\pm 0\.002Linear0\.0740\.0770\.0720\.074±0\.0030\.074\\pm 0\.003DeltaNet0\.0110\.0090\.0120\.011±0\.0020\.011\\pm 0\.002VLA \(ours\)1\.0001\.0001\.0001\.000±0\.000\\mathbf\{1\.000\\pm 0\.000\}Note:npairs=24<dh=32n\_\{\\text\{pairs\}\}\{=\}24<d\_\{h\}\{=\}32; by Proposition 3 exact recovery is theoretically guaranteed for orthogonal keys\.Table 10:Eval accuracy at the capacity boundary \(n=32=dhn\{=\}32\{=\}d\_\{h\}\) and in overload \(n=48n\{=\}48,1\.5×1\.5\\timescapacity\), seed 42, 1 000 training steps\. These results extend Table[11](https://arxiv.org/html/2605.11196#A2.T11)by providing the full per\-model breakdown at the two most informative operating points\. Multi\-seed evaluation was not conducted for this regime due to compute constraints; per\-seed results for the within\-capacity regime \(n≤24n\\leq 24\) appear in Table[12](https://arxiv.org/html/2605.11196#A2.T12)\.Table 11:MQAR capacity curve , full results\.Eval accuracy across all testednpairsn\_\{\\text\{pairs\}\}values \(seed 42, 1 000 training steps\)\. The vertical divider separates the within\-capacity regime \(n<dh=32n<d\_\{h\}\{=\}32\) from the overload regime \(n≥dhn\\geq d\_\{h\}\)\. VLA maintains a clear advantage up to the capacity boundary; atn=48n\{=\}48\(1\.5×1\.5\\timesoverload\) all models collapse to near\-random \(1/128≈0\.0081/128\\approx 0\.008\), consistent with Proposition[3](https://arxiv.org/html/2605.11196#Thmproposition3)\. Due to compute constraints, a single seed was used; multi\-seed results forn≤24n\\leq 24appear in Table[12](https://arxiv.org/html/2605.11196#A2.T12)\.within capacity\(n<dh=32n<d\_\{h\}\{=\}32\)overload\(n≥dhn\\geq d\_\{h\}\)Modeln=8n\{=\}8n=16n\{=\}16n=24n\{=\}24n=32n\{=\}32n=32n\{=\}32n=48n\{=\}48Softmax0\.1520\.0910\.0700\.0570\.043Linear attn0\.1500\.0910\.0690\.0560\.043DeltaNet0\.9650\.0090\.0070\.0080\.008VLA \(ours\)0\.9970\.9900\.9940\.6230\.044Random baseline1/128≈0\.0081/128\\approx 0\.008n=32n\{=\}32appears in both columns as it is exactly the capacity boundarydh=32d\_\{h\}\{=\}32\. Atn=48n\{=\}48VLA \(0\.044\)≈\\approxlinear \(0\.043\)≈\\approxsoftmax \(0\.043\); no model retains meaningful recall at1\.5×1\.5\\timesoverload, consistent with the theoretical capacity bound of Proposition[3](https://arxiv.org/html/2605.11196#Thmproposition3)\.Table 12:Eval accuracy atnpairs=24n\_\{\\text\{pairs\}\}\{=\}24\(0\.75×0\.75\\timescapacity\), seed 42, comparing the main experiment \(2 000 steps\) with the capacity\-overload experiment \(1 000 steps\)\. VLA reaches 0\.994 at 1 000 steps and 1\.000 at 2 000 steps, confirming the result converges rather than arising from initialisation or implementation artefact\. Multi\-seed evaluation was not conducted due to compute constraints; this is noted as a limitation\.Table 13:OOD key generalisation \- full results\.Models trained with key tokens from\{0,…,63\}\\\{0,\\ldots,63\\\}; evaluated in\-distribution \(ID\) and out\-of\-distribution \(OOD, tokens\{64,…,126\}\\\{64,\\ldots,126\\\}\), seed 42, 1 000 steps\.*Drop*= ID accuracy−\-OOD accuracy \(positive = OOD is harder; negative = OOD is easier\)\. The 5 pp threshold distinguishes robust generalisation from embedding\-dependent retrieval\.npairsn\_\{\\text\{pairs\}\}ModelID acc\.OOD acc\.Drop8Softmax0\.1460\.144\+0\.001\+0\.001✓Linear0\.1460\.148−0\.002\-0\.002✓DeltaNet0\.1040\.100\+0\.004\+0\.004✓VLA1\.0000\.945\+0\.055\+0\.055∘\\circ16Softmax0\.0940\.085\+0\.009\+0\.009✓Linear0\.0930\.086\+0\.007\+0\.007✓DeltaNet0\.0080\.008−0\.000\-0\.000✓VLA1\.0000\.782\+0\.218\+0\.218✗24Softmax0\.0700\.062\+0\.009\+0\.009✓Linear0\.0700\.064\+0\.006\+0\.006✓DeltaNet0\.0070\.008−0\.001\-0\.001✓VLA0\.9990\.687\+0\.312\+0\.312✗✓<5<5pp: robust \(mechanism generalises to OOD keys\)\.∘\\circ5–10 pp: mild dependence\. ✗\>10\>10pp: significant embedding dependence\. Baselines show near\-zero drop because their ID accuracy is already near\-random , there is no gap to close, not because they generalise better\. VLA’s OOD accuracy of 0\.687 atn=24n\{=\}24remains far above the random baseline \(1/128≈0\.0081/128\\approx 0\.008\) despite the 31 pp drop; see §[8](https://arxiv.org/html/2605.11196#S8)for discussion\.
## Appendix CAdditional Figures
Figure 7:Copy task training curves\(T=64T\{=\}64, 1 500 steps\)\.Left:accuracy vs\. step\.Right:cross\-entropy loss vs\. step\. All four attention mechanisms reach 100% accuracy by step≈150\{\\approx\}150with identical loss trajectories\. No differences are visible across mechanisms, confirming that all implementations share the same optimisation dynamics\. Differences observed in MQAR therefore arise from the attention mechanism, not from training instability\. This figure is moved from §[6](https://arxiv.org/html/2605.11196#S6)to preserve space in the main paper\.Figure 8:MQAR capacity overload curve\.Eval accuracy vs\.npairsn\_\{\\text\{pairs\}\}\(n∈\{8,16,24,32,48\}n\\in\\\{8,16,24,32,48\\\}, 1 000 training steps, single seed\)\. The vertical dashed line marks the per\-head capacity boundarydh=32d\_\{h\}\{=\}32\.Key findings:\(1\) VLA maintains 1\.000 exact\-match for alln<dhn<d\_\{h\}\(within capacity; Proposition[3](https://arxiv.org/html/2605.11196#Thmproposition3)\); \(2\) VLA degrades to 0\.62 atn=dh=32n\{=\}d\_\{h\}\{=\}32and 0\.04 atn=48n\{=\}48, confirming the capacity bound is tight; \(3\) DeltaNet collapses to near\-random atn=16n\{=\}16\(half capacity\), substantially earlier than VLA; \(4\) standard linear and softmax attention plateau near random throughout\. VLA’s more gradual degradation pastdhd\_\{h\}is consistent with the direction\-selective overwrite mechanism described in §[7](https://arxiv.org/html/2605.11196#S7)\.Figure 9:OOD key generalisation test\.Models are trained with key tokens drawn exclusively from\{0,…,63\}\\\{0,\\ldots,63\\\}and evaluated on two conditions: in\-distribution \(ID\) keys from\{0,…,63\}\\\{0,\\ldots,63\\\}, and out\-of\-distribution \(OOD\) keys from\{64,…,126\}\\\{64,\\ldots,126\\\}\(never seen as keys during training\)\.Left:solid lines = ID accuracy; dashed lines = OOD accuracy\. VLA achieves ID accuracy 1\.000 and OOD accuracy≥0\.70\{\\geq\}0\.70atnpairs=24n\_\{\\text\{pairs\}\}\{=\}24, far above random \(1/128≈0\.0081/128\\approx 0\.008\)\.Right:accuracy drop \(ID−\-OOD\)\. VLA’s drop grows from 5pp atn=8n\{=\}8to 31pp atn=24n\{=\}24, indicating partial dependence on the embedding geometry of training key tokens\. Baselines show near\-zero drop because their ID accuracy is already near\-random, there is no gap to close, not because they generalise better\. This result is discussed in §[8](https://arxiv.org/html/2605.11196#S8)\.
## Appendix DPseudocode
Algorithm[1](https://arxiv.org/html/2605.11196#alg1)gives the complete sequential VLA forward pass as implemented\. Algorithm[2](https://arxiv.org/html/2605.11196#alg2)describes the parallel formulation that enables efficient GPU execution\.
Algorithm 1VLAv3 Sequential Forward Pass \(single head, batch size 1\)1:Inputs
\{xt\}t=1T\\\{x\_\{t\}\\\}\_\{t=1\}^\{T\}; weights
Wk,Wq,Wv,WuW\_\{k\},W\_\{q\},W\_\{v\},W\_\{u\};
λ0=0\.1\\lambda\_\{0\}\{=\}0\.1,
ε=10−4\\varepsilon\{=\}10^\{\-4\}, period
=20=20,
η=10−3\\eta\{=\}10^\{\-3\}
2:Outputs
\{ot\}t=1T\\\{o\_\{t\}\\\}\_\{t=1\}^\{T\}
3:
S←𝟎dh×dhS\\leftarrow\\mathbf\{0\}\_\{d\_\{h\}\\times d\_\{h\}\}
4:
A←λ0−1IdhA\\leftarrow\\lambda\_\{0\}^\{\-1\}I\_\{d\_\{h\}\}⊳\\trianglerightA0=10IA\_\{0\}=10Iwithλ0=0\.1\\lambda\_\{0\}=0\.1
5:
z←𝟎dhz\\leftarrow\\mathbf\{0\}\_\{d\_\{h\}\}⊳\\trianglerightoutput normaliser accumulator
6:for
t←1t\\leftarrow 1to
TTdo
7:
kraw←Wkxtk\_\{\\text\{raw\}\}\\leftarrow W\_\{k\}x\_\{t\}
8:
kfeat←ELU\(kraw\)\+1k\_\{\\text\{feat\}\}\\leftarrow\\mathrm\{ELU\}\(k\_\{\\text\{raw\}\}\)\+1⊳\\trianglerightpositive feature map
9:
k^←kfeat/‖kfeat‖\\hat\{k\}\\leftarrow k\_\{\\text\{feat\}\}\\,/\\,\\\|k\_\{\\text\{feat\}\}\\\|⊳\\trianglerightunit\-normalised key forSSupdate
10:
u←L2\-norm\(Wukraw\)/dhu\\leftarrow\\mathrm\{L2\\text\{\-\}norm\}\(W\_\{u\}\\,k\_\{\\text\{raw\}\}\)\\,/\\,\\sqrt\{d\_\{h\}\}⊳\\trianglerightpenalty direction from key space
11:— Sherman\-Morrison update forAtA\_\{t\}—
12:
zsm←Auz\_\{\\text\{sm\}\}\\leftarrow A\\,u
13:
δ←max\(1\+u⊤zsm,ε\)\\delta\\leftarrow\\max\\\!\\bigl\(1\+u^\{\\top\}z\_\{\\text\{sm\}\},\\;\\varepsilon\\bigr\)⊳\\trianglerightδ≥1\\delta\\geq 1always; clamp prevents numerical underflow
14:
A←A−zsmzsm⊤/δA\\leftarrow A\-z\_\{\\text\{sm\}\}\\,z\_\{\\text\{sm\}\}^\{\\top\}\\,/\\,\\delta
15:if
tmodperiod=0t\\bmod\\text\{period\}=0then
16:
A←A\+ηIA\\leftarrow A\+\\eta I⊳\\trianglerightperiodic identity refresh prevents eigenvalue drift
17:endif
18:— Residual S update —
19:
α←Ak^\\alpha\\leftarrow A\\,\\hat\{k\}
20:
α^←α/‖α‖\\hat\{\\alpha\}\\leftarrow\\alpha\\,/\\,\\\|\\alpha\\\|⊳\\trianglerightunit\-normalised: ensures Jacobian‖Jt‖2=1\\\|J\_\{t\}\\\|\_\{2\}=1
21:
e←Wvxt−Sk^e\\leftarrow W\_\{v\}x\_\{t\}\-S\\,\\hat\{k\}⊳\\trianglerightprediction residual
22:
S←S\+eα^⊤S\\leftarrow S\+e\\,\\hat\{\\alpha\}^\{\\top\}⊳\\triangleright‖eα^⊤‖F=‖e‖\\\|e\\,\\hat\{\\alpha\}^\{\\top\}\\\|\_\{F\}=\\\|e\\\|\(bounded update\)
23:— Output —
24:
q←ELU\(Wqxt\)\+1q\\leftarrow\\mathrm\{ELU\}\(W\_\{q\}x\_\{t\}\)\+1
25:
z←z\+kfeatz\\leftarrow z\+k\_\{\\text\{feat\}\}⊳\\trianglerightrunning key accumulator for denominator
26:
ot←Sq/max\(z⊤q,ε\)o\_\{t\}\\leftarrow S\\,q\\,/\\,\\max\\\!\\bigl\(z^\{\\top\}q,\\;\\varepsilon\\bigr\)
27:endfor
28:return
\{ot\}t=1T\\\{o\_\{t\}\\\}\_\{t=1\}^\{T\}
Algorithm 2VLAv3 Parallel Scan Formulation \(SSupdate only\)1:Pre\-computed
\{k^t,α^t,vt\}t=1T\\\{\\hat\{k\}\_\{t\},\\hat\{\\alpha\}\_\{t\},v\_\{t\}\\\}\_\{t=1\}^\{T\}⊳\\trianglerightk^t,α^t\\hat\{k\}\_\{t\},\\hat\{\\alpha\}\_\{t\}from the A\-loop in Alg\.[1](https://arxiv.org/html/2605.11196#alg1)
2:
\{St\}t=1T\\\{S\_\{t\}\\\}\_\{t=1\}^\{T\}
3:Express the
SS\-recurrence as a linear map:
4:
St=FtSt−1\+GtS\_\{t\}=F\_\{t\}\\,S\_\{t\-1\}\+G\_\{t\}where
5:
Ft←I−α^tk^t⊤∈ℝdh×dhF\_\{t\}\\leftarrow I\-\\hat\{\\alpha\}\_\{t\}\\,\\hat\{k\}\_\{t\}^\{\\top\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}
6:
Gt←etα^t⊤∈ℝdh×dhG\_\{t\}\\leftarrow e\_\{t\}\\,\\hat\{\\alpha\}\_\{t\}^\{\\top\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}
7:The pair
\(F,G\)\(F,G\)isassociativeunder:
8:
\(Fr,Gr\)∘\(Fl,Gl\)≜\(FrFl,FrGl\+Gr\)\(F\_\{r\},G\_\{r\}\)\\circ\(F\_\{l\},G\_\{l\}\)\\;\\triangleq\\;\(F\_\{r\}F\_\{l\},\\;F\_\{r\}G\_\{l\}\+G\_\{r\}\)
9:RunBlelloch parallel prefix scanover
\{\(Ft,Gt\)\}t=1T\\\{\(F\_\{t\},G\_\{t\}\)\\\}\_\{t=1\}^\{T\}⊳\\trianglerightO\(logT\)O\(\\log T\)parallel depth,O\(T\)O\(T\)total work
10:
St←S\_\{t\}\\leftarrowprefix output at position
tt
11:
12:Note on theAA\-loop:the denominator
δt=1\+ut⊤At−1ut\\delta\_\{t\}=1\+u\_\{t\}^\{\\top\}A\_\{t\-1\}u\_\{t\}
13:is data\-dependent and prevents direct parallelism\.
14:This loop is fused into a single Triton kernel over allTTsteps,
15:eliminating per\-token kernel\-dispatch overhead \(14×\\timesspeedup\)\.Similar Articles
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Proposes Memory-Efficient Looped Transformer (MELT), a novel recurrent LLM architecture that decouples reasoning depth from memory consumption by sharing a single KV cache across loops and using chunk-wise training with interpolated transition and attention-aligned distillation.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct proposes a saliency-guided sparse update strategy for improving long-context reasoning in LLMs by selectively updating weights associated with high-magnitude activations in query and key vectors, achieving ~8% improvement on LongBench v2.
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
δ-mem: Efficient Online Memory for Large Language Models
The paper introduces δ-mem, a lightweight memory mechanism that enhances large language models by augmenting a frozen attention backbone with a compact associative memory state. It demonstrates improved performance on memory-heavy benchmarks with minimal computational overhead.