Blurry Window Attention
Summary
Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.
View Cached Full Text
Cached at: 06/10/26, 06:13 AM
# Blurry Window Attention
Source: [https://arxiv.org/html/2606.09862](https://arxiv.org/html/2606.09862)
SSMState\-Space ModelLALinear AttentionBLABlurry Window AttentionGLAGated Linear AttentionGSAGated Slot AttentionABCAttention with Bounded\-memory ControlFLAFlash Linear AttentionGDNGated DeltaNetSWASliding Window AttentionMQARMulti\-Query Associate RecallAIArtificial IntelligenceLMLanguage ModelRNNRecurrent Neural Network
\(1Huawei, Zurich, Switzerland 2Huawei Advanced Computing and Storage Lab, Shenzhen, China\)
###### Abstract
The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios\. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as[State\-Space Models](https://arxiv.org/html/2606.09862#id1.1.id1),[Linear Attention](https://arxiv.org/html/2606.09862#id2.2.id2)\([LA](https://arxiv.org/html/2606.09862#id2.2.id2)\), and[Attention with Bounded\-memory Control](https://arxiv.org/html/2606.09862#id6.6.id6)\([ABC](https://arxiv.org/html/2606.09862#id6.6.id6)\)\. Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information\. In this work, we introduce[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\) a novel[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)method inspired by[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)can be understood as a generalization of[Sliding Window Attention](https://arxiv.org/html/2606.09862#id9.9.id9)\([SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\) depending on the Dirichlet kernels resolution or as a special case of the[Gated Slot Attention](https://arxiv.org/html/2606.09862#id5.5.id5)\([GSA](https://arxiv.org/html/2606.09862#id5.5.id5)\), where the decay factor is implemented with Dirichlet kernels\. We describe in details the theory and efficient implementation of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\. On the[Multi\-Query Associate Recall](https://arxiv.org/html/2606.09862#id10.10.id10)\([MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)\) synthetic task, we show that the state efficiency of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is8×8\\timesbetter than[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)and is competitive with popular linear attention models, and in the RegBench synthetic task, only[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)and[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)improve their performance as the state size grows among the linear models we tested\.
## 1Introduction
The Transformer architecture\[[1](https://arxiv.org/html/2606.09862#bib.bib1)\]and its attention mechanism is one of the principal workhorses of large[Language Models](https://arxiv.org/html/2606.09862#id12.12.id12)\. The strength of attention comes from its ability to be parallelized over the sequence length and the all\-to\-all connection pathway between tokens, enabling direct interaction between distant time points\. However, the computing cost of this interaction is quadratic in the sequence length and becomes the main bottleneck when the context length increases past the model dimension, which commonly occurs in scenarios such as agentic AI or long chain\-of\-thoughts\. In addition, Transformers require a growing KV cache during inference and each new token needs proportionally more compute\.[Sliding Window Attention](https://arxiv.org/html/2606.09862#id9.9.id9)\([SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\) overcomes the quadratic complexity by truncating the KV history to a finite time window\. While stacking layers increases in principle the receptive field beyond the window size, the effect is not additive\[[2](https://arxiv.org/html/2606.09862#bib.bib2)\]and full attention layers are still required to maintain long range interaction\.
To mitigate the quadratic complexity bottleneck of attention while still allowing for long range performance, alternative architectures with linear sequence complexity have been designed\. The most prominent alternative architectures include[State\-Space Models](https://arxiv.org/html/2606.09862#id1.1.id1)\([SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)\)\[[3](https://arxiv.org/html/2606.09862#bib.bib3),[4](https://arxiv.org/html/2606.09862#bib.bib4),[5](https://arxiv.org/html/2606.09862#bib.bib5)\],[Linear Attention](https://arxiv.org/html/2606.09862#id2.2.id2)\([LA](https://arxiv.org/html/2606.09862#id2.2.id2)\)\[[6](https://arxiv.org/html/2606.09862#bib.bib6),[7](https://arxiv.org/html/2606.09862#bib.bib7),[8](https://arxiv.org/html/2606.09862#bib.bib8)\], and[ABCs](https://arxiv.org/html/2606.09862#id6.6.id6)\[[9](https://arxiv.org/html/2606.09862#bib.bib9),[10](https://arxiv.org/html/2606.09862#bib.bib10)\]\. Like transformers, those neural networks are parallelizable over the sequence, but unlike transformers their linear sequence mixing operations use a finite state instead of a growing KV cache\. This gives linear[LMs](https://arxiv.org/html/2606.09862#id12.12.id12)a computational advantage in long context scenario compared to transformers\. However, recent research points out that linear[LMs](https://arxiv.org/html/2606.09862#id12.12.id12)fall short of attention variants in specific tasks where long range information recall is needed\[[11](https://arxiv.org/html/2606.09862#bib.bib11),[12](https://arxiv.org/html/2606.09862#bib.bib12)\], casting doubt on the long\-term viability of purely linear[LMs](https://arxiv.org/html/2606.09862#id12.12.id12)for text processing\.
In this work, we present[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\), a novel linear attention architecture which is aimed at combining both the accurate retrieval of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)and the long range dependencies of traditional[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)and[LA](https://arxiv.org/html/2606.09862#id2.2.id2)models\. While the state of linear attention stores key\-value associations in an outer product format,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)maintains separated key and value states, which makes[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)more similar to[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)and[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\. However unlike[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)methods,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)writing mechanism can be seen as a generalization of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\. This is achieved by multiplying and accumulating incoming keys and values independently across a finite set of Fourier modes similar to an[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)like S4D\[[13](https://arxiv.org/html/2606.09862#bib.bib13)\]Such a state space representation allows for a lossy interpolation in the time domain up to a period using Dirichlet kernels\. The current query is then used to compute softmax attention over the interpolated keys and values\. In the following, we first present the theory of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)and then evaluate its performance on recall\-intensive synthetic tasks\. We show that[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)has8×8\\timesbetter state efficiency compared to[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)and comes close to popular linear models on the[MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)task\. In addition,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)achieves similar performance to full attention on the RegBench task in contrast to[Gated Linear Attention](https://arxiv.org/html/2606.09862#id4.4.id4)\([GLA](https://arxiv.org/html/2606.09862#id4.4.id4)\) and[Gated DeltaNet](https://arxiv.org/html/2606.09862#id8.8.id8)\([GDN](https://arxiv.org/html/2606.09862#id8.8.id8)\), and is performing better than[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)for small state sizes\.
## 2Background
We start by briefly recalling the operations of vanilla causal Softmax attention\[[1](https://arxiv.org/html/2606.09862#bib.bib1)\]and its linear variants, considering a single head and batch element for simplicity\.
### 2\.1Softmax Attention
Given a sequence ofdddimensional vectors𝐗∈ℝL×D\{\\bf\{X\}\}\\in\\mathbb\{R\}^\{L\\times D\}with sequence lengthLL, Softmax attention projects the input to queries, keys and values sequences𝐐=𝑾q𝐗,𝐊=𝑾k𝐗,𝐕=𝑾v𝐗∈ℝL×D\{\\bf\{Q\}\}=\\bm\{W\}\_\{q\}\{\\bf\{X\}\},\{\\bf\{K\}\}=\\bm\{W\}\_\{k\}\{\\bf\{X\}\},\{\\bf\{V\}\}=\\bm\{W\}\_\{v\}\{\\bf\{X\}\}\\in\\mathbb\{R\}^\{L\\times D\}using projection matrices𝑾q,𝑾k,𝑾v∈ℝD×D\\bm\{W\}\_\{q\},\\bm\{W\}\_\{k\},\\bm\{W\}\_\{v\}\\in\\mathbb\{R\}^\{D\\times D\}\. The output is then given by the formula:
𝐎=Softmax\(𝐐𝐊⊤D\+𝐌\)𝐕∈ℝL×D,\{\\bf\{O\}\}=\\mathrm\{Softmax\}\\left\(\\frac\{\{\\bf\{Q\}\}\{\\bf\{K\}\}^\{\\top\}\}\{\\sqrt\{D\}\}\+\{\\bf\{M\}\}\\right\)\{\\bf\{V\}\}\\quad\\in\\mathbb\{R\}^\{L\\times D\},\(1\)where the softmax is applied row\-wise\.𝐌∈\{−∞,0\}L×L\{\\bf\{M\}\}\\in\\\{\-\\infty,0\\\}^\{L\\times L\}is the causal mask that prevents a query𝒒t\{\\bm\{q\}\}\_\{t\}from querying future key vectors𝒌t′\>t\{\\bm\{k\}\}\_\{t^\{\\prime\}\>t\}\. The softmax term is aL×LL\\times Lmatrix called the attention mask and is responsible for theO\(L2D\)O\(L^\{2\}D\)quadratic complexity in sequence length of vanilla Attention\. In the case of[Sliding Window Attention](https://arxiv.org/html/2606.09862#id9.9.id9)\([SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\) with a window sizeww, the query𝒒t\{\\bm\{q\}\}\_\{t\}only attends to the keys of a sliding window𝒌t′\{\\bm\{k\}\}\_\{t^\{\\prime\}\}wheret′∈\[t−w,t\]t^\{\\prime\}\\in\[t\-w,t\], which brings the complexity toO\(LwD\)O\(LwD\)at the cost of dropping long range interaction between vectors\.
### 2\.2Attention with Bounded\-memory Control
[Attention with Bounded\-memory Control](https://arxiv.org/html/2606.09862#id6.6.id6)\([ABC](https://arxiv.org/html/2606.09862#id6.6.id6)\)\[[9](https://arxiv.org/html/2606.09862#bib.bib9)\]introduces a cumulative softmax write gateϕt\\bm\{\\phi\}\_\{t\}that allows multiple tokens to be stored in a fixed‑size memory slot:
𝐊~t=𝐊~t−1\+ϕt⊗𝐤t,𝐕~t=𝐕~t−1\+ϕt⊗𝐯t\.\\widetilde\{\\mathbf\{K\}\}\_\{t\}=\\widetilde\{\\mathbf\{K\}\}\_\{t\-1\}\+\\bm\{\\phi\}\_\{t\}\\otimes\\mathbf\{k\}\_\{t\},\\quad\\widetilde\{\\mathbf\{V\}\}\_\{t\}=\\widetilde\{\\mathbf\{V\}\}\_\{t\-1\}\+\\bm\{\\phi\}\_\{t\}\\otimes\\mathbf\{v\}\_\{t\}\.\(2\)ϕt\\bm\{\\phi\}\_\{t\}is obtained via a normalized exponential of token features, giving a data dependent, FIFO‑like memory update while retaining the softmax attention over slots\. This formulation can be expressed as a two‑pass linear attention, enabling hardware‑efficient chunkwise training with a small recurrent state\.
[Gated Slot Attention](https://arxiv.org/html/2606.09862#id5.5.id5)\([GSA](https://arxiv.org/html/2606.09862#id5.5.id5)\)\[[10](https://arxiv.org/html/2606.09862#bib.bib10)\]builds on the ABC mechanism by adding a data‑dependent gating scalarαi∈\[0,1\]\\alpha\_\{i\}\\in\[0,1\]for each memory slot\. At each step the key and value slots are updated with a gated recurrence
𝐊~t=Diag\(𝜶t\)𝐊~t−1\+\(1−𝜶t\)⊗𝐤t\\widetilde\{\\mathbf\{K\}\}\_\{t\}=\\operatorname\{Diag\}\(\\bm\{\\alpha\}\_\{t\}\)\\widetilde\{\\mathbf\{K\}\}\_\{t\-1\}\+\(1\-\\bm\{\\alpha\}\_\{t\}\)\\otimes\\mathbf\{k\}\_\{t\}\(3\)\(and analogously for𝐕~t\\widetilde\{\\mathbf\{V\}\}\_\{t\}\), which lets the model forget stale information and introduces a recency bias, addressing ABC’s inability to discard old tokens and its bias toward early tokens\. This update can be written as a two‑pass Gated Linear Attention, enabling the same hardware‑efficient chunkwise training used for linear attention while providing a compact recurrent state and improved inference efficiency\.
### 2\.3State\-Space Models
The[State\-Space Model](https://arxiv.org/html/2606.09862#id1.1.id1)\([SSM](https://arxiv.org/html/2606.09862#id1.1.id1)\) literature can be traced back to the Legendre Memory Unit\[[14](https://arxiv.org/html/2606.09862#bib.bib14)\]and the Hippo theory\[[3](https://arxiv.org/html/2606.09862#bib.bib3)\]\. The original question addressed by[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)can be summarized as: given an incoming 1\-D continuous signalx\(t\)x\(t\)and a finiteNNdimensional storage space, how to retain the most information about the signal? The[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)theory shows that given some desired measure about the signal, we can project it on a basis to maintain a set of coordinates from which the signal can be approximated back\. Ignoring the step of discretization, the equations of a discrete[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)are as follows:
𝒉\(t\+1\)=A𝒉\(t\)\+Bx\(t\),y\(t\)=C𝒉\(t\)\+Dx\(t\)\.\\begin\{split\}\{\\bm\{h\}\}\(t\+1\)&=A\{\\bm\{h\}\}\(t\)\+Bx\(t\),\\\\ y\(t\)&=C\{\\bm\{h\}\}\(t\)\+Dx\(t\)\.\\end\{split\}\(4\)Here𝒉\(t\)\{\\bm\{h\}\}\(t\)is aNNdimensional state space representation of the signalx\(t\)x\(t\)\. The matricesA,B,C,DA,B,C,Dare the parameters of the[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)\. As we see from the equation, the state update is a linear recurrence, which allows for efficient parallelization over sequence length provided that theAAmatrix is diagonalizable\. Early[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)project the input signal on Legendre polynomials or Truncated Fourier modes\[[14](https://arxiv.org/html/2606.09862#bib.bib14),[3](https://arxiv.org/html/2606.09862#bib.bib3)\], which correspond to using specific parameter matrices\. Early[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)and linear RNNs such as LRU or S4\-FouT used to haveAAmatrices with complex eigenvalues\[[15](https://arxiv.org/html/2606.09862#bib.bib15),[3](https://arxiv.org/html/2606.09862#bib.bib3)\]for better expressivity since any real matrixAAis diagonalizable inℂ\\mathbb\{C\}almost surely\. This trend later changed for using diagonal realAAmatrix learned from data, preceded by a short convolution to improve recall\[[16](https://arxiv.org/html/2606.09862#bib.bib16),[13](https://arxiv.org/html/2606.09862#bib.bib13),[5](https://arxiv.org/html/2606.09862#bib.bib5),[17](https://arxiv.org/html/2606.09862#bib.bib17)\]\. Interestingly, Mamba 3 comes back to using complex eigenvalues\[[18](https://arxiv.org/html/2606.09862#bib.bib18)\]\.
## 3Theory
We now describe the theory of our novel[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\) framework by first introducing it in a way that is similar to traditional[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)and highlights the similarity of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)with the[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)literature\. We then show a more efficient implementation that does not require a convolution by exploiting the permutation invariance of softmax attention and resembles more[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)\. Finally, we show how state decay similar to[GSA](https://arxiv.org/html/2606.09862#id5.5.id5)can be implemented in[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), and make it look like a more general version of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\.

Figure 1:Overview of the Blurry Window Attention mechanism\.Left:The state of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is a convolution of the keys and values with the cosine and sine components of a set ofMMFourier modes parameterized by a periodTT\.Right:In the specific case whereT=2M−1T=2M\-1, keys and values in the\(2M−1\)\(2M\-1\)time window can be exactly recovered through trigonometric interpolation\. WhenT\>2M−1T\>2M\-1, the2M−12M\-1keys and values are interpolated from theTTwindow\. When the sequence exceedsTT, the interpolated keys and values contain anterior patterns due to periodicity\.### 3\.1Blurry Window Attention
Like other linear[LMs](https://arxiv.org/html/2606.09862#id12.12.id12),[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\) maintains a finite state\. However, unlike[LA](https://arxiv.org/html/2606.09862#id2.2.id2)variants which maintain an outer product state of keys and values,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)keeps separated key and value states𝐊cos,𝐕cos∈ℝD×M\{\{\\bf\{K\}\}^\{\\mathrm\{cos\}\}\},\{\{\\bf\{V\}\}^\{\\mathrm\{cos\}\}\}\\in\\mathbb\{R\}^\{D\\times M\}and𝐊sin,𝐕sin∈ℝD×\(M−1\)\{\{\\bf\{K\}\}^\{\\mathrm\{sin\}\}\},\{\{\\bf\{V\}\}^\{\\mathrm\{sin\}\}\}\\in\\mathbb\{R\}^\{D\\times\(M\-1\)\}whereMMencodes a number of Fourier modes and is the main parameter of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\. These states are initialized to zero\. Definingω=2π/\(2M−1\)\\omega=2\\pi/\(2M\-1\), the recurrent update of the key state is written as :
𝐊m,t\+1cos=cos\(mω\)𝐊m,tcos−sin\(mω\)𝐊m,tsin\+𝒌t,𝐊m,t\+1sin=sin\(mω\)𝐊m,tcos\+cos\(mω\)𝐊m,tsin\\begin\{split\}\{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{cos\}\}\}&=\\cos\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}\-\\sin\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}\+\{\\bm\{k\}\}\_\{t\},\\\\ \{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{sin\}\}\}&=\\sin\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}\+\\cos\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}\\end\{split\}\(5\)form∈\[0,…,M−1\]m\\in\[0,\.\.\.,M\-1\]\. The value state is updated similarly\. These equations can be written more compactly in complex notation, and consists in linear recurrences with coefficientseimωe^\{\\mathrm\{i\}m\\omega\{\}\}, which is similar to a diagonal[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)with complex diagonalAAmatrix, like S4D\[[3](https://arxiv.org/html/2606.09862#bib.bib3),[13](https://arxiv.org/html/2606.09862#bib.bib13)\], and aBBmatrix full of ones \(Eq\. \([4](https://arxiv.org/html/2606.09862#S2.E4)\)\)\. We adopt real notation throughout to closely follow how the algorithm is implemented in practice with real data types\. If we solve the recurrence in Eq\. \([5](https://arxiv.org/html/2606.09862#S3.E5)\), we obtain the closed form formula for the states:
𝐊m,tcos=∑t′=1tcos\(mω\(t−t′\)\)𝒌t′,𝐊m,tsin=∑t′=1tsin\(mω\(t−t′\)\)𝒌t′\.\\begin\{split\}\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\},\\\\ \{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\.\\end\{split\}\(6\)which are the keys convolved with the cosine and sine functions\. We defer the derivation to Appendix[B](https://arxiv.org/html/2606.09862#A2)\. The left plot in Figure[1](https://arxiv.org/html/2606.09862#S3.F1)illustrates the convolution of the keys and values with the trigonometric functions\. By combining the two states with the appropriate coefficients, we can propagate the memory of the keys and values with an arbitrary function into the future\. In particular, we consider a\(2M−1\)\(2M\-1\)\-periodic continuous functionffwith highest mode\(M−1\)ω\(M\-1\)\\omega\. We can writeffin terms of its coordinates on the Fourier basis functions\(t↦1,t↦cos\(mωt\),t↦sin\(mωt\),m<M\)\(t\\mapsto 1,t\\mapsto\\cos\(m\\omega\{t\}\),t\\mapsto\\sin\(m\\omega\{t\}\),~m<M\):
f\(t\)=∑m=0M−1amcos\(mωt\)\+bmsin\(mωt\),f\(t\)=\\sum\_\{m=0\}^\{M\-1\}a\_\{m\}\\cos\(m\\omega t\)\+b\_\{m\}\\sin\(m\\omega t\),\(7\)settingb0=0b\_\{0\}=0by convention\. At a timett, even if we do not have access to past keys and values fort′<tt^\{\\prime\}<t, we can write a convolution of past keys and values with any suchfffunction:
\(f∗𝒌\)\(t\)=∑t′=1tf\(t−t′\)𝒌t′=∑t′=1t∑m=0M−1\(amcos\(mω\(t−t′\)\)\+bmsin\(mω\(t−t′\)\)\)𝒌t′=∑m=0M−1am∑t′=1tcos\(mω\(t−t′\)\)𝒌t′\+bm∑t′=1tsin\(mω\(t−t′\)\)𝒌t′=∑m=0M−1am𝐊m,tcos\+bm𝐊m,tsin=𝐊tcos⋅𝐚\+𝐊tsin⋅𝐛,\\begin\{split\}\(f\\ast\{\\bm\{k\}\}\)\(t\)&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}f\(t\-t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sum\_\{m=0\}^\{M\-1\}\\left\(a\_\{m\}\\cos\(m\\omega\(t\-t^\{\\prime\}\)\)\+b\_\{m\}\\sin\(m\\omega\(t\-t^\{\\prime\}\)\)\\right\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{m=0\}^\{M\-1\}a\_\{m\}\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\(t\-t^\{\\prime\}\)\)\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\+b\_\{m\}\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\(t\-t^\{\\prime\}\)\)\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{m=0\}^\{M\-1\}a\_\{m\}\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}\+b\_\{m\}\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}\\\\ &=\{\{\\bf\{K\}\}\_\{\{t\}\}^\{\\mathrm\{cos\}\}\}\\cdot\{\\bf\{a\}\}\+\{\{\\bf\{K\}\}\_\{\{t\}\}^\{\\mathrm\{sin\}\}\}\\cdot\{\\bf\{b\}\},\\end\{split\}\(8\)where𝐚=\(a0,…,aM−1\)⊤\{\\bf\{a\}\}=\(a\_\{0\},\.\.\.,a\_\{M\-1\}\)^\{\\top\}and𝐛=\(b0,…,bM−1\)⊤\{\\bf\{b\}\}=\(b\_\{0\},\.\.\.,b\_\{M\-1\}\)^\{\\top\}\. We used the state formula of Eq\. \([6](https://arxiv.org/html/2606.09862#S3.E6)\) and exchanged the sums over time and over modes\. In the[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)formalism, this step corresponds to the state readout using theCCmatrix \(Eq\. \([4](https://arxiv.org/html/2606.09862#S2.E4)\)\)\. Looking back at Eq\. \([8](https://arxiv.org/html/2606.09862#S3.E8)\), one interestingfffunction to consider is the so\-called Dirichlet kernel\[[19](https://arxiv.org/html/2606.09862#bib.bib19)\]:
DM\(t\)=12M−1\+22M−1∑m=1M−1cos\(mωt\)\.D\_\{M\}\(t\)=\\frac\{1\}\{2M\-1\}\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\cos\(m\\omega\{t\}\)\.\(9\)This corresponds to settinga0=1/\(2M−1\)a\_\{0\}=1/\(2M\-1\),am=2/\(2M−1\)m\>1a\_\{m\}=2/\(2M\-1\)~m\>1andbm=0∀mb\_\{m\}=0~\\forall m\.DMD\_\{M\}is such thatDM\(0\)=1D\_\{M\}\(0\)=1,DM\(t\)=0D\_\{M\}\(t\)=0for integerst∈\[1,…2M−2\]t\\in\[1,\.\.\.2M\-2\]\(Fig\.[1](https://arxiv.org/html/2606.09862#S3.F1)top\-right\)\. Applying the result of Eq\. \([8](https://arxiv.org/html/2606.09862#S3.E8)\) with those coefficients gives:
\(DM∗𝒌\)\(t\)=∑t′=1tDM\(t−t′\)𝒌t′=∑t′=1tδ\(t≡t′\[2M−1\]\)𝒌t′=∑t′≡t\[2M−1\]𝒌t′\.\\begin\{split\}\(D\_\{M\}\\ast\{\\bm\{k\}\}\)\(t\)&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M\}\(t\-t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\delta\(t\\equiv t^\{\\prime\}~\[2M\-1\]\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{t^\{\\prime\}\\equiv t~\[2M\-1\]\}\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\.\\end\{split\}\(10\)In particular if the sequence is less than2M−12M\-1in length it gives the latest key\. We can also consider the translated Dirichlet kernels to anyΔt\\Delta tintegers in\[0,…,2M−2\]\[0,\.\.\.,2M\-2\]:
DM,Δt\(t\)=DM\(t−Δt\)D\_\{M,\\Delta t\}\(t\)=D\_\{M\}\(t\-\\Delta t\)\(11\)We show in Appendix[B](https://arxiv.org/html/2606.09862#A2)that the Fourier coefficients of theDM,ΔtD\_\{M,\\Delta t\}are:
𝐀Δt,0=12M−1,\\displaystyle\{\\bf\{A\}\}\_\{\\Delta t,0\}=\\frac\{1\}\{2M\-1\},𝐁Δt,0=0,\\displaystyle\{\\bf\{B\}\}\_\{\\Delta t,0\}=0,𝐀Δt,m\>0=2cos\(mωΔt\)2M−1,\\displaystyle\{\\bf\{A\}\}\_\{\\Delta t,m\>0\}=\\frac\{2\\cos\(m\\omega\{\\Delta t\}\)\}\{2M\-1\},𝐁Δt,m\>0=2sin\(mωΔt\)2M−1\.\\displaystyle\{\\bf\{B\}\}\_\{\\Delta t,m\>0\}=\\frac\{2\\sin\(m\\omega\{\\Delta t\}\)\}\{2M\-1\}\.\(12\)We gather those coefficients in two matrices𝐀,𝐁∈ℝ\(2M−1\)×M\{\\bf\{A\}\},\{\\bf\{B\}\}\\in\\mathbb\{R\}^\{\(2M\-1\)\\times M\}\. We can then define𝐊~t∈ℝD×\(2M−1\)\\tilde\{\\bf\{K\}\}\_\{t\}\\in\\mathbb\{R\}^\{D\\times\(2M\-1\)\}\(respectively𝐕~t\\tilde\{\\bf\{V\}\}\_\{t\}\) as:
𝐊~t=𝐊tcos𝐀⊤\+𝐊tsin𝐁⊤\.\\tilde\{\{\\bf\{K\}\}\}\_\{t\}=\{\{\\bf\{K\}\}\_\{\{t\}\}^\{\\mathrm\{cos\}\}\}\{\\bf\{A\}\}^\{\\top\}\+\{\{\\bf\{K\}\}\_\{\{t\}\}^\{\\mathrm\{sin\}\}\}\{\\bf\{B\}\}^\{\\top\}\.\(13\)When the sequence length is less than2M−12M\-1,𝐊~t\\tilde\{\\bf\{K\}\}\_\{t\}contains the keys of the previous2M−12M\-1time window\. When the sequence length exceeds2M−12M\-1, the keys are added modulo2M−12M\-1following Eq\. \([10](https://arxiv.org/html/2606.09862#S3.E10)\)\. More generally, we can setω=2π/T\\omega=2\\pi/T, whereTT\(T≥2M−1T\\geq 2M\-1\) is the period of the fundamental frequency and is the second most important parameter of our model\. This interpolates keys and values instead of indexing them, which effectively blurs the sequence \(Fig\.[1](https://arxiv.org/html/2606.09862#S3.F1)bottom right\)\. The output of[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\) is computed as:
𝐎=Softmax\(𝐐𝐊~⊤D\+𝐌𝐁𝐋𝐀\)𝐕~\.\{\\bf\{O\}\}=\\mathrm\{Softmax\}\\left\(\\frac\{\{\\bf\{Q\}\}\\tilde\{\\bf\{K\}\}^\{\\top\}\}\{\\sqrt\{D\}\}\+\{\\bf\{M\}\_\{BLA\}\}\\right\)\\tilde\{\{\\bf\{V\}\}\}\.\(14\)The mask𝐌𝐁𝐋𝐀\{\\bf\{M\}\_\{BLA\}\}is not added for enforcing causality, but to prevent the model from attending to the zero\-valued keys and values before the beginning of the sequence\. The reason is that what the model sees is the sequence of keys and values from the present time to the past, so keys and values before the beginning of the sequence should not be attended to\.

Figure 2:Comparison of Vanilla Attention and Blurry Attention\.a\)Vanilla Attention mask with causal masking\.b\)Blurry attention mask whenT=L=2m−1T=L=2m\-1is identical to vanilla attention due to exact interpolation \(compare red boxes\)\. WhenL\>T=2m−1L\>T=2m\-1, the[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)mask becomes oblong and contains superposition of KV moduloTTfort∈\[T,L\]t\\in\[T,L\]\.c\)WhenL=T\>2m−1L=T\>2m\-1, the mask has lower token resolution \(compare blue boxes\)\.In Fig\.[2](https://arxiv.org/html/2606.09862#S3.F2), we compare the attention masks of full attention and[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)obtained with Eq\. \([14](https://arxiv.org/html/2606.09862#S3.E14)\) forM=8M=8, in the cases of choosingT=2M−1T=2M\-1andT\>2M−1T\>2M\-1\. We see that[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), in contrast to many other[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)and similar to[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)methods, can exhibit “sharp” attention matrices, which is important for retrieval\[[20](https://arxiv.org/html/2606.09862#bib.bib20)\]\. WhenT=2M−1T=2M\-1and the sequence length is smaller than2M−12M\-1andT=2M−1T=2M\-1,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)has the same attention mask as vanilla attention \(red boxes in Fig\.[2](https://arxiv.org/html/2606.09862#S3.F2)a,b\)\. However when the sequence length exceeds2M−12M\-1, the tokens from previous windows are summed modulo2M−12M\-1instead of dropped like in[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\(bottom left corner in Fig\.[2](https://arxiv.org/html/2606.09862#S3.F2)a,b\)\. This in theory allows[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)to capture longer range dependencies compared to[SWA](https://arxiv.org/html/2606.09862#id9.9.id9), but comes at the risk of having the state diverge, but we show later how a decay mechanism can be added to[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\. Finally, whenT\>2M−1T\>2M\-1, the attention mask has different query and key time scales, leading to a blurry attention mask\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is also similar to the[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)model withS=2M−1S=2M\-1slots\. However in contrast to[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)and due to its underpinning to Fourier theory the different modes/slots can be combined to generate an arbitrary functionffEq\. \([7](https://arxiv.org/html/2606.09862#S3.E7)\)\.
### 3\.2A More Efficient Implementation
In section[3\.1](https://arxiv.org/html/2606.09862#S3.SS1), we introduced[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)from the standpoint of the[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)theory\. As a result, the state formula Eq\. \([6](https://arxiv.org/html/2606.09862#S3.E6)\) has the form of a convolution\. We show here that owing to the permutation invariance of softmax attention we can instead write the state of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)as a cumulative sum\. We begin by rewriting the blurred key state for a specific slotΔt∈\[0,…2M−2\]\\Delta t\\in\[0,\.\.\.2M\-2\]at a time steptt\. We have using Eqs\. \([8](https://arxiv.org/html/2606.09862#S3.E8)\), \([10](https://arxiv.org/html/2606.09862#S3.E10)\), and \([13](https://arxiv.org/html/2606.09862#S3.E13)\):
𝐊~t\[Δt\]=∑t′=1tDM,Δt\(t−t′\)𝒌t′=∑t′=1tDM,0\(t−t′−Δt\)𝒌t′=∑t′=1tDM,0\(t′−\(t−Δt\)\)𝒌t′=∑t′=1tDM,t−Δt\(t′\)𝒌t′=∑t′=1tDM,Δτ\(t′\)𝒌t′,\\begin\{split\}\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\[\\Delta t\]&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,\\Delta t\}\(t\-t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,0\}\(t\-t^\{\\prime\}\-\\Delta t\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,0\}\(t^\{\\prime\}\-\(t\-\\Delta t\)\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,t\-\\Delta t\}\(t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\\\\ &=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,\\Delta\\tau\}\(t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\},\\end\{split\}\(15\)where we use the parity ofDM,0D\_\{M,0\}\(Eq\. \([9](https://arxiv.org/html/2606.09862#S3.E9)\)\), and defineΔτ=t−Δt\[2M−1\]\\Delta\\tau=t\-\\Delta t~\[2M\-1\]\. Therefore we can use the opposite manipulation and find that:
𝐊~t\[Δτ\]=∑t′=1tDM,Δt\(t′\)𝒌t′\.\\begin\{split\}\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\[\\Delta\\tau\]&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}D\_\{M,\\Delta t\}\(t^\{\\prime\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\.\\end\{split\}\(16\)As a result, we can simply compute first the quantity:
DM,Δt\(t′\)=cos\(ωt′𝐦\)𝐀⊤\+sin\(ωt′𝐦\)𝐁⊤,\\begin\{split\}D\_\{M,\\Delta t\}\(t^\{\\prime\}\)=\\cos\(\\omega t^\{\\prime\}\{\\bf\{m\}\}\)\{\\bf\{A\}\}^\{\\top\}\+\\sin\(\\omega t^\{\\prime\}\{\\bf\{m\}\}\)\{\\bf\{B\}\}^\{\\top\},\\end\{split\}\(17\)where𝐦=\(0,…,M−1\)\{\\bf\{m\}\}=\(0,\.\.\.,M\-1\)and multiply the current key with it\. Accumulating this quantity over time is computing𝐊~t\[Δτ\]\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\[\\Delta\\tau\]\(Eq\. \([16](https://arxiv.org/html/2606.09862#S3.E16)\)\), which is a time\-rolling column of𝐊~t\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\\textbf\{\}\. Since the softmax attention is permutation invariant, the output is not changed and we can use𝐊~t\[Δτ\]\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\[\\Delta\\tau\]instead of𝐊~t\[Δt\]\\tilde\{\{\\bf\{K\}\}\}\_\{t\}\[\\Delta t\]\. With this manipulation, we see that[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)can be computed with a cumulative sum operation similar to linear attention\. We provide the efficient algorithm for the recurrent mode in Alg\.[1](https://arxiv.org/html/2606.09862#alg1)and the chunk mode in Alg\.[3](https://arxiv.org/html/2606.09862#alg3)of Appendix[C](https://arxiv.org/html/2606.09862#A3)\.
Algorithm 1Efficient recurrent[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)Input:
𝐐,𝐊,𝐕∈ℝL×H×D\{\\bf\{Q\}\},\{\\bf\{K\}\},\{\\bf\{V\}\}\\in\\mathbb\{R\}^\{L\\times H\\times D\}, period
T∈ℝHT\\in\\mathbb\{R\}^\{H\}, interpolation matrices
𝐀∈ℝ2M−1×M,𝐁∈ℝ2M−1×M\{\\bf\{A\}\}\\in\\mathbb\{R\}^\{2M\-1\\times M\},\{\\bf\{B\}\}\\in\\mathbb\{R\}^\{2M\-1\\times M\}
Clamp period:
T←max\(T,2⋅M−1\)T\\leftarrow\\max\(T,2\\cdot M\-1\)
Compute dilated time grid:
dilated\_time←round\(\[0,…,2M−1−1\]2M−1⊗T\)\\mathrm\{dilated\\\_time\}\\leftarrow\\mathrm\{round\}\\left\(\\frac\{\[0,\\dots,2M\-1\-1\]\}\{2M\-1\}\\otimes T\\right\)
ω←2π/T\\omega\\leftarrow 2\\pi/T
Compute modes:
modes←ω⊗\[0,1,…,M−1\]\\mathrm\{modes\}\\leftarrow\\omega\\otimes\[0,1,\\dots,M\-1\]
Initialize compressed KV states:
𝐊prev,𝐕prev←𝟎H×D×2M−1\{\\bf\{K\}\}\_\{\\mathrm\{prev\}\},\{\\bf\{V\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\bf\{0\}\}^\{H\\times D\\times 2M\-1\}
Initialize output:
𝐎←𝟎L×H×D\{\\bf\{O\}\}\\leftarrow\{\\bf\{0\}\}^\{L\\times H\\times D\}
for
t=0t=0to
L−1L\-1do
Compute interpolation coefficients:
interpolate←\\mathrm\{interpolate\}\\leftarrow
cos\(t⋅modes\)⋅𝐀⊤\+sin\(t⋅modes\)⋅𝐁⊤\\cos\(t\\cdot\\mathrm\{modes\}\)\\cdot\{\\bf\{A\}\}^\{\\top\}\+\\sin\(t\\cdot\\mathrm\{modes\}\)\\cdot\{\\bf\{B\}\}^\{\\top\}
Update compressed KV states:
𝐊~←𝐊prev\+𝐊\[t\]⊗interpolate\{\\tilde\{\\bf\{K\}\}\}\\leftarrow\{\\bf\{K\}\}\_\{\\mathrm\{prev\}\}\+\{\\bf\{K\}\}\[t\]\\otimes\\mathrm\{interpolate\}
𝐕~←𝐕prev\+𝐕\[t\]⊗interpolate\{\\tilde\{\\bf\{V\}\}\}\\leftarrow\{\\bf\{V\}\}\_\{\\mathrm\{prev\}\}\+\{\\bf\{V\}\}\[t\]\\otimes\\mathrm\{interpolate\}
𝐊prev←𝐊~\{\\bf\{K\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\tilde\{\\bf\{K\}\}\}
𝐕prev←𝐕~\{\\bf\{V\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\tilde\{\\bf\{V\}\}\}
Compute attention weights:
𝐌BLA←where\(t≥dilated\_time,0\.0,−∞\)\{\\bf\{M\}\}\_\{\\mathrm\{BLA\}\}\\leftarrow\\mathrm\{where\}\(t\\geq\\mathrm\{dilated\\\_time\},0\.0,\-\\infty\)
𝐎\[t\]←Softmax\(𝐐\[t\]⋅𝐊~D\+𝐌BLA\)⋅𝐕~\{\\bf\{O\}\}\[t\]\\leftarrow\\mathrm\{Softmax\}\\left\(\\frac\{\{\\bf\{Q\}\}\[t\]\\cdot\{\\tilde\{\\bf\{K\}\}\}\}\{\\sqrt\{D\}\}\+\{\\bf\{M\}\}\_\{\\mathrm\{BLA\}\}\\right\)\\cdot\{\\tilde\{\\bf\{V\}\}\}
endfor
Return:
𝐎\\bf\{O\}
### 3\.3Adding State Decay to[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)
One potential issue of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is that the states𝐊~\\bf\{\\tilde\{\\bf\{K\}\}\}and𝐕~\\bf\{\\tilde\{\\bf\{V\}\}\}continuously accumulate keys and values over time, as shown in Eqs\. \([5](https://arxiv.org/html/2606.09862#S3.E5)\) and \([16](https://arxiv.org/html/2606.09862#S3.E16)\)\. For long sequences, this unbounded growth can lead to numerical instabilities\. A natural solution to mitigate this issue is to introduce a decay mechanism into the state update rule similarly to other[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)\[[5](https://arxiv.org/html/2606.09862#bib.bib5),[7](https://arxiv.org/html/2606.09862#bib.bib7)\]\. A principled choice for this decay is to multiply the previous state by \(1−DM,Δt\(t′\)1\-D\_\{M,\\Delta t\}\(t^\{\\prime\}\)\), as introduced in Eq\. \([17](https://arxiv.org/html/2606.09862#S3.E17)\)\. This formulation is particularly advantageous because it allows us to generalize[SWA](https://arxiv.org/html/2606.09862#id9.9.id9), and retrieve the classic[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)when the periodTTis set to2M−12M\-1\. As established in Eq\. \([10](https://arxiv.org/html/2606.09862#S3.E10)\), when the sequence length is less than or equal to2M−12M\-1, the matrices𝐊~\\bf\{\\tilde\{\\bf\{K\}\}\}and𝐕~\\bf\{\\tilde\{\\bf\{V\}\}\}contain the keys and values from the preceding2M−12M\-1time steps\. By incorporating this decay, we can implement a controlled forgetting mechanism: it “flushes” the previous exact value whenT=2M−1T=2M\-1, similar to[SWA](https://arxiv.org/html/2606.09862#id9.9.id9), or maintains a decaying history of previous values whenT\>2M−1T\>2M\-1\. By setting the decay term as1−DM,Δt\(t′\)1\-D\_\{M,\\Delta t\}\(t^\{\\prime\}\), we get almost the same formulation as[GSA](https://arxiv.org/html/2606.09862#id5.5.id5), which allows us to re\-use the efficient implementations of the FLA repository111Double GLA pass, and implemented inhttps://github\.com/fla\-org/flash\-linear\-attention/blob/main/fla/ops/gsa/chunk\.py\.
## 4Experiments
We perform experiments to evaluate how[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)performs in function of its state size\. We mainly compare[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)with popular linear models from the[Flash Linear Attention](https://arxiv.org/html/2606.09862#id7.7.id7)\([FLA](https://arxiv.org/html/2606.09862#id7.7.id7)\) repository\[[21](https://arxiv.org/html/2606.09862#bib.bib21)\], namely[GLA](https://arxiv.org/html/2606.09862#id4.4.id4)\[[7](https://arxiv.org/html/2606.09862#bib.bib7)\],[GDN](https://arxiv.org/html/2606.09862#id8.8.id8)\[[22](https://arxiv.org/html/2606.09862#bib.bib22)\],[GSA](https://arxiv.org/html/2606.09862#id5.5.id5)\[[10](https://arxiv.org/html/2606.09862#bib.bib10)\]and[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\. For fair comparison, we match the state size of different models according to the state sizes in Table[1](https://arxiv.org/html/2606.09862#S4.T1)\. We choose the[Multi\-Query Associate Recall](https://arxiv.org/html/2606.09862#id10.10.id10)\([MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)\)\[[23](https://arxiv.org/html/2606.09862#bib.bib23)\]and RegBench\[[24](https://arxiv.org/html/2606.09862#bib.bib24)\]synthetic benchmarks to evaluate[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)because they require retrieval abilities\. The hyperparameters are given in Appendix[D](https://arxiv.org/html/2606.09862#A4)\.
Table 1:The state sizes of different linear[LMs](https://arxiv.org/html/2606.09862#id12.12.id12)in function of model parameters\.### 4\.1Multi Query Associative recall
We first evaluate[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)without the decay mechanism on the[MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)task\. In this task, the model is presented with a sequence of key\-value pairs and is trained to output the correct values of multiple keys\. We use a challenging setting of sequence length of 512 and 64 key value pairs\. We produce a Pareto frontier shown in Fig\.[3](https://arxiv.org/html/2606.09862#S4.F3)a by measuring the maximum validation accuracy over a sweep of learning rates, seeds, and parameters controlling the state sizes \(Table[1](https://arxiv.org/html/2606.09862#S4.T1)\)\. We found that[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)uses its state size more efficiently than[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\(by8×8\\times\) and[GSA](https://arxiv.org/html/2606.09862#id5.5.id5), and comes close to[GLA](https://arxiv.org/html/2606.09862#id4.4.id4)and[GDN](https://arxiv.org/html/2606.09862#id8.8.id8)\. We found that a short convolution was required for[GLA](https://arxiv.org/html/2606.09862#id4.4.id4),[GDN](https://arxiv.org/html/2606.09862#id8.8.id8)and[GSA](https://arxiv.org/html/2606.09862#id5.5.id5)to achieve non trivial performance, while it did not help for[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)and[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\. In addition for[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), we introduce the token resolution as the quantity:
T2M−1\.\\frac\{T\}\{2M\-1\}\.\(18\)We found that the performance on[MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)depends strongly on the token resolution \(Fig\.[3](https://arxiv.org/html/2606.09862#S4.F3)b\), with a sharp maximum for a resolution of 2 tokens\. We hypothesize that it is due to the format of the task where key value pairs correspond to pairs of tokens\.
Finally, we wanted to test whether[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)can model longer range dependencies than[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)for a given window size\. To test this, we increased the model dimension while matching the “window sizes” of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)and[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\(respectively2M−12M\-1andww\)\. We observed that the performance of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)increased with higher model dimensions while[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)’s did not \(Fig\.[3](https://arxiv.org/html/2606.09862#S4.F3)c\)\. This suggests that even though keys and values end up overlapping due to the periodicityTT,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)can leverage orthogonality in the head dimension to enable retrieval beyond the window size\.

Figure 3:Results on Multi\-Query Associative recall\[[23](https://arxiv.org/html/2606.09862#bib.bib23)\]\.a\)Pareto Frontier of[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)compared to other linear models\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)improves the pareto frontier of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)by8×8\\times\.b\)The period parameter controlling the token resolution has a high impact on performance, with a clear optimum at a resolution of 2 tokens\.c\)[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)can leverage bigger model dimensions to store more information and improve performance, while[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)cannot\.
### 4\.2RegBench

Figure 4:Results on the RegBench task\[[24](https://arxiv.org/html/2606.09862#bib.bib24)\]using 5000 DFAs\. Accuracy of different models as the state size increases\. We report the best test accuracy out of three different seeds\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)in contrast to the other linear models increases its performance as the state size increases\. Using a token resolution of two,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is both performing better for small state size and reaches similar performance to Full attention as the state size increases\.The RegBench benchmark is another task where linear architectures struggle to match the performance of Transformers\[[12](https://arxiv.org/html/2606.09862#bib.bib12),[24](https://arxiv.org/html/2606.09862#bib.bib24)\]\. In this task, the objective is to infer the underlying structure of a grammar rule from a set of deterministic finite automata \(DFAs\)\. We train and evaluate different models on a dataset of 5,000 DFAs, including[Sliding Window Attention](https://arxiv.org/html/2606.09862#id9.9.id9)\([SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\) with varying window sizes,[Gated Linear Attention](https://arxiv.org/html/2606.09862#id4.4.id4)\([GLA](https://arxiv.org/html/2606.09862#id4.4.id4)\),[Gated DeltaNet](https://arxiv.org/html/2606.09862#id8.8.id8)\([GDN](https://arxiv.org/html/2606.09862#id8.8.id8)\),[Gated Slot Attention](https://arxiv.org/html/2606.09862#id5.5.id5)\([GSA](https://arxiv.org/html/2606.09862#id5.5.id5)\), and our proposed[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)model with state decay across different token resolutions \(Eq\. \([18](https://arxiv.org/html/2606.09862#S4.E18)\)\)\. To isolate the effect of the different[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)modules, we run all models without the short 1D\-convolution\.
Our findings \(Fig\.[4](https://arxiv.org/html/2606.09862#S4.F4)\) indicate that, in contrast to[GLA](https://arxiv.org/html/2606.09862#id4.4.id4)and[GDN](https://arxiv.org/html/2606.09862#id8.8.id8),[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)demonstrates a distinct state scaling advantage\. As the state size increases—varying the number of modes for[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), the number of slots for[GSA](https://arxiv.org/html/2606.09862#id5.5.id5), the key expansion ratio for[GLA](https://arxiv.org/html/2606.09862#id4.4.id4), the value expansion ratio for[GDN](https://arxiv.org/html/2606.09862#id8.8.id8), and the window size for[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)—our model’s performance continues to improve\. This suggests that[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)’s architecture is more effective at leveraging its state capacity on this benchmark\. Of particular interest is the comparison with[GSA](https://arxiv.org/html/2606.09862#id5.5.id5), where without the trigonometric interpolation kernel the model fails to increase it’s performance as the number of slots increase\. Furthermore, consistent with observations on the[MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)task, we note a performance boost for[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)at smaller state sizes when the token resolution is set to 2, outperforming[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)in this regime\.
## 5Discussion
We presented[Blurry Window Attention](https://arxiv.org/html/2606.09862#id3.3.id3)\([BLA](https://arxiv.org/html/2606.09862#id3.3.id3)\), a novel linear attention model that can generate “sharp” attention masks and bridges the gap between[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)and[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)\. We showed that our model can utilize its state size better than[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)particularly in the small state size regime and can scale the performance better than the other linear models with the increase of its state size\. The[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)model using an appropriate decay mechanism and period parameter can recover an exact implementation of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9), and can therefore be understood as a more general version of[SWA](https://arxiv.org/html/2606.09862#id9.9.id9)\.
A large body of work has been done to mitigate the quadratic complexity of full attention\. As explained in Section[3\.1](https://arxiv.org/html/2606.09862#S3.SS1),[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is motivated by the early[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)theory\[[14](https://arxiv.org/html/2606.09862#bib.bib14),[3](https://arxiv.org/html/2606.09862#bib.bib3)\]\. However, while[SSMs](https://arxiv.org/html/2606.09862#id1.1.id1)were introduced to compress continuous signals and involve a discretization step,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)uses the theory from the standpoint of discrete interpolation, bypassing the need for discretization\.[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is also related to the Attention with bounded memory control theory\[[9](https://arxiv.org/html/2606.09862#bib.bib9)\], which keeps separated key and values states\. This choice however sacrifices some state efficiency, since to storeDDKV associations of dimensionsDD,[LAs](https://arxiv.org/html/2606.09862#id2.2.id2)models needD2D^\{2\}space while a model keeping KV separated needs2D22D^\{2\}space\. We hypothesize that this difference explains why[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)does not fully match the efficiency of[LAs](https://arxiv.org/html/2606.09862#id2.2.id2)models on[MQAR](https://arxiv.org/html/2606.09862#id10.10.id10)with roughly a factor of 2 \(Fig\.[3](https://arxiv.org/html/2606.09862#S4.F3)a\)\. We give a more complete discussion of related work in Appendix[A](https://arxiv.org/html/2606.09862#A1)\.
As all methods,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)has certain limitations and shortcomings that open up avenues for future research directions\. For example,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is sensitive to the choice of the period and number of modes, since depending the sequence length of the task at hand different hyper\-parameters will give the optimal result\. As a rule of thumb, we find the more number of modes the better, and the period should be chosen so we get a token resolution of 1 or 2\. Another limitation, is that the state capacity scales with2D22D^\{2\}and notD2D^\{2\}like other[LA](https://arxiv.org/html/2606.09862#id2.2.id2)/[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)models\. One potential solution is instead of keeping two separate states for the keys and values, keep one latent representation for both, similar to multi\-latent attention\[[25](https://arxiv.org/html/2606.09862#bib.bib25)\]\. Furthermore, given the impact of the token resolution on performance, designing a smarter interpolation mechanism seems like a promising direction for improving the model\.
## References
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.*Advances in neural information processing systems*, 30, 2017\.
- Xiao \[2025\]Guangxuan Xiao\.Why stacking sliding windows can’t see very far\.[https://guangxuanx\.com/blog/stacking\-swa\.html](https://guangxuanx.com/blog/stacking-swa.html), 2025\.
- Gu et al\. \[2020\]Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré\.Hippo: Recurrent memory with optimal polynomial projections\.*Advances in neural information processing systems*, 33:1474–1487, 2020\.
- Gu et al\. \[2021\]Albert Gu, Karan Goel, and Christopher Ré\.Efficiently modeling long sequences with structured state spaces\.*arXiv preprint arXiv:2111\.00396*, 2021\.
- Gu and Dao \[2023\]Albert Gu and Tri Dao\.Mamba: Linear\-time sequence modeling with selective state spaces\. arxiv 2023\.*arXiv preprint arXiv:2312\.00752*, 2023\.
- Katharopoulos et al\. \[2020\]Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret\.Transformers are rnns: Fast autoregressive transformers with linear attention\.In*International conference on machine learning*, pages 5156–5165\. PMLR, 2020\.
- Yang et al\. \[2023\]Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim\.Gated linear attention transformers with hardware\-efficient training\.*arXiv preprint arXiv:2312\.06635*, 2023\.
- Yang et al\. \[2024a\]Songlin Yang, Jan Kautz, and Ali Hatamizadeh\.Gated delta networks: Improving mamba2 with delta rule\.*arXiv preprint arXiv:2412\.06464*, 2024a\.
- Peng et al\. \[2022\]Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, and Noah A Smith\.Abc: Attention with bounded\-memory control\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 7469–7483, 2022\.
- Zhang et al\. \[2024a\]Yu Zhang, Songlin Yang, Rui\-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al\.Gated slot attention for efficient linear\-time sequence modeling\.*Advances in Neural Information Processing Systems*, 37:116870–116898, 2024a\.
- Bick et al\. \[2025\]Aviv Bick, Eric Xing, and Albert Gu\.Understanding the skill gap in recurrent language models: The role of the gather\-and\-aggregate mechanism\.*arXiv preprint arXiv:2504\.18574*, 2025\.
- von Oswald et al\. \[2025\]Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al\.Mesanet: Sequence modeling by locally optimal test\-time training\.*arXiv preprint arXiv:2506\.05233*, 2025\.
- Gu et al\. \[2022\]Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré\.On the parameterization and initialization of diagonal state space models\.*Advances in Neural Information Processing Systems*, 35:35971–35983, 2022\.
- Voelker et al\. \[2019\]Aaron Voelker, Ivana Kajić, and Chris Eliasmith\.Legendre memory units: Continuous\-time representation in recurrent neural networks\.*Advances in neural information processing systems*, 32, 2019\.
- Orvieto et al\. \[2023\]Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De\.Resurrecting recurrent neural networks for long sequences\.In*International Conference on Machine Learning*, pages 26670–26698\. PMLR, 2023\.
- Fu et al\. \[2022\]Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré\.Hungry hungry hippos: Towards language modeling with state space models\.*arXiv preprint arXiv:2212\.14052*, 2022\.
- De et al\. \[2024\]Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian\-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al\.Griffin: Mixing gated linear recurrences with local attention for efficient language models\.*arXiv preprint arXiv:2402\.19427*, 2024\.
- Lahoti et al\. \[2026\]Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu\.Mamba\-3: Improved sequence modeling using state space principles\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=HwCvaJOiCj](https://openreview.net/forum?id=HwCvaJOiCj)\.
- Edwards \[1979\]RE Edwards\.The dirichlet and fejér kernels\. cesàro summability\.In*Fourier Series: A Modern Introduction Volume 1*, pages 78–86\. Springer, 1979\.
- Zhang et al\. \[2024b\]Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré\.The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry\.*arXiv preprint arXiv:2402\.04347*, 2024b\.
- Yang and Zhang \[2024\]Songlin Yang and Yu Zhang\.Fla: A triton\-based library for hardware\-efficient implementations of linear attention mechanism, January 2024\.URL[https://github\.com/fla\-org/flash\-linear\-attention](https://github.com/fla-org/flash-linear-attention)\.
- Yang et al\. \[2024b\]Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim\.Parallelizing linear transformers with the delta rule over sequence length\.*Advances in neural information processing systems*, 37:115491–115522, 2024b\.
- Arora et al\. \[2023\]Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré\.Zoology: Measuring and improving recall in efficient language models\.*arXiv preprint arXiv:2312\.04927*, 2023\.
- Akyürek et al\. \[2024\]Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas\.In\-context language learning: Architectures and algorithms\.In*International Conference on Machine Learning*, pages 787–812\. PMLR, 2024\.
- Liu et al\. \[2024\]Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al\.Deepseek\-v2: A strong, economical, and efficient mixture\-of\-experts language model\.*arXiv preprint arXiv:2405\.04434*, 2024\.
- Choromanski et al\. \[2020\]Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al\.Rethinking attention with performers\.*arXiv preprint arXiv:2009\.14794*, 2020\.
- Peng et al\. \[2021\]Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong\.Random feature attention\.*arXiv preprint arXiv:2103\.02143*, 2021\.
- Arora et al\. \[2024\]Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré\.Simple linear attention language models balance the recall\-throughput tradeoff\.*arXiv preprint arXiv:2402\.18668*, 2024\.
- Dao and Gu \[2024\]Tri Dao and Albert Gu\.Transformers are ssms: Generalized models and efficient algorithms through structured state space duality\.*arXiv preprint arXiv:2405\.21060*, 2024\.
- Schlag et al\. \[2021\]Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber\.Linear transformers are secretly fast weight programmers\.In*International conference on machine learning*, pages 9355–9366\. PMLR, 2021\.
- Team et al\. \[2025\]Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al\.Kimi linear: An expressive, efficient attention architecture\.*arXiv preprint arXiv:2510\.26692*, 2025\.
- Von Oswald et al\. \[2023a\]Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov\.Transformers learn in\-context by gradient descent\.In*International Conference on Machine Learning*, pages 35151–35174\. PMLR, 2023a\.
- Von Oswald et al\. \[2023b\]Johannes Von Oswald, Maximilian Schlegel, Alexander Meulemans, Seijin Kobayashi, Eyvind Niklasson, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, et al\.Uncovering mesa\-optimization algorithms in transformers\.*arXiv preprint arXiv:2309\.05858*, 2023b\.
- Zhang et al\. \[2018\]Jiong Zhang, Yibo Lin, Zhao Song, and Inderjit Dhillon\.Learning long term dependencies via fourier recurrent units\.In*International Conference on Machine Learning*, pages 5815–5823\. PMLR, 2018\.
- Dangovski et al\. \[2019\]Rumen Dangovski, Li Jing, Preslav Nakov, Mićo Tatalović, and Marin Soljačić\.Rotational unit of memory: a novel representation unit for rnns with scalable applications\.*Transactions of the Association for Computational Linguistics*, 7:121–138, 2019\.
- Lee\-Thorp et al\. \[2021\]James Lee\-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon\.Fnet: Mixing tokens with fourier transforms\. arxiv 2021\.*arXiv preprint arXiv:2105\.03824*, 2021\.
- Ma et al\. \[2024\]Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou\.Megalodon: Efficient llm pretraining and inference with unlimited context length\.*Advances in Neural Information Processing Systems*, 37:71831–71854, 2024\.
- Qin et al\. \[2022\]Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong\.cosformer: Rethinking softmax in attention\.*arXiv preprint arXiv:2202\.08791*, 2022\.
- Li et al\. \[2025\]Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, and Pan Li\.Faedkv: Infinite\-window fourier transform for unbiased kv cache compression\.*arXiv preprint arXiv:2507\.20030*, 2025\.
- Scribano et al\. \[2023\]Carmelo Scribano, Giorgia Franchini, Marco Prato, and Marko Bertogna\.Dct\-former: Efficient self\-attention with discrete cosine transform\.*Journal of Scientific Computing*, 94\(3\):67, 2023\.
## Appendix ARelated Work
A wide range of efficient attention models have emerged, each offering distinct strategies to scale linear architectures and compete with full attention\. Similar to[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), several approaches aim to enhance linear attention by enriching the feature mapϕ\\phi, the decay/gating mechanism of the state update or by leveraging Fourier theory and complex\-valued representations\. Beyond the[LA](https://arxiv.org/html/2606.09862#id2.2.id2)and[SSM](https://arxiv.org/html/2606.09862#id1.1.id1)literature,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), similar to[ABC](https://arxiv.org/html/2606.09862#id6.6.id6)and[GSA](https://arxiv.org/html/2606.09862#id5.5.id5), can also be viewed as a method for compressing the key\-value \(KV\) cache, aligning it with a number of efficient attention mechanisms\.
Performer\[[26](https://arxiv.org/html/2606.09862#bib.bib26)\]and Random Feature Attention\[[27](https://arxiv.org/html/2606.09862#bib.bib27)\]approximate the softmax kernel with random feature maps functions\. Hedgehog\[[20](https://arxiv.org/html/2606.09862#bib.bib20)\]uses a learnable MLP as a feature map to generate sharper attention mask and improve retrieval\. The Based architecture\[[28](https://arxiv.org/html/2606.09862#bib.bib28)\]uses a Taylor expansion feature map which improves retrieval but expands the head dimension significantly\. Gated Linear Attention, similar to Mamba and Mamba2\[[5](https://arxiv.org/html/2606.09862#bib.bib5),[29](https://arxiv.org/html/2606.09862#bib.bib29)\], uses simpler feature map, and adds data and feature dependent decay to Linear Attention\[[7](https://arxiv.org/html/2606.09862#bib.bib7)\]\.Schlag et al\. \[[30](https://arxiv.org/html/2606.09862#bib.bib30)\]introduces the Delta rule to pack the state of Linear attention more efficiently\. Gated DeltaNet added gating to the Delta rule\[[8](https://arxiv.org/html/2606.09862#bib.bib8),[22](https://arxiv.org/html/2606.09862#bib.bib22)\], and is one of the leading linear models\[[31](https://arxiv.org/html/2606.09862#bib.bib31)\]\. Models built on[LA](https://arxiv.org/html/2606.09862#id2.2.id2)can be formulated in terms of test\-time training, where the key\-value association can be viewed as an online learning objective\[[32](https://arxiv.org/html/2606.09862#bib.bib32)\]\. MesaNet\[[33](https://arxiv.org/html/2606.09862#bib.bib33),[12](https://arxiv.org/html/2606.09862#bib.bib12)\]makes this online learning objective depend on the whole trajectory to derive performance improvements\.
Similar to[BLA](https://arxiv.org/html/2606.09862#id3.3.id3), many models use the Fourier theory and more generally complex numbers to improve sequence modeling\. Fourier recurrent units\[[34](https://arxiv.org/html/2606.09862#bib.bib34)\]summarizes the recurrent states along the temporal dimension with Fourier basis functions\. Rotational unit of memory\[[35](https://arxiv.org/html/2606.09862#bib.bib35)\]uses unitary matrix to mitigate vanishing gradients\. FNet replaces the attention with a Fourier transform to mix the tokens\[[36](https://arxiv.org/html/2606.09862#bib.bib36)\]\. The linear recurrent unit\[[15](https://arxiv.org/html/2606.09862#bib.bib15)\]uses carefully initialized complex diagonal to model long range dependencies\. Megalodon\[[37](https://arxiv.org/html/2606.09862#bib.bib37)\]introduces complex exponential moving average to design powerful linear models\. CosFormer uses modulated cosine and sine states to add locality bias in Linear Attention\[[38](https://arxiv.org/html/2606.09862#bib.bib38)\]\.
Finally,[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)is also related to works using Fourier theory to compress the KV cache\. A few examples include FAEDKV, which compresses the KV cache into the frequency domain using an Infinite\-Window Fourier Transform\[[39](https://arxiv.org/html/2606.09862#bib.bib39)\], while DCT\-Former uses the discrete cosine transform to compress the sequence and reduce the complexity of attention\[[40](https://arxiv.org/html/2606.09862#bib.bib40)\]\. Those methods differ from[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)because they compress not only the keys and values but also the query sequence\.
## Appendix BProofs
State formula\.We prove here by recursion the closed from formulas for the states\.
𝐊m,tcos=∑t′=1tcos\(mω\(t−t′\)\)𝒌t′,𝐊m,tsin=∑t′=1tsin\(mω\(t−t′\)\)𝒌t′\.\\begin\{split\}\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\},\\\\ \{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}&=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\.\\end\{split\}\(19\)The state equations are true fort=0t=0as empty sums are zero\. We then substitute the formula and use trigonometric identities\. Then we have for𝐊m,t\+1cos\{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{cos\}\}\}:
𝐊m,t\+1cos\\displaystyle\{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{cos\}\}\}=cos\(mω\)𝐊m,tcos−sin\(mω\)𝐊m,tsin\+𝒌t\+1\\displaystyle=\\cos\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}\-\\sin\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}\+\{\\bm\{k\}\}\_\{t\+1\}=cos\(mω\)∑t′=1tcos\(mω\(t−t′\)\)𝒌t′−sin\(mω\)∑t′=1tsin\(mω\(t−t′\)\)𝒌t′\+𝒌t\+1\\displaystyle=\\cos\(m\\omega\{\}\)\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\-\\sin\(m\\omega\{\}\)\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\+\{\\bm\{k\}\}\_\{t\+1\}=∑t′=1t\(cos\(mω\)cos\(mω\(t−t′\)\)−sin\(mω\)sin\(mω\(t−t′\)\)\)𝒌t′\+𝒌t\+1\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\left\(\\cos\(m\\omega\{\}\)\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)\-\\sin\(m\\omega\{\}\)\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)\\right\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\+\{\\bm\{k\}\}\_\{t\+1\}=∑t′=1tcos\(mω\(t\+1−t′\)\)𝒌t′\+𝒌t\+1\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\{\(t\+1\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\+\{\\bm\{k\}\}\_\{t\+1\}=∑t′=1t\+1cos\(mω\(t\+1−t′\)\)𝒌t′\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\+1\}\\cos\(m\\omega\{\(t\+1\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\(20\)because the last cosine term is one\. Similarly for𝐊m,t\+1sin\{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{sin\}\}\}:
𝐊m,t\+1sin\\displaystyle\{\{\\bf\{K\}\}\_\{\{m,t\+1\}\}^\{\\mathrm\{sin\}\}\}=sin\(mω\)𝐊m,tcos\+cos\(mω\)𝐊m,tsin\\displaystyle=\\sin\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{cos\}\}\}\+\\cos\(m\\omega\{\}\)\{\{\\bf\{K\}\}\_\{\{m,t\}\}^\{\\mathrm\{sin\}\}\}=sin\(mω\)∑t′=1tcos\(mω\(t−t′\)\)𝒌t′\+cos\(mω\)∑t′=1tsin\(mω\(t−t′\)\)𝒌t′\\displaystyle=\\sin\(m\\omega\{\}\)\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\+\\cos\(m\\omega\{\}\)\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}=∑t′=1t\(sin\(mω\)cos\(mω\(t−t′\)\)\+cos\(mω\)sin\(mω\(t−t′\)\)\)𝒌t′\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\left\(\\sin\(m\\omega\{\}\)\\cos\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)\+\\cos\(m\\omega\{\}\)\\sin\(m\\omega\{\(t\-t^\{\\prime\}\)\}\)\\right\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}=∑t′=1tsin\(mω\(t\+1−t′\)\)𝒌t′\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\sin\(m\\omega\{\(t\+1\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\(21\)=∑t′=1t\+1sin\(mω\(t\+1−t′\)\)𝒌t′\\displaystyle=\\sum\_\{t^\{\\prime\}=1\}^\{t\+1\}\\sin\(m\\omega\{\(t\+1\-t^\{\\prime\}\)\}\)~\{\\bm\{k\}\}\_\{t^\{\\prime\}\}\(22\)as the last sine term is 0\.
Fourier coefficients of translated Dirichlet kernels\.Consider the Dirichlet Kernel for an integerM≥1M\\geq 1,
DM\(t\)=12M−1\+22M−1∑m=1M−1cos\(2πmt2M−1\)\\displaystyle D\_\{M\}\(t\)=\\frac\{1\}\{2M\-1\}\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\cos\\left\(\\frac\{2\\pi mt\}\{2M\-1\}\\right\)\(23\)ThenDMD\_\{M\}is such thatDM\(0\)=1D\_\{M\}\(0\)=1,DM\(t\)=0D\_\{M\}\(t\)=0fort∈\[1,…2M−2\]t\\in\[1,\.\.\.2M\-2\]We can also translateDMD\_\{M\}by an integer timeΔt∈\[0,2m−2\]\\Delta t\\in\[0,2m\-2\]
DM\(t−Δt\)\\displaystyle D\_\{M\}\(t\-\\Delta t\)=12M−1\+22M−1∑m=1M−1cos\(2πm\(t−Δt\)2M−1\)\\displaystyle=\\frac\{1\}\{2M\-1\}\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\cos\\left\(\\frac\{2\\pi m\(t\-\\Delta t\)\}\{2M\-1\}\\right\)\(24\)DM\(t−Δt\)\\displaystyle D\_\{M\}\(t\-\\Delta t\)=12M−1\+22M−1∑m=1M−1cos\(2πmt2M−1−2πmΔt2M−1\)\\displaystyle=\\frac\{1\}\{2M\-1\}\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\cos\\left\(\\frac\{2\\pi mt\}\{2M\-1\}\-\\frac\{2\\pi m\\Delta t\}\{2M\-1\}\\right\)\(25\)=12M−1\+22M−1∑m=1M−1cos\(2πmt2M−1\)cos\(2πmΔt2M−1\)\\displaystyle=\\frac\{1\}\{2M\-1\}\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\cos\\left\(\\frac\{2\\pi mt\}\{2M\-1\}\\right\)\\cos\\left\(\\frac\{2\\pi m\\Delta t\}\{2M\-1\}\\right\)\+22M−1∑m=1M−1sin\(2πmt2M−1\)sin\(2πmΔt2M−1\)\\displaystyle\\quad\+\\frac\{2\}\{2M\-1\}\\sum\_\{m=1\}^\{M\-1\}\\sin\\left\(\\frac\{2\\pi mt\}\{2M\-1\}\\right\)\\sin\\left\(\\frac\{2\\pi m\\Delta t\}\{2M\-1\}\\right\)\(26\)
## Appendix CPseudo Code
Algorithm 2Naive recurrent[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)Input:
𝐐,𝐊,𝐕∈ℝL×D\{\\bf\{Q\}\},\{\\bf\{K\}\},\{\\bf\{V\}\}\\in\\mathbb\{R\}^\{L\\times D\}, period
T∈ℕT\\in\\mathbb\{N\}, interpolation matrices
𝐀∈ℝ\(2M−1\)×M,𝐁∈ℝ\(2M−1\)×M\{\\bf\{A\}\}\\in\\mathbb\{R\}^\{\(2M\-1\)\\times M\},\{\\bf\{B\}\}\\in\\mathbb\{R\}^\{\(2M\-1\)\\times M\}
Compute dilated time grid:
dilated\_time←round\(\[0,…,2M−2\]2M−1⊗T\)\\mathrm\{dilated\\\_time\}\\leftarrow\\mathrm\{round\}\\left\(\\frac\{\[0,\\dots,2M\-2\]\}\{2M\-1\}\\otimes T\\right\)
ω←2π/T\\omega\\leftarrow 2\\pi/T
Compute trigonometric components:
modes←ω⊗\[0,1,…,M−1\]\\mathrm\{modes\}\\leftarrow\\omega\\otimes\[0,1,\\dots,M\-1\]
cos←cos\(modes\)\\cos\\leftarrow\\cos\(\\mathrm\{modes\}\)
sin←sin\(modes\)\\sin\\leftarrow\\sin\(\\mathrm\{modes\}\)
Initialize the KV states:
𝐊cos,𝐊sin,𝐕cos,𝐕sin←𝟎D×M\{\\bf\{K\}\}^\{\\cos\},\{\\bf\{K\}\}^\{\\sin\},\{\\bf\{V\}\}^\{\\cos\},\{\\bf\{V\}\}^\{\\sin\}\\leftarrow\{\\bf\{0\}\}^\{D\\times M\}
Initialize output:
𝐎←𝟎L×D\{\\bf\{O\}\}\\leftarrow\{\\bf\{0\}\}^\{L\\times D\}
for
t=0t=0to
L−1L\-1do
Update the KV state \(extra variables omitted\):
𝐊cos←\(cos⊗𝐊cos\)−\(sin⊗𝐊sin\)\+𝐊\[𝐭\]\\bf\{K\}^\{\\cos\}\\leftarrow\(\\cos\\otimes\\bf\{K\}^\{\\cos\}\)\-\(\\sin\\otimes\\bf\{K\}^\{\\sin\}\)\+\{\\bf\{K\}\}\[t\]
𝐊sin←\(sin⊗𝐊cos\)\+\(cos⊗𝐊sin\)\\bf\{K\}^\{\\sin\}\\leftarrow\(\\sin\\otimes\\bf\{K\}^\{\\cos\}\)\+\(\\cos\\otimes\\bf\{K\}^\{\\sin\}\)
𝐕cos←\(cos⊗𝐕cos\)−\(sin⊗𝐕sin\)\+𝐕\[𝐭\]\\bf\{V\}^\{\\cos\}\\leftarrow\(\\cos\\otimes\\bf\{V\}^\{\\cos\}\)\-\(\\sin\\otimes\\bf\{V\}^\{\\sin\}\)\+\{\\bf\{V\}\}\[t\]
𝐕sin←\(sin⊗𝐕cos\)\+\(cos⊗𝐕sin\)\\bf\{V\}^\{\\sin\}\\leftarrow\(\\sin\\otimes\\bf\{V\}^\{\\cos\}\)\+\(\\cos\\otimes\\bf\{V\}^\{\\sin\}\)
Compute interpolated keys and values:
𝐊~←𝐀⋅𝐊cos\+𝐁⋅𝐊sin∈ℝD×\(2M−1\)\\tilde\{\\bf\{K\}\}\\leftarrow\{\\bf\{A\}\}\\cdot\{\\bf\{K\}\}^\{\\cos\}\+\{\\bf\{B\}\}\\cdot\{\\bf\{K\}\}^\{\\sin\}\\quad\\in\\mathbb\{R\}^\{D\\times\(2M\-1\)\}
𝐕~←𝐀⋅𝐕cos\+𝐁⋅𝐕sin\\tilde\{\\bf\{V\}\}\\leftarrow\{\\bf\{A\}\}\\cdot\{\\bf\{V\}\}^\{\\cos\}\+\{\\bf\{B\}\}\\cdot\{\\bf\{V\}\}^\{\\sin\}
Update output:
𝐌BLA←where\(t≥dilated\_time,0\.0,−∞\)\{\\bf\{M\}\}\_\{\\mathrm\{BLA\}\}\\leftarrow\\mathrm\{where\}\(t\\geq\\mathrm\{dilated\\\_time\},0\.0,\-\\infty\)
𝐎\[t\]←Softmax\(𝐐\[t\]⋅𝐊~⊤D\+𝐌BLA\)⋅𝐕~\{\\bf\{O\}\}\[t\]\\leftarrow\\mathrm\{Softmax\}\\left\(\\frac\{\{\\bf\{Q\}\}\[t\]\\cdot\\tilde\{\\bf\{K\}\}^\{\\top\}\}\{\\sqrt\{D\}\}\+\\bf\{M\}\_\{\\mathrm\{BLA\}\}\\right\)\\cdot\\tilde\{\\bf\{V\}\}
endfor
Return:
𝐎\\bf\{O\}
Algorithm 3Efficient chunk[BLA](https://arxiv.org/html/2606.09862#id3.3.id3)Input:
𝐐,𝐊,𝐕∈ℝL×H×D\{\\bf\{Q\}\},\{\\bf\{K\}\},\{\\bf\{V\}\}\\in\\mathbb\{R\}^\{L\\times H\\times D\}, period
T∈ℝHT\\in\\mathbb\{R\}^\{H\}, interpolation matrices
𝐀∈ℝ2M−1×M,𝐁∈ℝ2M−1×M\{\\bf\{A\}\}\\in\\mathbb\{R\}^\{2M\-1\\times M\},\{\\bf\{B\}\}\\in\\mathbb\{R\}^\{2M\-1\\times M\}, chunk size
C∈\[L\]C\\in\[L\]
Clamp period:
T←max\(T,2⋅M−1\)T\\leftarrow\\max\(T,2\\cdot M\-1\)
Compute dilated time grid:
dilated\_time←round\(\[0,…,2M−1−1\]2M−1⊗T\)\\mathrm\{dilated\\\_time\}\\leftarrow\\mathrm\{round\}\\left\(\\frac\{\[0,\\dots,2M\-1\-1\]\}\{2M\-1\}\\otimes T\\right\)
ω←2π/T\\omega\\leftarrow 2\\pi/T
Compute modes:
modes←ω⊗\[0,1,…,M−1\]\\mathrm\{modes\}\\leftarrow\\omega\\otimes\[0,1,\\dots,M\-1\]
Initialize output:
𝐎←𝟎L×H×D\{\\bf\{O\}\}\\leftarrow\{\\bf\{0\}\}^\{L\\times H\\times D\}
Divide
𝐐,𝐊,𝐕,𝐎\\bf\{\{\\bf\{Q\}\}\},\\bf\{\{\\bf\{K\}\}\},\\bf\{\{\\bf\{V\}\}\},\\bf\{O\}into
N=LCN=\\frac\{L\}\{C\}blocks
\{𝐐\[𝟏\]…𝐐\[𝐍\]\}\\\{\\bf\{\{\\bf\{Q\}\}\}\_\{\[1\]\}\\dots\\bf\{\{\\bf\{Q\}\}\}\_\{\[N\]\}\\\},
\{𝐊\[𝟏\]…𝐊\[𝐍\]\}\\\{\\bf\{\{\\bf\{K\}\}\}\_\{\[1\]\}\\dots\\bf\{\{\\bf\{K\}\}\}\_\{\[N\]\}\\\},
\{𝐕\[𝟏\]…𝐕\[𝐍\]\}\\\{\\bf\{\{\\bf\{V\}\}\}\_\{\[1\]\}\\dots\\bf\{\{\\bf\{V\}\}\}\_\{\[N\]\}\\\},
\{𝐎\[𝟏\]…𝐎\[𝐍\]\}\\\{\\bf\{O\}\_\{\[1\]\}\\dots\\bf\{O\}\_\{\[N\]\}\\\}of size
C×H×dhC\\times H\\times d\_\{h\}each
Initialize compressed KV states:
𝐊prev,𝐕prev←𝟎H×D×2M−1\{\\bf\{\{\\bf\{K\}\}\}\}\_\{\\mathrm\{prev\}\},\{\\bf\{\{\\bf\{V\}\}\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\bf\{0\}\}^\{H\\times D\\times 2M\-1\}
for
n=0n=0to
NNdo
Compute chunk interpolation coefficients:
t←\[n,n\+1,…,n\+C−1\]t\\leftarrow\[n,n\+1,\\dots,n\+C\-1\]
interpolate←cos\(t⋅modes\)⋅𝐀⊤\+sin\(t⋅modes\)⋅𝐁⊤∈ℝC×H×2M−1\\mathrm\{interpolate\}\\leftarrow\\cos\(t\\cdot\\mathrm\{modes\}\)\\cdot\{\\bf\{A\}\}^\{\\top\}\+\\sin\(t\\cdot\\mathrm\{modes\}\)\\cdot\{\\bf\{B\}\}^\{\\top\}\\in\\mathbb\{R\}^\{C\\times H\\times 2M\-1\}
Update compressed KV states:
𝐊~←𝐊\[n\]⊗interpolate\{\\tilde\{\\bf\{\{\\bf\{K\}\}\}\}\}\\leftarrow\{\\bf\{K\}\}\_\{\[n\]\}\\otimes\\mathrm\{interpolate\}
𝐊~←cumsum\(𝐊~\)\{\\tilde\{\\bf\{\{\\bf\{K\}\}\}\}\}\\leftarrow\\mathrm\{cumsum\}\(\\tilde\{\\bf\{\{\\bf\{K\}\}\}\}\)over
CC
𝐕~←𝐕\[n\]⊗interpolate\{\\tilde\{\\bf\{\{\\bf\{V\}\}\}\}\}\\leftarrow\{\\bf\{V\}\}\_\{\[n\]\}\\otimes\\mathrm\{interpolate\}
𝐕~←cumsum\(𝐕~\)\{\\tilde\{\\bf\{\{\\bf\{V\}\}\}\}\}\\leftarrow\\mathrm\{cumsum\}\(\\tilde\{\\bf\{\{\\bf\{V\}\}\}\}\)over
CC
𝐊~←𝐊prev\+𝐊~\{\\tilde\{\\bf\{\{\\bf\{K\}\}\}\}\}\\leftarrow\{\\bf\{\{\\bf\{K\}\}\}\}\_\{\\mathrm\{prev\}\}\+\\tilde\{\\bf\{\{\\bf\{K\}\}\}\}
𝐕~←𝐕prev\+𝐕~\{\\tilde\{\\bf\{\{\\bf\{V\}\}\}\}\}\\leftarrow\{\\bf\{\{\\bf\{V\}\}\}\}\_\{\\mathrm\{prev\}\}\+\\tilde\{\\bf\{\{\\bf\{V\}\}\}\}
𝐊prev←𝐊~\[C−1\]\{\\bf\{K\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\tilde\{\\bf\{K\}\}\}\[C\-1\]
𝐕prev←𝐕~\[C−1\]\{\\bf\{V\}\}\_\{\\mathrm\{prev\}\}\\leftarrow\{\\tilde\{\\bf\{V\}\}\}\[C\-1\]
Compute attention weights:
𝐌BLA←where\(t≥dilated\_time,0\.0,−∞\)\{\\bf\{M\}\}\_\{\\mathrm\{BLA\}\}\\leftarrow\\mathrm\{where\}\(t\\geq\\mathrm\{dilated\\\_time\},0\.0,\-\\infty\)
𝐎\[n\]←Softmax\(𝐐\[n\]⋅𝐊~D\+𝐌BLA\)⋅𝐕~\{\\bf\{O\}\}\_\{\[n\]\}\\leftarrow\\mathrm\{Softmax\}\\left\(\\frac\{\{\\bf\{Q\}\}\_\{\[n\]\}\\cdot\{\\tilde\{\\bf\{K\}\}\}\}\{\\sqrt\{D\}\}\+\{\\bf\{M\}\}\_\{\\mathrm\{BLA\}\}\\right\)\\cdot\{\\tilde\{\\bf\{V\}\}\}
endfor
Return:
𝐎\\bf\{O\}
## Appendix DHyperparameters
### D\.1MQAR experiment
Table 2:The hyperparameters used for the MQAR experiment
### D\.2RegBench experiment
Similar to\[[12](https://arxiv.org/html/2606.09862#bib.bib12),[24](https://arxiv.org/html/2606.09862#bib.bib24)\], we train models across a small search\-space and in the paper we show the network with the best validation performance across three different network initializations\. As we mention in the main text, we remove the short convolution from all of our models, and we use only the RegBench dataset with50005000DFAs\. For more details see Table[3](https://arxiv.org/html/2606.09862#A4.T3)\.
Table 3:The hyperparameters used for the RegBench experimentSimilar Articles
Dynamic Linear Attention
DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.
Dynamic Linear Attention
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
This paper introduces SWARR, a two-stage recipe using supervised fine-tuning and reinforcement learning to adapt sliding-window attention models for mathematical reasoning, showing that RL can narrow the performance gap with self-attention while maintaining efficiency.
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
CompactAttention introduces Block-Union KV Selection to accelerate chunked prefill for long-context LLMs, achieving up to 2.72x attention speedup on LLaMA-3.1-8B at 128K context while maintaining accuracy close to dense attention.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.