Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
Summary
Flexformer proposes a flexible linear Transformer with fully learnable attention kernels using random Fourier features, achieving linear complexity while matching or exceeding softmax attention performance on language modeling and sequence classification tasks.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
Source: [https://arxiv.org/html/2606.27748](https://arxiv.org/html/2606.27748)
Haoran Zhang, Feng Zhou Center for Applied Statistics and School of Statistics, Renmin University of China Beijing, China zhrrhz2002@163\.com,feng\.zhou@ruc\.edu\.cn
###### Abstract
Transformer models rely on attention mechanism to capture long\-range dependencies but suffer from quadratic complexity, limiting their scalability to long sequences\. Kernel\-based linear attention reduces this complexity but typically relies on fixed or weakly learnable kernels, restricting expressiveness and performance\. In this work, we propose Flexformer, a flexible linear Transformer that learns attention kernels in a fully data\-driven manner\. Flexformer builds on random Fourier feature\-based linear attention and treats spectral frequencies as trainable parameters, enabling the model to learn a broad family of attention kernels\. We develop both stationary and nonstationary variants, with the latter offering strictly greater expressiveness\. Extensive experiments on language modeling and sequence classification demonstrate that Flexformer consistently outperforms baselines\. Moreover, Flexformer can be effectively distilled from pretrained Transformers to recover softmax attention and exhibits strong kernel transferability across domains, achieving both high efficiency and competitive performance on long\-sequence tasks\.
## 1Introduction
The TransformerVaswaniet al\.\([2017](https://arxiv.org/html/2606.27748#bib.bib4)\)architecture has achieved remarkable success in machine translation tasks since its original proposal\. Subsequently, numerous variants of Transformer have been extensively applied across prominent domains within artificial intelligence, including natural language processingDevlinet al\.\([2019](https://arxiv.org/html/2606.27748#bib.bib5)\); Liuet al\.\([2019](https://arxiv.org/html/2606.27748#bib.bib6)\); Brownet al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib17)\), computer visionDosovitskiyet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib16)\); Liuet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib18)\), time series analysisZhouet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib11)\); Nieet al\.\([2023](https://arxiv.org/html/2606.27748#bib.bib12)\); Liuet al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib13)\), and audio processingSchneideret al\.\([2019](https://arxiv.org/html/2606.27748#bib.bib19)\); Eilers and Jiang \([2023](https://arxiv.org/html/2606.27748#bib.bib20)\)\. The most critical component in these Transformer\-based models is the dot\-product attention mechanism\. It operates by computing pairwise token affinities across the entire input sequence, thereby enabling each token to dynamically aggregate the most relevant information pertaining to itself, while effectively modeling long\-range dependencies within the sequence\. Unlike sequential RNNsSherstinsky \([2020](https://arxiv.org/html/2606.27748#bib.bib22)\)processing tokens step\-by\-step, Transformer attention computes representations for all positions in parallel, maximizing computational hardware utilization\.
This effective attention mechanism, however, suffers from efficiency limitations: it computes attention weights for all pairs of positions in the sequence, leading to both time and space complexity that grow quadratically with the sequence length\. To address this issue, numerous studies have attempted to optimize the computation of the attention mechanism\. These redesigned or optimized attention computation methods should effectively reduce time and space complexity while achieving performance that comparable or even surpasses that of softmax attention\.
Some works are based on sparse attentionKitaevet al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib45)\); Royet al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib24)\); Zaheeret al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib46)\), which modifies the attention patterns by restricting the attention range of each token to focus only on a subset of critical tokens\. However, these attention patterns are not rigorously validated and are often based on empirically derived patterns, making it difficult to guarantee their consistently superior performance across diverse data domains\. The other category is kernel\-based linear attentionKatharopouloset al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib26)\); Choromanskiet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib25)\); Qinet al\.\([2022](https://arxiv.org/html/2606.27748#bib.bib10)\), which represents the process of computing dot products and applying softmax using kernels\. The kernel can be written as an inner product of feature maps\. By rearranging the order of matrix multiplications, the quadratic complexity of attention can be reduced to linear complexity\.[Figure˜1](https://arxiv.org/html/2606.27748#S1.F1)illustrates the difference of the computational processes between kernel\-based linear attention and softmax attention\.
\(a\)Softmax attention\.
\(b\)Kernel\-based linear attention\.
Figure 1:Comparison between softmax attention and kernel\-based linear attention\. Softmax attention requires𝒪\(N2d\)\\mathcal\{O\}\(N^\{2\}d\)time complexity and𝒪\(N2\+Nd\)\\mathcal\{O\}\(N^\{2\}\+Nd\)space complexity, whereas kernel\-based linear attention achieves𝒪\(Ndd′\)\\mathcal\{O\}\(Ndd^\{\\prime\}\)time complexity and𝒪\(Nd\+Nd′\+dd′\)\\mathcal\{O\}\(Nd\+Nd^\{\\prime\}\+dd^\{\\prime\}\)space complexity, whereN,d,d′N,d,d^\{\\prime\}denote input sequence length, hidden dimension, and feature map dimension respectively\. WhenN≫dN\\gg dandN≫d′N\\gg d^\{\\prime\}, linear attention achieves significant savings in both time and space\.Here, we focus on the kernel\-based linear attention whose core lies in constructing an effective feature map\. While numerous methods aim to recover softmax attention, there is still no proof that softmax is invariably optimal under all circumstances\. Therefore, a more effective approach is to apply learnable kernels that are directly learned from data\. However, how to construct sufficiently flexible learnable kernels that guarantee strong performance remains an open problem\. HedgehogZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\)and PolaformerMenget al\.\([2025](https://arxiv.org/html/2606.27748#bib.bib7)\)empirically exploit the low\-entropy property of softmax attention by combining learnable mappings with spiky functions such as exponentials and power functions\. However, these methods do not guarantee that the learned kernel is sufficiently expressive to encompass a wide variety of kernels, which can compromise their performance\.
In this work, we propose aFlexible Linear Transformerwith learnable kernel \(Flexformer\)\. Flexformer extends kernel\-based linear attention with random Fourier features by making the feature map fully learnable\. In contrast to existing learnable kernel methods, Flexformer is able to model a substantially broader and more expressive family of kernels\. Specifically, spectral representation theory states that a kernel admits an inverse Fourier transform representation in terms of its spectral measure, which can be approximated through Monte Carlo sampling\. In Flexformer, instead of sampling frequencies from a fixed spectral density, we treat the frequencies as learnable parameters and optimize them directly from data, thereby enabling the attention kernel to be learned in a data\-driven manner\. We propose both stationary and nonstationary versions of Flexformer, with the nonstationary formulation being strictly more expressive\. Theoretically, we can show that the kernel family of Flexformer contains the softmax kernel\. This guarantees that Flexformer can, at a minimum, learn a softmax attention, and potentially learn attention patterns that outperform softmax\. Flexformer can be trained either from scratch or by distillingZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\)from a pretrained Transformer\. We conduct extensive experiments on multiple benchmarks, including language modeling and classification tasks, demonstrating Flexformer’s superior performance over existing baselines, as well as its strong capability to recover the softmax kernel and effectively transfer the learned kernel\. As a variant of linear attention, it significantly improves efficiency and reduces memory consumption when processing extremely long sequences\. Specifically, our contributions are summarized as follows:
- •We propose Flexformer, which leverages learnable attention kernels constructed via random Fourier features, achieving high expressiveness while preserving linear time and space complexity w\.r\.t\. sequence length\.
- •We evaluate Flexformer through from\-scratch training on language modeling and sequence classification tasks\. The results show that Flexformer consistently outperforms other linear\-attention baselines, highlighting its strong effectiveness and efficiency\.
- •Flexformer can faithfully approximate softmax attention through attention weight distillation, enabling efficient linear\-attention replacements with preserved performance, and demonstrates strong kernel transferability in cross\-domain settings\.
## 2Preliminaries
In this section, we introduce the background knowledge of self\-attention mechanism and kernel\-based linear attention\.
### 2\.1Self\-attention Mechanism
Given an input sequence𝐗=\[X1,⋯,XN\]⊤∈ℝN×d\\mathbf\{X\}=\[\\text\{X\}\_\{1\},\\cdots,\\text\{X\}\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times d\}, whereNNanddddenote the input length and embedding dimension, respectively\. Each tokenXi\\text\{X\}\_\{i\}is projected through linear maps to generate the query, key, and value tensors, formally expressed asQi=WQXi,Ki=WKXi,Vi=WVXi,i=1,⋯,N\\text\{Q\}\_\{i\}=\\text\{W\}\_\{Q\}\\text\{X\}\_\{i\},\\text\{K\}\_\{i\}=\\text\{W\}\_\{K\}\\text\{X\}\_\{i\},\\text\{V\}\_\{i\}=\\text\{W\}\_\{V\}\\text\{X\}\_\{i\},i=1,\\cdots,N\. For each query tensor, a similarity measure is computed with all key tensors\. These similarity scores are then normalized to form attention weights, which are used to perform a weighted summation over the corresponding value tensors, thereby generating context\-aware representations\. Formally, the output of the self\-attention module for theii\-th token can be expressed as:
Oi=∑j=1Nsim\(Qi,Kj\)Vj∑j=1Nsim\(Qi,Kj\),\\text\{O\}\_\{i\}=\\frac\{\\sum\\limits\_\{j=1\}^\{N\}\\text\{sim\}\(\\text\{Q\}\_\{i\},\\text\{K\}\_\{j\}\)\\text\{V\}\_\{j\}\}\{\\sum\\limits\_\{j=1\}^\{N\}\\text\{sim\}\(\\text\{Q\}\_\{i\},\\text\{K\}\_\{j\}\)\},\(1\)wheresim\(⋅,⋅\):ℝd×ℝd→ℝ\+\\text\{sim\}\(\\cdot,\\cdot\):\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}\_\{\+\}denotes the similarity measure function\. In vanilla TransformerVaswaniet al\.\([2017](https://arxiv.org/html/2606.27748#bib.bib4)\)with softmax attention,sim\(Qi,Kj\)\\text\{sim\}\(\\text\{Q\}\_\{i\},\\text\{K\}\_\{j\}\)is defined assim\(Qi,Kj\)=exp\(Qi⊤Kjd\)\\text\{sim\}\(\\text\{Q\}\_\{i\},\\text\{K\}\_\{j\}\)=\\exp\\left\(\\frac\{\\text\{Q\}\_\{i\}^\{\\top\}\\text\{K\}\_\{j\}\}\{\\sqrt\{d\}\}\\right\)\.
### 2\.2Kernel\-based Linear Attention
If we directly follow the computation order in[Equation˜1](https://arxiv.org/html/2606.27748#S2.E1)by first calculating the similarity between queries and keys at all pairwise positions to obtain aN×NN\\times Nattention weight matrix, and then multiplying with values, it would result in a time complexity of𝒪\(N2d\)\\mathcal\{O\}\(N^\{2\}d\)and a space complexity of𝒪\(N2\+Nd\)\\mathcal\{O\}\(N^\{2\}\+Nd\), leading to substantial computational and memory overhead asNNincreases\.
To address the aforementioned issues, we assume an attention kernelk\(𝐱,𝐲\)k\(\\mathbf\{x\},\\mathbf\{y\}\)with a range ofℝ\+\\mathbb\{R\}\_\{\+\}as the similarity function, and letϕ:ℝd→ℝd′\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d^\{\\prime\}\}, whered′d^\{\\prime\}is the dimension of the kernel\-induced feature space, denote the corresponding feature map, such that:
k\(𝐱,𝐲\)=ϕ\(𝐱\)⊤ϕ\(𝐲\)\.k\(\\mathbf\{x\},\\mathbf\{y\}\)=\\phi\(\\mathbf\{x\}\)^\{\\top\}\\phi\(\\mathbf\{y\}\)\.\(2\)Then the computation of the self\-attention in[Equation˜1](https://arxiv.org/html/2606.27748#S2.E1)can be expressed as:
Oi⊤=∑j=1Nϕ\(Qi\)⊤ϕ\(Kj\)Vj⊤∑j=1Nϕ\(Qi\)⊤ϕ\(Kj\)=ϕ\(Qi\)⊤∑j=1Nϕ\(Kj\)Vj⊤ϕ\(Qi\)⊤∑j=1Nϕ\(Kj\)\.\\text\{O\}\_\{i\}^\{\\top\}=\\frac\{\\sum\\limits\_\{j=1\}^\{N\}\\phi\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\phi\(\\text\{K\}\_\{j\}\)\\text\{V\}\_\{j\}^\{\\top\}\}\{\\sum\\limits\_\{j=1\}^\{N\}\\phi\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\phi\(\\text\{K\}\_\{j\}\)\}=\\frac\{\\phi\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\sum\\limits\_\{j=1\}^\{N\}\\phi\(\\text\{K\}\_\{j\}\)\\text\{V\}\_\{j\}^\{\\top\}\}\{\\phi\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\sum\\limits\_\{j=1\}^\{N\}\\phi\(\\text\{K\}\_\{j\}\)\}\.\(3\)This approach avoids explicitly computing the pairwise similarity between any query and key\. Instead, we first precompute∑j=1Nϕ\(Ki\)Vj⊤\\sum\_\{j=1\}^\{N\}\\phi\(\\text\{K\}\_\{i\}\)\\text\{V\}\_\{j\}^\{\\top\}and∑j=1Nϕ\(Kj\)\\sum\_\{j=1\}^\{N\}\\phi\(\\text\{K\}\_\{j\}\), which are shared across all positional queries, and then multiply them with each positional queryQi\\text\{Q\}\_\{i\}\. Thus, the time complexity of this approach is𝒪\(Ndd′\)\\mathcal\{O\}\(Ndd^\{\\prime\}\)and the space complexity is𝒪\(Nd\+Nd′\+dd′\)\\mathcal\{O\}\(Nd\+Nd^\{\\prime\}\+dd^\{\\prime\}\), achieving linear scaling w\.r\.t\. the sequence length\. WhenN≫d,d′N\\gg d,d^\{\\prime\}, it demonstrates significant efficiency improvements and memory savings compared to standard softmax attention\.
## 3Methodology
In this section, we first review how softmax kernel can be approximated using random Fourier features\. We then extend this formulation to a fully learnable kernel, resulting in the proposed Flexformer\.
### 3\.1Softmax Kernel with Random Fourier Features
In softmax attention, the similarity measure can be expressed as the following kernel\. We hereafter refer to this as the softmax kernel:
kSM\(𝐱,𝐲\)=exp\(𝐱⊤𝐲d\)\.k\_\{\\text\{SM\}\}\(\\mathbf\{x\},\\mathbf\{y\}\)=\\exp\\left\(\\frac\{\\mathbf\{x\}^\{\\top\}\\mathbf\{y\}\}\{\\sqrt\{d\}\}\\right\)\.Through Taylor expansion, we can derive feature mapϕ\\phisatisfying[Equation˜2](https://arxiv.org/html/2606.27748#S2.E2)\. Unfortunately, the dimensionality ofϕ\\phiis infiniteSieberet al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib27)\)\. Therefore, we need to seek a finite\-dimensionalϕ\\phisuch that the softmax kernel can be effectively approximated\. The softmax kernel can be decomposed in the following manner:
kSM\(𝐱,𝐲\)\\displaystyle k\_\{\\text\{SM\}\}\(\\mathbf\{x\},\\mathbf\{y\}\)=exp\(‖𝐱‖2\+‖𝐲‖2−‖𝐱−𝐲‖22d\)=exp\(‖𝐱‖22d\)exp\(−‖𝐱−𝐲‖22d\)exp\(‖𝐲‖22d\)\.\\displaystyle=\\exp\\left\(\\frac\{\|\|\\mathbf\{x\}\|\|^\{2\}\+\|\|\\mathbf\{y\}\|\|^\{2\}\-\|\|\\mathbf\{x\}\-\\mathbf\{y\}\|\|^\{2\}\}\{2\\sqrt\{d\}\}\\right\)=\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\exp\\left\(\\frac\{\|\|\\mathbf\{x\}\|\|^\{2\}\}\{2\\sqrt\{d\}\}\\right\)\}\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\exp\\left\(\-\\frac\{\|\|\\mathbf\{x\}\-\\mathbf\{y\}\|\|^\{2\}\}\{2\\sqrt\{d\}\}\\right\)\}\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\exp\\left\(\\frac\{\|\|\\mathbf\{y\}\|\|^\{2\}\}\{2\\sqrt\{d\}\}\\right\)\}\.\(4\)The blue terms depend exclusively on vector norms, enabling all positions to potentially attend strongly to certain keys with large norm magnitudes at critical locations, and they are crucial for preserving the spikiness of the softmax\. The red term, which focuses exclusively on a specific similarity between𝐱\\mathbf\{x\}and𝐲\\mathbf\{y\}, is essentially a*Gaussian kernel*with bandwidthσ2=d\\sigma^\{2\}=\\sqrt\{d\}\. As a bounded, continuous, and positive definite stationary \(a\.k\.a\. translation\-invariant\) kernel, it can be approximated by the random Fourier features\(Rahimi and Recht,[2007](https://arxiv.org/html/2606.27748#bib.bib32)\)\.
###### Theorem 3\.1\.
Bochner\([2005](https://arxiv.org/html/2606.27748#bib.bib31)\)A stationary kernelk\(𝐱,𝐲\)=k\(𝐱−𝐲\):ℝd×ℝd→ℝk\(\\mathbf\{x\},\\mathbf\{y\}\)=k\(\\mathbf\{x\}\-\\mathbf\{y\}\):\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is a bounded, continuous, and positive definite if and only if it can be represented as:
k\(𝐱−𝐲\)=∫ℝdexp\(i\(ω⊤\(𝐱−𝐲\)\)dμ\(ω\),k\(\\mathbf\{x\}\-\\mathbf\{y\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\(i\(\\mathbf\{\\omega\}^\{\\top\}\(\\mathbf\{x\}\-\\mathbf\{y\}\)\)d\\mu\(\\mathbf\{\\omega\}\),whereμ\(ω\)\\mu\(\\mathbf\{\\omega\}\)is a bounded non\-negative measure associated to the spectral densityp\(ω\)=μ\(ω\)μ\(ℝd\)p\(\\omega\)=\\frac\{\\mu\(\\omega\)\}\{\\mu\(\\mathbb\{R\}^\{d\}\)\}\.
Letσ2=μ\(ℝd\)\\sigma^\{2\}=\\mu\(\\mathbb\{R\}^\{d\}\)denote the total measure of the space\. Then[Theorem˜3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1)can be expressed in the following expectation form, which can be approximated by Monte Carlo:
k\(𝐱−𝐲\)=\\displaystyle k\(\\mathbf\{x\}\-\\mathbf\{y\}\)=σ2𝔼p\(ω\)\(cos\(ω⊤\(𝐱−𝐲\)\)\+isin\(ω⊤\(𝐱−𝐲\)\)\)\\displaystyle\\sigma^\{2\}\\mathbb\{E\}\_\{p\(\\omega\)\}\(\\cos\(\\omega^\{\\top\}\(\\mathbf\{x\}\-\\mathbf\{y\}\)\)\+i\\sin\(\\omega^\{\\top\}\(\\mathbf\{x\}\-\\mathbf\{y\}\)\)\)\(5\)=\\displaystyle=σ2𝔼p\(ω\)\(cosω⊤𝐱cosω⊤𝐲\+sinω⊤𝐱sinω⊤𝐲\)\\displaystyle\\sigma^\{2\}\\mathbb\{E\}\_\{p\(\\omega\)\}\(\\cos\\omega^\{\\top\}\\mathbf\{x\}\\cos\\omega^\{\\top\}\\mathbf\{y\}\+\\sin\\omega^\{\\top\}\\mathbf\{x\}\\sin\\omega^\{\\top\}\\mathbf\{y\}\)≈\\displaystyle\\approxσ2n∑i=1n\(cosωi⊤𝐱cosωi⊤𝐲\+sinωi⊤𝐱sinωi⊤𝐲\)\\displaystyle\\frac\{\\sigma^\{2\}\}\{n\}\\sum\_\{i=1\}^\{n\}\(\\cos\\omega\_\{i\}^\{\\top\}\\mathbf\{x\}\\cos\\omega\_\{i\}^\{\\top\}\\mathbf\{y\}\+\\sin\\omega\_\{i\}^\{\\top\}\\mathbf\{x\}\\sin\\omega\_\{i\}^\{\\top\}\\mathbf\{y\}\)=\\displaystyle=ϕn\(𝐱\)⊤ϕn\(𝐲\),\\displaystyle\\phi\_\{n\}\(\\mathbf\{x\}\)^\{\\top\}\\phi\_\{n\}\(\\mathbf\{y\}\),whereϕn\(𝐱\)=σn\[cosω1⊤𝐱,⋯,cosωn⊤𝐱,sinω1⊤𝐱,⋯,sinωn⊤𝐱\]⊤\\phi\_\{n\}\(\\mathbf\{x\}\)=\\frac\{\\sigma\}\{\\sqrt\{n\}\}\[\\cos\\omega\_\{1\}^\{\\top\}\\mathbf\{x\},\\cdots,\\cos\\omega\_\{n\}^\{\\top\}\\mathbf\{x\},\\sin\\omega\_\{1\}^\{\\top\}\\mathbf\{x\},\\cdots,\\sin\\omega\_\{n\}^\{\\top\}\\mathbf\{x\}\]^\{\\top\}\.ϕn\(𝐱\)\\phi\_\{n\}\(\\mathbf\{x\}\)is called Fourier feature,\{ωi\}i=1n\\\{\\omega\_\{i\}\\\}^\{n\}\_\{i=1\}are i\.i\.d\. frequencies fromp\(ω\)p\(\\omega\)andnndenotes the number of frequency samples\. The second equality in the derivation holds because we constrain the range ofk\(𝐱−𝐲\)k\(\\mathbf\{x\}\-\\mathbf\{y\}\)toℝ\\mathbb\{R\}, which enforces symmetry in the measureppsuch thatp\(ω\)=p\(−ω\)p\(\\omega\)=p\(\-\\omega\), consequently giving𝔼p\(ω\)\(isin\(ω⊤\(𝐱−𝐲\)\)\)=0\\mathbb\{E\}\_\{p\(\\omega\)\}\(i\\sin\(\\omega^\{\\top\}\(\\mathbf\{x\}\-\\mathbf\{y\}\)\)\)=0\.
Substituting[Equation˜5](https://arxiv.org/html/2606.27748#S3.E5)into the red term of[Equation˜4](https://arxiv.org/html/2606.27748#S3.E4)yields a random Fourier feature representation of the softmax kernel:
kSM\(𝐱,𝐲\)≈ϕ~n\(𝐱\)⊤ϕ~n\(𝐲\),k\_\{\\text\{SM\}\}\(\\mathbf\{x\},\\mathbf\{y\}\)\\approx\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{x\}\)^\{\\top\}\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{y\}\),whereϕ~n\(𝐱\)=exp\(‖𝐱‖22d\)ϕn\(𝐱\)\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{x\}\)=\\exp\\left\(\\frac\{\|\|\\mathbf\{x\}\|\|^\{2\}\}\{2\\sqrt\{d\}\}\\right\)\\phi\_\{n\}\(\\mathbf\{x\}\),\{ωi\}i=1n\\\{\\omega\_\{i\}\\\}^\{n\}\_\{i=1\}are i\.i\.d\. frequencies from a*Gaussian spectral density*\. Based on the above representation, the kernel\-based linear attention framework can be employed to reduce the complexity of softmax attention to linear\(Penget al\.,[2021](https://arxiv.org/html/2606.27748#bib.bib21)\)\.
The above approach relies on a crucial assumption: it implicitly treats softmax attention as optimal, and therefore fixes the similarity function to the softmax kernel, which is non\-learnable\. Once the hidden dimensionddis specified, the softmax kernel—and consequently its corresponding Gaussian spectral densityp\(ω\)p\(\\omega\)—is fully determined, rendering the resulting attention patterns non\-learnable\. However, a growing body of workZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\); Menget al\.\([2025](https://arxiv.org/html/2606.27748#bib.bib7)\)has shown that softmax attention is not optimal in many scenarios, and that the attention kernel should instead be learned in a data\-driven manner\. This motivates extending the above framework to the learnable attention kernel\.
### 3\.2Flexformer
In this section, we first extend the fixed, non\-learnable attention kernel to a learnable one, allowing attention patterns to adapt to data and yielding Flexformer\. Then we further extend this formulation from stationary kernels to nonstationary kernels, resulting in a nonstationary variant of Flexformer\.
##### Flexformer with learnable stationary kernel\.
As stated in[Theorem˜3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1), a stationary kernelk\(𝐱−𝐲\)k\(\\mathbf\{x\}\-\\mathbf\{y\}\)is in one\-to\-one correspondence with its spectral densityp\(ω\)p\(\\omega\)\. Consequently, learning a kernel from data is equivalent to learning its spectral density from data\. Therefore, in this work, we directly treat the frequencies\{ωi\}i=1n\\\{\\omega\_\{i\}\\\}^\{n\}\_\{i=1\}as learnable parameters and optimize them end\-to\-end\. This is equivalent to learning an optimal spectral densityp\(ω\)p\(\\omega\), and hence learning an optimal attention kernel\. Similar techniques have been extensively explored in the Gaussian process literature\(Lázaro\-Gredillaet al\.,[2010](https://arxiv.org/html/2606.27748#bib.bib35); Sunet al\.,[2024](https://arxiv.org/html/2606.27748#bib.bib2)\)\. As[Theorem˜3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1)holds for all stationary kernels satisfying certain conditions, the kernels learned in this manner are sufficiently flexible\. We refer to this method asFlexformers\\textbf\{Flexformer\}\_\{\\textbf\{s\}\}, where the subscriptsdenotes the stationary kernel\.
It is worth noting that when the learned frequencies\{ωi\}i=1n\\\{\\omega\_\{i\}\\\}^\{n\}\_\{i=1\}follow a Gaussian distribution, this yields an approximation of the softmax kernel in[Equation˜4](https://arxiv.org/html/2606.27748#S3.E4), analogous to Random Feature AttentionPenget al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib21)\)\. This observation implies that the kernel family induced by Flexformer strictly contains the unbiased estimate of the softmax kernel\. As a result, Flexformer is guaranteed to recover softmax attention closely, while also being capable of learning potentially better attention patterns directly from data\.
##### Flexformer with learnable nonstationary kernel\.
The red term in the softmax kernel in[Equation˜4](https://arxiv.org/html/2606.27748#S3.E4)is stationary, which motivates the use of[Theorem˜3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1)to deriveFlexformers\\text\{Flexformer\}\_\{\\text\{s\}\}\. However, a stationary kernel implies that its value depends only on the difference𝐱−𝐲\\mathbf\{x\}\-\\mathbf\{y\}, which limits flexibility\. An immediate and natural extension is to replace this stationary kernel with a nonstationary one, thereby enabling the model to learn more diverse attention patterns and adapt more flexibly to the data\. Since[Theorem˜3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1)only characterizes stationary kernels, it cannot be directly applied here\. Instead, we resort to the Yaglom theorem, which generalizes Bochner’s result to nonstationary kernels\.
###### Theorem 3\.2\.
Yaglom\([1987](https://arxiv.org/html/2606.27748#bib.bib3)\)A nonstationary kernelk\(𝐱,𝐲\):ℝd×ℝd→ℝk\(\\mathbf\{x\},\\mathbf\{y\}\):\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is a bounded, continuous, and positive definite if and only if it can be represented as:
k\(𝐱,𝐲\)=∫ℝd×ℝdexp\(i\(ω1⊤𝐱−ω2⊤𝐲\)\)𝑑μ\(ω1,ω2\),k\(\\mathbf\{x\},\\mathbf\{y\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\}\\exp\(i\(\\mathbf\{\\omega\}^\{\\top\}\_\{1\}\\mathbf\{x\}\-\\mathbf\{\\omega\}^\{\\top\}\_\{2\}\\mathbf\{y\}\)\)d\\mu\(\\mathbf\{\\omega\}\_\{1\},\\mathbf\{\\omega\}\_\{2\}\),whereμ\(ω1,ω2\)\\mu\(\\mathbf\{\\omega\}\_\{1\},\\mathbf\{\\omega\}\_\{2\}\)is a Lebesgue\-Stieltjes measure associated to some bounded positive semi\-definite spectral densityp\(ω1,ω2\)=μ\(ω1,ω2\)μ\(ℝd,ℝd\)p\(\\mathbf\{\\omega\}\_\{1\},\\mathbf\{\\omega\}\_\{2\}\)=\\frac\{\\mu\(\\mathbf\{\\omega\}\_\{1\},\\mathbf\{\\omega\}\_\{2\}\)\}\{\\mu\(\\mathbb\{R\}^\{d\},\\mathbb\{R\}^\{d\}\)\}\.
In fact, the red term in the softmax kernel in[Equation˜4](https://arxiv.org/html/2606.27748#S3.E4)is indeed a positive definite kernel\. Therefore, after extending it to the nonstationary setting, we continue to assume that it remains positive definite\. When approximating the kernel in[Theorem˜3\.2](https://arxiv.org/html/2606.27748#S3.Thmtheorem2)via Monte Carlo sampling methods, to ensure the positive semi\-definiteness of the spectral densityp\(𝝎1,𝝎2\)p\(\\boldsymbol\{\\omega\}\_\{1\},\\boldsymbol\{\\omega\}\_\{2\}\), we first assume that it is symmetrized by an arbitrary joint probability density:
p\(ω1,ω2\)=14\(g\(ω1,ω1\)\+g\(ω2,ω2\)\+g\(ω1,ω2\)\+g\(ω2,ω1\)\)\.\\displaystyle p\(\\omega\_\{1\},\\omega\_\{2\}\)=\\frac\{1\}\{4\}\(g\(\\omega\_\{1\},\\omega\_\{1\}\)\+g\(\\omega\_\{2\},\\omega\_\{2\}\)\+g\(\\omega\_\{1\},\\omega\_\{2\}\)\+g\(\\omega\_\{2\},\\omega\_\{1\}\)\)\.Letσ2=μ\(ℝd,ℝd\)\\sigma^\{2\}=\\mu\(\\mathbb\{R\}^\{d\},\\mathbb\{R\}^\{d\}\)denote the total measure of the space\. Then[Theorem˜3\.2](https://arxiv.org/html/2606.27748#S3.Thmtheorem2)can be expressed in the following expectation form, which can be approximated by Monte Carlo:
k\(𝐱,𝐲\)=\\displaystyle k\(\\mathbf\{x\},\\mathbf\{y\}\)=σ24𝔼g\(ω1,ω2\)\(\(cosω1⊤𝐱\+cosω2⊤𝐱\)\(cosω1⊤𝐲\+cosω2⊤𝐲\)\\displaystyle\\frac\{\\sigma^\{2\}\}\{4\}\\mathbb\{E\}\_\{g\(\\omega\_\{1\},\\omega\_\{2\}\)\}\(\(\\cos\\omega^\{\\top\}\_\{1\}\\mathbf\{x\}\+\\cos\\omega^\{\\top\}\_\{2\}\\mathbf\{x\}\)\(\\cos\\omega^\{\\top\}\_\{1\}\\mathbf\{y\}\+\\cos\\omega^\{\\top\}\_\{2\}\\mathbf\{y\}\)\+\(sinω1⊤𝐱\+sinω2⊤𝐱\)\(sinω1⊤𝐲\+sinω2⊤𝐲\)\)\\displaystyle\+\(\\sin\\omega^\{\\top\}\_\{1\}\\mathbf\{x\}\+\\sin\\omega^\{\\top\}\_\{2\}\\mathbf\{x\}\)\(\\sin\\omega^\{\\top\}\_\{1\}\\mathbf\{y\}\+\\sin\\omega^\{\\top\}\_\{2\}\\mathbf\{y\}\)\)≈\\displaystyle\\approxσ24n∑i=1n\(\(cosω1i⊤𝐱\+cosω2i⊤𝐱\)\(cosω1i⊤𝐲\+cosω2i⊤𝐲\)\\displaystyle\\frac\{\\sigma^\{2\}\}\{4n\}\\sum\_\{i=1\}^\{n\}\(\(\\cos\\omega^\{\\top\}\_\{1i\}\\mathbf\{x\}\+\\cos\\omega^\{\\top\}\_\{2i\}\\mathbf\{x\}\)\(\\cos\\omega^\{\\top\}\_\{1i\}\\mathbf\{y\}\+\\cos\\omega^\{\\top\}\_\{2i\}\\mathbf\{y\}\)\+\(sinω1i⊤𝐱\+sinω2i⊤𝐱\)\(sinω1i⊤𝐲\+sinω2i⊤𝐲\)\)\\displaystyle\+\(\\sin\\omega^\{\\top\}\_\{1i\}\\mathbf\{x\}\+\\sin\\omega^\{\\top\}\_\{2i\}\\mathbf\{x\}\)\(\\sin\\omega^\{\\top\}\_\{1i\}\\mathbf\{y\}\+\\sin\\omega^\{\\top\}\_\{2i\}\\mathbf\{y\}\)\)=\\displaystyle=σ24n∑i=1n\(cosω1i′⊤𝐱cosω2i′⊤𝐱cosω1i′⊤𝐲cosω2i′⊤𝐲\+sinω1i′⊤𝐱cosω2i′⊤𝐱sinω1i′⊤𝐲cosω2i′⊤𝐲\)\\displaystyle\\frac\{\\sigma^\{2\}\}\{4n\}\\sum\_\{i=1\}^\{n\}\(\\cos\\omega^\{\\prime\\top\}\_\{1i\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{2i\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{1i\}\\mathbf\{y\}\\cos\\omega^\{\\prime\\top\}\_\{2i\}\\mathbf\{y\}\+\\sin\\omega^\{\\prime\\top\}\_\{1i\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{2i\}\\mathbf\{x\}\\sin\\omega^\{\\prime\\top\}\_\{1i\}\\mathbf\{y\}\\cos\\omega^\{\\prime\\top\}\_\{2i\}\\mathbf\{y\}\)=\\displaystyle=ϕn\(𝐱\)⊤ϕn\(𝐲\),\\displaystyle\\phi\_\{n\}\(\\mathbf\{x\}\)^\{\\top\}\\phi\_\{n\}\(\\mathbf\{y\}\),\(6\)whereϕn\(𝐱\)=σ2n\[cosω11′⊤𝐱cosω21′⊤𝐱,⋯,cosω1n′⊤𝐱cosω2n′⊤𝐱,sinω11′⊤𝐱cosω21′⊤𝐱,⋯,sinω1n′⊤𝐱cosω2n′⊤𝐱\]⊤\\phi\_\{n\}\(\\mathbf\{x\}\)=\\frac\{\\sigma\}\{2\\sqrt\{n\}\}\[\\cos\\omega^\{\\prime\\top\}\_\{11\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{21\}\\mathbf\{x\},\\cdots,\\cos\\omega^\{\\prime\\top\}\_\{1n\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{2n\}\\mathbf\{x\},\\sin\\omega^\{\\prime\\top\}\_\{11\}\\mathbf\{x\}\\cos\\omega^\{\\prime\\top\}\_\{21\}\\mathbf\{x\},\\cdots,\\sin\\omega^\{\\prime\\top\}\_\{1n\}\\mathbf\{x\}\\\\ \\cos\\omega^\{\\prime\\top\}\_\{2n\}\\mathbf\{x\}\]^\{\\top\},\{\(ω1i,ω2i\)\}i=1n\\\{\(\\omega\_\{1i\},\\omega\_\{2i\}\)\\\}\_\{i=1\}^\{n\}are i\.i\.d\. frequencies fromg\(ω1,ω2\)g\(\\omega\_\{1\},\\omega\_\{2\}\),nndenotes the number of frequencies, andω1i′=ω1i\+ω2i2,ω2i′=ω1i−ω2i2\\omega\_\{1i\}^\{\\prime\}=\\frac\{\\omega\_\{1i\}\+\\omega\_\{2i\}\}\{2\},\\omega\_\{2i\}^\{\\prime\}=\\frac\{\\omega\_\{1i\}\-\\omega\_\{2i\}\}\{2\}\. Substituting[Section˜3\.2](https://arxiv.org/html/2606.27748#S3.Ex7)into the red term of[Equation˜4](https://arxiv.org/html/2606.27748#S3.E4)yields a random Fourier feature representation of the attention kernel\. For the blue termexp\(∥𝐱∥22d\)\\exp\\\!\\left\(\\frac\{\\lVert\\mathbf\{x\}\\rVert^\{2\}\}\{2\\sqrt\{d\}\}\\right\), we preserve its functional form while replacing the fixed scaling factor2d2\\sqrt\{d\}with a learnable parameterexp\(τ\)\\exp\(\\tau\)\. This leads to the following attention kernel approximation:
k\(𝐱,𝐲\)≈ϕ~n\(𝐱\)⊤ϕ~n\(𝐲\),k\(\\mathbf\{x\},\\mathbf\{y\}\)\\approx\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{x\}\)^\{\\top\}\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{y\}\),whereϕ~n\(𝐱\)=exp\(∥𝐱∥2exp\(τ\)\)ϕn\(𝐱\)\\tilde\{\\phi\}\_\{n\}\(\\mathbf\{x\}\)=\\exp\\\!\\left\(\\frac\{\\lVert\\mathbf\{x\}\\rVert^\{2\}\}\{\\exp\(\\tau\)\}\\right\)\\phi\_\{n\}\(\\mathbf\{x\}\)\. Both scaling parameterτ\\tauand spectral pairs\{\(ω1i,ω2i\)\}i=1n\\\{\(\\omega\_\{1i\},\\omega\_\{2i\}\)\\\}\_\{i=1\}^\{n\}are treated as learnable parameters\. We refer to this nonstationary variant asFlexformern\\textbf\{Flexformer\}\_\{\\textbf\{n\}\}, where the subscriptndenotes the nonstationary kernel\. We provide an analysis of the time and space complexity, as well as the additional parameter overhead of Flexformer, in Appendix[B](https://arxiv.org/html/2606.27748#A2)\.
## 4Experiments
To evaluate the effectiveness and efficiency of Flexformer as a linear\-attention variant, we conduct extensive experiments on long\-context classification, autoregressive language modeling, softmax attention recovery, and cross\-domain kernel transfer\. All results report the mean over three runs, with full results including standard deviations provided in Appendix[C](https://arxiv.org/html/2606.27748#A3)\. Additional hyperparameter sensitivity analyses, and experimental results are deferred to Appendix[D](https://arxiv.org/html/2606.27748#A4)\.
### 4\.1Long Sequence Classification
Table 1:Performance comparison on LRA benchmarks\. All reported metrics represent test accuracy \(%\), where higher values are better\. The best results are inbold, and the second best inunderlined\.ModelListOpsTextRetrievalImagePathfinderAverageTransformerVaswaniet al\.\([2017](https://arxiv.org/html/2606.27748#bib.bib4)\)37\.2365\.6664\.4941\.8973\.4656\.55ReformerKitaevet al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib45)\)36\.6263\.5257\.6040\.0673\.6154\.28LongformerBeltagyet al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib44)\)36\.4663\.0959\.2441\.2372\.4454\.49Linear Trans\.Katharopouloset al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib26)\)17\.4262\.6458\.9243\.4075\.1151\.49BigBirdZaheeret al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib46)\)36\.8865\.3464\.2341\.9373\.5856\.39PerformerChoromanskiet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib25)\)16\.9763\.6755\.1741\.2375\.3850\.48RFAPenget al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib21)\)17\.2064\.8958\.7641\.0074\.5151\.27SkyformerChenet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib43)\)36\.7165\.5168\.0541\.3674\.1857\.16CosformerQinet al\.\([2022](https://arxiv.org/html/2606.27748#bib.bib10)\)37\.4465\.1664\.0443\.5672\.3256\.50HedgehogZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\)37\.0264\.1068\.6741\.6475\.8957\.47PolaformerMenget al\.\([2025](https://arxiv.org/html/2606.27748#bib.bib7)\)36\.9369\.6565\.3940\.9573\.2657\.23Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}\(Ours\)37\.0967\.3871\.5543\.3976\.1559\.11Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}\(Ours\)37\.1865\.7273\.9445\.4277\.6659\.99
To evaluate Flexformer’s ability to model long sequences and reduce cost, we conduct long\-sequence classification experiments on the Long Range Arena \(LRA\) benchmarkTayet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib1)\)\. LRA is designed to assess both the effectiveness and efficiency of sequence models under long\-context settings\. It includes tasks such as Long ListOpsNangia and Bowman \([2018](https://arxiv.org/html/2606.27748#bib.bib37)\), byte\-level text classificationMaaset al\.\([2011](https://arxiv.org/html/2606.27748#bib.bib38)\), byte\-level document retrievalRadevet al\.\([2009](https://arxiv.org/html/2606.27748#bib.bib39)\), image classification on sequences of pixelsKrizhevsky and Hinton \([2009](https://arxiv.org/html/2606.27748#bib.bib40)\), and PathfinderHoutkamp and Roelfsema \([2010](https://arxiv.org/html/2606.27748#bib.bib41)\)\. The input sequence lengths in these tasks range from1K1\\text\{K\}to4K4\\text\{K\}\. We adopt the same hyperparameter configurations as those used in the official open\-source implementation\.
We present in[Table˜1](https://arxiv.org/html/2606.27748#S4.T1)a comprehensive performance comparison between Flexformer, the vanilla Transformer, several classical Transformer variants, and a range of existing linear attention methods\. Early approaches that rely on random Fourier features to obtain unbiased estimators of the softmax kernel, such as RFA and Performer, substantially underperform the vanilla Transformer, particularly on challenging tasks such as ListOps and document retrieval\. Methods based on learnable kernels, including Hedgehog and Polaformer, exhibit strong performance on certain specific tasks, but their effectiveness does not consistently generalize across all benchmarks\. In contrast, the proposed Flexformer, benefiting from its flexible and expressive feature map representation, achieves consistently strong performance across all tasks\. In particular,Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}attains the best results on document retrieval, image classification, and Pathfinder, while remaining highly competitive on the remaining tasks\. Overall, it achieves the highest average accuracy, corresponding to a4\.4%4\.4\\%relative improvement over the best existing linear attention method\. The stationary variant,Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}, ranks second overall and also outperforms all prior linear attention approaches\.
\(a\)Forward Pass Time & GPU Memory Usage vs\. Sequence Length\.
\(b\)Test accuracy, training speed and peak GPU memory usage comparison on document retrieval\.
Figure 2:Comparison of runtime and memory usage across models\. As shown in \(a\), Flexformer scales linearly with sequence length in both time and memory\. Together with \(b\), the results show that Flexformer achieves higher efficiency and lower memory usage than other baselines\.We further compare the computational time and memory usage of different Transformer variants on the LRA benchmark in[Figure˜2](https://arxiv.org/html/2606.27748#S4.F2)\. In[Figure˜2\(a\)](https://arxiv.org/html/2606.27748#S4.F2.sf1), we evaluate the forward\-pass computation time and memory consumption of the vanilla Transformer, Linear Transformer, and our proposedFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}under same hyperparameter configurations on the document retrieval task using input sequences of varying lengths\. As expected, both the computation time and memory usage of the vanilla Transformer scale quadratically with the sequence length, resulting in a rapid increase as the input length grows\. In contrast,Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}exhibits linear complexity with respect to sequence length, closely matching the scaling behavior of the Linear Transformer\.
In[Figure˜2\(b\)](https://arxiv.org/html/2606.27748#S4.F2.sf2), we compare different Transformer variants under identical training hyperparameter settings in terms of test accuracy, training throughput, and peak memory usage on the document retrieval task\. This configuration uses a maximum sequence length ofN=4000N=4000and a per\-head dimension ofd=64d=64in the encoder, satisfyingN≫dN\\gg d\. Under this setting, all linear attention models demonstrate substantially faster training speed and significantly lower memory consumption compared to the vanilla Transformer\. Compared with other linear attention methods,Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}introduces additional parameters and employs a slightly more complex feature map, leading to marginally reduced efficiency and memory savings\. However, this modest overhead is offset by significantly improved model performance while preserving linear time and space complexity\. On the document retrieval task, Flexformer achieves a2\.6×2\.6\\timestraining speedup and an84%84\\%reduction in memory usage relative to the vanilla Transformer\.
### 4\.2Autoregressive Language Modeling
To evaluate Flexformer’s effectiveness on text generation tasks, we conduct autoregressive language modeling experiments on WikiText\-103Merityet al\.\([2017](https://arxiv.org/html/2606.27748#bib.bib14)\)\. We use the vanilla Transformer with adaptive inputsBaevski and Auli \([2019](https://arxiv.org/html/2606.27748#bib.bib15)\)as the base architecture and replace its softmax attention with various linear attentions\. Following the protocol ofPenget al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib21)\), we evaluate small and large models with roughly 41M and 247M parameters, respectively\. The detailed hyperparameter settings are provided in Appendix[A\.2\.1](https://arxiv.org/html/2606.27748#A1.SS2.SSS1)\. For all experiments, we set the sentence block size to 512 and train the models for 100K steps using a batch size of 128\.
The final evaluation results of all baselines on both the development and test sets are reported in[Table˜2](https://arxiv.org/html/2606.27748#S4.T2)\. Under both model configurations, the vanilla Transformer with softmax attention consistently achieves strong performance\. Hedgehog and Polaformer enable linear attention to preserve the low\-entropy property of softmax attention, resulting in substantial gains over earlier linear\-attention methods\. Nevertheless, they still fall short of matching the vanilla Transformer, although the performance gap narrows under the large model configuration\.
Table 2:Language model perplexity \(PPL\) on the WikiText\-103 development and test sets\. Lower PPL is better\. The best results for each benchmark are inbold, and the second best inunderlined\.Small\(41M41\\text\{M\}Params\)Big\(247M247\\text\{M\}Params\)ModelDev\.TestDev\.TestVanilla Trans\.30\.031\.322\.623\.8Linear Trans\.34\.135\.627\.929\.2RFA33\.835\.326\.728\.0Cosformer32\.834\.224\.926\.2Hedgehog31\.733\.223\.024\.2Polaformer31\.532\.923\.124\.3Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}31\.132\.722\.223\.4Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}30\.331\.621\.222\.3
Notably, both Flexformer variants outperform all other linear attention methods under the small model configuration and achieve performance very close to that of the vanilla Transformer\. Under the big model configuration, Flexformer further surpasses the vanilla Transformer, withFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}demonstrating a particularly pronounced improvement\. This observation suggests that learnable kernels benefit more from increased model capacity, as the additional parameters introduced by kernel learning allow better utilization of larger model scales\. Importantly, the relative increase in parameter count remains nearly constant across configurations \(see Appendix[A\.2\.1](https://arxiv.org/html/2606.27748#A1.SS2.SSS1)\)\. Moreover,Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}introduces a comparable number of additional parameters to Hedgehog and fewer than Polaformer, yet still achieves superior performance\.
### 4\.3Recovering Softmax Attention
Besides training from scratch, another scenario that demands efficient linear attention is when we already have a traditional Transformer pretrained or finetuned, and we need to replace its softmax attention with linear attention while recovering the original performance\. We adopt the attention weight distillation lossZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\)to supervise the attention weights computed by Flexformer with the attention weights derived from softmax\. We first download a pretrained 125M RoBERTaLiuet al\.\([2019](https://arxiv.org/html/2606.27748#bib.bib6)\)model using the fairseqOttet al\.\([2019](https://arxiv.org/html/2606.27748#bib.bib47)\)library and then finetune it on each dataset of GLUEWanget al\.\([2018](https://arxiv.org/html/2606.27748#bib.bib8)\)benchmark\. We compare the following categories of methods: \(1\) Directly finetuning the pretrained RoBERTa on the corresponding dataset\. \(2\) Directly replace the attention computation in RoBERTa with Performer and RFA \(they are not learnable\), and then finetune the models\. \(3\) Starting from a finetuned RoBERTa model, we perform attention weights distillation on the corresponding dataset to convert it intoFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}and Hedgehog, respectively, followed by finetuning of the resulting models\. More detailed experimental settings are provided in Appendix[A\.3](https://arxiv.org/html/2606.27748#A1.SS3)\. For these three methods, we use the same finetuning hyperparameter configuration and present the experimental results in[Table˜3](https://arxiv.org/html/2606.27748#S4.T3)\.
Table 3:Performance comparison on GLUE\. On CoLA, the Matthews correlation coefficient is reported; on STS\-B, the average of Pearson and Spearman correlation coefficients is reported; for all other tasks, accuracy is reported\. Higher values indicate better performance for all metrics\. The best results of each task \(excluding RoBERTa\) are inbold, and the second best inunderlined\.ModelCoLASST\-2MRPCSTS\-BQQPMNLIQNLIRTEAverage\\rowcoloryellow\!30 RoBERTa \(reference\)61\.994\.888\.590\.291\.787\.692\.776\.985\.5Performer \(unlearnable\)22\.888\.966\.338\.884\.477\.576\.351\.663\.3RFA \(unlearnable\)27\.690\.164\.241\.286\.380\.875\.853\.064\.9Hedgehog \(distilled\)62\.593\.386\.087\.791\.283\.791\.271\.083\.3Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}\(distilled\)63\.393\.685\.888\.391\.185\.191\.071\.583\.7
Compared to directly finetuning pretrained RoBERTa, replacing its softmax attention with linear attention methods that provide unbiased estimates of the softmax kernel severely degrades performance, especially on several datasets\. In contrast, distilled Hedgehog and Flexformer both nearly fully recover the softmax attention behavior\. Furthermore, the learnable kernel demonstrates performance that even surpasses standard softmax attention on CoLA\.
### 4\.4Transferability of Learned Kernel
In this section, we examine whether the attention kernels learned by Flexformer on one dataset can be transferred to datasets from other domains, thus maintaining strong generalization performance\. To this end, we first perform attention weight distillation on a pretrained RoBERTa model using a single source dataset, converting it into a linear\-attention model\. Specifically, we use four GLUE datasets—STS\-B, MNLI, QQP, and QNLI—to obtain four distilled models, each learned from a different source domain\. We then fine\-tune these four distilled models on the remaining GLUE datasets, enabling us to evaluate their cross\-domain transferability\. The complete results are in[Table˜11](https://arxiv.org/html/2606.27748#A4.T11)\. In[Table˜4](https://arxiv.org/html/2606.27748#S4.T4), we report the averaged performance, where each column shows the average performance on the target task over kernels learned from different source datasets\.
Table 4:Comparison of the transfer performance of kernels learned byFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}and Hedgehog\. For each target task, we report the average performance over kernels learned from different source domains\. The best results of each task are inbold\.ModelCoLASST\-2MRPCSTS\-BQQPMNLIQNLIRTEHedgehog55\.692\.676\.385\.389\.682\.787\.553\.1Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}60\.393\.379\.586\.090\.784\.889\.556\.1
We observe that, when transferring the learned attention kernel from a source domain to a target domain,Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}consistently achieves superior performance across all settings\. Moreover, compared to the in\-domain results reported in[Table˜3](https://arxiv.org/html/2606.27748#S4.T3), Hedgehog suffers a particularly severe performance degradation on the CoLA dataset under kernel transfer, whereasFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}exhibits a substantially smaller performance drop\. These results demonstrate that the attention kernels learned by Flexformer possess stronger cross\-domain transferability than those of existing baselines\.
## 5Related Works
Kernel\-based linear attention methods reduce the quadratic complexity of softmax attention by expressing the attention kernel as an inner product of feature maps\. Existing approaches differ in how these feature maps are constructed\. Early methods approximate the exponential function via Taylor expansionAroraet al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib29)\), but even low\-order approximations lead to prohibitively high feature dimensionsde Brébisson and Vincent \([2016](https://arxiv.org/html/2606.27748#bib.bib28)\)\. Other works adopt simple element\-wise mappings, such as Linear TransformerKatharopouloset al\.\([2020](https://arxiv.org/html/2606.27748#bib.bib26)\)and CosformerQinet al\.\([2022](https://arxiv.org/html/2606.27748#bib.bib10)\), which are efficient but limited in expressiveness\. Random\-feature\-based methods, including PerformerChoromanskiet al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib25)\)and RFAPenget al\.\([2021](https://arxiv.org/html/2606.27748#bib.bib21)\), use random Fourier features to obtain unbiased estimators of the softmax kernel, assuming softmax attention is optimal and fixing the kernel a priori\. Recent approaches introduce learnable feature maps: HedgehogZhanget al\.\([2024](https://arxiv.org/html/2606.27748#bib.bib9)\)exploits the low\-entropy property of softmax using exponential\-based mappings, while PolaFormerMenget al\.\([2025](https://arxiv.org/html/2606.27748#bib.bib7)\)incorporates polarity\-aware mixing and power functions\. These methods empirically design kernels with desirable properties, but the expressiveness of their kernel families is not explicitly characterized\. In contrast, Flexformer is grounded in spectral representation theory, which establishes a one\-to\-one correspondence between a kernel and its spectral measure\. By treating spectral frequencies as learnable parameters and optimizing them directly from data, Flexformer enables a principled, data\-driven construction of attention kernels and can model a broader and more expressive kernel family\.
## 6Conclusions
We proposed Flexformer, a flexible kernel\-based linear attention framework that learns attention kernels directly from data using Fourier features\. By treating spectral frequencies as trainable parameters, Flexformer expands the expressiveness of linear attention while preserving linear time and space complexity with respect to sequence length\. We further introduced stationary and nonstationary variants, depending on the desired level of flexibility in the kernel\. Empirically, extensive experiments on language modeling and sequence classification tasks demonstrate that Flexformer consistently outperforms existing linear attention baselines\. Moreover, when distilled from pretrained Transformers, Flexformer can approximate softmax attention while enabling efficient linear attention inference\. We also observe strong cross\-domain kernel transferability\. Overall, Flexformer provides an effective and scalable alternative to softmax attention for long\-sequence modeling\.
Limitations\.Flexformer assumes the learned attention kernel is positive definite, but whether positive definite kernels are truly the best choice for attention remains theoretically unexplored\. Establishing a rigorous justification for this assumption is an important direction for future work\.
## References
- \[1\]S\. Arora, S\. Eyuboglu, M\. Zhang, A\. Timalsina, S\. Alberti, J\. Zou, A\. Rudra, and C\. Re\(2024\)Simple linear attention language models balance the recall\-throughput tradeoff\.InInternational Conference on Machine Learning,Cited by:[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[2\]A\. Baevski and M\. Auli\(2019\)Adaptive input representations for neural language modeling\.InInternational Conference on Learning Representations,Cited by:[§4\.2](https://arxiv.org/html/2606.27748#S4.SS2.p1.1)\.
- \[3\]I\. Beltagy, M\. E\. Peters, and A\. Cohan\(2020\)Longformer: the long\-document transformer\.arXiv\.Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.18.18.18.7),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.6.1)\.
- \[4\]S\. Bochner\(2005\)Harmonic analysis and the theory of probability\.Courier Corporation\.Cited by:[Theorem 3\.1](https://arxiv.org/html/2606.27748#S3.Thmtheorem1.p1.1)\.
- \[5\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[6\]Y\. Chen, Q\. Zeng, H\. Ji, and Y\. Yang\(2021\)Skyformer: remodel self\-attention with gaussian kernel and nyström method\.InAdvances in Neural Information Processing Systems,Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.48.48.48.7),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.11.1)\.
- \[7\]K\. M\. Choromanski, V\. Likhosherstov, D\. Dohan, X\. Song, A\. Gane, T\. Sarlos, P\. Hawkins, J\. Q\. Davis, A\. Mohiuddin, L\. Kaiser, D\. B\. Belanger, L\. J\. Colwell, and A\. Weller\(2021\)Rethinking attention with performers\.InInternational Conference on Learning Representations,Cited by:[§B\.2](https://arxiv.org/html/2606.27748#A2.SS2.p1.8),[Table 8](https://arxiv.org/html/2606.27748#A3.T8.36.36.36.7),[§1](https://arxiv.org/html/2606.27748#S1.p3.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.9.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[8\]A\. de Brébisson and P\. Vincent\(2016\)An exploration of softmax alternatives belonging to the spherical loss family\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[9\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InNorth American Chapter of the Association for Computational Linguistics,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[10\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[11\]F\. Eilers and X\. Jiang\(2023\)Building blocks for a complex\-valued transformer architecture\.InIEEE International Conference on Acoustics, Speech and Signal Processing,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[12\]R\. Houtkamp and P\. Roelfsema\(2010\)Parallel and serial grouping of image elements in visual perception\.Journal of Experimental Psychology: Human Perception and Performance\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27748#A1.SS1.SSS1.Px5.p1.1),[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[13\]A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret\(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational Conference on Machine Learning,Cited by:[§B\.2](https://arxiv.org/html/2606.27748#A2.SS2.p1.8),[Table 8](https://arxiv.org/html/2606.27748#A3.T8.24.24.24.7),[§1](https://arxiv.org/html/2606.27748#S1.p3.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.7.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[14\]D\. P\. Kingma and J\. Ba\(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.27748#A1.SS2.p1.1)\.
- \[15\]N\. Kitaev, L\. Kaiser, and A\. Levskaya\(2020\)Reformer: the efficient transformer\.InInternational Conference on Learning Representations,Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.12.12.12.7),[§1](https://arxiv.org/html/2606.27748#S1.p3.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.5.1)\.
- \[16\]A\. Krizhevsky and G\. Hinton\(2009\)Learning multiple layers of features from tiny images\.Technical reportUniversity of Toronto\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27748#A1.SS1.SSS1.Px4.p1.1),[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[17\]M\. Lázaro\-Gredilla, J\. Quiñonero\-Candela, C\. E\. Rasmussen, and A\. R\. Figueiras\-Vidal\(2010\)Sparse spectrum gaussian process regression\.Journal of Machine Learning Research\.Cited by:[§3\.2](https://arxiv.org/html/2606.27748#S3.SS2.SSS0.Px1.p1.6)\.
- \[18\]Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov\(2019\)RoBERTa: a robustly optimized bert pretraining approach\.arXiv\.Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1),[§4\.3](https://arxiv.org/html/2606.27748#S4.SS3.p1.1)\.
- \[19\]Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long\(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[20\]Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo\(2021\)Swin transformer: hierarchical vision transformer using shifted windows\.InProceedings of the IEEE/CVF International Conference on Computer Vision,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[21\]A\. L\. Maas, R\. E\. Daly, P\. T\. Pham, D\. Huang, A\. Y\. Ng, and C\. Potts\(2011\)Learning word vectors for sentiment analysis\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27748#A1.SS1.SSS1.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[22\]W\. Meng, Y\. Luo, X\. Li, D\. Jiang, and Z\. Zhang\(2025\)PolaFormer: polarity\-aware linear attention for vision transformers\.InInternational Conference on Learning Representations,Cited by:[§B\.2](https://arxiv.org/html/2606.27748#A2.SS2.p1.8),[Table 8](https://arxiv.org/html/2606.27748#A3.T8.66.66.66.7),[§1](https://arxiv.org/html/2606.27748#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.27748#S3.SS1.p4.2),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.14.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[23\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2017\)Pointer sentinel mixture models\.InInternational Conference on Learning Representations,Cited by:[§A\.1\.2](https://arxiv.org/html/2606.27748#A1.SS1.SSS2.p1.1),[§4\.2](https://arxiv.org/html/2606.27748#S4.SS2.p1.1)\.
- \[24\]N\. Nangia and S\. R\. Bowman\(2018\)ListOps: a diagnostic dataset for latent tree learning\.arXiv\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27748#A1.SS1.SSS1.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[25\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[26\]M\. Ott, S\. Edunov, A\. Baevski, A\. Fan, S\. Gross, N\. Ng, D\. Grangier, and M\. Auli\(2019\)Fairseq: a fast, extensible toolkit for sequence modeling\.arXiv preprint arXiv:1904\.01038\.Cited by:[§4\.3](https://arxiv.org/html/2606.27748#S4.SS3.p1.1)\.
- \[27\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.2](https://arxiv.org/html/2606.27748#A1.SS2.p1.1)\.
- \[28\]H\. Peng, N\. Pappas, D\. Yogatama, R\. Schwartz, N\. Smith, and L\. Kong\(2021\)Random feature attention\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2606.27748#A2.SS1.p1.7),[Table 8](https://arxiv.org/html/2606.27748#A3.T8.42.42.42.7),[§3\.1](https://arxiv.org/html/2606.27748#S3.SS1.p3.2),[§3\.2](https://arxiv.org/html/2606.27748#S3.SS2.SSS0.Px1.p2.1),[§4\.2](https://arxiv.org/html/2606.27748#S4.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.10.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[29\]Z\. Qin, W\. Sun, H\. Deng, D\. Li, Y\. Wei, B\. Lv, J\. Yan, L\. Kong, and Y\. Zhong\(2022\)CosFormer: rethinking softmax in attention\.InInternational Conference on Learning Representations,Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.54.54.54.7),[§1](https://arxiv.org/html/2606.27748#S1.p3.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.12.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[30\]D\. R\. Radev, P\. Muthukrishnan, and V\. Qazvinian\(2009\)The acl anthology network corpus\.InWorkshop on Text and Citation Analysis for Scholarly Digital Libraries,Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27748#A1.SS1.SSS1.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[31\]A\. Rahimi and B\. Recht\(2007\)Random features for large\-scale kernel machines\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.1](https://arxiv.org/html/2606.27748#S3.SS1.p1.6)\.
- \[32\]A\. Roy, M\. T\. Saffar, D\. Grangier, and A\. Vaswani\(2020\)Efficient content\-based sparse attention with routing transformers\.InarXiv,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p3.1)\.
- \[33\]S\. Schneider, A\. Baevski, R\. Collobert, and M\. Auli\(2019\)Wav2vec: unsupervised pre\-training for speech recognition\.arXiv preprint arXiv:1904\.05862\.Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[34\]A\. Sherstinsky\(2020\)Fundamentals of recurrent neural network \(rnn\) and long short\-term memory \(lstm\) network\.Physica D: Nonlinear Phenomena\.Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
- \[35\]J\. Sieber, C\. Amo Alonso, A\. Didier, M\. Zeilinger, and A\. Orvieto\(2024\)Understanding the differences in foundation models: attention, state space models, and recurrent neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.1](https://arxiv.org/html/2606.27748#S3.SS1.p1.3)\.
- \[36\]Z\. Sun, Y\. Zhang, Z\. Ling, X\. Fan, and F\. Zhou\(2024\)Nonstationary sparse spectral permanental process\.InProceedings of the 38th International Conference on Neural Information Processing Systems,Cited by:[§3\.2](https://arxiv.org/html/2606.27748#S3.SS2.SSS0.Px1.p1.6)\.
- \[37\]Y\. Tay, M\. Dehghani, S\. Abnar, Y\. Shen, D\. Bahri, P\. Pham, J\. Rao, L\. Yang, S\. Ruder, and D\. Metzler\(2021\)Long range arena: a benchmark for efficient transformers\.InInternational Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2606.27748#S4.SS1.p1.2)\.
- \[38\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.6.6.6.7),[§1](https://arxiv.org/html/2606.27748#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.27748#S2.SS1.p1.9),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.4.1)\.
- \[39\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman\(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InBlackboxNLP at EMNLP,Cited by:[§4\.3](https://arxiv.org/html/2606.27748#S4.SS3.p1.1)\.
- \[40\]A\. M\. Yaglom\(1987\)Correlation theory of stationary and related random functions, volume i: basic results\.Springer\.Cited by:[Theorem 3\.2](https://arxiv.org/html/2606.27748#S3.Thmtheorem2.p1.1)\.
- \[41\]M\. Zaheer, G\. Guruganesh, A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontanon, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang, and A\. Ahmed\(2020\)Big bird: transformers for longer sequences\.InAdvances in Neural Information Processing Systems,Cited by:[Table 8](https://arxiv.org/html/2606.27748#A3.T8.30.30.30.7),[§1](https://arxiv.org/html/2606.27748#S1.p3.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.8.1)\.
- \[42\]M\. Zhang, K\. Bhatia, H\. Kumbong, and C\. Re\(2024\)The hedgehog & the porcupine: expressive linear attentions with softmax mimicry\.InInternational Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.27748#A1.SS3.SSS0.Px2.p1.2),[§B\.1](https://arxiv.org/html/2606.27748#A2.SS1.p1.7),[§B\.2](https://arxiv.org/html/2606.27748#A2.SS2.p1.8),[Table 8](https://arxiv.org/html/2606.27748#A3.T8.60.60.60.7),[§1](https://arxiv.org/html/2606.27748#S1.p4.1),[§1](https://arxiv.org/html/2606.27748#S1.p5.1),[§3\.1](https://arxiv.org/html/2606.27748#S3.SS1.p4.2),[§4\.3](https://arxiv.org/html/2606.27748#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.27748#S4.T1.2.2.13.1),[§5](https://arxiv.org/html/2606.27748#S5.p1.1)\.
- \[43\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InAAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.27748#S1.p1.1)\.
## Appendix AExperimental Details
### A\.1Introduction of Datasets
Statistics on the sizes of all datasets used in our experiments are provided in[Table˜5](https://arxiv.org/html/2606.27748#A1.T5)\. Next, we provide a brief overview of these datasets\.
Table 5:Statistics of the datasets used in our experiments\.DatasetTrainValidTestListOps96K2K2KIMDB25K–25KAAN147K18K17KCIFAR\-1045K5K10KPathfinder160K20K20KWikiText\-103103M218K246KCoLA8\.5K–1KSST\-267K–1\.8KMRPC3\.7K–1\.7KSTS\-B7K–1\.4KQQP364K–391KMNLI393K–20KQNLI105K–5\.4KRTE2\.5K–3KWNLI634K–146#### A\.1\.1Long Range Arena
Next, we provide a detailed introduction to the datasets of the five tasks in the LRA\.
##### Long ListOps
A 10\-class classification task based on a longer variant of the synthetic ListOps dataset\[[24](https://arxiv.org/html/2606.27748#bib.bib37)\]\. The sequences with a max length of 2K consist of nested expressions using operators like MAX, MIN, and MEDIAN, and models must predict the final result \(an integer from 0 to 9\), testing their ability to model hierarchical structure over long contexts\.
##### Byte\-level Text Classification
A binary classification task using IMDb movie reviews\[[21](https://arxiv.org/html/2606.27748#bib.bib38)\]processed as raw byte sequences with fixed length of 4K\. The goal is to classify each review as positive or negative sentiment without word segmentation or pretraining\.
##### Byte\-level Document Retrieval
A binary classification task based on the ACL Anthology Network \(AAN\)\[[30](https://arxiv.org/html/2606.27748#bib.bib39)\]dataset, where models determine whether two 4K\-byte academic documents share a citation link, using a two\-tower encoding architecture\.
##### Image Classification on Sequences of Pixels
A 10\-class classification task using grayscale CIFAR\-10 images\[[16](https://arxiv.org/html/2606.27748#bib.bib40)\], flattened into 1,024\-pixel sequences\. Models classify each image into one of ten object categories using only sequential pixel values\.
##### Pathfinder
A binary classification task derived from the Pathfinder dataset\[[12](https://arxiv.org/html/2606.27748#bib.bib41)\]introduced in cognitive psychology and adapted for machine learning\. It uses 32×32 synthetic images \(1,024 pixels\) containing two marked points and dashed paths; the model must determine whether the points are connected by a continuous path, evaluating its capacity for long\-range spatial reasoning\.
#### A\.1\.2Wikitext\-103
WikiText\-103\[[23](https://arxiv.org/html/2606.27748#bib.bib14)\]is a large\-scale language modeling benchmark introduced in this paper to address the limitations of smaller datasets like Penn Treebank\. It is constructed from 28,475 high\-quality English Wikipedia articles \(including 23,805 "Good" and 4,790 "Featured" articles\), resulting in a training set of over 103 million words\. The dataset preserves original casing, punctuation, and named entities, making it more realistic for real\-world applications\. After preprocessing using the Moses tokenizer and replacing words occurring fewer than 3 times with <unk\>, it has a vocabulary size of 267,735 and a very low out\-of\-vocabulary rate of0\.4%0\.4\\%\. Unlike shuffled corpora, WikiText\-103 maintains full article structure, enabling the evaluation of models on long\-range dependencies and rare word prediction\.
#### A\.1\.3GLUE
We introduce below the eight datasets in GLUE corresponding to their respective tasks\.
##### CoLA \(Corpus of Linguistic Acceptability\)
The task is to judge whether a sentence is grammatically acceptable in English\. The data is drawn from books and journal articles on linguistic theory\. Labels are binary: acceptable or unacceptable\.
##### SST\-2 \(Stanford Sentiment Treebank\)
The task is binary sentiment classification of sentences from movie reviews\. The data comes from movie reviews\. Labels are positive or negative sentiment\.
##### MRPC \(Microsoft Research Paraphrase Corpus\)
The task is to determine if two sentences are semantically equivalent \(paraphrases\)\. Sentence pairs are automatically extracted from online news sources\. Labels are binary: paraphrase or not paraphrase\.
##### STS\-B \(Semantic Textual Similarity Benchmark\)
The task is to predict the degree of semantic similarity between two sentences\. The data is drawn from news headlines, video/image captions, and natural language inference datasets\. Labels are continuous similarity scores ranging from 1 to 5 \(a regression task\)\.
##### QQP \(Quora Question Pairs\)
The task is to determine whether two questions from the Quora community QA platform are semantically equivalent \(duplicates\)\. Labels are binary: duplicate or not duplicate\.
##### MNLI \(Multi\-Genre Natural Language Inference\)
The task is to classify the relationship between a premise and a hypothesis sentence as entailment, contradiction, or neutral\. Premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports\. Labels are three\-way: entailment, contradiction, or neutral\.
##### QNLI \(Question Natural Language Inference\)
This task is derived from the SQuAD question\-answering dataset\. It is recast as an NLI problem to determine if a sentence from a Wikipedia paragraph contains the answer to a given question\. Labels are binary: entailment \(contains answer\) or not entailment\.
##### RTE \(Recognizing Textual Entailment\)
The task is to recognize textual entailment in a binary setting\. The data combines examples from the RTE1, RTE2, RTE3, and RTE5 challenges, based on news and Wikipedia text\. Labels are binary: entailment or not entailment\.
### A\.2Setup
All of our experiments are implemented using PyTorch 2\.9\.1\[[27](https://arxiv.org/html/2606.27748#bib.bib42)\]and conducted on a single Nvidia RTX5090 GPU with 32GB memory\. We use the Adam\[[14](https://arxiv.org/html/2606.27748#bib.bib48)\]optimizer for all experiments\.
#### A\.2\.1Autoregressive Language Modeling
The detailed hyperparameter configurations for the autoregressive language modeling experiments are listed in[Table˜6](https://arxiv.org/html/2606.27748#A1.T6)\. Total number of parameters refers to the Transformer architecture without any additional parameters introduced\. Methods with learnable feature maps have a total number of parameters slightly larger than this value\. Their specific amounts of additional parameters are shown in[Table˜7](https://arxiv.org/html/2606.27748#A1.T7)\. As discussed in[Appendix˜B](https://arxiv.org/html/2606.27748#A2), the additional parameters introduced by Flexformer are minimal, all less than1%1\\%\. The additional parameters ofFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}are almost exactly twice those of Hedgehog, whereasFlexformers\\text\{Flexformer\}\_\{\\text\{s\}\}introduces nearly the same amount of extra parameters as Hedgehog\. Polaformer, in contrast, incurs a larger number of additional parameters because it applies a linear projection after concatenating all heads, rather than using separate projections for each head\. In large language models, a greater number of parameters often leads to better performance\. Our results show that Flexformer achieves a higher performance gain per additional parameter compared to other models\.
Table 6:Detailed hyperparameter settings for autoregressive language modeling experiments\.HyperparameterSmallBigEmbedding Size5121024Layers616Heads816Head Size6464FFN Size20484096Batch Size128128Learning Rate5×10−45\\times 10^\{\-4\}5×10−45\\times 10^\{\-4\}Total Steps100,000100,000Warmup Steps4,0004,000Dropout0\.10\.2Gradient Clipping Norm0\.10\.1Total Number of Parameters41M41\\,\\text\{M\}247M247\\,\\text\{M\}Table 7:Comparison of parameter counts between methods with learnable feature maps and the Vanilla Transformer\. “Relative Parameter Increase” denotes the percentage increase in model parameters compared to a Vanilla Transformer under the same configuration\.SmallBigModelTotal ParamsRel\. Increase \(%\)Total ParamsRel\. Increase \(%\)Transformer41M0\.00247M0\.00Hedgehog41M0\.49248M0\.43Polaformer43M3\.83262M6\.74Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}41M0\.49248M0\.43Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}42M0\.97249M0\.86
### A\.3Recovering Softmax Attention and Transferability of Learned Kernel
For each dataset of GLUE, we use the preprocessing script from the Fairseq repository111[https://github\.com/facebookresearch/fairseq/blob/main/examples/roberta/preprocess\_GLUE\_tasks\.sh](https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/preprocess_GLUE_tasks.sh)to further split the training set into training and validation subsets\. The downloaded pre\-trained 125M RoBERTa model uses a GPT\-style BPE tokenizer with a vocabulary size of approximately 50K, an embedding dimension of 512, 12 encoder layers, and 8 attention heads\. All methods use identical settings during finetuning, and both Hedgehog and Flexformer employ the same configuration for distillation\.
##### Finetuning Setting\.
The fine\-tuning configurations are also taken from the Fairseq repository222[https://github\.com/facebookresearch/fairseq/tree/main/examples/roberta/config/finetuning](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta/config/finetuning): each task uses a batch size of 16 or 32, a learning rate of1×10−51\\times 10^\{\-5\}or2×10−52\\times 10^\{\-5\}, and is trained for 10 epochs, with the checkpoint achieving the best validation performance selected for testing\.
##### Attention Distillation Setting\.
We use the attention weight distillation loss proposed in\[[42](https://arxiv.org/html/2606.27748#bib.bib9)\]:
ℒawd=−1N∑i=1N∑j=1Nexp\(Qi⊤Kjd\)∑l=1Nexp\(Qi⊤Kld\)logϕ~n\(Qi\)⊤ϕ~n\(Kj\)∑l=1Nϕ~n\(Qi\)⊤ϕ~n\(Kl\)\.\\mathcal\{L\}\_\{awd\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{N\}\\frac\{\\exp\(\\frac\{\\text\{Q\}\_\{i\}^\{\\top\}\\text\{K\}\_\{j\}\}\{\\sqrt\{d\}\}\)\}\{\\sum\_\{l=1\}^\{N\}\\exp\(\\frac\{\\text\{Q\}\_\{i\}^\{\\top\}\\text\{K\}\_\{l\}\}\{\\sqrt\{d\}\}\)\}\\log\\frac\{\\tilde\{\\phi\}\_\{n\}\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\tilde\{\\phi\}\_\{n\}\(\\text\{K\}\_\{j\}\)\}\{\\sum\_\{l=1\}^\{N\}\\tilde\{\\phi\}\_\{n\}\(\\text\{Q\}\_\{i\}\)^\{\\top\}\\tilde\{\\phi\}\_\{n\}\(\\text\{K\}\_\{l\}\)\}\.It computes the cross\-entropy between the distribution obtained by normalizing each query with our kernel and the target distribution from softmax attention, then averages this loss over all query positions\. During attention weight distillation, we freeze all parameters of RoBERTa, replace its attention computation with our learnable kernel, and train only the kernel parameters for 20,000 steps on a dataset with a batch size of 32 and a learning rate of1×10−31\\times 10^\{\-3\}\. For the experiments in[Section˜4\.3](https://arxiv.org/html/2606.27748#S4.SS3), we perform attention weight distillation followed by fine\-tuning directly on the target dataset\. In contrast, for[Section˜4\.4](https://arxiv.org/html/2606.27748#S4.SS4), we first conduct attention weight distillation on a different dataset and then finetune the model on the target dataset\. We select STS\-B, QQP, MNLI, and QNLI as distillation datasets because they span multiple domains and are representative of diverse tasks\.
## Appendix BComplexity Analysis
### B\.1Time and Space Complexity
As discussed in[Section˜2\.2](https://arxiv.org/html/2606.27748#S2.SS2), the overall time and space complexity of linear attention is𝒪\(Ndd′\)\\mathcal\{O\}\(Ndd^\{\\prime\}\)and𝒪\(Nd\+Nd′\+dd′\)\\mathcal\{O\}\(Nd\+Nd^\{\\prime\}\+dd^\{\\prime\}\)respectively\. For Flexformer,d′=2nd^\{\\prime\}=2n, wherenndenotes the number of frequencies\. Generally, largerd′d^\{\\prime\}enhances the expressive power of the learned kernel but increases both parameter count and complexity\. A trade\-off between expressiveness and efficiency is therefore necessary to determine the optimal hyperparameterd′d^\{\\prime\}\. In subsequent comparative experiments, to ensure maximal fairness, we setn=d,d′=2dn=d,d^\{\\prime\}=2dfollowing prior work\[[28](https://arxiv.org/html/2606.27748#bib.bib21),[42](https://arxiv.org/html/2606.27748#bib.bib9)\]\.
### B\.2Additional Parameters
Compared to early linear attention methods\[[13](https://arxiv.org/html/2606.27748#bib.bib26),[7](https://arxiv.org/html/2606.27748#bib.bib25)\], more recent studies\[[42](https://arxiv.org/html/2606.27748#bib.bib9),[22](https://arxiv.org/html/2606.27748#bib.bib7)\]incorporate additional learnable parameters to enhance the expressive power of kernel\-based linear attention\. Flexformer follows the same line\. Lethhdenote the number of heads\. Assumingn=dn=d, Flexformer requires approximatelyd2d^\{2\}extra parameters per head for stationary variant \(2d22d^\{2\}for nonstationary\)\. To maximize expressiveness, we assume these parameters independent across all heads\. In this configuration, Flexformer requires approximately112h\\frac\{1\}\{12h\}extra parameters per encoder/decoder layer for stationary variant \(16h\\frac\{1\}\{6h\}for nonstationary\), which accounts for only about1%1\\%of the total parameters whenh=8h=8\.
## Appendix CError Bar
All our main experiments were repeated under three different random seeds, and we report the mean results in the main text\. In this section, we additionally report the standard deviations in[Table˜8](https://arxiv.org/html/2606.27748#A3.T8),[Table˜9](https://arxiv.org/html/2606.27748#A3.T9)and[Table˜10](https://arxiv.org/html/2606.27748#A3.T10)\.
Table 8:Mean and standard deviation of performance metrics on the LRA benchmarks\. All reported metrics represent test accuracy \(%\), where higher values indicate better performance\. The best results for each benchmark are inbold, and the second best inunderlined\.ModelListOpsTextRetrievalImagePathfinderAverageTransformer\[[38](https://arxiv.org/html/2606.27748#bib.bib4)\]37\.23±\\pm0\.1765\.66±\\pm0\.5864\.49±\\pm0\.6141\.89±\\pm0\.2973\.46±\\pm0\.4356\.55±\\pm0\.21Reformer\[[15](https://arxiv.org/html/2606.27748#bib.bib45)\]36\.62±\\pm0\.1863\.52±\\pm0\.5657\.60±\\pm0\.5840\.06±\\pm0\.3073\.61±\\pm0\.4654\.28±\\pm0\.22Longformer\[[3](https://arxiv.org/html/2606.27748#bib.bib44)\]36\.46±\\pm0\.2363\.09±\\pm0\.5659\.24±\\pm0\.6041\.23±\\pm0\.3072\.44±\\pm0\.4254\.49±\\pm0\.21Linear Trans\.\[[13](https://arxiv.org/html/2606.27748#bib.bib26)\]17\.42±\\pm0\.1262\.64±\\pm0\.5458\.92±\\pm0\.5843\.40±\\pm0\.3275\.11±\\pm0\.4451\.49±\\pm0\.20BigBird\[[41](https://arxiv.org/html/2606.27748#bib.bib46)\]36\.88±\\pm0\.1665\.34±\\pm0\.6064\.23±\\pm0\.6041\.93±\\pm0\.3173\.58±\\pm0\.4556\.39±\\pm0\.23Performer\[[7](https://arxiv.org/html/2606.27748#bib.bib25)\]16\.97±\\pm0\.1463\.67±\\pm0\.5755\.17±\\pm0\.5641\.23±\\pm0\.2975\.38±\\pm0\.4450\.48±\\pm0\.20RFA\[[28](https://arxiv.org/html/2606.27748#bib.bib21)\]17\.20±\\pm0\.1364\.89±\\pm0\.5858\.76±\\pm0\.6141\.00±\\pm0\.3074\.51±\\pm0\.4851\.27±\\pm0\.21Skyformer\[[6](https://arxiv.org/html/2606.27748#bib.bib43)\]36\.71±\\pm0\.1665\.51±\\pm0\.5968\.05±\\pm0\.6341\.36±\\pm0\.3174\.18±\\pm0\.4557\.16±\\pm0\.23Cosformer\[[29](https://arxiv.org/html/2606.27748#bib.bib10)\]37\.44±\\pm0\.2165\.16±\\pm0\.5764\.04±\\pm0\.6143\.56±\\pm0\.3072\.32±\\pm0\.4256\.50±\\pm0\.22Hedgehog\[[42](https://arxiv.org/html/2606.27748#bib.bib9)\]37\.02±\\pm0\.1864\.10±\\pm0\.5868\.67±\\pm0\.6341\.64±\\pm0\.3275\.89±\\pm0\.4857\.47±\\pm0\.23Polaformer\[[22](https://arxiv.org/html/2606.27748#bib.bib7)\]36\.93±\\pm0\.1769\.65±\\pm0\.6165\.39±\\pm0\.6240\.95±\\pm0\.2973\.26±\\pm0\.4457\.23±\\pm0\.22Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}\(Ours\)37\.09±\\pm0\.1867\.38±\\pm0\.6071\.55±\\pm0\.6243\.39±\\pm0\.2976\.15±\\pm0\.4859\.11±\\pm0\.24Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}\(Ours\)37\.18±\\pm0\.1865\.72±\\pm0\.5973\.94±\\pm0\.6645\.42±\\pm0\.3277\.66±\\pm0\.4859\.99±\\pm0\.25
Table 9:Mean and standard deviation of PPL on the WikiText\-103 development and test sets\. Lower PPL indicates better performance\. The best results for each benchmark are inbold, and the second best inunderlined\.Small\(41M41\\text\{M\}Params\)Big\(247M247\\text\{M\}Params\)ModelDev\.TestDev\.TestVanilla Trans\.30\.0±\\pm0\.0631\.3±\\pm0\.0722\.6±\\pm0\.0323\.8±\\pm0\.06Linear Trans\.34\.1±\\pm0\.1235\.6±\\pm0\.1027\.9±\\pm0\.0829\.2±\\pm0\.08RFA33\.8±\\pm0\.1135\.3±\\pm0\.1126\.7±\\pm0\.0928\.0±\\pm0\.10Cosformer32\.8±\\pm0\.0934\.2±\\pm0\.1124\.9±\\pm0\.0526\.2±\\pm0\.06Hedgehog31\.7±\\pm0\.0833\.2±\\pm0\.0523\.0±\\pm0\.0624\.2±\\pm0\.07Polaformer31\.5±\\pm0\.0832\.9±\\pm0\.0823\.1±\\pm0\.0324\.3±\\pm0\.02Flexformers\\text\{Flexformer\}\_\{\\text\{s\}\}31\.1±\\pm0\.0632\.7±\\pm0\.0822\.2±\\pm0\.0223\.4±\\pm0\.04Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}30\.3±\\pm0\.0631\.6±\\pm0\.0721\.2±\\pm0\.0322\.3±\\pm0\.04
Table 10:Mean and standard deviation of performance metrics on GLUE\. On CoLA, the Matthews correlation coefficient is reported; on STS\-B, the average of Pearson and Spearman correlation coefficients is reported; for all other tasks, accuracy is reported\. Higher values indicate better performance for all metrics\. The best results of each task are inbold, and the second best inunderlined\.ModelCoLASST\-2MRPCSTS\-BQQPMNLIQNLIRTEAveragePerformer22\.8±\\pm1\.2188\.9±\\pm0\.3266\.3±\\pm0\.6938\.8±\\pm0\.4584\.4±\\pm0\.4077\.5±\\pm0\.3076\.3±\\pm0\.5351\.6±\\pm0\.2663\.3±\\pm0\.31RFA27\.6±\\pm0\.9290\.1±\\pm0\.2964\.2±\\pm0\.7141\.2±\\pm0\.5286\.3±\\pm0\.3780\.8±\\pm0\.2975\.8±\\pm0\.4553\.0±\\pm0\.3364\.9±\\pm0\.29Hedgehog62\.5±\\pm0\.3293\.3±\\pm0\.1186\.0±\\pm0\.1287\.7±\\pm0\.1891\.2±\\pm0\.1083\.7±\\pm0\.1691\.2±\\pm0\.1471\.0±\\pm0\.2283\.3±\\pm0\.11Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}63\.3±\\pm0\.2893\.6±\\pm0\.1285\.8±\\pm0\.1288\.3±\\pm0\.1691\.1±\\pm0\.1385\.1±\\pm0\.1691\.0±\\pm0\.1571\.5±\\pm0\.2083\.7±\\pm0\.10
## Appendix DAdditional Results
In this section, we provide additional experimental results\.
### D\.1Hyperparameter Sensitivity
In all main experiments of this paper, to ensure a fair comparison, we setn=d,d′=2dn=d,d^\{\\prime\}=2d\. However, as shown by our analysis in Section 3, a largernncan fit more complex kernel functions, but it also incurs higher computational cost\. To guide the practical choice of an appropriatenn, we investigated how model performance and computational cost vary withnnon WikiText\-103 and on the Image and ListOps tasks in LRA\.
The results in[Figure˜3](https://arxiv.org/html/2606.27748#A4.F3),[Figure˜4](https://arxiv.org/html/2606.27748#A4.F4)and[Figure˜5](https://arxiv.org/html/2606.27748#A4.F5)show that asnnincreases, the computational cost grows accordingly\. On larger\-scale models and datasets, increasingnnconsistently improves generalization, while on smaller\-scale benchmarks such as LRA, performance saturates beyond a certain point\. In the main experiments, we setn=dn=dto keep the feature dimension consistent with prior work, under which Flexformer already achieves notable performance gains\. Moreover, the additional results show that even with a moderately smallernn, Flexformer still outperforms the baselines while maintaining better efficiency\.
Figure 3:Performance, memory consumption, and training speed ofFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}in the autoregressive language modeling experiments under different values of the sampling frequency hyperparameternn\. On large\-scale models and datasets, increasingnnimproves both the fitting and generalization abilities of the model, while also incurring higher computational cost\. In the main experiments, we setn=64n=64\.Figure 4:Performance, memory consumption, and training speed ofFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}on the Image dataset for Long Sequence Classification under different values of the sampling frequency hyperparameternn\. Both memory usage and training time increase with largernn\. On the Image dataset, increasingnnimproves the model’s generalization whennnis small; however, asnnbecomes larger, it increases computational cost and may even lead to overfitting\. In the main experiments, we setn=16n=16\.Figure 5:Performance, memory consumption, and training speed ofFlexformern\\text\{Flexformer\}\_\{\\text\{n\}\}on the ListOps dataset for Long Sequence Classification under different values of the sampling frequency hyperparameternn\. Asnnincreases, both memory usage and training time also increase\. On ListOps, competitive performance can be achieved even with a very smallnn\. In the main experiments, we usen=64n=64\.
### D\.2Recovering Softmax Attention
In[Figure˜6](https://arxiv.org/html/2606.27748#A4.F6), we selectively compare the loss curves of attention weight distillation using Hedgehog and Flexformer on the MNLI and QNLI datasets\. It can be observed that Flexformer consistently achieves lower cross\-entropy loss on the validation set compared to Hedgehog\. Moreover, Flexformer converges slightly faster than Hedgehog: its validation loss stabilizes around 8,000 training steps, whereas Hedgehog does not stabilize until nearly 12,000 steps\. This further demonstrates that the learnable kernel in Flexformer is more flexible\.
\(a\)Distillation on MNLI\.
\(b\)Distillation on QNLI\.
Figure 6:Comparison of loss curves during attention weight distillation training\.
### D\.3Transfer Performance
In[Table˜11](https://arxiv.org/html/2606.27748#A4.T11), we fully present the results of transferring the model—after performing attention weight distillation on a specific dataset—to other datasets\. Flexformer achieves better performance than Hedgehog in almost all cases\. Among them, the relative performance improvement reaches up to13%13\\%\. Moreover, it can also be observed that Hedgehog’s performance is relatively unstable when distillation is performed on different datasets\.
Table 11:Comparison of performance when learning the kernel on one dataset and transferring it to other datasets\. The best results of each situation are inbold\.Source/TargetModelCoLASST\-2MRPCSTS\-BQQPMNLIQNLIRTESTS\-BHedgehog58\.092\.377\.2–89\.682\.087\.855\.2Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}63\.593\.280\.2–90\.784\.789\.856\.7QQPHedgehog55\.293\.375\.285\.1–83\.387\.750\.7Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}58\.893\.379\.084\.8–84\.889\.153\.5MNLIHedgehog57\.792\.876\.585\.590\.1–86\.952\.6Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}60\.693\.180\.486\.790\.4–89\.554\.9QNLIHedgehog51\.691\.876\.285\.189\.082\.8–53\.8Flexformern\\text\{Flexformer\}\_\{\\text\{n\}\}58\.393\.278\.486\.490\.684\.8–59\.4
## Appendix ELimitations and Future Work
Theoretical analysis in this paper assumes the learned kernel is positive definite, as are the softmax and Gaussian kernels\. Empirically, positive definite kernels appear to induce desirable attention behaviors and avoid pathological patterns \(e\.g\., weak self\-attention but strong incoming attention\)\. However, whether positive definite kernels are fundamentally optimal for attention mechanisms remains unexplored\. We hope future work will provide a deeper understanding of this issue\.
Our experiments are primarily conducted on tasks and datasets related to natural language processing\. In fact, there exist numerous other scenarios involving long sequences that require attention mechanisms—for instance, processing high\-resolution images or extremely long protein sequences—each of which may demand substantially different attention patterns for effective modeling\. Although the LRA benchmark does provide a valuable testbed for evaluating the ability to capture long\-range dependencies, thereby complementing the limitation of natural language data \(where dependencies are often predominantly local\), it remains insufficient to fully reflect the diverse range of attention patterns required across different modalities and domains\. We plan to validate the effectiveness of our approach across a broader range of domains in future work\.Similar Articles
Exact Linear Attention
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.
Dynamic Linear Attention
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...
Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.
Learning the Koopman Operator using Attention Free Transformers
This paper introduces attention-free latent memory and dynamic re-encoding to improve long-horizon predictions in Koopman autoencoders, reducing error accumulation on benchmark dynamical systems.
Functional Attention: From Pairwise Affinities to Functional Correspondences
Functional Attention is a novel attention mechanism that reinterprets attention as a functional correspondence between adaptive bases, replacing softmax affinities with structured linear operators inspired by geometric functional maps. The method achieves state-of-the-art performance on operator learning tasks including PDE solving and 3D segmentation while remaining resolution-invariant.