Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

arXiv cs.LG 05/27/26, 04:00 AM Papers
transformer attention positional-encoding wavelet energy-gated inductive-bias spectral-methods
Summary
This paper proposes Energy-Gated Attention (EGA) and Morlet Positional Encoding (MoPE) to address missing inductive biases in transformer attention: token salience and scale-adaptive locality. Experiments on TinyShakespeare show superadditive gains when combined, highlighting complementarity.
arXiv:2605.26355v1 Announce Type: new Abstract: Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:09 AM
# Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
Source: [https://arxiv.org/html/2605.26355](https://arxiv.org/html/2605.26355)
Athanasios Zeris[iD0009\-0002\-6907\-2400](https://orcid.org/0009-0002-6907-2400)Independent Researcher, Athens, Greece\. Correspondence:athzeris@gmail\.com\. ORCID:[0009\-0002\-6907\-2400](https://orcid.org/0000-0002-XXXX-XXXX)\. Part of a five\-paper series on spectral methods in transformer attention\.

###### Abstract

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input\. We identify two complementary inductive biases that standard attention lacks:*energy salience*\(which tokens concentrate informational energy, learned end\-to\-end without explicit frequency decomposition\) and*scale\-selective locality*\(how far positional influence extends at each frequency, implemented via Morlet wavelet encoding\)\. We address both with two simple components\.Energy\-Gated Attention\(EGA\) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects*what*to attend to\.Morlet Positional Encoding\(MoPE\) replaces fixed sinusoidal encodings with learned Gaussian\-windowed wavelets that adapt the joint position\-frequency localization to the corpus; it specifies*where*attention operates at each scale\. On TinyShakespeare,EGAalone achieves\+0\.092\+0\.092validation loss improvement over standard attention \(\+0\.103\+0\.103over Phase 1–3 baseline\);MoPEalone is−0\.032\-0\.032\(below baseline as a standalone encoding\); but their combination achieves\+0\.119\+0\.119— more than the sum of parts\. This superadditivity, observed consistently across two independent training runs, is the central empirical finding: salience and locality are*complementary*inductive biases, each addressing a gap the other cannot fill alone\. Ablations confirm that structured spectral priors \(Morlet wavelet gates, scale\-initialized heads, fixed sinusoidal PE\) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively\. All experiments are at small scale \(≤6\\leq 6M parameters, character\-level benchmarks, single seed\); larger\-scale multi\-seed validation is the most important direction for future work\.

## 1Introduction

The transformer\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.26355#bib.bib20)\)computes attention weights from query\-key similarity alone\. This is powerful but structurally incomplete in two ways: it does not model which tokens are intrinsically informative \(spectral salience\), and it does not adapt how far positional influence extends at each scale \(locality\)\. We propose two components that address each gap, and show empirically that they are complementary\.

#### Attention lacks salience\.

Dot\-product attention weights tokens by content similarity to the current query, but not by their intrinsic informational density\. A token at a morphological boundary, syntactic head, or discourse marker carries disproportionate information regardless of what the query is asking\. Standard attention has no mechanism to detect or exploit this property\.EGA\(Zeris,[2026](https://arxiv.org/html/2605.26355#bib.bib1)\)addresses this by gating value aggregation with a learned energy estimate — a scalar that is high for informationally dense tokens and low for background tokens such as function words, repeated patterns, and filler\. The gate is a single learned linear projection costing<0\.3%<0\.3\\%parameter overhead\.

#### Positional encoding lacks adaptive locality\.

Standard sinusoidal PE\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.26355#bib.bib20)\)assigns each embedding dimension a fixed frequency with no spatial envelope: every position contributes equally at every scale, regardless of context length or the natural scale of the linguistic phenomenon being encoded\.RoPE\(Su et al\.,[2021](https://arxiv.org/html/2605.26355#bib.bib18)\)encodes relative rather than absolute position, but still uses fixed frequencies without Gaussian locality\.MoPEaddresses this by replacing the fixed sinusoidal basis with learned Gaussian\-windowed wavelets\. Each embedding dimension learns its own center frequencyωi\\omega\_\{i\}and locality bandwidthσi\\sigma\_\{i\}, providing adaptive time\-frequency localization\.

#### The complementarity hypothesis\.

EGAcontrols*what*to attend to \(salience\);MoPEcontrols*where*attention is sensitive at each scale \(locality\)\. We hypothesize that these are orthogonal properties of attention — neither can substitute for the other — and that their combination provides a more complete attention mechanism than either alone\.

#### Main result\.

The combination EGA\-MORLET achieves val = 1\.3550 on TinyShakespeare,\+0\.119\+0\.119over standard attention\. This exceeds the sum of components \(\+0\.092−0\.032=\+0\.060\+0\.092\-0\.032=\+0\.060\) by\+0\.059\+0\.059, consistent with complementarity rather than simple additivity\. The result is observed in two independent training runs with different seeds, providing preliminary evidence for robustness\.

#### Supporting findings\.

Five further experiments test predictions of the spectral filtering interpretation of attention: convolution attention \(nonzero lags improve over zero\-lag dot product,\+0\.007\+0\.007\); scale\-initialized heads \(no benefit,−0\.007\-0\.007, a negative result showing gradient descent discovers scale structure without guidance\); spectral flux gating \(\+0\.012\+0\.012, suggesting boundary detection as a useful attention signal\); phase coherence gating \(−0\.007\-0\.007, suggesting phase is not informative at character scale\); and a spectral cascade analysis showing qualitative coarsening of spectral content across layers\. Code available at: https://github\.com/AthanasiosZeris/energy\-gated\-attention\.

#### Contributions\.

1. 1\.A cross\-correlation interpretation of dot\-product attention:qi⋅kj=Cij\(0\)q\_\{i\}\\cdot k\_\{j\}=C\_\{ij\}\(0\), establishing that standard attention is the zero\-lag value of a richer spectral relationship\.
2. 2\.MoPE: a localized wavelet positional encoding that strictly generalizes sin/cos PE \(σi→∞\\sigma\_\{i\}\\to\\inftyrecovers sin/cos\) and provides a theoretical connection toRoPE\(phase structure in theσ→∞\\sigma\\to\\inftylimit\) and ALiBi \(zero\-frequency locality limit\)\.
3. 3\.Empirical demonstration thatEGAandMoPEare complementary inductive biases whose combination is superadditive in a controlled experiment\.
4. 4\.A structured ablation showing which spectral priors help vs fail, with interpretable explanations for each outcome\.

## 2Method

### 2\.1Interpreting Attention as Cross\-Correlation

Standard scaled dot\-product attention computes:

eij=qi⋅kjdk=Cij\(0\)e\_\{ij\}=\\frac\{q\_\{i\}\\cdot k\_\{j\}\}\{\\sqrt\{d\_\{k\}\}\}=C\_\{ij\}\(0\)\(1\)whereCij\(τ\)=∑dqi\[d\]⋅kj\[d\+τ\]C\_\{ij\}\(\\tau\)=\\sum\_\{d\}q\_\{i\}\[d\]\\cdot k\_\{j\}\[d\+\\tau\]is the cross\-correlation at lagτ\\tau\. Standard attention is the*zero\-lag*value of this cross\-correlation, discarding the full lag profile\{Cij\(τ\):τ≠0\}\\\{C\_\{ij\}\(\\tau\):\\tau\\neq 0\\\}\.

We adopt the operational spectral interpretation ofVerma & Pilanci \([2024](https://arxiv.org/html/2605.26355#bib.bib21)\): each coordinate of the embedding dimension across token positions defines a 1\-D causal signal of lengthTT\. All spectral quantities are finite\-window operational estimates applied to nonstationary learned embeddings; they should be read as approximations rather than exact spectral theorems\.

#### What zero\-lag discards\.

Three quantities are lost by collapsing to zero lag:

Scale selectivity\.The full cross\-spectral densitySij\(ω\)=Qi∗\(ω\)Kj\(ω\)S\_\{ij\}\(\\omega\)=Q\_\{i\}^\{\*\}\(\\omega\)K\_\{j\}\(\\omega\)shows which frequencies contribute to the similarity\. The dot product integrates over all frequencies equally\.

Lag structure\.Cij\(τ\)C\_\{ij\}\(\\tau\)forτ≠0\\tau\\neq 0measures how query and key signals relate with temporal offsets\. Positive lags \(τ\>0\\tau\>0\): key leads query — anticipatory structure\. Negative lags \(τ<0\\tau<0\): query leads key — retrospective reference, anaphora\.

Spectral salience\.The marginal energy∫\|Kj\(ω\)\|2𝑑ω\\int\|K\_\{j\}\(\\omega\)\|^\{2\}d\\omegameasures the total spectral content of positionjjindependently of the query\.EGAestimates this quantity directly\.

### 2\.2Energy\-Gated Attention \(EGA\)

EGA\(Zeris,[2026](https://arxiv.org/html/2605.26355#bib.bib1)\)augments standard attention with a learned energy gate:

ej\\displaystyle e\_\{j\}=wproj⊤xj\(energy projection\)\\displaystyle=w\_\{\\mathrm\{proj\}\}^\{\\top\}x\_\{j\}\\quad\\text\{\(energy projection\)\}\(2\)e~j\\displaystyle\\tilde\{e\}\_\{j\}=\(ej−μe\)/\(σe\+ϵ\)\(z\-normalize\)\\displaystyle=\(e\_\{j\}\-\\mu\_\{e\}\)/\(\\sigma\_\{e\}\+\\epsilon\)\\quad\\text\{\(z\-normalize\)\}\(3\)gj\\displaystyle g\_\{j\}=σ\(α\(e~j−τ\)\)\(gate\)\\displaystyle=\\sigma\\\!\\left\(\\alpha\(\\tilde\{e\}\_\{j\}\-\\tau\)\\right\)\\quad\\text\{\(gate\)\}\(4\)A^ij\\displaystyle\\hat\{A\}\_\{ij\}=Aij⋅gj∑kAik⋅gk\+ϵ\(renormalize\)\\displaystyle=\\frac\{A\_\{ij\}\\cdot g\_\{j\}\}\{\\sum\_\{k\}A\_\{ik\}\\cdot g\_\{k\}\+\\epsilon\}\\quad\\text\{\(renormalize\)\}\(5\)The gategj∈\(0,1\)g\_\{j\}\\in\(0,1\)is high for tokens whose embeddings project strongly onto the learned directionwprojw\_\{\\mathrm\{proj\}\}— tokens carrying high energy at the dominant projection direction\. The thresholdτ\\tauconverges to≈0\.35\\approx 0\.35regardless of initialization, corresponding to the fraction of tokens carrying above\-average energy \(≈36%\\approx 36\\%\) — consistent with the content word fraction in English running text\(Zeris,[2026](https://arxiv.org/html/2605.26355#bib.bib1)\)\.

EGAaddsd\+2d\+2parameters per head \(<0\.3%<0\.3\\%overhead\) and no measurable computational cost\. It is causally implemented: the projectionwproj⊤xjw\_\{\\mathrm\{proj\}\}^\{\\top\}x\_\{j\}operates on positionjjonly, satisfying the causality requirement ofVerma & Pilanci \([2024](https://arxiv.org/html/2605.26355#bib.bib21)\)\.

#### On the term “energy salience\.”

We use the term*energy salience*for theEGAgate with the following precise meaning: by Parseval’s identity, a linear projection over embedding dimensions estimates a spectrally\-weighted energy of the embedding vector, sowproj⊤xjw\_\{\\mathrm\{proj\}\}^\{\\top\}x\_\{j\}is theoretically motivated as an energy estimate\. We acknowledge that whatwprojw\_\{\\mathrm\{proj\}\}actually learns end\-to\-end may be better described as a general informational salience signal — it could learn to detect syntactic headedness, token rarity, boundary position, or frequency\-selective energy, all of which would produce the observed improvement\. Whether the gate specifically learns spectral energy as opposed to other salience properties is testable \(by correlating gate outputs with DFT\-computed spectral energy of the embeddings\) and we identify this as an important direction for future work\.EGAis most precisely a*learned energy gate*; the spectral framing provides theoretical motivation, not a claim about the specific computational mechanism\.

### 2\.3Morlet Positional Encoding \(MoPE\)

MoPEreplaces fixed sinusoidal PE with learned Gaussian\-windowed wavelet encodings:

MoPE\(b,2i\)\\displaystyle\\textsc\{MoPE\}\{\}\(b,2i\)=cos⁡\(ωib\)⋅e−b2/2σi2\\displaystyle=\\cos\(\\omega\_\{i\}b\)\\cdot e^\{\-b^\{2\}/2\\sigma\_\{i\}^\{2\}\}\(6\)MoPE\(b,2i\+1\)\\displaystyle\\textsc\{MoPE\}\{\}\(b,2i\+1\)=sin⁡\(ωib\)⋅e−b2/2σi2\\displaystyle=\\sin\(\\omega\_\{i\}b\)\\cdot e^\{\-b^\{2\}/2\\sigma\_\{i\}^\{2\}\}\(7\)whereωi\\omega\_\{i\}andσi\\sigma\_\{i\}are learned per embedding dimension, initialized at dyadic spacing withωiσi=5\\omega\_\{i\}\\sigma\_\{i\}=5\(admissibility minimum\)\.

#### Theoretical properties\.

MoPEprovides localized joint position\-frequency representations analogous to Gaussian\-windowed wavelets\. Standard sin/cos PE is the degenerate caseσi→∞\\sigma\_\{i\}\\to\\infty:

limσi→∞MoPE\(b,2i\)=cos⁡\(ωib\)\\lim\_\{\\sigma\_\{i\}\\to\\infty\}\\textsc\{MoPE\}\{\}\(b,2i\)=\\cos\(\\omega\_\{i\}b\)\(8\)MoPEtherefore strictly generalizes sin/cos PE\.

#### Connection to prior PE methods\.

At the level of phase structure,RoPErecovers sinusoidal phase behavior in theσi→∞\\sigma\_\{i\}\\to\\inftylimit ofMoPEapplied to relative position; the fullRoPEmechanism additionally uses rotational composition in complex query\-key space, which is not equivalent to settingσi→∞\\sigma\_\{i\}\\to\\inftyin the additiveMoPEencoding\. ALiBi corresponds toMoPEat zero frequency \(locality only, no oscillation\)\.MoPEis the unique generalization that provides both adaptive frequency and adaptive locality\.

#### Cross\-correlation structure\.

SubstitutingMoPEinto the cross\-correlationCi\(τ\)=∑bMoPE\(b,2i\)⋅MoPE\(b\+τ,2i\)C\_\{i\}\(\\tau\)=\\sum\_\{b\}\\textsc\{MoPE\}\{\}\(b,2i\)\\cdot\\textsc\{MoPE\}\{\}\(b\+\\tau,2i\), assuming same\-scale correlation and neglecting boundary effects, gives up to normalization constants:

Ci\(τ\)∝e−τ2/4σi2⋅cos⁡\(ωiτ\)C\_\{i\}\(\\tau\)\\propto e^\{\-\\tau^\{2\}/4\\sigma\_\{i\}^\{2\}\}\\cdot\\cos\(\\omega\_\{i\}\\tau\)\(9\)This has the form of a Morlet kernel in lag space\. Three properties are notable\.

Persistence\.The Gaussiane−τ2/4σi2e^\{\-\\tau^\{2\}/4\\sigma\_\{i\}^\{2\}\}measures how strongly scale\-iilinguistic patterns persist overτ\\tautoken steps\. Fine\-scale dimensions \(smallσi\\sigma\_\{i\}\) have rapidly decaying cross\-correlations, capturing character\-level local structure\. Coarse\-scale dimensions \(largeσi\\sigma\_\{i\}\) have slowly decaying cross\-correlations, capturing clause or sentence\-level dependencies\.

Periodicity\.The cosinecos⁡\(ωiτ\)\\cos\(\\omega\_\{i\}\\tau\)encodes relative position at frequencyωi\\omega\_\{i\}— the same quantity asRoPE’s rotation angle at the same frequency\.

Heisenberg tradeoff\.Within the class of Gaussian\-windowed representations, the productΔτ⋅Δω=1/2\\Delta\\tau\\cdot\\Delta\\omega=1/2achieves the minimum uncertainty product permitted by the Heisenberg bound\.MoPEprovides the optimal tradeoff within this class; sin/cos PE achievesΔω=0\\Delta\\omega=0\(zero bandwidth\) at the cost ofΔτ=∞\\Delta\\tau=\\infty\(no locality\)\.

### 2\.4Combined Model: Salience and Locality

The combined model EGA\-MORLET appliesEGAgating to the attention weights computed underMoPEpositional encoding\. No architectural changes beyond these two components are required\.

The complementarity hypothesis predicts superadditivity:EGAandMoPEaddress distinct and non\-overlapping properties of attention\.EGAimproves attention by identifying*which*tokens are informative \(salience\-aware\)\.MoPEimproves attention by specifying*where*positional influence extends at each scale \(locality\-aware\)\. Neither component encodes the information the other provides\. Their combination should therefore achieve more than either alone — a prediction we test empirically in Section[4](https://arxiv.org/html/2605.26355#S4)\.

## 3Theoretical Analysis

### 3\.1Why the Combination is Superadditive

The formal reason for superadditivity follows from the complementarity of whatEGAandMoPEprovide\.

EGAmodifies the*value*aggregation step by reweighting which tokens contribute\.MoPEmodifies the*score*computation by changing what position information is available\. These two operations modify different computational steps and carry non\-overlapping information, so their combination can improve both simultaneously\.

More precisely, the EGA\-MORLET attention score is:

eijEGA\-MORLET=qiMoPE⋅kjMoPEdk⋅gjEGAe\_\{ij\}^\{\\text\{EGA\-MORLET\}\}=\\frac\{q\_\{i\}^\{\\text\{MoPE\}\}\\cdot k\_\{j\}^\{\\text\{MoPE\}\}\}\{\\sqrt\{d\_\{k\}\}\}\\cdot g\_\{j\}^\{\\text\{EGA\}\}\(10\)whereqiMoPE,kjMoPEq\_\{i\}^\{\\text\{MoPE\}\},k\_\{j\}^\{\\text\{MoPE\}\}are query/key vectors incorporatingMoPEpositional information, andgjEGAg\_\{j\}^\{\\text\{EGA\}\}is the spectral energy gate\. The gate and the positional structure multiply — they interact rather than add\.

### 3\.2Why Structured Priors Fail

A consistent pattern across all four phases of this experimental series is that unconstrained learned projections outperform structured spectral priors:

The exception isMoPE, which improves over sin/cos — but only in combination withEGA\. The pattern suggests that the structure gradient descent discovers in language models is non\-sinusoidal, non\-orthogonal, and corpus\-specific\. Structured priors designed for physical signal analysis provide a useful inductive bias only when they are genuinely complementary to what gradient descent finds, not when they replicate or constrain it\.

### 3\.3Spectral Cascade: Qualitative Layer Analysis

We define the spectral cascade profile:

Cascade\(l,a\)=𝔼b,d\[\|Wψ\[ed\(l\)\]\(a,b\)\|\]\\text\{Cascade\}\(l,a\)=\\mathbb\{E\}\_\{b,d\}\\\!\\left\[\\left\|W\_\{\\psi\}\[e^\{\(l\)\}\_\{d\}\]\(a,b\)\\right\|\\right\]\(11\)as the mean Morlet wavelet coefficient magnitude at layerlland scaleaa\. Under the operational spectral interpretation, this estimates mean spectral energy at each scale and depth\.

The cascade \(Figure[1](https://arxiv.org/html/2605.26355#S4.F1)\) shows qualitative coarsening: higher spectral energy at fine scales in early layers shifts toward coarser scales in later layers\. This is qualitatively consistent with a multiscale filter bank interpretation of transformer computation — early layers process character statistics, later layers process longer\-range structure\. We present this as a descriptive observation, not a formal theoretical claim\.

## 4Experiments

### 4\.1Setup

#### Architecture and data\.

GPT\-style decoder,L=6L=6,H=8H=8,d=256d=256, contextT=256T=256, character\-level TinyShakespeare\. All models trained for 5,000 steps with cosine LR decay from3×10−43\\times 10^\{\-4\}, AdamW, identical mini\-batches throughout\.

#### Statistical note\.

All reported results are single\-run, single\-seed\. The primary result \(EGA\-MORLET\+0\.119\+0\.119\) is large relative to the noise floor and consistent across two independent training sessions\. Effect sizes below±0\.02\\pm 0\.02\(convolution\+0\.007\+0\.007, flux\+0\.012\+0\.012, phase−0\.007\-0\.007\) should be treated as preliminary observations pending multi\-seed validation\.

### 4\.2Main Result: Salience and Locality

Table 1:Main results on TinyShakespeare\.Δ\\Delta= improvement over BASE\-DOT \(positive = better\)\. EGA\-MORLET achieves more than the sum of its components, consistent with the complementarity hypothesis\.ModelValΔ\\DeltaMechanismBASE\-DOT1\.4742—dot product \+ learned PEPE\-SINCOS1\.5863−0\.112\-0\.112fixed sin/cosPE\-ROPE1\.4637\+0\.011\+0\.011rotary \(relative\)PE\-MORLET1\.5060−0\.032\-0\.032MoPEaloneEGA\-11\.3821\+0\.092\+0\.092energy gate aloneEGA\-MORLET1\.3550\+0\.119EGA\+MoPESum of components:\+0\.092\+\(−0\.032\)=\+0\.060\+0\.092\+\(\-0\.032\)=\+0\.060\. Combination:\+0\.119\+0\.119\. Excess:\+0\.059\+0\.059\. Consistent with complementarity hypothesis\. Note:EGA\-1 val = 1\.3712 in original Phase 1–3 session \(1\.3821 here\); difference is within expected single\-seed variance\.#### Result interpretation\.

MoPEalone underperforms standard attention \(−0\.032\-0\.032\): adaptive locality without spectral salience does not help and slightly hurts\.EGAalone substantially outperforms \(\+0\.092\+0\.092\): salience alone is a useful inductive bias\. Together they achieve\+0\.119\+0\.119— the complementarity hypothesis is supported\.

The most natural interpretation:MoPEprovides scale\-appropriate context for the salience signal thatEGAcomputes\. Without salience, locality alone cannot identify which tokens to attend to\. Without locality, salience cannot distinguish*at what scale*a token is important\. Together they implement a more complete attention mechanism\.

### 4\.3Ablation: What Helps and What Fails

Table 2:Complete ablation results\. PositiveΔ\\Delta= better than BASE\-DOT\.ModelValΔ\\DeltaMechanismInterpretationBASE\-DOT1\.4742—standardbaselineCONV\-L41\.4668\+0\.007\+0\.007±4\\pm 4lagslags carry infoCONV\-L81\.4691\+0\.005\+0\.005±8\\pm 8lagsshorter betterPE\-SINCOS1\.5863−0\.112\-0\.112fixed PEfixed failsPE\-ROPE1\.4637\+0\.011\+0\.011relative PErelative helpsPE\-MORLET1\.5060−0\.032\-0\.032MoPElocality alone failsEGA\-11\.3821\+0\.092\+0\.092energy gatesalience helpsEGA\-MORLET1\.3550\+0\.119\+0\.119combinedbestSCALE\-INIT1\.4812−0\.007\-0\.007scale initGD finds scalesMQ\-E1\.4688\+0\.005\+0\.005E onlymarginalMQ\-EP1\.4810−0\.007\-0\.007E\+phasephase hurtsMQ\-EF1\.4625\+0\.012\+0\.012E\+fluxflux helps#### Convolution attention \(\+0\.007\+0\.007\)\.

Extending the dot product to nonzero lags improves over zero\-lag attention, confirming that lag structure carries genuine linguistic information\. Shorter lag windows \(±4\\pm 4\) outperform longer ones \(±8\\pm 8\), consistent with character\-level structure being predominantly local\. Effect size is small; multi\-seed confirmation needed\.

#### Scale initialization \(−0\.007\-0\.007\)\.

Initializing attention heads at specific frequency bands provides no benefit\. Gradient descent discovers the optimal scale structure from random initialization — the inductive bias is redundant\. This parallels the Phase 1–3 finding that Morlet wavelet energy gates underperform unconstrained learned projections\.

#### Spectral flux \(\+0\.012\+0\.012\)\.

Spectral flux\|∂Ej/∂b\|\|\\partial E\_\{j\}/\\partial b\|, measuring the rate of change of wavelet energy, provides a small positive signal\. This is consistent with flux acting as a boundary detector: tokens at morphological or lexical boundaries have high energy change rates\. Effect size is small; single\-seed, treat as preliminary observation\.

#### Phase coherence \(−0\.007\-0\.007\)\.

Phase informationcos⁡\(ϕj\)\\cos\(\\phi\_\{j\}\)shows a small negative association\. At the character level, phase varies rapidly and may not carry stable structure; optimizer noise cannot be ruled out without multi\-seed validation\. Directionally consistent with the general finding that structured sinusoidal quantities are suboptimal at character scale\.

### 4\.4LearnedMoPEParameters

![Refer to caption](https://arxiv.org/html/2605.26355v1/phase4_exp12.png)Figure 1:Left: Validation loss curves for convolution attention ablation\. Both CONV models beat BASE\-DOT, confirming nonzero lags carry linguistic information\.Center: Validation loss for PE ablation\. EGA\-MORLET \(orange, dashed\) converges fastest\.Right: Final validation loss for all models\. EGA\-MORLET andEGA\-1 are the only models substantially above baseline\.The learnedMoPEparameters cluster at four distinct regions in the\(σi,ωi\)\(\\sigma\_\{i\},\\omega\_\{i\}\)plane: character scale \(σ≈2\\sigma\\approx 2–33tokens\), word scale \(σ≈8\\sigma\\approx 8–1212\), clause scale \(σ≈25\\sigma\\approx 25–4040\), and sentence scale \(σ≈60\\sigma\\approx 60–100100\)\. Dyadic initialization distributes dimensions uniformly on a log scale; the learned distribution concentrates at these clusters, suggesting thatMoPEdiscovers linguistically natural temporal scales from data rather than assuming them\. We present this as suggestive clustering rather than confirmed linguistic hierarchy, pending validation at word\-level tokenization and larger scale\.

## 5Discussion

#### The complementarity of salience and locality\.

The central finding is that spectral salience \(EGA\) and scale\-selective locality \(MoPE\) are complementary inductive biases\. Neither is sufficient alone at this scale; their combination is superadditive\. We cannot distinguish*architectural complementarity*\(genuinely orthogonal information\) from*optimization interaction*\(a better loss basin\) at this scale\. Larger\-scale multi\-seed experiments would provide stronger evidence\.

#### Why unconstrained learning wins\.

The consistent failure of structured spectral priors at character scale is interpretable\. The optimal basis for character\-level language model embeddings is non\-sinusoidal and corpus\-specific — not well\-described by Morlet wavelets, Daubechies filters, or Fourier bases designed for physical signal analysis\. The exception confirms the rule:MoPEhelps only because it provides genuine adaptive locality complementary toEGA, not because its wavelet basis is intrinsically correct\.

#### Long\-context opportunity\.

The locality parameterσi\\sigma\_\{i\}inMoPEcontrols how far positional influence extends\. For long\-context models \(T≥4096T\\geq 4096\), this adaptivity may be particularly valuable: different embedding dimensions can specialize to different context ranges rather than all contributing globally\. This is speculative at current scale; we identify it as a high\-priority direction for future investigation\.

#### Multiscale structure across layers\.

The spectral cascade \(Eq\.[11](https://arxiv.org/html/2605.26355#S3.E11)\) shows qualitative coarsening loosely reminiscent of multiscale cascade structures in other domains — finer scales dominate early layers, coarser scales later layers\. We present this as purely descriptive; it does not constitute evidence for any specific physical model\. We present this as a descriptive observation; it does not constitute evidence for any specific physical model of transformer computation\.

#### Limitations\.

All experiments are character\-level,≤6\\leq 6M parameters, single seed\. The primary result \(\+0\.119\+0\.119\) is large enough to be credible at this scale; the secondary effects \(convolution\+0\.007\+0\.007, flux\+0\.012\+0\.012, phase−0\.007\-0\.007\) require multi\-seed confirmation\. Scaling to word\-level tokenization, WikiText\-103, and 50M–100M parameters is the most important direction for future work, as isRoPE\-compatible integration for drop\-in deployment\.

## 6Related Work

#### Positional encoding\.

Vaswani et al\. \([2017](https://arxiv.org/html/2605.26355#bib.bib20)\)introduced fixed sin/cos PE\.Su et al\. \([2021](https://arxiv.org/html/2605.26355#bib.bib18)\)proposedRoPE, encoding relative position as rotation in complex space\.Press et al\. \([2022](https://arxiv.org/html/2605.26355#bib.bib16)\)introduced ALiBi, adding linear distance biases\.MoPEunifies these: sin/cos isMoPEatσ→∞\\sigma\\to\\infty; ALiBi isMoPEatω=0\\omega=0\. Recent length\-generalization methods —YaRN\(Peng et al\.,[2023](https://arxiv.org/html/2605.26355#bib.bib14)\)and NTK\-aware scaling\(Bloc97,[2023](https://arxiv.org/html/2605.26355#bib.bib5)\)— extendRoPEby modifying frequency scaling;MoPEaddresses the orthogonal question of spatial locality\. Hyena\(Poli et al\.,[2023](https://arxiv.org/html/2605.26355#bib.bib15)\)and state\-space models\(Gu & Dao,[2023](https://arxiv.org/html/2605.26355#bib.bib8)\)encode position implicitly through convolutions and recurrence\.

#### Signal processing in transformers\.

Verma & Pilanci \([2024](https://arxiv.org/html/2605.26355#bib.bib21)\)applied causal filter banks between transformer layers, establishing the signal interpretation we adopt\.Lee\-Thorp et al\. \([2022](https://arxiv.org/html/2605.26355#bib.bib12)\)replaced attention with Fourier mixing, showing that structured spectral operations can substitute for attention\.Tamkin et al\. \([2020](https://arxiv.org/html/2605.26355#bib.bib19)\)used DCT filters to disentangle multiscale representations in BERT, explicitly identifying wavelets as future work;MoPEis one realization of that direction\.

#### Spectral and energy\-based attention\.

Zeris \([2026](https://arxiv.org/html/2605.26355#bib.bib1)\)introducedEGA\(Paper 1 of this series\), establishing the energy gating mechanism\.MoPEis new to this paper; its combination withEGAand the complementarity finding are the central contributions\.

#### Efficient and sparse attention\.

Beltagy et al\. \([2020](https://arxiv.org/html/2605.26355#bib.bib4)\)andZaheer et al\. \([2020](https://arxiv.org/html/2605.26355#bib.bib23)\)reduce complexity through fixed sparsity patterns\.EGAproduces data\-dependent sparsity motivated by spectral salience rather than structural constraints\.

#### Mechanistic interpretability\.

Elhage et al\. \([2021](https://arxiv.org/html/2605.26355#bib.bib7)\)andOlsson et al\. \([2022](https://arxiv.org/html/2605.26355#bib.bib13)\)analyze transformer circuits\. The spectral cascade provides an orthogonal interpretability tool that automatically identifies scale\-selective structure across layers without manual circuit analysis\.

## 7Conclusion

Similarity selects what matches the query; salience selects what matters\.

We have shown that standard transformer attention lacks two complementary inductive biases: spectral salience \(EGA\) and scale\-selective locality \(MoPE\)\. Each is individually useful or neutral; together they achieve\+0\.119\+0\.119over standard attention at character scale, more than the sum of parts\.

The consistent theme across the full ablation is that unconstrained learned components outperform structured spectral priors \(Morlet gates, Daubechies filters, scale initialization, sin/cos PE\) except when a structured component provides genuine complementary information not already discoverable by gradient descent\.MoPEis the one structured component that helps — and only in combination withEGA— precisely because adaptive locality is not something unconstrained gradient descent on dot\-product attention learns by default\.

Future work should validate these findings with multiple seeds at 50M–100M parameters on word\-level benchmarks \(WikiText\-103, OpenWebText\), investigateMoPEfor long\-context locality, and develop aRoPE\-compatibleMoPEvariant for drop\-in deployment\.

## References

- Zeris \[2026\]Zeris, A\.Energy\-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention\.*arXiv preprint arXiv:2605\.21842v1*, 2026\.
- Bahdanau et al\. \[2015\]Bahdanau, D\., Cho, K\., and Bengio, Y\.Neural machine translation by jointly learning to align and translate\.In*ICLR*, 2015\.
- Bello et al\. \[2005\]Bello, J\. P\., et al\.A tutorial on onset detection in music signals\.*IEEE Trans\. Speech Audio Process\.*, 13\(5\):1035–1047, 2005\.
- Beltagy et al\. \[2020\]Beltagy, I\., Peters, M\. E\., and Cohan, A\.Longformer: The Long\-Document Transformer\.*arXiv:2004\.05150*, 2020\.
- Bloc97 \[2023\]Bloc97\.NTK\-aware scaled RoPE allows LLaMA models to have extended \(8k\+\) context size without any fine\-tuning\.*Reddit / GitHub*, 2023\.
- Dai et al\. \[2021\]Dai, Z\., Liu, H\., Le, Q\. V\., and Tan, M\.CoAtNet: Marrying convolution and attention for all data sizes\.In*NeurIPS*, volume 34, 2021\.
- Elhage et al\. \[2021\]Elhage, N\., et al\.A mathematical framework for transformer circuits\.*Transformer Circuits Thread*, 2021\.
- Gu & Dao \[2023\]Gu, A\. and Dao, T\.Mamba: Linear\-time sequence modeling with selective state spaces\.*arXiv:2312\.00752*, 2023\.
- Holmes et al\. \[1996\]Holmes, P\., Lumley, J\. L\., and Berkooz, G\.*Turbulence, Coherent Structures, Dynamical Systems and Symmetry*\.Cambridge University Press, 1996\.
- Joshi et al\. \[2020\]Joshi, C\., Laurent, T\., and Bresson, X\.On the equivalence of deep neural networks and graph neural networks\.*arXiv:2001\.12232*, 2020\.
- Karpathy \[2015\]Karpathy, A\.The unreasonable effectiveness of recurrent neural networks, 2015\.
- Lee\-Thorp et al\. \[2022\]Lee\-Thorp, J\., Ainslie, J\., Eckstein, I\., and Ontanon, S\.FNet: Mixing tokens with Fourier transforms\.In*NAACL*, 2022\.
- Olsson et al\. \[2022\]Olsson, C\., et al\.In\-context learning and induction heads\.*Transformer Circuits Thread*, 2022\.
- Peng et al\. \[2023\]Peng, B\., et al\.YaRN: Efficient context window extension of large language models\.*arXiv:2309\.00071*, 2023\.
- Poli et al\. \[2023\]Poli, M\., et al\.Hyena hierarchy: Towards larger convolutional language models\.In*ICML*, 2023\.
- Press et al\. \[2022\]Press, O\., Smith, N\. A\., and Lewis, M\.Train short, test long: Attention with linear biases enables input length extrapolation\.In*ICLR*, 2022\.
- So et al\. \[2021\]So, D\. R\., Manke, W\., Liu, H\., Dai, Z\., Shazeer, N\., and Le, Q\. V\.Searching for efficient transformers for language modeling\.In*NeurIPS*, volume 34, 2021\.
- Su et al\. \[2021\]Su, J\., Lu, Y\., Pan, S\., Murtadha, A\., Wen, B\., and Liu, Y\.RoFormer: Enhanced transformer with rotary position embedding\.*arXiv:2104\.09864*, 2021\.
- Tamkin et al\. \[2020\]Tamkin, A\., Jurafsky, D\., and Goodman, N\.Language through a prism: A spectral approach for multiscale language representations\.In*NeurIPS*, volume 33, 2020\.
- Vaswani et al\. \[2017\]Vaswani, A\., et al\.Attention is all you need\.In*NeurIPS*, volume 30, 2017\.
- Verma & Pilanci \[2024\]Verma, P\. and Pilanci, M\.Towards signal processing in large language models\.*arXiv:2406\.10254*, 2024\.
- Wu et al\. \[2021\]Wu, H\., et al\.CvT: Introducing convolutions to vision transformers\.In*ICCV*, 2021\.
- Zaheer et al\. \[2020\]Zaheer, M\., et al\.Big Bird: Transformers for longer sequences\.In*NeurIPS*, volume 33, 2020\.

## Appendix AAdditional Ablations

Table 3:Complete Phase 4 results including all models\. Models above the dividing line are the main contribution; models below are secondary ablations\.
## Appendix BMemory\-Efficient Convolution Attention

Peak memory for convolution attention: one\[B,T,T\]\[B,T,T\]accumulator plus one small\[B,T−\|τ\|,T−\|τ\|\]\[B,T\-\|\\tau\|,T\-\|\\tau\|\]intermediate, independent ofLmaxL\_\{\\mathrm\{max\}\}\. This solved the out\-of\-memory failure atL=16L=16on a 15\.6 GB T4 GPU\.

Algorithm 1Memory\-Efficient Convolution Attention0:

Q,K,V∈ℝB×T×dkQ,K,V\\in\\mathbb\{R\}^\{B\\times T\\times d\_\{k\}\}, lag weights

λ∈ℝ2L\+1\\lambda\\in\\mathbb\{R\}^\{2L\+1\}, causal mask

MM
1:

wτ←softmax\(λ\)w\_\{\\tau\}\\leftarrow\\mathrm\{softmax\}\(\\lambda\)
2:

S←0B×T×TS\\leftarrow 0\_\{B\\times T\\times T\}
3:for

τ∈\{−L,…,\+L\}\\tau\\in\\\{\-L,\\ldots,\+L\\\}do

4:if

\|wτ\|<10−4\|w\_\{\\tau\}\|<10^\{\-4\}then

5:continue

6:endif

7:Accumulate shifted

QK⊤/dkQK^\{\\top\}/\\sqrt\{d\_\{k\}\}at lag

τ\\tau
8:Delete intermediate tensors immediately

9:endfor

10:Apply causal mask; return

softmax\(S/K\)⋅V\\mathrm\{softmax\}\(S/\\sqrt\{K\}\)\\cdot V

## Appendix CMoPEInitialization and Admissibility

MoPEinitializes frequencies at dyadic spacing:

ωi\(0\)=exp⁡\(id/2−1ln⁡\(π⋅0\.99\)\),i=0,…,d/2−1\\omega\_\{i\}^\{\(0\)\}=\\exp\\\!\\left\(\\frac\{i\}\{d/2\-1\}\\ln\(\\pi\\cdot 0\.99\)\\right\),\\quad i=0,\\ldots,d/2\-1\(12\)and bandwidths at the admissibility minimum:

σi\(0\)=5/ωi\(0\)\\sigma\_\{i\}^\{\(0\)\}=5/\\omega\_\{i\}^\{\(0\)\}\(13\)ensuringωi\(0\)σi\(0\)=5\\omega\_\{i\}^\{\(0\)\}\\sigma\_\{i\}^\{\(0\)\}=5at initialization\. During training the constraintωiσi≥5\\omega\_\{i\}\\sigma\_\{i\}\\geq 5is enforced as a floor\. Parameters are stored in log space, ensuring positivity without explicit constraints\.
Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Similar Articles

Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization

Relevant and Irrelevant: A Renormalization Group Analysis of Transformer Attention

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

Submit Feedback

Similar Articles

Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization
Relevant and Irrelevant: A Renormalization Group Analysis of Transformer Attention
Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers