Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL Papers

Summary

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

arXiv:2606.15378v1 Announce Type: new Abstract: Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:47 AM

# Rethinking the Role of Efficient Attention in Hybrid Architectures
Source: [https://arxiv.org/html/2606.15378](https://arxiv.org/html/2606.15378)
Ziqing Qiao1, Yinuo Xu111footnotemark:1, Chaojun Xiao1, Zhou Su2, Zihan Zhou2, Yingfa Chen1,Xiaoyue Xu2,Xu Han122footnotemark:2,Zhiyuan Liu122footnotemark:2 1Tsinghua University2OpenBMB qzq24@mails\.tsinghua\.edu\.cn,\{xcj,han\-xu,liuzy\}@tsinghua\.edu\.cn

###### Abstract

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding\-window attention \(SWA\) and recurrent sequence mixers\. However, how these efficient modules shape model capabilities remains poorly understood\. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design\. First, from a scaling perspective, we find that efficient\-attention design primarily affects how fast long\-context capability emerges, while different hybrids eventually converge to comparable long\-context performance under sufficient training\. Second, mechanistically, we show that long\-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory\. This explains a counter\-intuitive phenomenon we call*Large\-Window Laziness*: larger SWA windows can delay the formation of retrieval heads in full\-attention layers\. Third, guided by this mechanism, we show that applying NoPE to only the full\-attention layers of a small\-window SWA hybrid substantially improves long\-context performance with negligible impact on short\-context performance\.111We release our code at[rethinking\-hybrid\-attention](https://github.com/thunlp/rethinking-hybrid-attention)\.

Rethinking the Role of Efficient Attention in Hybrid Architectures

Ziqing Qiao1††thanks:Equal contribution\., Yinuo Xu111footnotemark:1, Chaojun Xiao1††thanks:Corresponding authors, Zhou Su2, Zihan Zhou2,Yingfa Chen1,Xiaoyue Xu2,Xu Han122footnotemark:2,Zhiyuan Liu122footnotemark:21Tsinghua University2OpenBMBqzq24@mails\.tsinghua\.edu\.cn,\{xcj,han\-xu,liuzy\}@tsinghua\.edu\.cn

## 1Introduction

As large language models are increasingly used for long\-document understanding and agentic workflows, handling extended contexts has become a core requirement in recent model releases\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.15378#bib.bib1); Singhet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib2)\)\. However, standard softmax attention, which we refer to as*full attention*, is costly at long sequence lengths\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15378#bib.bib3)\)\. This has motivated a family of hybrid attention architectures that combine full attention with*efficient attention*such as sliding\-window attention \(SWA\)\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib20)\)and recurrent sequence mixers\(Gu and Dao,[2023](https://arxiv.org/html/2606.15378#bib.bib37); Yanget al\.,[2024a](https://arxiv.org/html/2606.15378#bib.bib38)\), a design now widely adopted in recent language models\(Agarwalet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib5); Gemma Team,[2025](https://arxiv.org/html/2606.15378#bib.bib6); Caoet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib8)\)\.

Despite their prevalence, the role of efficient attention in hybrid architectures remains unclear\. Existing studies lack a unified mechanistic analysis of how different efficient\-attention designs shape the capabilities and training dynamics of hybrid architectures, particularly their long\-context performance\(Xiaoet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib9); Liet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib7); Wanget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib11); Baeet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib12)\)\. To address this gap, we investigate three research questions:

*RQ1 \- Scaling Behavior: How do hybrid architectures scale in short\- and long\-context performance?*

*RQ2 \- Mechanism Analysis: How does efficient\-attention design influence long\-context performance?*

*RQ3 \- Architecture Design: What design principles lead to more effective hybrid architectures?*

#### Scaling laws for short\- and long\-context capabilities\.

We study how hybrid architectures scale in short\- and long\-context performance through the lens of*scaling law*across multiple model scales and training budgets\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib17); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.15378#bib.bib18)\)\. Considering the discreteness and instability of downstream benchmark scores\(Lianget al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib13)\), we use validationLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\(Fanget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib10)\)as two continuous fitting targets\. The former captures general short\-context modeling quality, while the latter provides a smooth proxy for long\-context capability\. The fitted scaling curves clearly show that efficient\-attention design has little effect on validationLoss\\mathrm\{Loss\}, but leads to more pronounced differences inlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\. Specifically, different hybrid architectures exhibit substantial gaps under limited training budgets, with large\-window SWA hybrids performing notably worse\. However, as training becomes more sufficient, these gaps shrink significantly and eventually approach a similar level\.

#### Efficient attention as an optimization prior\.

The scaling pattern above leaves us with two seemingly contradictory puzzles\.First, why do hybrid architectures with different efficient attention ultimately converge to a similar long\-context level?Second, why do their convergence rates differ so much, especially across SWA variants with different window sizes? Our mechanistic analysis shows that both puzzles share a common explanation: efficient attention does not directly determine long\-context capability; instead, it acts as an*optimization prior*that shapes how full attention is trained\.

Why do hybrids converge?Through receptive\-field constraint and layer\-wise probing experiments, we find that long\-range information is carried primarily by full attention rather than by efficient\-attention modules, even recurrent sequence mixers with in\-principle unbounded receptive fields\. Sharing this same full\-attention component, different hybrids converge to a similar long\-context level regardless of their efficient\-attention design\.

Why do convergence rates differ?While full attention sets the final converged level, efficient attention influences long\-context capability by shaping how quickly full attention develops its long\-range retrieval behavior during training\. As a concrete example, by tracing retrieval heads\(Wuet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib34)\), we find that retrieval heads form noticeably later in hybrid models equipped with larger SWA windows: once the local window already supplies sufficient context for next\-token prediction, the gradient signal pushing full attention to learn long\-range retrieval weakens\. We term this phenomenon*Large\-Window Laziness*\.

#### Hybrid architecture designs beyond efficient attention\.

These findings suggest that hybrid architecture design should focus less on increasing the intrinsic capability of efficient attention and more on helping full attention learn long\-range retrieval more effectively\. From this perspective, we revisit several design choices beyond the efficient\-attention module\. As a simple but effective instance, we apply NoPE\(Kazemnejadet al\.,[2023](https://arxiv.org/html/2606.15378#bib.bib36)\)to the full\-attention layers of a small\-window SWA hybrid\. This simple modification yields a clear long\-context capability gain with negligible impact on short\-context performance, which is reflected consistently in downstream benchmark evaluations\.

Figure[1](https://arxiv.org/html/2606.15378#S1.F1)summarizes our main findings and their design implications\. Taken together, our results reframe the role of efficient attention in hybrid architectures\. The practical bottleneck for long\-context capability is not simply how powerful the efficient\-attention module is, but how it affects the emergence of long\-range retrieval in full attention\. This view explains the scaling patterns across hybrids and points to full attention as a key target for improving long\-context hybrid models\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x1.png)Figure 1:Overview\.Scaling: different efficient\-attention designs yield distinctlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)curves that converge to a similar level after sufficient training\.Mechanism: long\-range retrieval is primarily carried by full attention, while efficient attention acts as an*optimization prior*, where large\-window SWA lags the most\.Design: strengthening full attention itself \(e\.g\., RoPE→\\rightarrowNoPE in full attention\) further improves long\-context performance\.

## 2Related Work

#### Hybrid Attention Architectures\.

Existing hybrid architectures mainly follow two lines\. One uses SWA\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib20)\)as efficient attention, where recent designs have moved toward smaller windows and sparser full\-attention ratios with limited overall performance degradation\(Agarwalet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib5); Huanget al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib30)\)\. The other employs recurrent sequence mixers that compress past history into a compact recurrent state, such as Lightning Attention\(Qinet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib19)\), Mamba\-2\(Dao and Gu,[2024](https://arxiv.org/html/2606.15378#bib.bib15)\), and Gated DeltaNet\(Yanget al\.,[2025b](https://arxiv.org/html/2606.15378#bib.bib16)\), which are increasingly adopted in recent models\(Liet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib7); Blakemanet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib14); Caoet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib8); Teamet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib69)\)\. Beyond the choice of efficient\-attention module, recent work also explores head\-wise mixing\(Donget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib21); Xiaoet al\.,[2025b](https://arxiv.org/html/2606.15378#bib.bib31)\)and positional encoding for the full\-attention layers\(Yanget al\.,[2025a](https://arxiv.org/html/2606.15378#bib.bib32); Puvvadaet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib33); Chenet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib70)\)\. However, most of these studies present only final results or limited ablations within specific systems\(Gemma Team,[2025](https://arxiv.org/html/2606.15378#bib.bib6); Xiaoet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib9)\), leaving a lack of controlled comparisons across efficient\-attention architectures\.

Several studies have begun to examine structural choices in hybrid architectures more systematically\.Wanget al\.\([2025](https://arxiv.org/html/2606.15378#bib.bib11)\)compare multiple linear attention variants and mixing ratios, whileWaleffeet al\.\([2024](https://arxiv.org/html/2606.15378#bib.bib68)\); Baeet al\.\([2025](https://arxiv.org/html/2606.15378#bib.bib12)\)analyze layer composition and placement in Mamba\-Transformer hybrids\. Yet these studies remain within recurrent\-mixer\-based hybrids and lack a mechanistic explanation\. We bridge this gap by comparing different efficient\-attention designs under a controlled scaling\-law setup and analyzing how they shape the long\-context capability of hybrid architectures\.

#### Scaling Laws and Long\-Context Evaluation\.

Scaling laws characterize how pretraining performance depends on model and data scale\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib17); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.15378#bib.bib18)\), with subsequent extensions to transfer learning\(Hernandezet al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib65)\)and downstream capability prediction\(Chenet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib67)\)\. However, scaling laws for long\-context capability remain underexplored\. Existing long\-context evaluations typically rely on discrete benchmarks such as RULER and LongBench\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib25); Baiet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib41)\), which measure final performance but are less suitable for tracking pretraining dynamics\. A complementary line of mechanistic studies shows that retrieval heads underlie long\-context factual recall\(Wuet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib34); Xiaoet al\.,[2025a](https://arxiv.org/html/2606.15378#bib.bib73)\)and tracks the formation of retrieval heads to observe how long\-context capability develops during pretraining\(Lianget al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib13)\), but such signals describe the mechanism rather than quantify capability\. In contrast, LongPPL\(Fanget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib10)\)provides a continuous perplexity\-style metric that correlates strongly with long\-context benchmarks, and has since been adopted in recent long\-context studies\(Songet al\.,[2026](https://arxiv.org/html/2606.15378#bib.bib27); Willetteet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib26)\)\. We further leverage this metric to fit scaling laws for long\-context performance, enabling a more comprehensive comparison of how long\-context capability emerges across hybrid architectures\.

## 3Preliminaries

### 3\.1Hybrid Architecture

We cover two common forms of efficient attention:Sliding\-Window Attention \(SWA\), where each token attends only to a finite local window, andrecurrent sequence mixers, includingLightning Attention,Mamba\-2, andGated DeltaNet \(GDN\), which compress past tokens into a recurrent state through different decay strategies and update rules\.

We useqt,kt,vt∈ℝdhq\_\{t\},k\_\{t\},v\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\}for the per\-head query, key, and value vectors at positiontt\(withdk=dv=dhd\_\{k\}\{=\}d\_\{v\}\{=\}d\_\{h\}assumed for notational simplicity\), and letsoftmaxs\\mathrm\{softmax\}\_\{s\}denote the softmax normalized over the indexss\. The formulas below present canonical forms of the mechanisms; implementation\-level parameter choices used for matching sizes of different hybrid models are given in Appendix[B](https://arxiv.org/html/2606.15378#A2)\.

#### Full Attention\.

For each positiontt, the outputOtO\_\{t\}is computed over all preceding positionss≤ts\\leq t:

Ot=∑s≤tsoftmaxs​\(qt⊤​ks/dh\)​vsO\_\{t\}=\\sum\_\{s\\leq t\}\\mathrm\{softmax\}\_\{s\}\\\!\\bigl\(q\_\{t\}^\{\\\!\\top\}k\_\{s\}/\\sqrt\{d\_\{h\}\}\\bigr\)\\,v\_\{s\}\(1\)

#### Sliding Window Attention\.

SWA restricts the summation range to a window of sizeww:

Ot=∑s∈\[t−w\+1,t\]softmaxs​\(qt⊤​ks/dh\)​vsO\_\{t\}=\\sum\_\{s\\in\[t\{\-\}w\{\+\}1,\\,t\]\}\\mathrm\{softmax\}\_\{s\}\\\!\\bigl\(q\_\{t\}^\{\\\!\\top\}k\_\{s\}/\\sqrt\{d\_\{h\}\}\\bigr\)\\,v\_\{s\}\(2\)
The three recurrent mixers below all share the formOt=St​qtO\_\{t\}=S\_\{t\}q\_\{t\}with a recurrent stateSt∈ℝdh×dhS\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{h\}\}; they differ mainly in howStS\_\{t\}is updated\.

#### Lightning Attention\.

Lightning is a linear attention with a fixed per\-head decayγ∈\(0,1\)\\gamma\\in\(0,1\):

St=γ​St−1\+vt​kt⊤\.S\_\{t\}=\\gamma S\_\{t\-1\}\+v\_\{t\}k\_\{t\}^\{\\\!\\top\}\.\(3\)

#### Mamba\-2\.

Following the structured state\-space duality \(SSD\) form, Mamba\-2 can be written as:

St=γt​St−1\+vt​kt⊤\.S\_\{t\}=\\gamma\_\{t\}S\_\{t\-1\}\+v\_\{t\}k\_\{t\}^\{\\\!\\top\}\.\(4\)The data\-dependentγt\\gamma\_\{t\}allows per\-token control over how much of the past state is preserved\.

#### Gated DeltaNet\.

GDN further adds controlled forgetting through a data\-dependent decayαt∈\(0,1\)\\alpha\_\{t\}\\in\(0,1\)and a data\-dependent update strengthβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\):

St\\displaystyle S\_\{t\}=αt​St−1​\(I−βt​kt​kt⊤\)\+βt​vt​kt⊤\.\\displaystyle=\\alpha\_\{t\}S\_\{t\-1\}\(I\-\\beta\_\{t\}k\_\{t\}k\_\{t\}^\{\\\!\\top\}\)\+\\beta\_\{t\}v\_\{t\}k\_\{t\}^\{\\\!\\top\}\.\(5\)Here, the delta\-rule term removes the existing content associated withktk\_\{t\}before writing the new associationvt​kt⊤v\_\{t\}k\_\{t\}^\{\\\!\\top\}\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x2.png)Figure 2:PredictedLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)at S5 scale \(N=0\.48​BN\{=\}0\.48\\mathrm\{B\}\) across Train tokensDD\.Loss\\mathrm\{Loss\}curves of all hybrids closely overlap, whereaslog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)curves show large gaps in the low\-data regime that shrink with more training\. The insets verify extrapolation accuracy against the measured S5 checkpoints of*Full*and*SWA\-128*\.

### 3\.2Scaling Law

To compare hybrid architectures across model scales and training budgets, we use two fitting targets: validationLoss\\mathrm\{Loss\}for short\-context modeling andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)for long\-context capability\.

#### Loss\.

Validation loss is the standard target in language\-model scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib17); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.15378#bib.bib18)\)\. We select4040K held\-out samples from C4\(Raffelet al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib42)\)that are disjoint from our training corpus, and report the average negative log\-likelihood asLoss\\mathrm\{Loss\}\.

#### LongPPL\.

We adoptlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\(Fanget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib10)\)as the fitting target for long\-context capability\. Following its original implementation, we adopt GovReport\(Huanget al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib40)\)as the evaluation corpus and Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib28)\)as the reference model\. More details are provided in Appendix[A](https://arxiv.org/html/2606.15378#A1)\.

#### Scaling Law Formula\.

For bothlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)andLoss\\mathrm\{Loss\}, we model performance as a function of model parametersNN\(w/o embeddings\) and training tokensDD\(Hoffmannet al\.,[2022](https://arxiv.org/html/2606.15378#bib.bib18)\), using the separable power\-law form as the fitting template:

L​\(N,D\)=a​N−α\+b​D−βL\(N,D\)=aN^\{\-\\alpha\}\+bD^\{\-\\beta\}\(6\)wherea,b,α,βa,b,\\alpha,\\betaare fitted separately for each architecture and fitting target\.

## 4Scaling Behavior of Short\- and Long\-Context Capabilities

To answer RQ1, we fit scaling laws for validationLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)to compare short\-context and long\-context capabilities across hybrid architectures with different efficient\-attention designs\.

### 4\.1Settings

#### Model architecture\.

We compare a full\-attention Transformer baseline, denoted as*Full*, with six layer\-wise hybrid architectures that differ in efficient\-attention components\. Three hybrids use SWA with window sizes of 128, 512, and 2048, denoted as*SWA\-128*,*SWA\-512*, and*SWA\-2048*\. The other three use recurrent sequence mixers, denoted as*Lightning*,*Mamba\-2*, and*GDN*\. All hybrid models alternate full\-attention and efficient\-attention layers with a1:11\{:\}1ratio\.

Table 1:Key hyperparameters of*Full*model for S1–S5\.ConfigurationS1S2S3S4S5Params \(w/o embed\.\)15M31M65M104M477MTotal Params71M107M159M217M665MLayers1012161830Hidden dim3845126407681280FFN dim9601280160019203200Heads \(Q\)68101220Heads \(KV\)22222Head dim6464646464

#### Scaling setup\.

The scaling study covers five model sizes, S1–S5, with the hyperparameters of the*Full*configuration summarized in Table[1](https://arxiv.org/html/2606.15378#S4.T1)\. For the main scaling analysis, we evaluate S1–S4 checkpoints trained with six token budgets,D∈\{100​N,200​N,300​N,400​N,500​N,1000​N\}D\\in\\\{100N,200N,300N,400N,500N,1000N\\\}, across all architectures, whereNNcorresponds to the parameters without embedding\. For S5 scale \(N=0\.48​BN=0\.48Bexcluding embeddings; total parameters0\.66​B0\.66B\), we train*Full*and*SWA\-128*atD=100​ND=100NandD=200​ND=200Nfor larger\-scale extrapolation checks\.

All models are pretrained with a 16K context length on a1:11\{:\}1mixture of long and short datasets, which allows us to simultaneously measure short\- and long\-context capabilities\. More training settings are given in Appendix[C](https://arxiv.org/html/2606.15378#A3)\.

### 4\.2Scaling Law of Validation𝐋𝐨𝐬𝐬\\mathbf\{Loss\}

We fit the scaling law for validationLoss\\mathrm\{Loss\}using 18 data points from S1–S3, and hold out the 6 data points from S4 as a verification set\. As shown in Figure[9](https://arxiv.org/html/2606.15378#A0.F9), all seven architectures are well captured by the scaling law, achieving highR2R^\{2\}on both the fitting and verification sets\.

To compare architectures under matched scaling conditions, we examine the predictedLoss\\mathrm\{Loss\}at the S5 scale \(N=0\.48​BN\{=\}0\.48\\mathrm\{B\}\) across training tokensDD, and include the measured S5Loss\\mathrm\{Loss\}to assess the extrapolation accuracy of the fitted curves\.

As shown in the left panel of Figure[2](https://arxiv.org/html/2606.15378#S3.F2), the validationLoss\\mathrm\{Loss\}curves of all hybrid models closely overlap with*Full*across the full range ofDD\. This indicates that efficient\-attention design has limited impact on short\-context capability\.

### 4\.3Scaling Law oflog⁡\(𝐋𝐨𝐧𝐠𝐏𝐏𝐋\)\\log\(\\mathbf\{LongPPL\}\)

We fit the scaling law forlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)following the same protocol as for validationLoss\\mathrm\{Loss\}, except that we exclude the S1 checkpoint atD=100​ND=100Nbecause its training budget is too small for stable long\-context behavior\. As shown in Figure[10](https://arxiv.org/html/2606.15378#A0.F10), althoughlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)is noisier at early checkpoints, it is still smoothly captured by Eq\. \([6](https://arxiv.org/html/2606.15378#S3.E6)\)\.

In contrast to the strong overlap observed forLoss\\mathrm\{Loss\}, the predictedlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)reveals much larger architectural differences\. We compare architectures under the same setting as above and include the measured S5log⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)values to assess extrapolation accuracy\.

As shown in the right panel of Figure[2](https://arxiv.org/html/2606.15378#S3.F2), a clear pattern emerges: architectural differences are most pronounced in early training, corresponding to the low\-data regime, where large\-window SWA, especially*SWA\-2048*, exhibits substantially higherlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\. As the training becomes more sufficient, this gap rapidly shrinks, and the hybrid models with different efficient\-attention designs eventually converge to similar levels with*Full*\.

Taken together, theLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)scaling results reveal a clear separation between final capability and training dynamics:Efficient\-attention design has limited effect on the eventual short\- and long\-context capabilities of hybrid models, but strongly shapes the emergence speed of long\-context capability\.

## 5Mechanism: How Efficient Attention Shapes Long\-Context Capability

The key observation in Section[4](https://arxiv.org/html/2606.15378#S4)naturally motivates*RQ2:How does efficient\-attention design influence long\-context performance?*In this section, we conduct a series of mechanistic experiments that dissect the role of efficient attention in long\-context modeling; full implementation details and extended analyses can be found in Appendix[D](https://arxiv.org/html/2606.15378#A4)\.

### 5\.1The Dominant Role of Full Attention

A natural hypothesis is that efficient attention with a larger receptive field, especially recurrent sequence mixers whose receptive field is in principle unbounded, should help improve the long\-context capability of hybrid models\. However, this is not supported by the scaling pattern that different hybrid models converge to similarlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\. To examine where long\-context capability actually arises, we conduct a receptive\-field constraint and a layer\-wise probing experiment\.

#### Receptive\-field constraint\.

For the S4 models trained withD=1000​ND=1000Nin scaling experiments, we separately restrict the accessible receptive field of efficient attention and full attention to≈2048\\approx 2048tokens at inference time, and measure the resulting change inlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\. As shown in Figure[3](https://arxiv.org/html/2606.15378#S5.F3), when the receptive field of full attention is restricted,log⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)increases sharply across all hybrid models\. In contrast, restricting the receptive field of efficient attention causes only minor changes\. This indicates that, in our setting, even recurrent sequence mixers whose receptive field is in principle unbounded and whose update rules are delicate, such as GDN, store little long\-range information in their recurrent states during inference\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x3.png)Figure 3:Inference\-time receptive\-field restriction for S4/1000​N1000Nhybrids\.Restricting efficient attention to≈2048\\approx 2048tokens leaveslog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)nearly unchanged, while restricting full attention raises it sharply\.
#### Probing Experiment\.

To examine how long\-range information emerges across layers, we conduct a layer\-wise probing experiment\(Belinkov,[2022](https://arxiv.org/html/2606.15378#bib.bib39)\)on a Needle\-in\-a\-Haystack \(NIAH\)\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib25)\)classification task\. For each layer, we extract the hidden state of the final query token and train a logistic\-regression classifier to predict the inserted needle\. By comparing the incremental change in probing accuracy from one layer to its predecessor, we estimate how much long\-range information is introduced by each layer\. Details are provided in Appendix[D\.2](https://arxiv.org/html/2606.15378#A4.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x4.png)Figure 4:Layer\-wise probing accuracy gain on NIAH for the S4/1000​N1000Nmodels\.Cells show incremental accuracy over the previous layer\. In all hybrids, gains concentrate at middle full\-attention layers \(odd\-numbered\)\.![Refer to caption](https://arxiv.org/html/2606.15378v1/x5.png)\(a\)Gradient influence over distance\.
![Refer to caption](https://arxiv.org/html/2606.15378v1/x6.png)\(b\)Retrieval\-head training trajectories\.

Figure 5:Evidence for Large\-Window Laziness\.\(a\)Beyond20482048tokens,G​\(d\)G\(d\)decays to a flat baseline, while the 512–2048 range still carries substantial signal\.\(b\)*SWA\-2048*is the outlier: its retrieval\-head attention entropyH​\(t\)H\(t\)stays high and Q/K weight distancedQK​\(t\)d^\{\\mathrm\{QK\}\}\(t\)shrinks more slowly, indicating under\-trained retrieval\.Figure[4](https://arxiv.org/html/2606.15378#S5.F4)shows that, in layer\-wise hybrids, probing accuracy increases almost exclusively at middle full\-attention layers*\(odd\-numbered\)*, while middle efficient\-attention layers*\(even\-numbered\)*contribute little gain and even reduce accuracy\. In contrast,*Full*shows continuous growth across middle layers\. This supports the view that long\-range information in hybrids is mainly introduced and processed by full attention\.

The receptive\-field constraint and probing experiments suggest that long\-context capability in hybrid architectures primarily relies on full attention rather than efficient\-attention modules\. This also helps explain the scaling behavior observed in Section[4\.3](https://arxiv.org/html/2606.15378#S4.SS3): architectural gaps inlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)shrink after sufficient training because all hybrid models share the same full\-attention design as*Full*\.

### 5\.2Efficient Attention as an Optimization Prior of Long\-Context Capability

The scaling experiments show that different efficient\-attention designs substantially affect the convergence speed oflog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\. Since long\-context capability is primarily carried by full attention, we argue that these differences arise because efficient attention affects how fast full attention learns long\-range retrieval\. This effect is especially clear in large\-window SWA hybrids, which we refer to as*Large\-Window Laziness*\.

Concretely, a large local window can already cover many useful dependencies during training\. As a result, the model can often predict the next token using information within the sliding window, without relying on full attention to retrieve from farther positions\. This weakens the optimization pressure for full attention to develop long\-range retrieval ability, causing this ability to emerge more slowly\. In contrast, SWA with smaller windows leaves more dependencies outside the window range, forcing the model to access them through full attention and thereby providing a denser signal for long\-range retrieval\. We provide two pieces of evidence consistent with this mechanism\.

#### Gradient Influence Profiling\.

To estimate how next\-token\-prediction signal decays with distancedd, we use Llama\-3\.1\-8B to measure the gradient influenceG​\(d\)G\(d\)\(Liet al\.,[2016](https://arxiv.org/html/2606.15378#bib.bib43)\)on long documents sampled from the pretraining corpus\. This proxy assumes that the natural long\-range dependency distribution is largely model\-agnostic, so the profile approximates the dependency signal seen during hybrid\-model pretraining\. For an input sequencex1:Tx\_\{1:T\}, we defineG​\(d\)G\(d\)as

G​\(d\)=𝔼x​\[‖∂s​\(x\)∂eT−d‖2\],G\(d\)=\\mathbb\{E\}\_\{x\}\\left\[\\left\\\|\\frac\{\\partial s\(x\)\}\{\\partial e\_\{T\-d\}\}\\right\\\|\_\{2\}\\right\],whereeT−de\_\{T\-d\}is the embedding of the token at distancedd, ands​\(x\)s\(x\)denotes the logit used for prediction\. This quantity measures how sensitive the model’s prediction is to each historical token, and thus serves as a proxy for distance\-dependent signal strength\. As shown in Figure[5\(a\)](https://arxiv.org/html/2606.15378#S5.F5.sf1), the signal beyond 2048 tokens decays to a flat baseline, while the 512–2048 range still contains substantial gradient signal\. This suggests that a 2048\-token window already captures most useful training signal, whereas sub\-512 windows leave substantial signal outside the window, thereby imposing stronger pressure on full attention to learn retrieval\. This is consistent with*Large\-Window Laziness*\.

#### Retrieval\-Head Tracing\.

We use retrieval heads\(Wuet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib34)\)as the unit of analysis: we densely save intermediate checkpoints before the S4 models reachD=200​ND=200Ntokens, identify retrieval heads in the final checkpoint, and track two diagnostics at each intermediate checkpointtt\.

\(i\)H​\(t\)H\(t\), the normalized attention entropy when retrieving the needle token in the NIAH task:

H​\(t\)=−1log⁡\|𝒱q\|​∑j∈𝒱qaq​j\(t\)​log⁡aq​j\(t\),H\(t\)=\-\\frac\{1\}\{\\log\|\\mathcal\{V\}\_\{q\}\|\}\\sum\_\{j\\in\\mathcal\{V\}\_\{q\}\}a^\{\(t\)\}\_\{qj\}\\log a^\{\(t\)\}\_\{qj\},whereaq​j\(t\)a^\{\(t\)\}\_\{qj\}is the attention weight from queryqqto visible keyjjat checkpointtt, and𝒱q\\mathcal\{V\}\_\{q\}is the visible\-key set\. LowerH​\(t\)H\(t\)indicates sharper retrieval\.

\(ii\)dQK​\(t\)d^\{\\mathrm\{QK\}\}\(t\), the relative parameter distance from checkpointttto the final checkpoint:

dQK​\(t\)=∑W∈\{WQ,WK\}‖W\(t\)−W\(te​n​d\)‖F‖W\(te​n​d\)‖F,d^\{\\mathrm\{QK\}\}\(t\)=\\sum\_\{W\\in\\\{W\_\{Q\},W\_\{K\}\\\}\}\\frac\{\\\|W^\{\(t\)\}\-W^\{\(t\_\{end\}\)\}\\\|\_\{F\}\}\{\\\|W^\{\(t\_\{end\}\)\}\\\|\_\{F\}\},whereWQW\_\{Q\}andWKW\_\{K\}are the query and key projection matrices of the identified retrieval head,∥⋅∥F\\\|\\cdot\\\|\_\{F\}denotes the Frobenius norm, andte​n​dt\_\{end\}indexes theD=200​ND=200Ncheckpoint\. We report the mean of both diagnostics over the Top\-2 retrieval heads\.

Figure[5\(b\)](https://arxiv.org/html/2606.15378#S5.F5.sf2)shows that*SWA\-2048*follows a clearly different pattern from the other models: its normalized attention entropy remains high, and its retrieval\-head weights converge more slowly, indicating that its retrieval heads remain under\-trained\. By contrast, retrieval heads train faster in smaller\-window SWA and recurrent efficient\-attention hybrids, consistent with the need for full attention to access information beyond what the efficient\-attention module can reliably provide\. We provide additional analyses of retrieval\-head formation from complementary perspectives in Appendix[D\.4](https://arxiv.org/html/2606.15378#A4.SS4), all of which lead to consistent conclusions\.

Together, these analyses yield a unified mechanistic answer to*RQ2*:efficient attention primarily shapes how efficiently full attention learns long\-range retrieval, rather than carrying long\-range information directly\.

## 6Hybrid Architecture Design Beyond Efficient Attention

Table 2:Downstream evaluationof*Full*,*SWA\-128*, and*SWA\-128\-NoPE*at S4 \(0\.22​B0\.22B\) and S5 \(0\.66​B0\.66B\)\. RULERNIAHis the average over the 8 NIAH\-style sub\-tasks in RULER; ShortAvg is the average over 19 short\-context benchmarks, evaluated with the 16K models\.Boldmarks the best within each model scale\. Full per\-task results are reported in Appendix[E](https://arxiv.org/html/2606.15378#A5)\.SettingModelShortAvgLong\-Context \(16K\)Long\-Context \(32K\)RULERRULERNIAHLongBenchRULERRULERNIAHLongBenchS​4​\(0\.22​B\)S4\(0\.22B\)D≈100​BD\\approx 100BFull38\.1325\.0935\.9515\.09–––SWA\-12838\.0335\.3349\.5815\.88–––SWA\-128\-NoPE37\.8844\.8067\.8116\.43–––S​5​\(0\.66​B\)S5\(0\.66B\)D≈100​BD\\approx 100BFull40\.4647\.1767\.1418\.4443\.9062\.6118\.93SWA\-12841\.3146\.1365\.9117\.5241\.8660\.1718\.30SWA\-128\-NoPE41\.3252\.8882\.3119\.0246\.9870\.4219\.46

The mechanism above motivates us to revisit hybrid architecture design, raising*RQ3:What design principles lead to more effective hybrid architectures?*We move beyond efficient attention and examine several other design factors through scaling law and downstream benchmark evaluation\.

### 6\.1Full\-to\-Efficient Layer Ratio

![Refer to caption](https://arxiv.org/html/2606.15378v1/x7.png)Figure 6:SWA\-128 \(1:1\) vs\. SWA\-128 \(1:3\)\.![Refer to caption](https://arxiv.org/html/2606.15378v1/x8.png)Figure 7:SWA\-128 vs\. SWA\-128\-Headwise\.![Refer to caption](https://arxiv.org/html/2606.15378v1/x9.png)Figure 8:SWA\-128 vs\. SWA\-128\-NoPE\.We compare the 1:1 SWA\-128 setting used in our main experiments with a sparser 1:3 variant, and fit their validationLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)scaling curves\. As shown in Figure[6](https://arxiv.org/html/2606.15378#S6.F6), the 1:3 ratio gives almost the same validationLoss\\mathrm\{Loss\}as the 1:1 ratio\. Forlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\), however, the sparser model performs worse at small scales, likely because the number of full\-attention layers is too limited\. As model size increases, this gap closes, suggesting that full\-attention density can be safely reduced once enough full\-attention layers are available\.

### 6\.2Layer\-wise vs\. Head\-wise

Another design choice is whether to place full attention in dedicated layers or distribute it across heads within each layer, as in recent head\-wise or intra\-layer hybrid designs\(Baeet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib12)\)\. To examine this factor, we compare the layer\-wise SWA\-128 model with a head\-wise variant, SWA\-128\-Headwise\. As shown in Figure[7](https://arxiv.org/html/2606.15378#S6.F7), under our setting, head\-wise mixing does not provide an advantage over layer\-wise\. Specifically, the two methods reach similar validationLoss\\mathrm\{Loss\}andlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)after sufficient training, yet the head\-wise variant shows slowerlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)convergence\.

### 6\.3Positional Encoding of Full Attention

Recent studies show that applying NoPE to full\-attention layers can effectively enhance their long\-range retrieval capability\(Yanget al\.,[2025a](https://arxiv.org/html/2606.15378#bib.bib32); Puvvadaet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib33)\)\. We take*SWA\-128*as the base since it activates full\-attention retrieval well, and apply NoPE to its full\-attention layers, denoted as*SWA\-128\-NoPE*\. As shown in Figure[8](https://arxiv.org/html/2606.15378#S6.F8), this change substantially decreaseslog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)while leaving validationLoss\\mathrm\{Loss\}nearly unchanged\.

Following the training protocol of scaling experiments, we train*Full*,*SWA\-128*, and*SWA\-128\-NoPE*at S4 \(0\.22​B0\.22\\mathrm\{B\}\) and S5 \(0\.66​B0\.66\\mathrm\{B\}\) under≈100​B\{\\approx\}100\\mathrm\{B\}tokens\. To further evaluate in longer contexts, we continue training the S5 checkpoints for an additional5​B5\\mathrm\{B\}tokens at a 32K sequence length\. For long\-context, we use RULER\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib25)\)and LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib41)\); for short\-context, we report the average over 19 benchmarks\. As shown in Table[2](https://arxiv.org/html/2606.15378#S6.T2),*SWA\-128\-NoPE*consistently leads on long\-context benchmarks at both scales while remaining comparable on short\-context tasks\.

The design studies suggest that hybrid architecture design should move beyond simply choosing a stronger efficient\-attention component and instead prioritize choices thatbetter activate or directly strengthen full attention, allowing its long\-range retrieval capability to emerge more efficiently\.

## 7Conclusion

Through scaling\-law fits and mechanistic analysis, we find that the long\-context performance of hybrid models is primarily determined by full attention, while efficient attention, acting as an*optimization prior*, indirectly shapes it by modulating how quickly full attention learns long\-range retrieval\. This suggests that, under limited training budgets, hybrid design should favor choices that more effectively activate and strengthen the long\-context capability of full attention, such as small\-window SWA and NoPE, both validated in our experiments\.

## Limitations

Although our experiments cover multiple model scales and verify the fitted scaling laws via extrapolation, the largest model we train is still at the sub\-billion\-parameter level with at most≈100\\approx\\\!100B pretraining tokens, which is smaller than the scale of frontier industrial systems\. We also pretrain directly at a 16K context length and extend to at most 32K, in contrast to the prevailing recipe that pretrains on short context first and subsequently extends to long context\. These choices may limit the applicability of our conclusions to larger\-scale or differently trained settings\.

For efficient\-attention designs, we cover representative operators widely adopted in recent hybrid architectures, while leaving out some other popular variants such as RWKV\-7\(Penget al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib44)\)and Kimi\-Linear\(Teamet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib71)\)\. In addition, the design choices discussed in Section[6](https://arxiv.org/html/2606.15378#S6)are intended to validate our mechanistic conclusions rather than to serve as a full design study, and a more comprehensive verification at larger scales is left to future work\.

## References

- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§B\.1](https://arxiv.org/html/2606.15378#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.15378#S1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Bae, B\. Acun, C\. Lin, H\. Habeeb, S\. Kim, L\. Luo, J\. Wang, and C\. Wu \(2025\)Hybrid architectures for language models: systematic analysis and design insights\.arXiv preprint arXiv:2510\.04800\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p2.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p2.1),[§6\.2](https://arxiv.org/html/2606.15378#S6.SS2.p1.3)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§6\.3](https://arxiv.org/html/2606.15378#S6.SS3.p2.4)\.
- Y\. Belinkov \(2022\)Probing classifiers: promises, shortcomings, and advances\.Computational Linguistics48\(1\),pp\. 207–219\.Cited by:[§5\.1](https://arxiv.org/html/2606.15378#S5.SS1.SSS0.Px2.p1.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. Le Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- A\. Blakeman, A\. Grattafiori, A\. Basant, A\. Gupta, A\. Khattar, A\. Renduchintala, A\. Vavre, A\. Shukla, A\. Bercovich, A\. Ficek,et al\.\(2025\)NVIDIA nemotron 3: efficient and open intelligence\.arXiv preprint arXiv:2512\.20856\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Cao, M\. Chen, J\. Chen, Z\. Cui, Y\. Feng, B\. Hui, Y\. Jing, K\. Li, M\. Li, J\. Lin,et al\.\(2026\)Qwen3\-coder\-next technical report\.arXiv preprint arXiv:2603\.00729\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Chen, B\. Huang, Y\. Gao, Z\. Wang, J\. Yang, and H\. Ji \(2024\)Scaling laws for predicting downstream performance in llms\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Chen, Z\. L\. Thai, Z\. Zhou, Z\. Zhang, X\. Shen, S\. Wang, C\. Xiao, X\. Han, and Z\. Liu \(2026\)Hybrid linear attention done right: efficient distillation and effective architectures for extremely long contexts\.arXiv preprint arXiv:2601\.22156\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- T\. Dao and A\. Gu \(2024\)Transformers are ssms: generalized models and efficient algorithms through structured state space duality\.InInternational Conference on Machine Learning,pp\. 10041–10071\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- M\. De Marneffe, M\. Simons, and J\. Tonhauser \(2019\)The CommitmentBank: investigating projection in naturally occurring discourse\.InProceedings of Sinn und Bedeutung,Vol\.23,pp\. 107–124\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1)\.
- X\. Dong, Y\. Fu, S\. Diao, W\. Byeon, Z\. Chen, A\. Mahabaleshwarkar, S\. Liu, M\. Chen, Y\. Suhara, Y\. C\. Lin,et al\.\(2025\)Hymba: a hybrid\-head architecture for small language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Fang, Y\. Wang, Z\. Liu, C\. Zhang, S\. Jegelka, J\. Gao, B\. Ding, and Y\. Wang \(2025\)What is wrong with perplexity for long\-context language modeling?\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.15378#A1.p1.1),[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px2.p1.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Appendix A](https://arxiv.org/html/2606.15378#A1.p1.1),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px2.p1.1)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- D\. Hernandez, J\. Kaplan, T\. Henighan, and S\. McCandlish \(2021\)Scaling laws for transfer\.arXiv preprint arXiv:2102\.01293\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,pp\. 30016–30030\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px1.p1.2),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px3.p1.4)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.15378#S5.SS1.SSS0.Px2.p1.1),[§6\.3](https://arxiv.org/html/2606.15378#S6.SS3.p2.4)\.
- S\. Hu, Y\. Tu, X\. Han, G\. Cui, C\. He, W\. Zhao, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, X\. Zhang, Z\. L\. Thai, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, dahai li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=3X2L2TFr0f)Cited by:[Appendix C](https://arxiv.org/html/2606.15378#A3.p1.9)\.
- A\. Huang, A\. Li, A\. Kong, B\. Wang, B\. Jiao, B\. Dong, B\. Wang, B\. Chen, B\. Li, B\. Ma, C\. Su, C\. Miao, C\. Wan, C\. Lou, C\. Hu, C\. Xu, C\. Yu, C\. Feng, C\. Yao, C\. Han, D\. Ma, D\. Shi, D\. Jiang, D\. Ma, D\. Sun, D\. Qi, E\. Liu, F\. Zhang, F\. Wan, G\. Huang, G\. Yan, G\. Cao, G\. Li, H\. Cheng, H\. Guo, H\. Zhang, H\. Nie, H\. Jia, H\. Lv, H\. Zhou, H\. Lv, H\. Wang, H\. Shum, H\. Huang, H\. Peng, H\. Zhou, H\. Wang, H\. Chen, H\. Zhu, H\. Wu, H\. Guo, J\. Wang, J\. Zhou, J\. Sun, J\. Wu, J\. Zhang, J\. Lv, J\. Liu, J\. Fu, J\. Liu, J\. Cheng, J\. Luo, J\. Yang, J\. Zhou, J\. Hou, J\. Bai, J\. Hu, J\. Xie, J\. Wu, J\. Zhang, J\. Zhou, J\. Liu, J\. Lin, K\. M\. Lo, K\. Liang, K\. Liu, K\. Tan, K\. Yan, K\. Li, K\. An, K\. Lin, L\. Yang, L\. Lv, L\. Zhao, L\. Chen, L\. Shi, L\. Tan, L\. Lin, L\. Chen, L\. Ma, M\. Ren, M\. Li, M\. Li, M\. Li, M\. Zhang, M\. Chen, M\. Huang, N\. Wang, P\. Liu, Q\. Han, Q\. Zhao, Q\. He, Q\. Du, Q\. Wu, Q\. Sun, R\. Yang, R\. Miao, R\. Han, R\. Wan, R\. Guo, S\. Wang, S\. Pang, S\. Yang, S\. Fan, S\. Shang, S\. Yang, S\. Li, S\. Tian, S\. Liu, S\. Wu, S\. Chen, S\. Yuan, T\. Cao, T\. Yue, T\. Cheng, T\. Li, T\. Luo, W\. You, W\. Ji, W\. Yuan, W\. Zhang, W\. Wu, W\. Xie, W\. Sun, W\. Deng, W\. Zheng, W\. Xie, X\. Wang, X\. Kong, X\. Liu, X\. Zhang, X\. Yang, X\. Liu, X\. Yuan, X\. Jiao, X\. Ren, X\. Zhang, X\. Li, X\. Liu, X\. Wu, X\. Chen, X\. Yang, X\. Wang, X\. Zhao, X\. He, X\. Feng, X\. Cai, X\. Zhou, Y\. Yu, Y\. Li, Y\. Xu, Y\. Lai, Y\. Xu, Y\. Wang, Y\. Shen, Y\. Zhu, Y\. Lv, Y\. Cao, Y\. Gong, Y\. Yang, Y\. Yang, Y\. Zhao, Y\. Zhao, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Chen, Y\. Zhao, Y\. Long, Y\. Wang, Y\. Guan, Y\. Zhou, Y\. Peng, Y\. Ding, Y\. Fan, Y\. Lu, Y\. Yang, Y\. Luo, Y\. Zhao, Y\. Peng, Y\. Lin, Y\. Lu, Y\. Zhao, Y\. Ju, Y\. Zhang, Y\. Li, Y\. Yang, Y\. Chen, Y\. Cai, Z\. Weng, Z\. Hong, Z\. Li, Z\. Xie, Z\. Ge, Z\. Gong, Z\. Zeng, Z\. Lu, Z\. Huang, Z\. Chang, Z\. Huang, Z\. Hu, Z\. Yang, Z\. Wang, Z\. Ren, Z\. Zhang, and Z\. Wang \(2026\)Step 3\.5 flash: open frontier\-level intelligence with 11b active parameters\.External Links:2602\.10604,[Link](https://arxiv.org/abs/2602.10604)Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Huang, S\. Cao, N\. Parulian, H\. Ji, and L\. Wang \(2021\)Efficient attentions for long document summarization\.InProceedings of the 2021 conference of the north American chapter of the association for computational linguistics: Human language technologies,pp\. 1419–1436\.Cited by:[Appendix A](https://arxiv.org/html/2606.15378#A1.p1.1),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px2.p1.1)\.
- Y\. Huang, Y\. Bai, Z\. Zhu, J\. Zhang, J\. Zhang, T\. Su, J\. Liu, C\. Lv, Y\. Zhang, J\. Lei, Y\. Fu, M\. Sun, and J\. He \(2023\)C\-Eval: a multi\-level multi\-discipline Chinese evaluation suite for foundation models\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- G\. Jawahar, B\. Sagot, and D\. Seddah \(2019\)What does BERT learn about the structure of language?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 3651–3657\.External Links:[Link](https://aclanthology.org/P19-1356/),[Document](https://dx.doi.org/10.18653/v1/P19-1356)Cited by:[§D\.2](https://arxiv.org/html/2606.15378#A4.SS2.p4.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, Y\. Jiacheng, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks, 2024\.URL https://kellerjordan\. github\. io/posts/muon6\(3\),pp\. 4\.Cited by:[Table 8](https://arxiv.org/html/2606.15378#A3.T8)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px1.p1.2)\.
- A\. Kazemnejad, I\. Padhi, K\. Natesan Ramamurthy, P\. Das, and S\. Reddy \(2023\)The impact of positional encoding on length generalization in transformers\.Advances in Neural Information Processing Systems36,pp\. 24892–24928\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px3.p1.1)\.
- D\. Khashabi, S\. Chaturvedi, M\. Roth, S\. Upadhyay, and D\. Roth \(2018\)Looking beyond the surface: a challenge set for reading comprehension over multiple sentences\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),pp\. 252–262\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- G\. Lai, Q\. Xie, H\. Liu, Y\. Yang, and E\. Hovy \(2017\)RACE: large\-scale ReAding comprehension dataset from examinations\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,pp\. 785–794\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- A\. Li, B\. Gong, B\. Yang, B\. Shan, C\. Liu, C\. Zhu, C\. Zhang, C\. Guo, D\. Chen, D\. Li,et al\.\(2025\)Minimax\-01: scaling foundation models with lightning attention\.arXiv preprint arXiv:2501\.08313\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p2.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Li, Y\. Zhang, F\. Koto, Y\. Yang, H\. Zhao, Y\. Gong, N\. Duan, and T\. Baldwin \(2024\)CMMLU: measuring massive multitask language understanding in Chinese\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 11260–11285\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- J\. Li, X\. Chen, E\. Hovy, and D\. Jurafsky \(2016\)Visualizing and understanding neural models in nlp\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 681–691\.Cited by:[§D\.3](https://arxiv.org/html/2606.15378#A4.SS3.p2.6),[§5\.2](https://arxiv.org/html/2606.15378#S5.SS2.SSS0.Px1.p1.4)\.
- Y\. Liang, S\. Chen, G\. Zhang, S\. Wang, and S\. Zheng \(2026\)Revealing the learning dynamics of long\-context continual pre\-training\.arXiv preprint arXiv:2604\.02650\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2381–2391\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- N\. Mostafazadeh, N\. Chambers, X\. He, D\. Parikh, D\. Batra, L\. Vanderwende, P\. Kohli, and J\. Allen \(2016\)A corpus and cloze evaluation for deeper understanding of commonsense stories\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 839–849\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- B\. Peng, R\. Zhang, D\. Goldstein, E\. Alcaide, X\. Du, H\. Hou, J\. Lin, J\. Liu, J\. Lu, W\. Merrill, G\. Song, K\. Tan, S\. Utpala, N\. Wilce, J\. S\. Wind, T\. Wu, D\. Wuttke, and C\. Zhou\-Zheng \(2025\)RWKV\-7 "goose" with expressive dynamic state evolution\.External Links:2503\.14456,[Link](https://arxiv.org/abs/2503.14456)Cited by:[Limitations](https://arxiv.org/html/2606.15378#Sx1.p2.1)\.
- M\. T\. Pilehvar and J\. Camacho\-Collados \(2019\)WiC: the word\-in\-context dataset for evaluating context\-sensitive meaning representations\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 1267–1273\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- K\. C\. Puvvada, F\. Ladhak, S\. A\. Serrano, C\. Hsieh, S\. Acharya, S\. Majumdar, F\. Jia, S\. Kriman, S\. Sun, D\. Rekesh,et al\.\(2025\)Swan\-gpt: an efficient and scalable approach for long\-context language modeling\.arXiv preprint arXiv:2504\.08719\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1),[§6\.3](https://arxiv.org/html/2606.15378#S6.SS3.p1.2)\.
- Z\. Qin, W\. Sun, D\. Li, X\. Shen, W\. Sun, and Y\. Zhong \(2024\)Various lengths, constant speed: efficient language modeling with lightning attention\.InInternational Conference on Machine Learning,pp\. 41517–41535\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§3\.2](https://arxiv.org/html/2606.15378#S3.SS2.SSS0.Px1.p1.2)\.
- M\. Roemmele, C\. A\. Bejan, and A\. S\. Gordon \(2011\)Choice of plausible alternatives: an evaluation of commonsense causal reasoning\.In2011 AAAI Spring Symposium Series,Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial Winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. Le Bras, and Y\. Choi \(2019\)Social IQa: commonsense reasoning about social interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 4463–4473\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- I\. Schlag, K\. Irie, and J\. Schmidhuber \(2021\)Linear transformers are secretly fast weight programmers\.InInternational Conference on Machine Learning,pp\. 9355–9366\.Cited by:[§B\.3](https://arxiv.org/html/2606.15378#A2.SS3.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1)\.
- Y\. Song, J\. Kai, L\. Lu, K\. Qiu, and Z\. Lin \(2026\)Towards compressive and scalable recurrent memory\.arXiv preprint arXiv:2602\.11212\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- K\. Team, Y\. Zhang, Z\. Lin, X\. Yao, J\. Hu, F\. Meng, C\. Liu, X\. Men, S\. Yang, Z\. Li,et al\.\(2025\)Kimi linear: an expressive, efficient attention architecture\.arXiv preprint arXiv:2510\.26692\.Cited by:[Limitations](https://arxiv.org/html/2606.15378#Sx1.p2.1)\.
- M\. Team, W\. An, Y\. Chen, Y\. Fang, J\. Li, X\. Li, Y\. Li, Y\. Li, Y\. Li, B\. Lin,et al\.\(2026\)Minicpm\-sala: hybridizing sparse and linear attention for efficient long\-context modeling\.arXiv preprint arXiv:2602\.11761\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p1.1)\.
- R\. Waleffe, W\. Byeon, D\. Riach, B\. Norick, V\. Korthikanti, T\. Dao, A\. Gu, A\. Hatamizadeh, S\. Singh, D\. Narayanan,et al\.\(2024\)An empirical study of mamba\-based language models\.arXiv preprint arXiv:2406\.07887\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p2.1)\.
- A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2019\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.
- D\. Wang, R\. Zhu, S\. Abreu, Y\. Shan, T\. Kergan, Y\. Pan, Y\. Chou, Z\. Li, G\. Zhang, W\. Huang,et al\.\(2025\)A systematic analysis of hybrid linear attention\.arXiv preprint arXiv:2507\.06457\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p2.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p2.1)\.
- J\. Willette, H\. Lee, and S\. J\. Hwang \(2025\)Delta attention: fast and accurate sparse attention inference by delta correction\.Advances in Neural Information Processing Systems38,pp\. 12052–12080\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Wu, Y\. Wang, G\. Xiao, H\. Peng, and Y\. Fu \(2025\)Retrieval head mechanistically explains long\-context factuality\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 62143–62156\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.15378#S5.SS2.SSS0.Px2.p1.2)\.
- B\. Xiao, B\. Xia, B\. Yang, B\. Gao, B\. Shen, C\. Zhang, C\. He, C\. Lou, F\. Luo, G\. Wang,et al\.\(2026\)Mimo\-v2\-flash technical report\.arXiv preprint arXiv:2601\.02780\.Cited by:[§1](https://arxiv.org/html/2606.15378#S1.p2.1),[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Xiao, J\. Tang, J\. Zuo, J\. Guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han \(2025a\)Duoattention: efficient long\-context llm inference with retrieval and streaming heads\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 37228–37253\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2606.15378#A2.SS1.p1.1)\.
- L\. Xiao, L\. Zhiyuan, and L\. Yueyu \(2025b\)WuNeng: hybrid state with attention\.External Links:2504\.19191,[Link](https://arxiv.org/abs/2504.19191)Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Yang, B\. Venkitesh, D\. G\. Talupuru, H\. Lin, D\. Cairuz, P\. Blunsom, and A\. Locatelli \(2025a\)Rope to nope and back again: a new hybrid attention strategy\.Advances in Neural Information Processing Systems38,pp\. 64133–64157\.Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1),[§6\.3](https://arxiv.org/html/2606.15378#S6.SS3.p1.2)\.
- S\. Yang, J\. Kautz, and A\. Hatamizadeh \(2025b\)Gated delta networks: improving mamba2 with delta rule\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.15378#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yang, B\. Wang, Y\. Shen, R\. Panda, and Y\. Kim \(2024a\)Gated linear attention transformers with hardware\-efficient training\.InInternational Conference on Machine Learning,pp\. 56501–56523\.Cited by:[§B\.3](https://arxiv.org/html/2606.15378#A2.SS3.p1.1),[§1](https://arxiv.org/html/2606.15378#S1.p1.1)\.
- S\. Yang, B\. Wang, Y\. Zhang, Y\. Shen, and Y\. Kim \(2024b\)Parallelizing linear transformers with the delta rule over sequence length\.InAdvances in Neural Information Processing Systems,Cited by:[§B\.3](https://arxiv.org/html/2606.15378#A2.SS3.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[Appendix E](https://arxiv.org/html/2606.15378#A5.SS0.SSS0.Px1.p1.1)\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x10.png)Figure 9:ValidationLoss\\mathrm\{Loss\}scaling\-law fits across all ten architectures\.Each panel plots validationLoss\\mathrm\{Loss\}against training tokensDDfor one architecture at four model scales S1–S4, with the 18 S1–S3 points used for fitting \(solid markers\) and the 6 S4 points held out for verification \(orange triangles\)\. Each colored curve is the fit ofL​\(N,D\)=a​N−α\+b​D−βL\(N,D\)=aN^\{\-\\alpha\}\+bD^\{\-\\beta\}\(Eq\. \([6](https://arxiv.org/html/2606.15378#S3.E6)\)\) at the correspondingNN, with the fitted coefficients and the train/verificationR2R^\{2\}printed inside each panel\. The first seven panels cover the architectures studied in the main scaling experiments \(Section[4](https://arxiv.org/html/2606.15378#S4)\)—*Full*together with three SWA hybrids and three recurrent\-mixer hybrids—while the last three panels \(*SWA\-128\(1:3\)*,*SWA\-128\-Headwise*, and*SWA\-128\-NoPE*\) correspond to the design variants from Section[6](https://arxiv.org/html/2606.15378#S6)\.![Refer to caption](https://arxiv.org/html/2606.15378v1/x11.png)Figure 10:log⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)scaling\-law fits across the same ten architectures\.The panel layout, marker convention, and per\-panel annotations follow Figure[9](https://arxiv.org/html/2606.15378#A0.F9)\. Compared with validationLoss\\mathrm\{Loss\},log⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)is noticeably noisier at early checkpoints; in particular, the S1/D=100​ND=100Ncheckpoint is excluded from fitting due to unstable long\-context behavior at this small training budget, leaving 17 S1–S3 fitting points and 6 S4 held\-out points per architecture\. Despite the higher noise level, the same power\-law formL​\(N,D\)=a​N−α\+b​D−βL\(N,D\)=aN^\{\-\\alpha\}\+bD^\{\-\\beta\}still fits well\.## Appendix ALongPPL Evaluation Details

LongPPL evaluates a model only on tokens whose prediction benefits from long context\. FollowingFanget al\.\([2025](https://arxiv.org/html/2606.15378#bib.bib10)\), we identify these tokens by comparing the token\-level negative log\-likelihoods assigned by a reference model under full context and under a local chunk\. In our experiments, we use GovReport\(Huanget al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib40)\)as the evaluation corpus and Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib28)\)as the reference model\.

Letℓreffull​\(xt\)\\ell^\{\\mathrm\{full\}\}\_\{\\mathrm\{ref\}\}\(x\_\{t\}\)andℓrefchunk​\(xt\)\\ell^\{\\mathrm\{chunk\}\}\_\{\\mathrm\{ref\}\}\(x\_\{t\}\)denote the token\-level negative log\-likelihoods assigned by the reference model to tokenxtx\_\{t\}under the full contextx<tx\_\{<t\}and under a local chunk, respectively\. The set of key tokens is defined as

𝒦=\{t:\\displaystyle\\mathcal\{K\}=\\bigl\\\{\\,t\\,:ℓrefchunk​\(xt\)−ℓreffull​\(xt\)\>τgain,\\displaystyle\\ell^\{\\mathrm\{chunk\}\}\_\{\\mathrm\{ref\}\}\(x\_\{t\}\)\-\\ell^\{\\mathrm\{full\}\}\_\{\\mathrm\{ref\}\}\(x\_\{t\}\)\>\\tau\_\{\\mathrm\{gain\}\},\(7\)ℓreffull\(xt\)<τnll\},τgain=τnll=2\.\\displaystyle\\ell^\{\\mathrm\{full\}\}\_\{\\mathrm\{ref\}\}\(x\_\{t\}\)<\\tau\_\{\\mathrm\{nll\}\}\\,\\bigr\\\},\\quad\\tau\_\{\\mathrm\{gain\}\}=\\tau\_\{\\mathrm\{nll\}\}=2\.The first condition selects tokens that receive a clear gain from long context, while the second filters out tokens that remain hard to predict even with full context\. For a modelMMunder evaluation, LongPPL is then computed only over𝒦\\mathcal\{K\}:

LongPPL​\(M\)=exp⁡\(1\|𝒦\|​∑t∈𝒦ℓMfull​\(xt\)\)\.\\mathrm\{LongPPL\}\(M\)=\\exp\\\!\\Bigl\(\\frac\{1\}\{\|\\mathcal\{K\}\|\}\\sum\_\{t\\in\\mathcal\{K\}\}\\ell^\{\\mathrm\{full\}\}\_\{M\}\(x\_\{t\}\)\\Bigr\)\.\(8\)
#### Evaluation dataset statistics\.

Table[3](https://arxiv.org/html/2606.15378#A1.T3)summarizes the datasets used for the two scaling\-law targets\. The C4 validation split contains many short documents, making it suitable for measuring short\-context modeling quality, whereas the GovReport subset used by LongPPL contains substantially longer sequences and a sufficient number of reference\-selected key tokens per example\. For GovReport, token lengths are computed after re\-tokenization with the evaluated\-model tokenizer and truncation at 16K, which is the pretraining sequence length of our models\.

In preliminary experiments, we found that examples with fewer than 10 key tokens often produce unstable LongPPL estimates that occasionally spike to extremely large values\. We therefore skip these examples to obtain more stable LongPPL estimates\.

Dataset / metricSamplesAfter filterAvg tokensMedian tokensAvg key tokensC4 validation /Loss\\mathrm\{Loss\}40,00040,000497\.6269–GovReport / LongPPL10,0008,89813,317\.313,27678Table 3:Evaluation dataset statistics\. Token lengths use the evaluated\-model tokenizer; GovReport lengths are after re\-tokenization and truncation at 16K\. Key tokens are identified by the Llama\-3\.1\-8B reference; “After filter” is the count remaining after dropping examples with fewer than 10 key tokens \(no filter is applied to C4\)\.

## Appendix BModel Details

To compare hybrid architectures fairly, we keep the backbone configuration of*Full*,*SWA*,*Lightning*,*Mamba\-2*, and*GDN*matched as closely as possible, including the number of layers, hidden size, GQA grouping, and per\-head dimension\. For efficient attention variants that introduce additional parameters, we make only minimal architectural adjustments so that the total parameter count stays close to the*Full*backbone\. This avoids mixing the benefit of extra modules from the original implementations into our comparison of efficient\-attention designs\.

### B\.1Softmax Attention

For softmax attention, prior work has observed the attention\-sink phenomenon, where attention probability can concentrate on a small number of non\-semantic positions at the beginning of the sequence\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib62)\)\. To mitigate this, we adopt a learnable per\-head softmax sink, as used in recent open models\(Agarwalet al\.,[2025](https://arxiv.org/html/2606.15378#bib.bib5)\)\. Concretely, for headhhthe attention distribution is

ai​j\(h\)=exp⁡\(qi\(h\)⊤​kj\(h\)/dh\)exp⁡\(sh\)\+∑ℓ≤iexp⁡\(qi\(h\)⊤​kℓ\(h\)/dh\),a\_\{ij\}^\{\(h\)\}=\\frac\{\\exp\\\!\\left\(q\_\{i\}^\{\(h\)\\top\}k\_\{j\}^\{\(h\)\}/\\sqrt\{d\_\{h\}\}\\right\)\}\{\\exp\(s\_\{h\}\)\\;\+\\;\\sum\_\{\\ell\\leq i\}\\exp\\\!\\left\(q\_\{i\}^\{\(h\)\\top\}k\_\{\\ell\}^\{\(h\)\}/\\sqrt\{d\_\{h\}\}\\right\)\},whereshs\_\{h\}is a learnable per\-head scalar initialized to zero\. This is equivalent to introducing a virtual “sink” key with logitshs\_\{h\}that absorbs excess attention mass but contributes nothing to the value aggregation\. We enable this sink in all softmax\-attention layers\.

### B\.2Lightning Attention

Lightning attention is a representative linear\-attention variant within the recurrent sequence mixer family introduced in Section[3\.1](https://arxiv.org/html/2606.15378#S3.SS1)\. Compared with a full\-attention layer, a Lightning layer introduces a small number of additional parameters\. To keep the Lightning hybrid comparable with*Full*and the SWA hybrids in total parameter count, we preserve the GQA configuration, layer count, and backbone hidden size, and only slightly reduce the FFN hidden size inside the Lightning layers\. The resulting configuration is summarized in Table[4](https://arxiv.org/html/2606.15378#A2.T4)\.

ScaleFFN hiddenPer\-layer paramsFullLightningFullLightningS19609201,499,1361,502,208S21,2801,2402,621,4402,625,536S31,6001,5604,055,0404,060,160S41,9201,8805,799,9365,806,080Table 4:Lightning configuration and per\-layer parameter counts\. Per\-layer params report the parameter counts of a single layer; shared LayerNorms and embeddings are omitted\.ScaleHead DimPer\-layer paramsFullGDNS1461,499,1361,496,114S2462,621,4402,627,634S3464,055,0404,075,570S4465,799,9365,839,922Table 5:Gated DeltaNet configuration and per\-layer parameter counts\. “Head Dim” is the per\-group key/value channel dimensiondk=dvd\_\{k\}=d\_\{v\}\(sinceexpand\_v=1=1\); the Full backbone usesdk=dv=64d\_\{k\}=d\_\{v\}=64\.ScaleState DimPer\-layer paramsFullMamba\-2S1161,499,1361,576,098S2162,621,4402,789,912S3164,055,0404,349,470S4165,799,9366,253,092Table 6:Mamba\-2 configuration and per\-layer parameter counts\. State Dim is the SSM state dimensiondstated\_\{\\text\{state\}\}\. Mamba\-2 ends up slightly larger \(5%–8%\) than Full in order to retain sufficient state capacity\.Training tokensDD100​N100N200​N200N300​N300N400​N400N500​N500N1000​N1000NC4 validation loss \(↓\\downarrow\)GDN w/ conv1d4\.3364\.1634\.0884\.0534\.0143\.929GDN w/o conv1d4\.3684\.1794\.1064\.0724\.0283\.942LongPPL \(↓\\downarrow\)GDN w/ conv1d80\.7919\.2315\.9113\.3112\.8511\.36GDN w/o conv1d91\.3524\.9616\.4413\.5612\.3011\.01Table 7:Ablation of the short 1D convolution in Gated DeltaNet at the S1 scale\. The convolution consistently lowers C4 validation loss by a small margin throughout training\. Its LongPPL advantage, however, exists only at small training budgets: the gap closes to within0\.50\.5byD≥300​ND\{\\geq\}300Nand reverses atD≥500​ND\{\\geq\}500N\.
### B\.3Gated DeltaNet

Gated DeltaNet \(GDN\) is a more elaborate recurrent sequence mixer that combines the gated update of GLA\(Yanget al\.,[2024a](https://arxiv.org/html/2606.15378#bib.bib38)\)with the delta\-rule mechanism of DeltaNet\(Schlaget al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib63); Yanget al\.,[2024b](https://arxiv.org/html/2606.15378#bib.bib64)\)\(Section[3\.1](https://arxiv.org/html/2606.15378#S3.SS1)\)\. The standard GDN implementation additionally includes a short 1D convolution on Q/K/V and a value\-expansion factor \(expand\_v\) that widens the value dimension relative to the key dimension\. To make the GDN hybrid comparable to*Full*and the other hybrid variants under a matched parameter budget, we make two adjustments to this configuration\.

#### Removing the short convolution\.

The short 1D convolution on Q/K/V is an auxiliary mixing operator not common in Transformer\-based models, and keeping it would conflate the effect of the recurrent sequence mixer itself with this auxiliary mechanism\. We disable it in our main study, and verify with a small ablation at the S1 scale \(Table[7](https://arxiv.org/html/2606.15378#A2.T7)\)\. The convolution consistently improves C4 validation loss by a small margin throughout training, but itsLongPPL\\mathrm\{LongPPL\}advantage exists only at small training budgets and vanishes once the budget is sufficient\. Therefore, disabling the short convolution simplifies the architectural comparison without altering the long\-context findings in the paper\.

#### Keeping FFN width and tuning the state dimension\.

At default settings, a GDN layer is heavier than a Full attention layer due to its data\-dependent gating and recurrent\-state projections\. Shrinking the FFN to compensate would require an awkwardly narrow width, so we instead keep the FFN identical to the Full backbone and adjust only the state\-related dimensions\. We find thatexpand\_v=1\\texttt\{expand\\\_v\}=1\(i\.e\.,dv=dkd\_\{v\}=d\_\{k\}\) consistently outperformsdv\>dkd\_\{v\}\>d\_\{k\}on validation loss, and therefore fixexpand\_v=1\\texttt\{expand\\\_v\}=1and pick the GDN head dimension so the per\-layer parameter count matches*Full*, givingdk=dv=46d\_\{k\}=d\_\{v\}=46\(Table[5](https://arxiv.org/html/2606.15378#A2.T5)\)\.

### B\.4Mamba\-2

As with GDN, the standard Mamba\-2 implementation includes a short 1D causal convolution before the SSM and an expansion factorexpandthat widens the SSM hidden dimension\. We apply the same strategy as for GDN: disable the convolution to isolate the recurrence, and adjust only the SSM\-related dimensions while keeping the FFN unchanged\.

A default Mamba\-2 layer is also heavier than a Full attention layer, mainly due to the SSM projections \(Δt\\Delta\_\{t\},BB,CC,AA\) together with the input/output projections\. Mamba\-2 provides a dedicatedstate\_dimparameter that controls the SSM state size independently of the per\-head channel widthhead\_dim, which we use to match the per\-layer parameter count to*Full*\. Specifically, we setexpand=1\\texttt\{expand\}=1, keephead\_dim=64\\texttt\{head\\\_dim\}=64to match Full’s attention head dimension, and shrinkstate\_dimto1616, which still preserves sufficient state capacity \(Table[6](https://arxiv.org/html/2606.15378#A2.T6)\)\.

## Appendix CTraining Details

For the S1–S4 models, we share the same training hyperparameters \(data, sequence length, learning\-rate schedule, and batch size\) across architectures, so that the scaling comparison is not confounded by optimization or data differences\. All models are pretrained with a 16K sequence length, a1:11\{:\}1mixture of long and short documents, and a Warmup\-Stable\-Decay \(WSD\) learning\-rate schedule\(Huet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib29)\)\. The stable and decay phases account for90%90\\%and10%10\\%of the total training tokens, respectively\. During the decay phase, the learning rate is linearly annealed from the stable value to1/101/10of it\. For scaling\-law fitting, we use checkpoints atD/N∈\{100,200,300,400,500,1000\}D/N\\in\\\{100,200,300,400,500,1000\\\}for S1–S4 \(andD/N∈\{100,200\}D/N\\in\\\{100,200\\\}for S5\); each checkpoint corresponds to a complete WSD schedule \(90%90\\%stable plus10%10\\%decay scaled to thatD/ND/N\), not a mid\-stable snapshot\.

Table[8](https://arxiv.org/html/2606.15378#A3.T8)summarizes the concrete schedule: S1–S4 are trained toD/N=1000D/N\{=\}1000while S5 is trained toD/N=200D/N\{=\}200\. The global batch size and stable learning rate at each scale were obtained from a hyperparameter sweep, and we report a configuration that consistently performs well on the Full baseline\. For fairness, the same configuration is then shared by all hybrid variants at that scale\.

The final row of Table[8](https://arxiv.org/html/2606.15378#A3.T8)records the long\-context extension of the S5/200​N200Ncheckpoint used in Section[6](https://arxiv.org/html/2606.15378#S6): we continue training for≈5​B\\approx 5Btokens \(4,7694\{,\}769iters\) at a 32K sequence length, with the LR linearly decayed from the S5 end LR to0\(no stable phase\) and the RoPE base raised from10510^\{5\}to5×1055\\times 10^\{5\}\.

ScaleD/ND/NGlobal batchStable LREnd LRStable itersDecay itersTotal itersSeq\. len\.RoPE baseS11000321\.953×10−31\.953\\times 10^\{\-3\}1\.953×10−41\.953\\times 10^\{\-4\}25,7642,86228,62616K10510^\{5\}S21000169\.766×10−49\.766\\times 10^\{\-4\}9\.766×10−59\.766\\times 10^\{\-5\}111,93812,438124,37616K10510^\{5\}S31000289\.542×10−49\.542\\times 10^\{\-4\}9\.542×10−59\.542\\times 10^\{\-5\}119,74013,304133,04416K10510^\{5\}S41000649\.766×10−49\.766\\times 10^\{\-4\}9\.766×10−59\.766\\times 10^\{\-5\}94,31810,480104,79816K10510^\{5\}S5200649\.766×10−49\.766\\times 10^\{\-4\}9\.766×10−59\.766\\times 10^\{\-5\}81,9009,10091,00016K10510^\{5\}S5 \+ 32K ext\.–329\.766×10−59\.766\\times 10^\{\-5\}004,7694,76932K5×1055\\times 10^\{5\}

Table 8:Training schedule\. “D/ND/N” is the training budget; iters columns report actual training iters\. The final row is the long\-context extension used in Section[6](https://arxiv.org/html/2606.15378#S6)\. All runs use the Muon optimizer with weight decay0\.10\.1\(Jordanet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib72)\), gradient clipping1\.01\.0\.
## Appendix DMechanism Analysis Details

We conduct several experiments to analyze the mechanism of long\-range retrieval in hybrid models, including probing, receptive\-field constraints, gradient profiling, and retrieval\-head tracing\. Here, we provide implementation details for these experiments and explain why they support our conclusions that full attention dominates long\-range retrieval and that efficient attention shapes long\-context training dynamics by modulating the optimization pressure on full attention\.

### D\.1Receptive\-field Constraint Details

This section gives implementation details for the inference\-time receptive\-field restriction experiment in Section[5\.1](https://arxiv.org/html/2606.15378#S5.SS1)\(Figure[3](https://arxiv.org/html/2606.15378#S5.F3)\), where we limit either full attention or efficient attention to a receptive field ofH≈2048H\{\\approx\}2048tokens and measure the change inlog⁡\(LongPPL\)\\log\(\\mathrm\{LongPPL\}\)\.

#### Softmax attention \(Full and SWA\)\.

We apply an exact 4D attention mask: for a query at positionii, attention is allowed only to keys at positions in\[i−H,i\]\[\\,i\-H,\\,i\\,\]withH=2048H=2048\. This gives a strict per\-token receptive field\.

#### Recurrent kernels \(Lightning, Mamba\-2, GDN\)\.

The same masking cannot be applied to the recurrent/SSM kernels\. We instead use an overlapping\-window approximation: the sequence is split into windows of30723072tokens with a10241024\-token stride; within each window, the recurrent state is reset to zero and rolled forward, and only the last10241024positions of the window are written to the output buffer\. Concretely, for a retained block starting at positions≥2048s\\geq 2048, the computation window is\[s−2048,s\+1024\)\[s\-2048,\\,s\+1024\)and the copied\-back interval is\[s,s\+1024\)\[s,\\,s\+1024\), so each token’s recurrent state is built from≈2049\\approx 2049to30723072preceding tokens\. The effective receptive field is therefore slightly looser than the strictH=2048H\{=\}2048used for softmax attention, but is well within the same order of magnitude; we report this as the same “H≈2048H\{\\approx\}2048” condition in Section[5\.1](https://arxiv.org/html/2606.15378#S5.SS1)\.

### D\.2Layer\-wise Probing Analysis

This section gives implementation details for the layer\-wise probing experiment in Section[5\.1](https://arxiv.org/html/2606.15378#S5.SS1)\(Figure[4](https://arxiv.org/html/2606.15378#S5.F4)\)\. We probe the S4/1000​N1000Ncheckpoints of seven models:*Full*,*SWA\-128*,*SWA\-512*,*SWA\-2048*,*Lightning*,*Mamba\-2*, and*GDN*\. The synthetic NIAH classification dataset contains 10,000 samples with a sequence length of 16K and eight candidate classes; its prompt format is illustrated in Figure[11](https://arxiv.org/html/2606.15378#A4.F11)\.

NIAH probing sample formatCandidate values:31415920,31415921, …,31415927Sample fields:•Key:golden\-crystalsampled adjective–noun identifier•Value:31415923one value from the eight candidates•Label:33index of the selected valueData:A special magic number is hidden within the following text\. Make sure to memorize it\. I will quiz you about the number afterwards\.\[repeated distractor passage\]One of the special magic numbers forgolden\-crystalis:31415923\.\[repeated distractor passage\]What is the special magic number forgolden\-crystalmentioned in the provided text? The special magic number forgolden\-crystalmentioned in the provided text isFigure 11:Data format for the NIAH classification dataset used in layer\-wise probing\. The probe predicts the label of the inserted magic number from the final query\-token hidden state\.For each model and each sample, we run a forward pass with hidden\-state output enabled and extract the hidden state of the final query token after every transformer layer\. We train an independent logistic\-regression probe for each layer, using an 80/20 train/test split with stratified labels and standardizing the hidden states before fitting; the multi\-class implementation uses a one\-vs\-rest scheme\. Table[9](https://arxiv.org/html/2606.15378#A4.T9)shows that logistic regression gives the strongest layer\-wise accuracy among the lightweight classifiers we test, so we use it as the primary probe\.

Figure[4](https://arxiv.org/html/2606.15378#S5.F4)visualizes the incremental layer contribution, i\.e\., the heatmap entries areAℓ−Aℓ−1A\_\{\\ell\}\-A\_\{\\ell\-1\}whereAℓA\_\{\\ell\}is the raw probing accuracy at layerℓ\\ell\. Table[10](https://arxiv.org/html/2606.15378#A4.T10)reports the underlying raw layer\-wise accuracies for all 18 layers\.

Interestingly, probing accuracy typically peaks at intermediate layers and declines in deeper layers\. This suggests that retrieval\-related information becomes most linearly accessible in the middle layers, while later layers progressively mix and integrate these signals into higher\-level semantic representations, making them less separable by lightweight classifiers\. This observation is broadly consistent with prior findings that transformer representations evolve from surface and syntactic features in lower and middle layers toward more abstract semantic representations in deeper layers\(Jawaharet al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib74)\)\.

ClassifierL0L1L2L3L4L5L6L7L8L9L10L11L12L13L14L15L16L17Logistic regression19\.416\.113\.714\.117\.616\.114\.114\.516\.547\.795\.392\.892\.389\.686\.882\.178\.874\.9MLP22\.113\.411\.812\.612\.011\.812\.311\.211\.728\.991\.287\.184\.879\.768\.864\.863\.054\.9Random forest30\.615\.613\.913\.114\.012\.112\.311\.812\.311\.818\.418\.816\.415\.814\.814\.414\.413\.1kNN20\.714\.313\.212\.612\.312\.711\.512\.312\.112\.213\.813\.113\.313\.512\.712\.712\.412\.6PCA\+Naive Bayes15\.612\.612\.712\.913\.614\.112\.211\.311\.116\.665\.557\.955\.151\.039\.030\.828\.728\.2

Table 9:Comparison of lightweight classifiers on the S4/1000​N1000NFull model under the same NIAH probing task as Table[10](https://arxiv.org/html/2606.15378#A4.T10); logistic regression gives the strongest layer\-wise accuracy\.ModelL0L1L2L3L4L5L6L7L8L9L10L11L12L13L14L15L16L17Full19\.416\.113\.714\.117\.616\.114\.114\.516\.547\.795\.392\.892\.389\.686\.882\.178\.874\.9GDN19\.925\.121\.125\.428\.126\.828\.232\.727\.174\.463\.860\.058\.177\.577\.067\.564\.255\.6Lightning12\.112\.212\.411\.912\.711\.612\.823\.523\.067\.064\.589\.180\.282\.278\.472\.068\.863\.4Mamba\-212\.113\.712\.114\.214\.914\.012\.812\.513\.816\.715\.161\.753\.678\.269\.757\.551\.835\.8SWA\-12812\.311\.512\.512\.012\.414\.312\.639\.233\.176\.661\.575\.869\.185\.080\.277\.573\.767\.7SWA\-51211\.612\.713\.511\.712\.611\.812\.222\.628\.386\.278\.687\.381\.475\.669\.965\.763\.560\.0SWA\-204812\.412\.512\.713\.215\.828\.529\.034\.232\.669\.066\.272\.864\.661\.457\.053\.250\.045\.5

Table 10:Layer\-wise logistic\-regression probing accuracy on the S4/1000​N1000NNIAH classification task\.
### D\.3Gradient Profiling

Gradient profiling uses the input\-gradient norm of a logit\-based scalar output as a proxy for the long\-range training signal that a historical token provides for next\-token prediction\. We give a short derivation linking this proxy to \(i\) local sensitivity of the model’s prediction, \(ii\) gradients on retrieval\-head Q/K parameters, and \(iii\) conditional dependency in the data, and we use it to read Figure[5\(a\)](https://arxiv.org/html/2606.15378#S5.F5.sf1)\.

Letx1:T∼𝒟x\_\{1:T\}\\sim\\mathcal\{D\}be a token sequence sampled from the pretraining distribution,ei∈ℝdmodele\_\{i\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{model\}\}\}the input embedding ofxix\_\{i\}, andz\(t\)​\(x\)∈ℝ\|𝒱\|z^\{\(t\)\}\(x\)\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}the logit vector produced bypθp\_\{\\theta\}at positiontt\. FollowingLiet al\.\([2016](https://arxiv.org/html/2606.15378#bib.bib43)\), we summarize the model’s prediction near the end of the context by the scalar

s​\(x\)=∑v∈𝒱1Nτ​∑t∈τzv\(t\)​\(x\),s\(x\)\\;=\\;\\sum\_\{v\\in\\mathcal\{V\}\}\\frac\{1\}\{N\_\{\\tau\}\}\\sum\_\{t\\in\\tau\}z\_\{v\}^\{\(t\)\}\(x\),whereτ\\tauis the lastNτ=20N\_\{\\tau\}=20positions, and report the average input\-gradient norm at distanced=T−id=T\-i,

G​\(d\)=𝔼x∼𝒟​\[‖∂s​\(x\)/∂eT−d‖2\]\.G\(d\)\\;=\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\\!\\left\[\\left\\\|\\partial s\(x\)/\\partial e\_\{T\-d\}\\right\\\|\_\{2\}\\right\]\.
#### \(1\) Local sensitivity\.

A first\-order Taylor expansion ofssineie\_\{i\}, followed by Cauchy–Schwarz, gives for any perturbationΔ​ei\\Delta e\_\{i\}, up to second\-order terms in‖Δ​ei‖\\\|\\Delta e\_\{i\}\\\|,

\|s​\(ei\+Δ​ei\)−s​\(ei\)\|≤‖∂s/∂ei‖2⋅‖Δ​ei‖2\.\\big\|s\(e\_\{i\}\+\\Delta e\_\{i\}\)\-s\(e\_\{i\}\)\\big\|\\;\\leq\\;\\\|\\partial s/\\partial e\_\{i\}\\\|\_\{2\}\\cdot\\\|\\Delta e\_\{i\}\\\|\_\{2\}\.So‖∂s/∂ei‖2\\\|\\partial s/\\partial e\_\{i\}\\\|\_\{2\}tightly bounds the first\-order change ofssunder infinitesimal perturbations ofeie\_\{i\}\.

#### \(2\) Connection to retrieval\-head gradients\.

By chain rule,∂s/∂ei\\partial s/\\partial e\_\{i\}decomposes into contributions from all computational paths that route information from positioniiinto the lastNτN\_\{\\tau\}positions\. For a single retrieval head with attention weightsat,ja\_\{t,j\}and per\-position outputot=∑jat,j​vjo\_\{t\}=\\sum\_\{j\}a\_\{t,j\}v\_\{j\}\(withvj=V​ejv\_\{j\}=Ve\_\{j\}\), a direct softmax computation gives

∂s∂scoret,i=at,i​\(vi−ot\)⊤​∂s∂ot,\\frac\{\\partial s\}\{\\partial\\mathrm\{score\}\_\{t,i\}\}\\;=\\;a\_\{t,i\}\\,\(v\_\{i\}\-o\_\{t\}\)^\{\\\!\\top\}\\frac\{\\partial s\}\{\\partial o\_\{t\}\},so the head’s Q/K gradient at the entry\(t,i\)\(t,i\)shares the multiplicative factorat,i​∂s/∂ota\_\{t,i\}\\,\\partial s/\\partial o\_\{t\}\. The same factor also appears in the value\-path contribution to∂s/∂ei\\partial s/\\partial e\_\{i\}, via∂s/∂vi=∑t∈τat,i​∂s/∂ot\\partial s/\\partial v\_\{i\}=\\sum\_\{t\\in\\tau\}a\_\{t,i\}\\,\\partial s/\\partial o\_\{t\}\. Hence, absent fine\-tuned path cancellation, a small‖∂s/∂ei‖2\\\|\\partial s/\\partial e\_\{i\}\\\|\_\{2\}implies that the Q/K update strengthening retrieval at distanceddis correspondingly weak, and we readG​\(d\)G\(d\)as a per\-sample upper\-bound proxy on this training signal\.

#### \(3\) Connection to data dependency\.

If the data satisfies the conditional independenceyt⟂xi∣xi\+1:ty\_\{t\}\\perp x\_\{i\}\\mid x\_\{i\+1:t\}for everyt∈τt\\in\\tau, then a sufficiently trainedpθp\_\{\\theta\}inherits the same independence in its predictive distribution, and the gradient vanishes:

yt⟂xi∣xi\+1:t\\displaystyle y\_\{t\}\\perp x\_\{i\}\\mid x\_\{i\+1:t\}⟹pθ\(⋅∣x1:t\)≈pθ\(⋅∣xi\+1:t\)\\displaystyle\\;\\Longrightarrow\\;p\_\{\\theta\}\(\\cdot\\mid x\_\{1:t\}\)\\approx p\_\{\\theta\}\(\\cdot\\mid x\_\{i\+1:t\}\)⟹∂s/∂ei≈0\.\\displaystyle\\;\\Longrightarrow\\;\\partial s/\\partial e\_\{i\}\\approx 0\.Conversely, a genuine conditional dependency at distanceddforces∂s/∂ei\\partial s/\\partial e\_\{i\}to be nonzero on average\. Crucially,yt⟂xi∣xi\+1:ty\_\{t\}\\perp x\_\{i\}\\mid x\_\{i\+1:t\}is a property of the*data distribution*, so the dependency profile reflected byG​\(d\)G\(d\)transfers across models trained on similar corpora; this justifies using Llama\-3\.1\-8B as a proxy for the dependency signal seen by our hybrid models\.222Strictly, conditional independence constrains logits only up to a global additive constant \(softmax is invariant under such shifts\); in the standard parameterizationzv\(t\)=wv⊤​h\(t\)z\_\{v\}^\{\(t\)\}=w\_\{v\}^\{\\\!\\top\}h^\{\(t\)\}, this common mode carries no independent training signal\.

Combining \(1\)–\(3\), for a sufficiently trainedpθp\_\{\\theta\}, a smallG​\(d\)G\(d\)jointly indicates local insensitivity ofsstoeT−de\_\{T\-d\}, weak Q/K updates that would strengthen retrieval at distancedd, and weak conditional dependency at distanceddin the data\.

#### The flat baseline\.

Even whenxix\_\{i\}is conditionally uninformative,G​\(d\)G\(d\)does not reach zero in practice; instead, it decays to a flat baseline\. Three sources contribute to this irreducible level: \(a\) finite\-precision backward arithmetic, \(b\) finite\-capacitypθp\_\{\\theta\}that is not exactly Bayes\-optimal, and \(c\) coarse topic/style/domain signals that distant tokens still carry\. Formally, even with a mean\-zero per\-sample gradient, Jensen’s inequality gives

G​\(d\)=𝔼​\[‖∂s/∂ei‖2\]≥‖𝔼​\[∂s/∂ei\]‖2,G\(d\)\\;=\\;\\mathbb\{E\}\\\!\\left\[\\\|\\partial s/\\partial e\_\{i\}\\\|\_\{2\}\\right\]\\;\\geq\\;\\\|\\mathbb\{E\}\[\\partial s/\\partial e\_\{i\}\]\\\|\_\{2\},soG​\(d\)G\(d\)remains strictly positive whenever the per\-sample gradient is non\-degenerate\. We therefore estimate the baseline at a distance where Figure[5\(a\)](https://arxiv.org/html/2606.15378#S5.F5.sf1)has visibly flattened,

Gbase:=GPG19​\(d=4096\),G\_\{\\mathrm\{base\}\}\\;:=\\;G\_\{\\mathrm\{PG19\}\}\(d=4096\),shown as the dashed reference line in Figure[5\(a\)](https://arxiv.org/html/2606.15378#S5.F5.sf1), and treatG​\(d\)≲GbaseG\(d\)\\lesssim G\_\{\\mathrm\{base\}\}as effectively no usable retrieval signal at distancedd\. Figure[5\(a\)](https://arxiv.org/html/2606.15378#S5.F5.sf1)then becomes a quantitative map of the distance ranges that contribute training signal during pretraining, which directly supports the Large\-Window Laziness argument in Section[5\.2](https://arxiv.org/html/2606.15378#S5.SS2): a SWA window already covering the range whereG​\(d\)≫GbaseG\(d\)\\gg G\_\{\\mathrm\{base\}\}absorbs most of the dependency\-driven training signal before it can propagate to full\-attention retrieval heads\.

### D\.4Retrieval\-Head Tracing

This section gives implementation details for the retrieval\-head tracing experiment in Section[5\.2](https://arxiv.org/html/2606.15378#S5.SS2)\(Figure[5\(b\)](https://arxiv.org/html/2606.15378#S5.F5.sf2)\) and more analysis around the formation of the retrieval head in hybrid architectures\.

#### NIAH probe and head score\.

We construct an NIAH probe where a unique “needle” string is hidden in a long context and the prompt ends with a question whose answer is the needle\. Running the S4/200​N200Ncheckpoint of each hybrid on this prompt, we read the per\-head attention from the last input positionqq\(the query\) to all keys, and score each head\(ℓ,h\)\(\\ell,h\)by the attention mass it places on the needle tokens, averaged over NIAH samples:

score¯ℓ,h=1\|𝒮\|​∑x∈𝒮∑j∈𝒩​\(x\)aq,j\(ℓ,h\)​\(x\),\\overline\{\\mathrm\{score\}\}\_\{\\ell,h\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{x\\in\\mathcal\{S\}\}\\sum\_\{j\\in\\mathcal\{N\}\(x\)\}a^\{\(\\ell,h\)\}\_\{q,\\,j\}\(x\),whereaq,j\(ℓ,h\)​\(x\)a^\{\(\\ell,h\)\}\_\{q,j\}\(x\)is the attention weight fromqqto keyjjin head\(ℓ,h\)\(\\ell,h\)for samplexx,𝒩​\(x\)\\mathcal\{N\}\(x\)is the set of needle token positions, and𝒮\\mathcal\{S\}is the NIAH evaluation set\. A highscore¯ℓ,h\\overline\{\\mathrm\{score\}\}\_\{\\ell,h\}means the head consistently routes the query’s attention back to the needle—the canonical retrieval\-head signature\.

#### Head selection\.

Each cell of Figure[12](https://arxiv.org/html/2606.15378#A4.F12)reportsscore¯ℓ,h\\overline\{\\mathrm\{score\}\}\_\{\\ell,h\}for one \(layer, head\), for the six traced hybrid models:*SWA\-128*,*SWA\-512*,*SWA\-2048*,*Lightning*,*Mamba\-2*, and*GDN*\. We restrict the search to full\-attention layers, since our analysis targets long\-range retrieval formed there, and select the Top\-2 heads per model \(red circles\) as the retrieval\-head set used by the tracing diagnostics in Section[5\.2](https://arxiv.org/html/2606.15378#S5.SS2); lower\-ranked heads have noisier retrieval signatures and would dilute these diagnostics\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x12.png)Figure 12:Per\-head NIAH attention\-mass scoresscore¯ℓ,h\\overline\{\\mathrm\{score\}\}\_\{\\ell,h\}for the six S4/200​N200Nhybrid models\. Red circles mark the selected top\-2 retrieval heads in each model\.Figure[12](https://arxiv.org/html/2606.15378#A4.F12)also reveals that*SWA\-2048*has noticeably fewer high\-response heads in its full\-attention layers than the other hybrids, consistent with the*Large\-Window Laziness*hypothesis\.

![Refer to caption](https://arxiv.org/html/2606.15378v1/x13.png)Figure 13:Smaller sliding\-window attention activates retrieval\-head training earlier\. The figure shows the evolution of the Frobenius norm of the gradient on theQQprojections of retrieval heads during training for*SWA\-128*,*SWA\-512*, and*SWA\-2048*under both S1 and S4 model scales\.
#### Training Gradient\.

To trace the training dynamics of retrieval heads, we train SWA hybrid models with different window sizes and track their gradient norms throughout training\. Following the setup of our scaling experiments in Appendix[C](https://arxiv.org/html/2606.15378#A3), we train S1\- and S4\-scale models from scratch using a constant learning rate for40004000steps, with an initial100100\-step warmup phase\. During training, we record the gradients of theQQprojection slices for all heads, and use the final checkpoint to identify the Top\-1 retrieval head\. We then compare the evolution of the gradient norm of this retrieval head throughout training\.

Figure[13](https://arxiv.org/html/2606.15378#A4.F13)shows the evolution of the Frobenius norm of the loss gradient with respect to theQQprojection weights of the retrieval head during training:

‖∇W‖F=\(∑i,j\(∂ℒ∂Wi​j\)2\)1/2\.\\\|\\nabla W\\\|\_\{F\}=\\left\(\\sum\_\{i,j\}\\left\(\\frac\{\\partial\\mathcal\{L\}\}\{\\partial W\_\{ij\}\}\\right\)^\{2\}\\right\)^\{1/2\}\.
We can clearly observe that smaller sliding windows allocate gradient mass to retrieval heads much earlier, whereas larger sliding windows substantially delay the training of retrieval heads\. For example, the retrieval head in*SWA\-2048*does not begin to receive effective training until roughly15001500steps into training\. The light gray curves in the figure represent the evolution of‖∇WQ‖F\\\|\\nabla W\_\{Q\}\\\|\_\{F\}for the other heads in the same model\. Notably, for*SWA\-2048*, these other heads do not exhibit the same delayed\-activation behavior\.

## Appendix EBenchmark Evaluation

Table[2](https://arxiv.org/html/2606.15378#S6.T2)in the main paper reports only the aggregated scores\. Here we provide the per\-task results for the same configurations:*Full*,*SWA\-128*, and*SWA\-128\-NoPE*, each at S4 \(0\.22​B0\.22B\) and S5 \(0\.66​B0\.66B\) trained under≈100​B\{\\approx\}100Btokens\. The 16K\-context results use these≈100​B\{\\approx\}100B\-token checkpoints directly; the 32K\-context results use the S5 checkpoint after an additional5​B5B\-token long\-context extension at a 32K sequence length\.

#### Benchmarks\.

For long\-context evaluation, we use RULER\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib25)\)and LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib41)\); for each RULER sub\-task, we generate200200test instances and report task accuracy averaged over them\. For short\-context evaluation we use 19 standard benchmarks covering knowledge, commonsense reasoning, reading comprehension and natural language inference: MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib45)\), C\-Eval\(Huanget al\.,[2023](https://arxiv.org/html/2606.15378#bib.bib46)\), CMMLU\(Liet al\.,[2024](https://arxiv.org/html/2606.15378#bib.bib47)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib48)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.15378#bib.bib49)\), ARC\-Easy and ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.15378#bib.bib50)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.15378#bib.bib51)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2606.15378#bib.bib52)\), CommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib53)\), SIQA\(Sapet al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib54)\), StoryCloze\(Mostafazadehet al\.,[2016](https://arxiv.org/html/2606.15378#bib.bib55)\), RACE\-middle and RACE\-high\(Laiet al\.,[2017](https://arxiv.org/html/2606.15378#bib.bib56)\), COPA\(Roemmeleet al\.,[2011](https://arxiv.org/html/2606.15378#bib.bib57)\), RTE\(Wanget al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib61)\), CB\(De Marneffeet al\.,[2019](https://arxiv.org/html/2606.15378#bib.bib58)\), WiC\(Pilehvar and Camacho\-Collados,[2019](https://arxiv.org/html/2606.15378#bib.bib59)\), and MultiRC\(Khashabiet al\.,[2018](https://arxiv.org/html/2606.15378#bib.bib60)\)\.

#### Evaluation protocol\.

All evaluations use deterministic \(greedy\) decoding to eliminate sampling variance\. For long\-context tasks, we follow the task\-specific reference\-based metrics of RULER and LongBench\. For short\-context multiple\-choice tasks, we score each candidate option by its log\-likelihood \(length\-normalized perplexity\) under the model and select the option with the highest score; this likelihood\-based protocol better reflects the underlying capability of base models, which do not yet have the instruction\-following ability needed for direct answer generation\.

#### Per\-task results\.

The detailed RULER\-16K and LongBench scores are shown in Tables[11](https://arxiv.org/html/2606.15378#A6.T11)and[12](https://arxiv.org/html/2606.15378#A6.T12); per\-task short\-context scores are reported in Table[13](https://arxiv.org/html/2606.15378#A6.T13)\.

## Appendix FStatement on the AI Usage

During the writing and revision of this paper, the authors used large language models only as auxiliary tools for improving grammar, wording, sentence structure, clarity, and readability\. These tools were not involved in the core academic work of this study, including formulating research questions, designing experiments, processing data, analyzing results, or drawing conclusions\.

All LLM\-assisted edits were carefully reviewed, judged, and revised by the authors\. The authors take full responsibility for the authenticity, originality, accuracy, and completeness of the final manuscript\.

TaskS4 /≈100​B\{\\approx\}100B/16​K16KS5 /≈100​B\{\\approx\}100B/16​K16KS5 /≈100​B\{\\approx\}100B/32​K32KFullSWA\-128SWA\-128\-NoPEFullSWA\-128SWA\-128\-NoPEFullSWA\-128SWA\-128\-NoPENIAH: single keyniah\_s181\.0097\.00100\.0099\.50100\.00100\.00100\.00100\.00100\.00niah\_s293\.5095\.50100\.0097\.0097\.50100\.00100\.0099\.5097\.00niah\_s327\.5041\.0066\.5065\.0064\.0066\.5054\.0050\.5063\.50NIAH: multi keyniah\_mk130\.0075\.5088\.0095\.5083\.5083\.5096\.5076\.0077\.50niah\_mk218\.0027\.0059\.5087\.5079\.0092\.0085\.0072\.0072\.00niah\_mk31\.503\.0027\.0019\.5017\.0048\.505\.5017\.0030\.50NIAH: multi value / multi queryniah\_mv15\.2528\.3846\.5036\.2544\.1285\.5027\.8827\.1254\.62niah\_mq20\.8829\.2555\.0036\.8842\.1282\.5032\.0039\.2568\.25Variable tracking / aggregationvt1\.500\.500\.703\.404\.700\.908\.206\.703\.00cwe4\.901\.302\.000\.507\.204\.100\.853\.452\.05fwe12\.6736\.8314\.1740\.1735\.000\.5044\.3337\.1714\.33QAqa\_114\.0016\.0015\.5020\.5018\.0015\.508\.008\.5017\.00qa\_25\.508\.007\.5011\.507\.508\.008\.507\.0011\.00NIAH average \(8\)35\.9549\.5867\.8167\.1465\.9182\.3162\.6160\.1770\.42Total average \(13\)25\.0935\.3344\.8047\.1746\.1352\.8843\.9041\.8646\.98Table 11:Per\-task results on RULER\. “NIAH average \(8\)” is the average over the eight NIAH\-style tasks; “Total average \(13\)” is the average over all1313RULER tasks\.TaskS4 /≈100​B\{\\approx\}100B/16​K16KS5 /≈100​B\{\\approx\}100B/16​K16KS5 /≈100​B\{\\approx\}100B/32​K32KFullSWA\-128SWA\-128\-NoPEFullSWA\-128SWA\-128\-NoPEFullSWA\-128SWA\-128\-NoPESingle\-document QAnarrativeqa2\.822\.522\.532\.722\.933\.102\.742\.872\.88qasper14\.0716\.4614\.3918\.7217\.0919\.0719\.3218\.3019\.78multifieldqa\_en16\.0617\.6917\.7219\.0919\.7021\.0120\.2620\.3921\.39multifieldqa\_zh13\.1913\.0613\.3517\.1716\.0817\.6017\.5915\.2615\.74Multi\-document QAhotpotqa6\.706\.286\.287\.348\.037\.767\.998\.568\.942wikimqa8\.368\.128\.178\.698\.189\.278\.588\.449\.77musique3\.643\.133\.783\.784\.055\.374\.524\.375\.46dureader18\.0919\.9518\.6825\.1522\.4926\.4323\.2623\.2125\.37Summarizationgov\_report15\.3320\.6026\.7624\.4624\.4823\.4026\.1529\.5425\.68qmsum14\.9917\.6115\.9019\.1515\.3618\.8919\.0917\.6617\.87multi\_news17\.7419\.7225\.6917\.4122\.4223\.8621\.9625\.8326\.21vcsum0\.905\.657\.514\.345\.184\.392\.146\.545\.04Few\-shot learningtrec71\.5062\.5067\.0069\.0066\.0071\.0065\.5066\.0071\.50triviaqa4\.1513\.080\.500\.003\.030\.500\.000\.500\.50samsum12\.7418\.069\.3927\.8318\.3430\.3528\.3418\.1229\.76lsht6\.006\.5012\.0015\.5016\.2521\.0021\.0023\.5024\.25Syntheticpassage\_count1\.980\.330\.230\.971\.173\.132\.350\.620\.96passage\_retrieval\_en3\.833\.714\.014\.173\.673\.833\.544\.793\.88passage\_retrieval\_zh3\.594\.224\.533\.855\.083\.973\.894\.673\.98Code completionlcc43\.8838\.1145\.7948\.7044\.3941\.2749\.4941\.2243\.52repobench\-p37\.3336\.1640\.7849\.2444\.0644\.1449\.8843\.8446\.09Average \(21\)15\.0915\.8816\.4318\.4417\.5219\.0218\.9318\.3019\.46Table 12:Per\-task results on LongBench, using the task\-specific reference\-based metrics from the official LongBench scripts\. The bottom row averages all2121tasks and matches the LongBench column in Table[2](https://arxiv.org/html/2606.15378#S6.T2)\.BenchmarkS4 /≈100​B\{\\approx\}100BS5 /≈100​B\{\\approx\}100BFullSWA\-128SWA\-128\-NoPEFullSWA\-128SWA\-128\-NoPEComprehensive knowledgeMMLU26\.3524\.1025\.4529\.7130\.0630\.84C\-Eval27\.2324\.9224\.2026\.8227\.7428\.47CMMLU25\.6325\.1825\.1226\.2029\.4627\.48Commonsense and completionHellaSwag30\.6330\.9730\.2038\.2138\.6838\.17PIQA64\.0961\.9261\.9265\.8365\.8966\.21ARC\-Easy39\.5139\.1540\.0443\.0344\.8042\.15ARC\-Challenge23\.7325\.7628\.4731\.5330\.1728\.81WinoGrande52\.1753\.9152\.8054\.3853\.9953\.43OpenBookQA27\.6026\.6027\.4025\.0024\.8027\.60CommonsenseQA19\.2519\.2519\.8221\.6224\.9022\.77SIQA38\.0839\.4138\.0840\.6341\.5040\.58StoryCloze56\.9756\.8756\.6061\.7361\.5762\.27Reading and entailmentRACE\-middle25\.0721\.8723\.1227\.5130\.1534\.75RACE\-high25\.6421\.1321\.7028\.1629\.9630\.62COPA51\.0054\.0049\.0058\.0056\.0057\.00RTE48\.7453\.4353\.0752\.7151\.6250\.54CB50\.0050\.0050\.0044\.6450\.0050\.00WiC50\.0050\.0050\.0050\.0050\.0050\.00MultiRC42\.8244\.0842\.8043\.0943\.6743\.34Average \(19\)38\.1338\.0337\.8840\.4641\.3141\.32Table 13:Per\-task results on the 19 short\-context benchmarks; bottom row averages all 19 tasks and matches the ShortAvg column in Table[2](https://arxiv.org/html/2606.15378#S6.T2)\. MMLU, C\-Eval, and CMMLU report macro averages over their sub\-tasks; the remaining rows report individual benchmark accuracies\. All scores are obtained with deterministic decoding and option selection by length\-normalized log\-likelihood \(higher is better\)\.

Similar Articles

Dynamic Linear Attention

arXiv cs.CL

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

arXiv cs.CL

This paper identifies and formalizes the 'structural attention tax' phenomenon, where the format of retrieved content (e.g., knowledge graph triples) independently distorts LLM attention distribution regardless of semantic relevance, leading to compressed demonstration attention. It provides a formal framework, empirical evidence across models and benchmarks, and proposes structure-aware mitigation strategies.