ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

arXiv cs.CL Papers

Summary

ConSA is a framework that learns optimal assignment between full attention and sliding-window attention under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraint. It demonstrates consistent gains over rule-based baselines on LLMs at 0.6B and 1.7B scales.

arXiv:2606.18056v1 Announce Type: new Abstract: Hybrid architectures combining full attention (FA) and sliding-window attention (SWA) are a promising paradigm for efficient LLM inference. However, existing methods typically rely on hand-crafted rules or simple post-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs. We propose Controllable Sparsity in Hybrid Attention (ConSA), a framework that learns optimal FA/SWA assignment under a user-specified sparsity target. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV-head granularity. We evaluate ConSA on two LLMs at the 0.6B and 1.7B scales. Learned allocations consistently outperform rule-based baselines, with KV-head-wise allocation yielding clear gains over layer-wise allocation. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle-layer blocks, diverging from evenly interleaved patterns in rule-based methods. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine-grained spectrum of intrinsic attention behaviors that underlies the learned allocation.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:42 AM

# Controllable Sparsity in Hybrid Attention via Learnable Allocation
Source: [https://arxiv.org/html/2606.18056](https://arxiv.org/html/2606.18056)
Yao Chen1,2, Yinqi Yang3∗, Junyuan Shang3, Xiangzhao Hao3, Simeng Zhang1,2, Yilong Chen1,2,Tingwen Liu1,2†,Shuohuan Wang3,Dianhai Yu3 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Baidu Inc\. \{chenyao2023, liutingwen\}@iie\.ac\.cn \{yangyinqi, shangjunyuan, wangshuohuan\}@baidu\.com

###### Abstract

Hybrid architectures combining full attention \(FA\) and sliding\-window attention \(SWA\) are a promising paradigm for efficient LLM inference\. However, existing methods typically rely on hand\-crafted rules or simple post\-hoc heuristics for FA/SWA allocation and offer limited analysis of the attention behaviors underlying these designs\. We propose Controllable Sparsity in Hybrid Attention \(ConSA\), a framework that learns optimal FA/SWA assignment under a user\-specified sparsity target\. ConSA employs L0 regularization to learn binary masks selecting between FA and SWA for each attention unit, while an augmented Lagrangian constraint enforces the target sparsity at either layer or KV\-head granularity\. We evaluate ConSA on two LLMs at the 0\.6B and 1\.7B scales\. Learned allocations consistently outperform rule\-based baselines, with KV\-head\-wise allocation yielding clear gains over layer\-wise allocation\. The learned patterns place SWA in the bottom layers and concentrate FA into contiguous middle\-layer blocks, diverging from evenly interleaved patterns in rule\-based methods\. This structure persists across model scales, sparsity levels, and allocation granularities, revealing a fine\-grained spectrum of intrinsic attention behaviors that underlies the learned allocation\.

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

Yao Chen1,2††thanks:denotes equal contribution\.†denotes the corresponding author\., Yinqi Yang3∗, Junyuan Shang3, Xiangzhao Hao3, Simeng Zhang1,2,Yilong Chen1,2,Tingwen Liu1,2†,Shuohuan Wang3,Dianhai Yu31Institute of Information Engineering, Chinese Academy of Sciences2School of Cyber Security, University of Chinese Academy of Sciences3Baidu Inc\.\{chenyao2023, liutingwen\}@iie\.ac\.cn\{yangyinqi, shangjunyuan, wangshuohuan\}@baidu\.com

## 1Introduction

Large language models have made attention cost a deployment bottleneck: full attention \(FA\) scales quadratically with sequence length in compute and linearly in KV cacheKwonet al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib38)\)\. Among efficient alternatives, sliding\-window attention \(SWA\)Beltagyet al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib7)\)restricts each token to a fixed local window, reducing both attention compute and per\-head KV cache during inference\. However, the fixed window discards long\-range dependencies, which can hurt tasks that require global contextXiaoet al\.\([2024](https://arxiv.org/html/2606.18056#bib.bib39)\)\. A natural strategy to balance cost and capability is to combine FA and SWA within a single architecture\.

Production models such as MistralJianget al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib1)\), Gemma 2Team \([2024](https://arxiv.org/html/2606.18056#bib.bib2)\), and MiMo\-V2\-FlashXiaomi \([2026](https://arxiv.org/html/2606.18056#bib.bib17)\)have adopted such hybrid designs through hand\-crafted interleaved patterns\. However, these manually specified allocations do not account for the heterogeneous attention behaviors across layers and heads in the original modelXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)\. LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)replaces manual design with a lightweight calibration stage that scores layers via learnable scalar weights and converts low\-scoring layers to local attention\. Yet when calibrated on a small amount of pre\-training data, such scalar scores may have limited discriminative power across layers, making layer selection less reliable under different target sparsity levels\. Moreover, prior workXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\); Zhaoet al\.\([2026a](https://arxiv.org/html/2606.18056#bib.bib40)\)on hybrid attention offers little analysis of what FA/SWA patterns emerge across layers and heads, and how these patterns relate to the intrinsic attention behaviors of the original model\.

These limitations motivate a learnable allocation method that optimizes FA/SWA assignments under an explicit sparsity objective, while also calling for a finer\-grained analysis of the intrinsic attention behaviors underlying the learned allocation\. We proposeConSA\(ControllableSparsity in HybridAttention\), a framework for learning hybrid FA/SWA allocation under controllable sparsity\. Given a pre\-trained Transformer and a user\-specified target sparsityρ\\rho, ConSA formulates hybrid attention as a sparsity\-constrained optimization problem: each attention unit receives a binary mask, parameterized by the hard concrete distribution under L0 regularizationLouizoset al\.\([2018](https://arxiv.org/html/2606.18056#bib.bib28)\); Xiaet al\.\([2024](https://arxiv.org/html/2606.18056#bib.bib6)\), that selects between FA and SWA\. An augmented Lagrangian constraint is designed to enforce the targetρ\\rho, enabling the model to discover optimal allocations at either layer or KV\-head granularity\. The mask parameters and model weights are first jointly optimized during a mask\-learning stage, after which the learned masks are binarized and fixed for continued pre\-training\.

We further analyze the learned allocation patterns and the intrinsic attention behaviors of the models across model scales and sparsity levels\. The learned masks consistently place SWA in the bottom layers and concentrate FA into contiguous middle\-layer blocks, diverging from the evenly interleaved patterns used in rule\-based methods\. Examination of the attention behavior of representative layers and heads under the learned allocation reveals diverse attention spike ranges that extend beyond the retrieval\-versus\-streaming dichotomy described in prior workXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\), and align well with ConSA’s learned allocation\.

Our contributions are threefold: \(1\) We propose ConSA, a framework that learns hybrid FA/SWA allocation via L0 regularization and augmented Lagrangian optimization at both layer\-wise and KV\-head\-wise granularity, enabling users to specify an arbitrary targetρ\\rhothat is reliably satisfied during optimization\. \(2\) Experiments across two model scales \(0\.6B and 1\.7B\) and multiple sparsity levels show that learned allocations consistently outperform rule\-based baselines, with KV\-head\-wise allocation yielding clear gains over layer\-wise allocation\. Ablation studies further confirm that the L0\-Lagrangian formulation outperforms calibration\-based approaches relying on unconstrained scalar gates with post\-hoc ranking\. \(3\) Analysis of the learned patterns reveals a consistent SWA\-bottom / FA\-middle structure across model scales, sparsity levels, and allocation granularities\. Examination of intrinsic attention behavior shows that this structure aligns with diverse attention spike ranges extending beyond the retrieval\-versus\-streaming dichotomy in prior work\.

## 2Related Work

#### Efficient Attention Mechanisms\.

The quadratic scaling of full attention has led to a variety of efficient alternatives\. Sliding\-window attention \(SWA\)Beltagyet al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib7)\); Zaheeret al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib8)\); Childet al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib9)\)is a common choice because it limits computational overhead and the KV cache footprint during inference\. Other approaches include linear attentionKatharopouloset al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib10)\), sparse attention with learned patternsKitaevet al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib11)\); Royet al\.\([2021](https://arxiv.org/html/2606.18056#bib.bib12)\), and state\-space modelsGu and Dao \([2023](https://arxiv.org/html/2606.18056#bib.bib13)\)\. Our work does not introduce new attention mechanisms but focuses on how to distribute FA and SWA within a model\.

#### Hybrid Attention Architectures\.

Recent LLMs often combine FA and SWA via hand\-crafted patterns: MistralJianget al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib1)\)alternates SWA and FA layers, Gemma 2Team \([2024](https://arxiv.org/html/2606.18056#bib.bib2)\)uses interleaving tied to scale, and Command\-R and JambaLenzet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib14)\)adopt mixed types\. Two recent works move toward learned allocation: SwiAttnZhaoet al\.\([2026b](https://arxiv.org/html/2606.18056#bib.bib29)\)routes tokens to FA or SWA via per\-layer routers but must retain a unified KV cache; LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)calibrates a per\-layer scalar weight and converts the bottom\-ranked layers to streaming sparse attention at a fixed50%50\\%ratio\. ConSA differs by formulating FA/SWA allocation as a sparsity\-constrained optimization problem, where an augmented Lagrangian constraint enforces a user\-specified target; a detailed comparison is provided in Appendix[A](https://arxiv.org/html/2606.18056#A1)\.

#### Attention Head Analysis\.

Attention heads are known to perform distinct roles, such as tracking position, syntax, or rare tokensVoitaet al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib4)\); Clarket al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib5)\)\. More recent work identifies retrieval heads, which assign close attention mass to a few critical tokens across the full context, and streaming heads, which attend primarily to recent tokens and attention sinks; this classification is derived from output deviation on synthetic long\-range retrieval tasks and has guided KV cache compressionXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)\. ConSA’s learned allocation reveals a consistent SWA\-bottom / FA\-middle structure across model scales, sparsity levels, and allocation granularities\. Analysis of representative layers and heads shows that their intrinsic attention spike ranges form a finer\-grained spectrum beyond this binary classification, aligning well with the learned FA/SWA assignment\.

![Refer to caption](https://arxiv.org/html/2606.18056v1/x1.png)Figure 1:Overview of ConSA\.Left:the two\-stage training pipeline\. Stage 1 jointly optimizes the model parametersθ\\theta, mask parametersα\\alpha, and Lagrange multipliers\{λ,ϕ\}\\\{\\lambda,\\phi\\\}on11B tokens, with the constraintρ^​\(z\)=ρ\\hat\{\\rho\}\(z\)=\\rhoenforcing the user\-specified target sparsity\. Stage 2 binarizes the masks and continues pre\-training for100100B tokens with a fixed FA/SWA assignment\.Right:the per\-head allocation mechanism\. For each KV head\(l,i\)\(l,i\), a hard concrete maskzl,iz\_\{l,i\}, parameterized by a learnableαl,i\\alpha\_\{l,i\}, selects between full attention \(FA\) and sliding\-window attention \(SWA\)\.

## 3Preliminaries

We consider a Transformer withLLlayers, each containing multiple key\-value \(KV\) heads\. FA and SWA can be applied at different granularities; we formalize both at the KV\-head level, which is the finest granularity considered in our method\.

#### Full Attention \(FA\)\.

The output of theii\-th KV\-head group in layerllis:

𝐎l,iFA=softmax​\(𝐐l,i​𝐊l,i⊤dk\)​𝐕l,i,\\mathbf\{O\}\_\{l,i\}^\{\\mathrm\{FA\}\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{Q\}\_\{l,i\}\\mathbf\{K\}\_\{l,i\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\mathbf\{V\}\_\{l,i\},\(1\)where𝐐l,i\\mathbf\{Q\}\_\{l,i\}denotes the concatenated queries from allggquery heads in the group, and𝐊l,i,𝐕l,i∈ℝn×dk\\mathbf\{K\}\_\{l,i\},\\mathbf\{V\}\_\{l,i\}\\in\\mathbb\{R\}^\{n\\times d\_\{k\}\}are the shared key and value matrices for a sequence of lengthnnwith head dimensiondkd\_\{k\}\. Under causal masking, FA allows each token to attend to all preceding tokens, incurringO​\(n2\)O\(n^\{2\}\)compute andO​\(n\)O\(n\)KV cache per head group\.

#### Sliding\-Window Attention \(SWA\)\.

SWA restricts each token to attend only to thewwmost recent preceding tokens, wherewwis a fixed window size:

𝐎l,iSWA=softmax​\(𝐐l,i​\(𝐊l,iw\)⊤dk\)​𝐕l,iw,\\mathbf\{O\}\_\{l,i\}^\{\\text\{SWA\}\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{Q\}\_\{l,i\}\(\\mathbf\{K\}\_\{l,i\}^\{w\}\)^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\mathbf\{V\}\_\{l,i\}^\{w\},\(2\)where𝐊l,iw,𝐕l,iw∈ℝw×dk\\mathbf\{K\}\_\{l,i\}^\{w\},\\mathbf\{V\}\_\{l,i\}^\{w\}\\in\\mathbb\{R\}^\{w\\times d\_\{k\}\}are the shared key and value matrices containing only the entries within the window\. Withw≪nw\\ll n, the KV cache is reduced fromO​\(n\)O\(n\)toO​\(w\)O\(w\)and the compute cost fromO​\(n2\)O\(n^\{2\}\)toO​\(n​w\)O\(nw\)per head group, yielding substantial efficiency gains over FA\.

## 4Method

### 4\.1Problem Formulation

ConSA formulates the design of hybrid attention as a sparsity\-constrained allocation problem: given a pre\-trained Transformer, the objective is to determine, for each KV head, whether it should perform full attention \(FA\) or sliding\-window attention \(SWA\), such that the resulting hybrid model satisfies a user\-specified target sparsity while preserving language modeling performance\.

Letρ∈\[0,1\]\\rho\\in\[0,1\]denote the target sparsity ratio, defined as the fraction of KV heads assigned to SWA\. For theii\-th KV head in layerll, we introduce a binary allocation variablezl,i∈\{0,1\}z\_\{l,i\}\\in\\\{0,1\\\}that selects between the two attention types\. The output of each KV\-head group is then a hard selection between the two:

𝐎^l,i=zl,i⋅𝐎l,iFA\+\(1−zl,i\)⋅𝐎l,iSWA\.\\hat\{\\mathbf\{O\}\}\_\{l,i\}=z\_\{l,i\}\\cdot\\mathbf\{O\}\_\{l,i\}^\{\\mathrm\{FA\}\}\+\(1\-z\_\{l,i\}\)\\cdot\\mathbf\{O\}\_\{l,i\}^\{\\mathrm\{SWA\}\}\.\(3\)
ConSA applies this formulation at two levels of granularity\. The*head\-wise*variant treats eachzl,iz\_\{l,i\}as an independent variable, allowing different KV heads within the same layer to adopt different attention types\. The*layer\-wise*variant constrains all KV heads in a layer to share a single allocation variable,zl,i=zlz\_\{l,i\}=z\_\{l\}for allii, which reduces the size of the search space\. The induced sparsity ratioρ^​\(z\)\\hat\{\\rho\}\(z\)under head\-wise allocation is

ρ^​\(z\)=ρ^head​\(z\)=1−1L⋅HKV​∑l=1L∑i=1HKVzl,i,\\hat\{\\rho\}\(z\)=\\hat\{\\rho\}\_\{\\mathrm\{head\}\}\(z\)=1\-\\frac\{1\}\{L\\cdot H\_\{\\mathrm\{KV\}\}\}\\sum\_\{l=1\}^\{L\}\\sum\_\{i=1\}^\{H\_\{\\mathrm\{KV\}\}\}z\_\{l,i\},\(4\)and under layer\-wise allocation, it simplifies to

ρ^​\(z\)=ρ^layer​\(z\)=1−1L​∑l=1Lzl,\\hat\{\\rho\}\(z\)=\\hat\{\\rho\}\_\{\\mathrm\{layer\}\}\(z\)=1\-\\frac\{1\}\{L\}\\sum\_\{l=1\}^\{L\}z\_\{l\},\(5\)whereLLis the number of layers andHKVH\_\{\\mathrm\{KV\}\}is the number of KV heads per layer\. The overall optimization problem is

minθ,z⁡ℒLM​\(θ,z\)s\.t\.ρ^​\(z\)=ρ,\\min\_\{\\theta,z\}\\;\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\(\\theta,z\)\\quad\\text\{s\.t\.\}\\quad\\hat\{\\rho\}\(z\)=\\rho,\(6\)whereθ\\thetadenotes the model parameters andℒLM\\mathcal\{L\}\_\{\\mathrm\{LM\}\}the autoregressive language modeling loss\.

### 4\.2Differentiable Mask Learning with Hard Concrete

Sincezl,i∈\{0,1\}z\_\{l,i\}\\in\\\{0,1\\\}is binary, Eq\.[6](https://arxiv.org/html/2606.18056#S4.E6)is non\-differentiable and cannot be optimized directly with gradients\. To jointly trainθ\\thetaandzz, we parameterize eachzl,iz\_\{l,i\}with the hard concrete distribution\(Louizoset al\.,[2018](https://arxiv.org/html/2606.18056#bib.bib28)\), which assigns non\-zero probability mass to0and11while remaining continuous and differentiable in between\. We refer to the resultingzl,iz\_\{l,i\}as a learnable binary mask\. Each mask is controlled by a learnable parameterαl,i∈ℝ\\alpha\_\{l,i\}\\in\\mathbb\{R\}and is sampled as

u\\displaystyle u∼𝒰​\(0,1\),\\displaystyle\\sim\\mathcal\{U\}\(0,1\),\(7\)s\\displaystyle s=σ​\(1β​\(log⁡u−log⁡\(1−u\)\+αl,i\)\),\\displaystyle=\\sigma\\\!\\left\(\\tfrac\{1\}\{\\beta\}\\bigl\(\\log u\-\\log\(1\-u\)\+\\alpha\_\{l,i\}\\bigr\)\\right\),s¯\\displaystyle\\bar\{s\}=s⋅\(ζ−γ\)\+γ,\\displaystyle=s\\cdot\(\\zeta\-\\gamma\)\+\\gamma,zl,i\\displaystyle z\_\{l,i\}=min⁡\(1,max⁡\(0,s¯\)\),\\displaystyle=\\min\\bigl\(1,\\,\\max\(0,\\,\\bar\{s\}\)\\bigr\),whereσ\\sigmais the sigmoid function,β\\betathe temperature, andζ\>1\\zeta\>1andγ<0\\gamma<0the stretch parameters\.

During training,zl,iz\_\{l,i\}is sampled stochastically and gradients with respect toαl,i\\alpha\_\{l,i\}are obtained by reparameterizing the noise variableuu\. We initialize allαl,i\\alpha\_\{l,i\}to5\.05\.0, which makes the expected mask valuez¯l,i\\bar\{z\}\_\{l,i\}close to11, so every head starts as FA at the beginning of training\. This warm\-start anchors mask learning at the original full\-attention configuration of the pre\-trained model and lets the optimizer turn a head into SWA only when doing so does not hurt the loss\.

### 4\.3Lagrangian Optimization under Sparsity Constraints

The problem in Eq\.[6](https://arxiv.org/html/2606.18056#S4.E6)is a constrained optimization that requires the realized sparsity to match the targetρ\\rho\. We address it through augmented Lagrangian relaxation, which converts the equality constraint into an additive penalty and yields the unified training objective

minθ,α⁡maxλ,ϕ⁡ℒLM​\(θ,z\)\+λ⋅\(ρ^​\(z\)−ρ\)\+ϕ⋅\(ρ^​\(z\)−ρ\)2\\displaystyle\\min\_\{\\theta,\\alpha\}\\;\\max\_\{\\lambda,\\phi\}\\;\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\(\\theta,z\)\+\\lambda\\cdot\\bigl\(\\hat\{\\rho\}\(z\)\-\\rho\\bigr\)\+\\phi\\cdot\\bigl\(\\hat\{\\rho\}\(z\)\-\\rho\\bigr\)^\{2\}

\(8\)whereλ\\lambdais a Lagrange multiplier andϕ\\phiis an adaptive quadratic coefficient\. The linear term pushes the realized sparsity towardρ\\rho, while the quadratic term penalizes larger violations and stabilizes the optimization dynamics around the constraint\. Under this min\-max formulation,θ\\thetaandα\\alphaare updated by gradient descent, whileλ\\lambdaandϕ\\phiare updated by gradient ascent\.

As a stochastic sample from the hard concrete distribution,zzfluctuates across forward passes, which produces high\-variance gradients with respect toα\\alpha\. We therefore constrain its expectation𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]instead, which is a smooth, deterministic function ofα\\alpha\. By the linearity of expectation,

𝔼​\[ρ^​\(z\)\]\\displaystyle\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]=1−1L⋅HKV​∑l=1L∑i=1HKV𝔼​\[zl,i\],\\displaystyle=1\-\\frac\{1\}\{L\\cdot H\_\{\\mathrm\{KV\}\}\}\\sum\_\{l=1\}^\{L\}\\sum\_\{i=1\}^\{H\_\{\\mathrm\{KV\}\}\}\\mathbb\{E\}\[z\_\{l,i\}\],\(9\)𝔼​\[zl,i\]\\displaystyle\\mathbb\{E\}\[z\_\{l,i\}\]=1−Fs¯l,i​\(0∣αl,i\)\\displaystyle=1\-F\_\{\\bar\{s\}\_\{l,i\}\}\(0\\mid\\alpha\_\{l,i\}\)=σ​\(αl,i−β​log⁡−γζ\),\\displaystyle=\\sigma\\\!\\left\(\\alpha\_\{l,i\}\-\\beta\\log\\tfrac\{\-\\gamma\}\{\\zeta\}\\right\),whereFs¯l,i\(⋅∣αl,i\)F\_\{\\bar\{s\}\_\{l,i\}\}\(\\cdot\\mid\\alpha\_\{l,i\}\)is the cumulative distribution function of the stretched concrete variables¯l,i\\bar\{s\}\_\{l,i\}, andσ\\sigmais the sigmoid function\. Replacingρ^​\(z\)\\hat\{\\rho\}\(z\)in Eq\.[8](https://arxiv.org/html/2606.18056#S4.E8)with𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]gives the objective actually optimized during training:

minθ,α⁡maxλ,ϕ⁡ℒLM​\(θ,z\)\+λ⋅\(𝔼​\[ρ^​\(z\)\]−ρ\)\+ϕ⋅\(𝔼​\[ρ^​\(z\)\]−ρ\)2\.\\min\_\{\\theta,\\alpha\}\\;\\max\_\{\\lambda,\\phi\}\\;\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\(\\theta,z\)\+\\lambda\\cdot\\bigl\(\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]\-\\rho\\bigr\)\+\\phi\\cdot\\bigl\(\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]\-\\rho\\bigr\)^\{2\}\.

\(10\)This substitution is exact in the limit: by a property of the hard concrete distribution, each𝔼​\[zl,i\]\\mathbb\{E\}\[z\_\{l,i\}\]concentrates on\{0,1\}\\\{0,1\\\}as training proceeds, so the expected sparsity converges to the realized sparsityρ^​\(z\)\\hat\{\\rho\}\(z\)at convergence\.

After the augmented Lagrangian penalty converges to zero, we drop the stochastic sampling and binarize each learned mask as

zl,i=𝟙​\[αl,i\>0\]\.z\_\{l,i\}=\\mathbbm\{1\}\[\\alpha\_\{l,i\}\>0\]\.\(11\)The binarized masks give a fixed FA/SWA assignment that is used in all subsequent forward passes\. The full training pipeline is described in Section[5\.1](https://arxiv.org/html/2606.18056#S5.SS1)\.

## 5Experiments

### 5\.1Experimental Setup

MethodMMLULogiQA\-ENLogiQA\-CNCSQAPIQASIQADense FA45\.5134\.3133\.2350\.0456\.9154\.86ConSA \(*head\-wise, single\-layer*\)45\.7636\.9234\.9252\.0961\.3253\.94ConSA \(*head\-wise, all\-layers*\)45\.5535\.0833\.0851\.2757\.7854\.40Rule \(*head\-wise*\)45\.1534\.6232\.4649\.2258\.7653\.28ConSA \(*layer\-wise*\)45\.4532\.9231\.5452\.9956\.4753\.89Rule \(*layer\-wise*\)44\.0331\.7131\.0251\.4356\.1252\.30MethodARC\-CHellaARC\-EWebQA\-CNCN\-GENAverageDense FA51\.0236\.3569\.9154\.5839\.3847\.83ConSA \(*head\-wise, single\-layer*\)51\.7137\.9371\.0057\.1536\.8949\.06ConSA \(*head\-wise, all\-layers*\)52\.0534\.2771\.2156\.8637\.8848\.13Rule \(*head\-wise*\)51\.1934\.6169\.1156\.1635\.8747\.31ConSA \(*layer\-wise*\)51\.7936\.9871\.3055\.2936\.5147\.74Rule \(*layer\-wise*\)50\.4334\.3167\.8055\.4136\.3946\.45

Table 1:Comparison of head\-wise and layer\-wise FA/SWA allocation on 1\.7B at target sparsityρ=0\.50\\rho=0\.50\. ConSA and the rule\-based baselines are matched at the sameρ\\rho, while FA \(ρ=0\\rho=0\) serves as the dense reference\. The best results are inbold, and the second\-best results areunderlined\.#### Training Pipeline\.

We train the model in two stages\.*Stage 1 \(Mask Learning, 1B tokens\)*jointly optimizes the model parametersθ\\theta, mask parametersα\\alpha, and Lagrange multipliersλ\\lambdaandϕ\\phi, starting from a pre\-trained checkpoint\. In this stage, masks are sampled from the hard concrete distribution with a constraint that drives the expected sparsity𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]toward the targetρ\\rho\. In*Stage 2 \(Continued Pre\-training, 100B tokens\)*, we binarize the masks viazl,i=𝟙​\[αl,i\>0\]z\_\{l,i\}=\\mathbbm\{1\}\[\\alpha\_\{l,i\}\>0\]and continue training on the resulting fixed FA/SWA assignments to let the weights adapt to the new configuration\. \(see Appendix[B](https://arxiv.org/html/2606.18056#A2)for further details\)\.

#### Models\.

We pre\-train two dense Transformer LLMs from scratch to evaluate ConSA: a 0\.6B\-parameter model and a 1\.7B\-parameter model\. Both adopt a standard GQA architecture with 28 layers, 16 query heads, and 8 KV heads per layer, differing only in the hidden dimension \(Table[2](https://arxiv.org/html/2606.18056#A2.T2)\)\. We run the main downstream evaluation on 1\.7B \(Table[1](https://arxiv.org/html/2606.18056#S5.T1)\) and use 0\.6B for ablation studies and pattern visualization, since its smaller size lets us sweep over more sparsity levels and granularities at a lower computational cost\.

#### Baselines\.

We compare the six configurations listed in Table[1](https://arxiv.org/html/2606.18056#S5.T1), all evaluated under matched continued pre\-training: 1\) Dense FA, the full\-attention reference withρ=0\\rho=0in which every KV head performs full attention; 2\) ConSA \(*head\-wise, single\- layer*\), the head\-wise variant of ConSA trained under a per\-layer sparsity constraint that requires each layer to independently satisfy1−1HKV​∑izl,i=ρ1\-\\frac\{1\}\{H\_\{\\mathrm\{KV\}\}\}\\sum\_\{i\}z\_\{l,i\}=\\rho; 3\) ConSA \(*head\-wise, all\-layers*\), the head\-wise variant trained under the global constraint in Eq\.[4](https://arxiv.org/html/2606.18056#S4.E4), whereρ\\rhois imposed only on the full pool ofL⋅HKVL\\cdot H\_\{\\mathrm\{KV\}\}KV heads, so that the optimizer can distribute the SWA budget unevenly across layers; 4\) Rule \(*head\-wise*\), a static head\-wise pattern with a hand\-crafted SWA/FA assignment within each layer at the target sparsityρ\\rho; 5\) ConSA \(*layer\-wise*\), the layer\-wise variant of ConSA trained under the constraint in Eq\.[5](https://arxiv.org/html/2606.18056#S4.E5), in which all KV heads within a layer share a single allocationzlz\_\{l\}; 6\) Rule \(*layer\-wise*\), a static layer\-wise interleaving in the style of MistralJianget al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib1)\)and Gemma 2Team \([2024](https://arxiv.org/html/2606.18056#bib.bib2)\), where SWA and FA layers alternate at the sameρ\\rho\.

#### Evaluation\.

We evaluate our models on a range of English and Chinese benchmarks covering knowledge and reasoning\. General knowledge is measured via MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.18056#bib.bib19)\), while logical reasoning is assessed using LogiQA\-EN and LogiQA\-CNLiuet al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib20)\)\. For commonsense reasoning, we include CommonsenseQA \(CSQA\)Talmoret al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib21)\), PIQABisket al\.\([2020](https://arxiv.org/html/2606.18056#bib.bib18)\), and SocialIQA \(SIQA\)Sapet al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib22)\)\. Scientific and contextual reasoning are tested using ARC\-Challenge \(ARC\-C\), ARC\-Easy \(ARC\-E\)Clarket al\.\([2018](https://arxiv.org/html/2606.18056#bib.bib23)\), and HellaSwag \(Hella\)Zellerset al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib24)\), alongside open\-domain question answering with WebQA\-CNLiet al\.\([2016](https://arxiv.org/html/2606.18056#bib.bib25)\)\. The evaluation also covers two Chinese generation tasks \(CN\-GEN\): scientific summarization from CSLLiet al\.\([2022](https://arxiv.org/html/2606.18056#bib.bib26)\)and story generation from LOTGuanet al\.\([2022](https://arxiv.org/html/2606.18056#bib.bib27)\)\.

### 5\.2Main Results

#### Learned Allocation Outperforms Rule Allocation\.

Tab\.[1](https://arxiv.org/html/2606.18056#S5.T1)compares ConSA against rule\-based baselines on 1\.7B atρ=0\.50\\rho=0\.50\. Under head\-wise allocation, the single\-layer variant of ConSA yields a 3\.7% relative improvement in average accuracy over Rule \(*head\-wise*\), with consistent gains across all eleven benchmarks\. Under layer\-wise allocation, ConSA likewise outperforms Rule \(*layer\-wise*\) by a 2\.8% relative margin on average\. The consistent advantage observed under both granularities indicates that the allocation learned by ConSA captures FA/SWA configurations that are unreachable through hand\-crafted interleaving\.

#### Head\-wise ConSA Matches or Exceeds Dense FA\.

Although ConSA operates atρ=0\.50\\rho=0\.50while Dense FA uses no sparsity \(ρ=0\\rho=0\), the head\-wise \(single\-layer\) variant surpasses Dense FA by a 2\.6% relative margin on average\. The main exception is CN\-GEN, where performance drops due to long\-range dependencies that fall outside the SWA window\. On the remaining benchmarks, sequence lengths generally fall within the SWA window, so the selected heads attend to the same context as FA heads at test time\. This indicates that the performance gap originates from training, where local attention on these heads may act as an implicit regularizer that encourages more focused attention patterns in the learned weights\.

#### Head\-wise Granularity Drives Most of the Improvement\.

Both head\-wise variants of ConSA outperform the layer\-wise variant despite sharing the same training framework and targetρ\\rho, indicating that the granularity of allocation is a key factor\. Among the two head\-wise variants, the single\-layer variant outperforms the all\-layers variant\. We attribute this to the difference in constraint structure: the all\-layers variant imposesρ\\rhoas a single global constraint over allL×HKVL\\times H\_\{\\mathrm\{KV\}\}heads, resulting in a substantially larger search space that makes the Lagrangian constraint harder to satisfy during mask learning\. As shown in Figures[2](https://arxiv.org/html/2606.18056#S5.F2)and[8\(b\)](https://arxiv.org/html/2606.18056#A2.F8.sf2), the sparsity loss of the all\-layers variant converges more slowly than that of the single\-layer variant, which enforcesρ\\rhoindependently at each layer\. The per\-layer constraint, therefore, acts as a structural prior that regularizes the optimization and yields a more effective allocation\.

![Refer to caption](https://arxiv.org/html/2606.18056v1/x2.png)Figure 2:Convergence of the Lagrangian constraint loss during Stage 1 mask learning on 1\.7B atρ=0\.50\\rho=0\.50under three allocation granularities\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x3.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x4.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x5.png)

Figure 3:Training loss trajectories on 0\.6B under layer\-wise allocation atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\. Each panel compares ConSA, the ablation variant \(w/o L0\-Lagrangian\), and Dense FA over 20B tokens of continued pre\-training\.Dashed circleshighlight the final convergence region where the relative ordering of ConSA and Dense FA shifts across sparsity levels\.

### 5\.3Convergence of the Lagrangian Constraint

To verify that the learned masks meet the target sparsity, we monitor the Lagrangian constraint loss,ℒLagrange=λ⋅\(𝔼​\[ρ^​\(z\)\]−ρ\)\+ϕ⋅\(𝔼​\[ρ^​\(z\)\]−ρ\)2\{\\mathcal\{L\}\}\_\{\\text\{Lagrange\}\}=\\lambda\\cdot\(\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]\-\\rho\)\+\\phi\\cdot\(\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]\-\\rho\)^\{2\}, during the 1B\-token Stage 1 mask\-learning phase\. Figure[2](https://arxiv.org/html/2606.18056#S5.F2)reports the trajectory ofℒLagrange\{\\mathcal\{L\}\}\_\{\\text\{Lagrange\}\}for the 1\.7B model atρ=0\.50\\rho=0\.50across the three allocation granularities\.

#### All Granularities Converge within the 1B\-Token Budget\.

Under the min\-max formulation of the augmented Lagrangian, the Lagrange multipliers and mask parameters compete before reaching equilibrium\. As a result, the constraint loss does not decrease monotonically but instead oscillates, converging to zero only when the constraintρ^​\(z\)=ρ\\hat\{\\rho\}\(z\)=\\rhois satisfied\. Despite these transient oscillations, all three configurations drive the constraint loss to near zero within 1,000 training steps, confirming that the target sparsity can be reliably achieved within the 1B\-token mask\-learning budget\. The effect of granularity on convergence speed and its connection to downstream performance are analyzed in Appendix[D\.1](https://arxiv.org/html/2606.18056#A4.SS1)\.

### 5\.4Ablation Study

#### Setup\.

To isolate the contribution of the L0\-Lagrangian formulation in ConSA, we construct an ablation variant that removes it and instead adopts a calibration\-based strategy analogous to those used in LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)and DuoAttentionXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)\. Each attention unit is equipped with an unconstrained scalar gateαi\\alpha\_\{i\}, initialized to 1\.0 and clamped to\[0,1\]\[0,1\]during training\. The attention output at layerll, headiiis computed as:

𝐎^l,i=αl,i⋅𝐎l,iFA\+\(1−αl,i\)⋅𝐎l,iSWA\.\\hat\{\\mathbf\{O\}\}\_\{l,i\}=\\alpha\_\{l,i\}\\cdot\\mathbf\{O\}\_\{l,i\}^\{\\mathrm\{FA\}\}\+\(1\-\\alpha\_\{l,i\}\)\\cdot\\mathbf\{O\}\_\{l,i\}^\{\\mathrm\{SWA\}\}\.\(12\)The training objective is:

ℒ=ℒLM\+λ⋅ℒL1,\\displaystyle\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\+\\lambda\\cdot\\mathcal\{L\}\_\{\\mathrm\{L1\}\},\(13\)ℒL1=1N​∑i\|αi\|,\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{L1\}\}=\\frac\{1\}\{N\}\\sum\_\{i\}\|\\alpha\_\{i\}\|,whereλ=0\.05\\lambda=0\.05following DuoAttention\. Since a higherαl,i\\alpha\_\{l,i\}indicates a stronger preference for full attention, the final FA/SWA assignment is obtained by sorting all gates in ascending order and assigning the bottom\-ρ\\rhofraction to SWA, following the ranking\-based selection of LoZA\. We compare ConSA with this ablation and the Dense FA baseline on 0\.6B under layer\-wise allocation atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\. Both ConSA and the ablation train masks on 1B tokens, followed by 20B tokens of continued pre\-training under the resulting fixed configuration\. We report the loss trajectory over the 20B\-token stage in Figure[3](https://arxiv.org/html/2606.18056#S5.F3)\. The detailed training setup is provided in Appendix[C](https://arxiv.org/html/2606.18056#A3)\.

#### ConSA vs\. Ablation Variant\.

As shown in Figure[3](https://arxiv.org/html/2606.18056#S5.F3), ConSA achieves a lower final loss than the ablation variant at every sparsity level, with the gap emerging early in training and persisting throughout\. By binding the mask distribution directly to the targetρ\\rhoduring optimization, the L0\-Lagrangian formulation steers the mask toward configurations that post\-hoc selection based on unconstrained scalar weights cannot reach\. The gap between ConSA and the ablation widens asρ\\rhoincreases from 0\.25 to 0\.75, suggesting that the Lagrangian constraint becomes increasingly important at higher sparsity levels where the optimization landscape grows more challenging\.

![Refer to caption](https://arxiv.org/html/2606.18056v1/x6.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x7.png)

Figure 4:Learned layer\-wise FA/SWA allocation atρ=0\.50\\rho=0\.50\. Each cell indicates whether a layer usesFA \(red\)orSWA \(blue\)\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x8.png)Figure 5:Per\-layer SWA head ratios under head\-wise allocation across four configurations\. The bold curves show moving averages, computed from the raw per\-layer ratios shown by the faded curves\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x9.png)\(a\)Uniform
![Refer to caption](https://arxiv.org/html/2606.18056v1/x10.png)\(b\)Weakly local
![Refer to caption](https://arxiv.org/html/2606.18056v1/x11.png)\(c\)Strongly local
![Refer to caption](https://arxiv.org/html/2606.18056v1/x12.png)\(d\)Sparse broad
![Refer to caption](https://arxiv.org/html/2606.18056v1/x13.png)\(e\)Dense broad
![Refer to caption](https://arxiv.org/html/2606.18056v1/x14.png)\(f\)Dense broad

Figure 6:Last\-token attention distribution across representative layers of the 0\.6B model, spanning the spectrum from uniform to dense broad attention\.

### 5\.5Analysis of Learned Allocation Patterns

We analyze the learned FA/SWA allocation patterns across model scales, sparsity levels, and granularities to understand what configurations emerge from the L0\-Lagrangian optimization\.

#### Layer\-wise Allocation Patterns\.

In contrast to rule\-based approaches that interleave FA and SWA at a fixed ratio, the learned masks concentrate FA into contiguous middle\-layer blocks, suggesting that adjacent FA layers working in concert are more beneficial than FA layers evenly spread across the network \(Figure[4](https://arxiv.org/html/2606.18056#S5.F4)\)\. Notably, the first layer is consistently assigned to SWA across all configurations, contradicting the design choice in several rule\-based methods that designate the first layer as FA to capture global context early\. The concentration of FA in the middle layers is consistent with prior analyses showing that intermediate layers play a central role in semantic integration and reasoningClarket al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib5)\); Voitaet al\.\([2019](https://arxiv.org/html/2606.18056#bib.bib4)\); Chenet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib30)\), which may require a broader attention scope\. This structure is shared across both model scales \(Figure[4](https://arxiv.org/html/2606.18056#S5.F4)\) and sparsity levels \(Figure[9](https://arxiv.org/html/2606.18056#A2.F9)\), though the FA block shifts to slightly earlier layers in 1\.7B\.

#### Head\-wise Allocation Patterns\.

Under head\-wise allocation, the per\-layer SWA head ratio follows an approximate W\-shaped trend, with peaks at the bottom and top layers and dips in the middle \(Figure[5](https://arxiv.org/html/2606.18056#S5.F5)\)\. This trend is shared across model scales and the higher sparsity levels \(ρ∈\{0\.50,0\.75\}\\rho\\in\\\{0\.50,0\.75\\\}\), while atρ=0\.25\\rho=0\.25the overall sparsity budget is too low for the top\-layer recovery to manifest clearly\. The consistent dip in the middle layers aligns with the layer\-wise observation that this region serves as the primary site for full attention\. Full per\-head allocation heatmaps are provided in Appendix[D\.2](https://arxiv.org/html/2606.18056#A4.SS2)\.

### 5\.6Analysis of Attention Behavior

Across sparsity levels, ConSA’s learned allocation reveals that different layers exhibit distinct preferences for FA or SWA \(Figure[9](https://arxiv.org/html/2606.18056#A2.F9)\)\. To examine their intrinsic attention behaviors, we select six representative layers and visualize their last\-token attention distribution, using the pre\-trained checkpoint before Stage 1 mask learning\. For each layer, we average the attention scores of all heads into a layer\-wise distribution\.

#### Diverse Attention Spikes Ranges\.

Figure[6](https://arxiv.org/html/2606.18056#S5.F6)reveals that these layers exhibit a fine\-grained spectrum of attention patterns beyond the retrieval\-versus\-streaming dichotomy described in prior workXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)\. Among the layers assigned to SWA across all sparsity levels \(L1, L4, L27\), L1 displays a*uniform*pattern \(Figure[6\(a\)](https://arxiv.org/html/2606.18056#S5.F6.sf1)\) with attention spread nearly evenly and at low magnitude across the full context; L4 is*weakly local*\(Figure[6\(b\)](https://arxiv.org/html/2606.18056#S5.F6.sf2)\), mostly uniform but with a mild rise in the last few hundred positions; and L27 is*strongly local*\(Figure[6\(c\)](https://arxiv.org/html/2606.18056#S5.F6.sf3)\), with attention sharply concentrated near the tail\. The layers assigned to FA across all sparsity levels, L16 and L22, show*dense broad*attention pattern \(Figures[6\(e\)](https://arxiv.org/html/2606.18056#S5.F6.sf5),[6\(f\)](https://arxiv.org/html/2606.18056#S5.F6.sf6)\), with high\-magnitude spikes spanning a wide range of the sequence\. The switching layer L25, which transitions from FA atρ=0\.25\\rho=0\.25to SWA atρ∈\{0\.50,0\.75\}\\rho\\in\\\{0\.50,0\.75\\\}, shows*sparse broad*attention pattern \(Figure[6\(d\)](https://arxiv.org/html/2606.18056#S5.F6.sf4)\): spikes appear at distant positions, but are confined to a small number of localized regions, placing it between the local and dense broad extremes\. This spectrum generally aligns with ConSA’s allocation: layers with narrower attention spike ranges tend to be assigned SWA first as the sparsity budget grows\. Further analysis is provided in Appendix[D\.4](https://arxiv.org/html/2606.18056#A4.SS4)\.

## 6Conclusion

In this paper, we introduced ConSA, a principled framework for learning hybrid FA/SWA attention configurations with controllable sparsity\. By formulating FA/SWA allocation as a Lagrangian\-constrained L0 optimization problem, ConSA learns binary masks at both layer\-wise and KV\-head\-wise granularities that meet user\-specified sparsity targets\. Analysis of the learned patterns reveals a counterintuitive yet architecture\-consistent principle: bottom layers are predominantly assigned SWA, whereas FA concentrates in middle layers\. This challenges prevailing intuitions about the necessity of global attention and offers practical utility and interpretive insights for the community\.

## Limitations

First, the SWA window size is fixed throughout all experiments; jointly optimizing the window size and the FA/SWA allocation may yield further gains\. Second, although the cross\-scale consistency of the identified patterns is promising, a more comprehensive evaluation across a broader spectrum of model scales and architectural designs is required to verify the generalizability of the proposed allocation principle\.

## Ethics Statement

Our work examines architectural modifications to large language models, using publicly available pre\-training corpora and evaluation benchmarks, all accessed under their respective open licenses for academic research use\. No personal or sensitive data is used during training or evaluation\. The proposed method does not introduce new deployment risks beyond those commonly associated with language models\.

## References

- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: A bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 3119–3137\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.172),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.172)Cited by:[§D\.4](https://arxiv.org/html/2606.18056#A4.SS4.SSS0.Px1.p1.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.CoRRabs/2004\.05150\.External Links:[Link](https://arxiv.org/abs/2004.05150),2004\.05150Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p1.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 7432–7439\.External Links:[Link](https://doi.org/10.1609/aaai.v34i05.6239),[Document](https://dx.doi.org/10.1609/AAAI.V34I05.6239)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- Y\. Chen, J\. Sheng, W\. Zhang, and T\. Liu \(2025\)Improving reasoning capabilities in small models through mixture\-of\-layers distillation with stepwise attention on key information\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 4952–4971\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.250/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.250),ISBN 979\-8\-89176\-332\-6Cited by:[§5\.5](https://arxiv.org/html/2606.18056#S5.SS5.SSS0.Px1.p1.1)\.
- R\. Child, S\. Gray, A\. Radford, and I\. Sutskever \(2019\)Generating long sequences with sparse transformers\.CoRRabs/1904\.10509\.External Links:[Link](http://arxiv.org/abs/1904.10509),1904\.10509Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning \(2019\)What does BERT look at? an analysis of bert’s attention\.InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019,T\. Linzen, G\. Chrupala, Y\. Belinkov, and D\. Hupkes \(Eds\.\),pp\. 276–286\.External Links:[Link](https://doi.org/10.18653/v1/W19-4828),[Document](https://dx.doi.org/10.18653/V1/W19-4828)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px3.p1.1),[§5\.5](https://arxiv.org/html/2606.18056#S5.SS5.SSS0.Px1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the AI2 reasoning challenge\.CoRRabs/1803\.05457\.External Links:[Link](http://arxiv.org/abs/1803.05457),1803\.05457Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.CoRRabs/2312\.00752\.External Links:[Link](https://doi.org/10.48550/arXiv.2312.00752),[Document](https://dx.doi.org/10.48550/ARXIV.2312.00752),2312\.00752Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Guan, Z\. Feng, Y\. Chen, R\. He, X\. Mao, C\. Fan, and M\. Huang \(2022\)LOT: A story\-centric benchmark for evaluating chinese long text understanding and generation\.Trans\. Assoc\. Comput\. Linguistics10,pp\. 434–451\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00469),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00469)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- L\. Huang, S\. Cao, N\. N\. Parulian, H\. Ji, and L\. Wang \(2021\)Efficient attentions for long document summarization\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2021, Online, June 6\-11, 2021,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tür, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),pp\. 1419–1436\.External Links:[Link](https://doi.org/10.18653/v1/2021.naacl-main.112),[Document](https://dx.doi.org/10.18653/V1/2021.NAACL-MAIN.112)Cited by:[§D\.4](https://arxiv.org/html/2606.18056#A4.SS4.SSS0.Px1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de Las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.CoRRabs/2310\.06825\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.06825),[Document](https://dx.doi.org/10.48550/ARXIV.2310.06825),2310\.06825Cited by:[Appendix A](https://arxiv.org/html/2606.18056#A1.p1.5),[Table 3](https://arxiv.org/html/2606.18056#A2.T3.2.2.3.2),[§1](https://arxiv.org/html/2606.18056#S1.p2.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px3.p1.7)\.
- A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret \(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13\-18 July 2020, Virtual Event,Proceedings of Machine Learning Research,pp\. 5156–5165\.External Links:[Link](http://proceedings.mlr.press/v119/katharopoulos20a.html)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Kitaev, L\. Kaiser, and A\. Levskaya \(2020\)Reformer: the efficient transformer\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,External Links:[Link](https://openreview.net/forum?id=rkgNKkHtvB)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Kociský, J\. Schwarz, P\. Blunsom, C\. Dyer, K\. M\. Hermann, G\. Melis, and E\. Grefenstette \(2018\)The narrativeqa reading comprehension challenge\.Trans\. Assoc\. Comput\. Linguistics6,pp\. 317–328\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00023),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00023)Cited by:[§D\.4](https://arxiv.org/html/2606.18056#A4.SS4.SSS0.Px1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23\-26, 2023,J\. Flinn, M\. I\. Seltzer, P\. Druschel, A\. Kaufmann, and J\. Mace \(Eds\.\),pp\. 611–626\.External Links:[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p1.1)\.
- B\. Lenz, O\. Lieber, A\. Arazi, A\. Bergman, A\. Manevich, B\. Peleg, B\. Aviram, C\. Almagor, C\. Fridman, D\. Padnos, D\. Gissin, D\. Jannai, D\. Muhlgay, D\. Zimberg, E\. M\. Gerber, E\. Dolev, E\. Krakovsky, E\. Safahi, E\. Schwartz, G\. Cohen, and et al\. \(2025\)Jamba: hybrid transformer\-mamba language models\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=JFPaD7lpBD)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Li, W\. Li, Z\. He, X\. Wang, Y\. Cao, J\. Zhou, and W\. Xu \(2016\)Dataset and neural recurrent sequence labeling model for open\-domain factoid question answering\.External Links:1607\.06275,[Link](https://arxiv.org/abs/1607.06275)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- Y\. Li, Y\. Zhang, Z\. Zhao, L\. Shen, W\. Liu, W\. Mao, and H\. Zhang \(2022\)CSL: A large\-scale chinese scientific literature dataset\.InProceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12\-17, 2022,N\. Calzolari, C\. Huang, H\. Kim, J\. Pustejovsky, L\. Wanner, K\. Choi, P\. Ryu, H\. Chen, L\. Donatelli, H\. Ji, S\. Kurohashi, P\. Paggio, N\. Xue, S\. Kim, Y\. Hahm, Z\. He, T\. K\. Lee, E\. Santus, F\. Bond, and S\. Na \(Eds\.\),pp\. 3917–3923\.External Links:[Link](https://aclanthology.org/2022.coling-1.344)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang \(2020\)LogiQA: A challenge dataset for machine reading comprehension with logical reasoning\.InProceedings of the Twenty\-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020,C\. Bessiere \(Ed\.\),pp\. 3622–3628\.External Links:[Link](https://doi.org/10.24963/ijcai.2020/501),[Document](https://dx.doi.org/10.24963/IJCAI.2020/501)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Trans\. Assoc\. Comput\. Linguistics12,pp\. 157–173\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00638),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00638)Cited by:[§D\.4](https://arxiv.org/html/2606.18056#A4.SS4.SSS0.Px1.p1.1)\.
- C\. Louizos, M\. Welling, and D\. P\. Kingma \(2018\)Learning sparse neural networks through l\_0 regularization\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2606.18056#A2.SS0.SSS0.Px1.p1.11),[§1](https://arxiv.org/html/2606.18056#S1.p3.2),[§4\.2](https://arxiv.org/html/2606.18056#S4.SS2.p1.8)\.
- A\. Roy, M\. Saffar, A\. Vaswani, and D\. Grangier \(2021\)Efficient content\-based sparse attention with routing transformers\.Trans\. Assoc\. Comput\. Linguistics9,pp\. 53–68\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00353),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00353)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. L\. Bras, and Y\. Choi \(2019\)Social iqa: commonsense reasoning about social interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP\-IJCNLP 2019, Hong Kong, China, November 3\-7, 2019,K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),pp\. 4462–4472\.External Links:[Link](https://doi.org/10.18653/v1/D19-1454),[Document](https://dx.doi.org/10.18653/V1/D19-1454)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: A question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),pp\. 4149–4158\.External Links:[Link](https://doi.org/10.18653/v1/n19-1421),[Document](https://dx.doi.org/10.18653/V1/N19-1421)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- G\. Team \(2024\)Gemma 2: improving open language models at a practical size\.CoRRabs/2408\.00118\.External Links:[Link](https://doi.org/10.48550/arXiv.2408.00118),[Document](https://dx.doi.org/10.48550/ARXIV.2408.00118),2408\.00118Cited by:[Appendix A](https://arxiv.org/html/2606.18056#A1.p1.5),[Table 3](https://arxiv.org/html/2606.18056#A2.T3.2.2.3.2),[§1](https://arxiv.org/html/2606.18056#S1.p2.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px3.p1.7)\.
- E\. Voita, D\. Talbot, F\. Moiseev, R\. Sennrich, and I\. Titov \(2019\)Analyzing multi\-head self\-attention: specialized heads do the heavy lifting, the rest can be pruned\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 5797–5808\.External Links:[Link](https://doi.org/10.18653/v1/p19-1580),[Document](https://dx.doi.org/10.18653/V1/P19-1580)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px3.p1.1),[§5\.5](https://arxiv.org/html/2606.18056#S5.SS5.SSS0.Px1.p1.1)\.
- M\. Weber, D\. Y\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams, B\. Athiwaratkun, R\. Chalamala, K\. Chen, M\. Ryabinin, T\. Dao, P\. Liang, C\. Ré, I\. Rish, and C\. Zhang \(2024\)RedPajama: an open dataset for training large language models\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/d34497330b1fd6530f7afd86d0df9f76-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[Appendix B](https://arxiv.org/html/2606.18056#A2.SS0.SSS0.Px2.p1.1)\.
- M\. Xia, T\. Gao, Z\. Zeng, and D\. Chen \(2024\)Sheared llama: accelerating language model pre\-training via structured pruning\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=09iOdaeOzp)Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p3.2)\.
- C\. Xiao, P\. Zhang, X\. Han, G\. Xiao, Y\. Lin, Z\. Zhang, Z\. Liu, and M\. Sun \(2024\)InfLLM: training\-free long\-context extrapolation for llms with an efficient context memory\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p1.1)\.
- G\. Xiao, J\. Tang, J\. Zuo, J\. Guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han \(2025\)DuoAttention: efficient long\-context LLM inference with retrieval and streaming heads\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=cFu7ze7xUm)Cited by:[Appendix A](https://arxiv.org/html/2606.18056#A1.p2.1),[Appendix C](https://arxiv.org/html/2606.18056#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.18056#S1.p2.1),[§1](https://arxiv.org/html/2606.18056#S1.p4.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px3.p1.1),[§5\.4](https://arxiv.org/html/2606.18056#S5.SS4.SSS0.Px1.p1.4),[§5\.6](https://arxiv.org/html/2606.18056#S5.SS6.SSS0.Px1.p1.2)\.
- L\. Xiaomi \(2026\)MiMo\-v2\-flash technical report\.CoRRabs/2601\.02780\.External Links:[Link](https://doi.org/10.48550/arXiv.2601.02780),[Document](https://dx.doi.org/10.48550/ARXIV.2601.02780),2601\.02780Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p2.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 2369–2380\.External Links:[Link](https://doi.org/10.18653/v1/d18-1259),[Document](https://dx.doi.org/10.18653/V1/D18-1259)Cited by:[§D\.4](https://arxiv.org/html/2606.18056#A4.SS4.SSS0.Px1.p1.1)\.
- Y\. Yu, Z\. Dai, Z\. Wang, W\. Wang, R\. Chen, and J\. Pei \(2025\)OpenCSG chinese corpus: A series of high\-quality chinese datasets for LLM training\.CoRRabs/2501\.08197\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.08197),[Document](https://dx.doi.org/10.48550/ARXIV.2501.08197),2501\.08197Cited by:[Appendix B](https://arxiv.org/html/2606.18056#A2.SS0.SSS0.Px2.p1.1)\.
- M\. Zaheer, G\. Guruganesh, K\. A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontañón, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang, and A\. Ahmed \(2020\)Big bird: transformers for longer sequences\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4791–4800\.External Links:[Link](https://doi.org/10.18653/v1/p19-1472),[Document](https://dx.doi.org/10.18653/V1/P19-1472)Cited by:[§5\.1](https://arxiv.org/html/2606.18056#S5.SS1.SSS0.Px4.p1.1)\.
- C\. Zhang, Y\. Bai, J\. Li, A\. Gui, K\. Wang, F\. Liu, G\. Wu, Y\. Jiang, D\. Bu, L\. Wei, H\. Jing, H\. Tang, X\. Chen, X\. Huang, F\. Li, R\. Weng, Y\. Qian, Y\. Lu, Y\. Sun, J\. Wang, Y\. Xie, and X\. Cai \(2025\)Efficient context scaling with longcat zigzag attention\.CoRRabs/2512\.23966\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.23966),[Document](https://dx.doi.org/10.48550/ARXIV.2512.23966),2512\.23966Cited by:[Appendix A](https://arxiv.org/html/2606.18056#A1.p1.5),[Table 3](https://arxiv.org/html/2606.18056#A2.T3.2.2.3.3),[Appendix C](https://arxiv.org/html/2606.18056#A3.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.18056#S1.p2.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px2.p1.1),[§5\.4](https://arxiv.org/html/2606.18056#S5.SS4.SSS0.Px1.p1.4)\.
- Y\. Zhao, H\. Li, B\. Wu, J\. Yuan, M\. Zhang, Y\. Yin, L\. Shang, and M\. Zhang \(2026a\)Switch attention: towards dynamic and fine\-grained hybrid transformers\.CoRRabs/2603\.26380\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.26380),[Document](https://dx.doi.org/10.48550/ARXIV.2603.26380),2603\.26380Cited by:[§1](https://arxiv.org/html/2606.18056#S1.p2.1)\.
- Y\. Zhao, H\. Li, B\. Wu, J\. Yuan, M\. Zhang, Y\. Yin, L\. Shang, and M\. Zhang \(2026b\)Switch attention: towards dynamic and fine\-grained hybrid transformers\.CoRRabs/2603\.26380\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.26380),[Document](https://dx.doi.org/10.48550/ARXIV.2603.26380),2603\.26380Cited by:[Appendix A](https://arxiv.org/html/2606.18056#A1.p2.1),[§2](https://arxiv.org/html/2606.18056#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AComparison with Existing Approaches

Table[3](https://arxiv.org/html/2606.18056#A2.T3)compares ConSA with rule\-based interleavingJianget al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib1)\); Team \([2024](https://arxiv.org/html/2606.18056#bib.bib2)\)and LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)along several design dimensions\. Rule\-based methods and LoZA both allocate at the layer level, so all KV heads within a layer share the same attention type; ConSA supports both layer\-wise and KV\-head\-wise granularity under the same framework, and the head\-wise variant delivers most of the empirical gain\. On sparsity control, LoZA selects layers by ranking scalar weights post\-hoc and reports only a single50%50\\%ratio; ConSA instead enforcesρ^​\(z\)=ρ\\hat\{\\rho\}\(z\)=\\rhoas a Lagrangian equality constraint, allowing the user to specify anyρ\\rhoin advance, with convergence verified in Appendix[D\.1](https://arxiv.org/html/2606.18056#A4.SS1)\. On training integration, LoZA freezes model weights during calibration and then mid\-trains under the frozen pattern; ConSA jointly optimizesθ\\thetaandα\\alphaduring Stage 1 and transitions into continued pre\-training under the binarized masks\.

Two additional methods are worth noting\. SwiAttnZhaoet al\.\([2026b](https://arxiv.org/html/2606.18056#bib.bib29)\)dynamically routes each token to FA or SWA via per\-layer routers, but must retain a unified KV cache because any token may require full attention, reducing compute but not memory; ConSA’s fixed binarized masks ensure that SWA\-allocated heads retain only window\-sized KV, yielding genuine KV cache reduction\. DuoAttentionXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)operates at head granularity and achieves similar memory savings, but relies on synthetic long\-range retrieval data and output\-deviation minimization with frozen model weights; the realized sparsity is controlled indirectly via L1 regularization and thresholding rather than an explicit target\. ConSA aims to find effective hybrid configurations during the pre\-training stage using only standard pre\-training data, jointly adapting both the allocation masks and model weights in a single optimization framework\.

## Appendix BTraining Details

#### Hyperparameters\.

We follow the standard configuration ofLouizoset al\.\([2018](https://arxiv.org/html/2606.18056#bib.bib28)\)and setβ=2/3\\beta=2/3,ζ=1\.1\\zeta=1\.1,γ=−0\.1\\gamma=\-0\.1across all experiments without further tuning\. The Lagrange multipliersλ\\lambdaandϕ\\phiare both initialized to zero and updated by gradient ascent jointly with the model and mask parameters\. For Stage 1 mask learning, we use the Adam optimizer \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95,ϵ=10−8\\epsilon=10^\{\-8\}\) with a cosine learning rate schedule, a peak learning rate of3×10−43\\times 10^\{\-4\}, a minimum learning rate of3×10−53\\times 10^\{\-5\}, and 300 warmup steps\. The global batch size is 128 with a maximum sequence length of 8,192\. Training runs for 1,000 steps, corresponding to approximately 1B tokens\. For Stage 2 continued pre\-training, we use the same optimizer with a WSD \(Warmup\-Stable\-Decay\) learning rate schedule, a peak learning rate of3×10−43\\times 10^\{\-4\}, and 2,000 warmup steps\. The global batch size is increased to 512 with the same sequence length of 8,192\. All experiments use mixed\-precision training in BF16 with a gradient clipping threshold of 1\.0\.

SizeLHid\.FFNQKVDim\.Vocab0\.6B2810243072168641024001\.7B2820486144168128102400

Table 2:Model architecture configurations\. L denotes the number of layers; Hid\. denotes the hidden size; FFN denotes the intermediate feed\-forward dimension; Q/KV denote query and KV head counts; Dim\. denotes the per\-head dimension\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x15.png)Figure 7:Learned scalar gatesαi\\alpha\_\{i\}of the ablation variant across 28 layers\. Nearly all layers converge to the same value, showing minimal differentiation\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x16.png)\(a\)ρ=0\.25\\rho=0\.25
![Refer to caption](https://arxiv.org/html/2606.18056v1/x17.png)\(b\)ρ=0\.50\\rho=0\.50
![Refer to caption](https://arxiv.org/html/2606.18056v1/x18.png)\(c\)ρ=0\.75\\rho=0\.75

Figure 8:Lagrangian constraint loss on 0\.6B atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\. Theρ=0\.25\\rho=0\.25andρ=0\.75\\rho=0\.75settings use layer\-wise and head\-wise \(all\-layers\) allocation;ρ=0\.50\\rho=0\.50additionally includes head\-wise \(single\-layer\)\. Dashed vertical lines mark the approximate convergence step for the slowest configuration in each panel\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x19.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x20.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x21.png)

Figure 9:Learned layer\-wise FA/SWA allocation on 0\.6B atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\. Each cell indicates whether a layer usesFA \(red\)orSWA \(blue\)\.Rule\-basedJianget al\.\([2023](https://arxiv.org/html/2606.18056#bib.bib1)\); Team \([2024](https://arxiv.org/html/2606.18056#bib.bib2)\)LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)ConSAAllocation methodHand\-craftedScalar calibrationL0 \+ LagrangianGranularityLayerLayerLayer & KV\-headSparsity controlFixed by designTunable via top\-kk, only 50% reportedArbitrary targetρ\\rhoOptimizationN/APost\-hoc scoringEnd\-to\-end differentiablePattern analysisN/ANoneCross\-scale visualizationTraining integrationPre\-definedPost\-hocJoint with pre\-training

Table 3:Comparison of ConSA with existing methods for hybrid attention allocation\.
#### Fair Comparison\.

ConSA uses 1B tokens for mask learning in Stage 1, followed by 100B tokens for continued pre\-training in Stage 2, for a total of 101B tokens beyond the initial pre\-trained checkpoint\. To ensure that the performance gains of ConSA are not attributed to this additional training budget, all baselines are trained from the same pre\-trained checkpoint for the same 101B tokens across two stages, using the same data mixture, optimizer, learning rate schedule, and batch size as ConSA in each corresponding stage\. The training data is drawn from two open\-source corpora, RedPajamaWeberet al\.\([2024](https://arxiv.org/html/2606.18056#bib.bib31)\)and Chinese FineWebYuet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib32)\)\. All reported results are averaged over multiple evaluation runs, so that every method is compared under identical training and evaluation conditions\.

## Appendix CAblation Study: Setup and Analysis

#### Ablation Variant Design\.

The ablation variant \(w/o L0\-Lagrangian\) follows the calibration\-based paradigm used by LoZAZhanget al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib3)\)and DuoAttentionXiaoet al\.\([2025](https://arxiv.org/html/2606.18056#bib.bib15)\)\. Since LoZA does not specify its calibration objective in detail and the full DuoAttention setup involves a distillation loss against a dense teacher together with synthetic retrieval data, we adopt a simplified variant for a controlled comparison\. Specifically, our ablation uses the standard language modeling loss and does not rely on synthetic data, thereby isolating the effect of removing the L0\-Lagrangian formulation while keeping all other factors aligned with ConSA\.

#### Training Configuration\.

For all three sparsity configurations, the 0\.6B model starts from the same checkpoint pre\-trained from scratch on 40B tokens\. Both ConSA and the ablation variant train masks on 1B tokens, followed by 20B tokens of continued pre\-training under the resulting fixed attention configuration\. The 20B\-token continued pre\-training stage consists of 10,500 training steps\. The loss trajectories in Figure[3](https://arxiv.org/html/2606.18056#S5.F3)are displayed starting from step 2,500 to focus on the post\-warmup training dynamics\.

#### ConSA vs\. Dense FA\.

Comparing ConSA with Dense FA across the three sparsity levels, the relative position of the two loss curves shifts progressively asρ\\rhoincreases: ConSA trains consistently below Dense FA atρ=0\.25\\rho=0\.25, the two trajectories nearly overlap throughout training atρ=0\.50\\rho=0\.50, and ConSA settles slightly above Dense FA atρ=0\.75\\rho=0\.75where three quarters of the attention units operate with local attention\. This progression shows that the L0\-Lagrangian formulation effectively preserves model quality at moderate sparsity levels\.

#### Scalar Gates Fail to Differentiate Attention Units\.

Figure[7](https://arxiv.org/html/2606.18056#A2.F7)visualizes the learned scalar gatesαi\\alpha\_\{i\}of the ablation variant across all 28 layers\. Unlike the binary masks produced by ConSA, the calibration\-based variant yields gate values that are nearly uniform across layers, with most values clustered in a narrow range\. This lack of differentiation suggests that learning a single scalar per attention unit does not provide sufficient signal to distinguish layers that benefit from full attention from those that can operate with local attention\. We note that DuoAttention achieves more differentiated gate values but relies on a considerably more complex setup involving synthetic retrieval data and a multi\-component distillation loss relative to a dense teacher, which introduces additional design choices and computational overhead beyond the allocation mechanism itself\.

![Refer to caption](https://arxiv.org/html/2606.18056v1/x22.png)\(a\)Layer\-wise
![Refer to caption](https://arxiv.org/html/2606.18056v1/x23.png)\(b\)Head\-wise \(all\-layers\)

Figure 10:Trajectory of expected sparsity𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]during Stage 1 mask learning on 0\.6B atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\. Dashed horizontal lines indicate the targetρ\\rho\. All configurations initially overshoot to a similar level before settling to their respective targets, with higherρ\\rhorequiring less correction and thus converging earlier\.

## Appendix DAdditional Experimental Results and Analysis

### D\.1Lagrangian Constraint Convergence

We provide the full set of Lagrangian constraint loss trajectories that supplement the convergence analysis in Section[5\.3](https://arxiv.org/html/2606.18056#S5.SS3)\. Figures[8\(a\)](https://arxiv.org/html/2606.18056#A2.F8.sf1)–[8\(c\)](https://arxiv.org/html/2606.18056#A2.F8.sf3)present the constraint loss during Stage 1 mask learning on 0\.6B atρ∈\{0\.25,0\.50,0\.75\}\\rho\\in\\\{0\.25,0\.50,0\.75\\\}\.

#### Granularity Affects Convergence Speed\.

The head\-wise \(single\-layer\) variant achieves the fastest convergence, with the constraint loss remaining close to zero from nearly the beginning of training\. This is because the per\-layer constraint decomposes the global sparsity target into independent sub\-problems, each involving onlyHKVH\_\{\\mathrm\{KV\}\}variables\. In contrast, the layer\-wise variant and the head\-wise \(all\-layers\) variant both exhibit larger oscillations before stabilizing, with the latter showing the widest amplitude and the slowest convergence\. This phenomenon can be attributed to the enlarged search space of the global constraint, in which a singleρ\\rhotarget is distributed across allL×HKVL\\times H\_\{\\mathrm\{KV\}\}heads simultaneously\. The slower convergence of the head\-wise \(all\-layers\) variant is consistent with the lower downstream performance reported in Table[1](https://arxiv.org/html/2606.18056#S5.T1)\.

#### Granularity Effects on 0\.6B\.

Atρ=0\.50\\rho=0\.50, the head\-wise \(single\-layer\) variant again converges the fastest among all three granularities, reproducing the pattern observed on 1\.7B \(Figure[2](https://arxiv.org/html/2606.18056#S5.F2)\)\. The 0\.6B model exhibits a convergence profile highly similar to that of 1\.7B, indicating that the optimization dynamics of the augmented Lagrangian are robust to model scale\.

#### Convergence across Sparsity Levels on 0\.6B\.

At all three sparsity levels, the constraint loss converges to near zero within 1,000 steps, confirming that ConSA can precisely target arbitrary sparsity ratios\. The oscillation amplitude during the transient phase increases withρ\\rho: the peak amplitude atρ=0\.75\\rho=0\.75is roughly twice that atρ=0\.25\\rho=0\.25, reflecting the more aggressive redistribution required at higher sparsity\. Interestingly, convergence occurs earlier at higherρ\\rhodespite the larger oscillations\. This behavior is apparent in the trajectory of the expected sparsity𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]\(Figure[10](https://arxiv.org/html/2606.18056#A3.F10)\)\. During the early stages of training, the multipliers force𝔼​\[ρ^​\(z\)\]\\mathbb\{E\}\[\\hat\{\\rho\}\(z\)\]to overshoot to a uniformly high level, irrespective of the targetρ\\rho\. Consequently, a higher target necessitates less correction from this initial overshoot\. In contrast, a lower target, such asρ=0\.25\\rho=0\.25, requires closing a larger gap to return to the designated value, thereby leading to slower convergence\.

### D\.2Learned Allocation Patterns

We provide the supplement to the analysis in Section[5\.5](https://arxiv.org/html/2606.18056#S5.SS5)\.

#### FA Retreats Hierarchically as the SWA Budget Grows\.

A comparison of the three sparsity levels on 0\.6B in Figure[9](https://arxiv.org/html/2606.18056#A2.F9)shows how the model prioritizes the allocation of the FA budget\. Atρ=0\.75\\rho=0\.75, where only 7 layers use FA, FA is restricted to two small clusters located near the early\-middle and upper regions\. Atρ=0\.50\\rho=0\.50, where 14 layers use FA, these clusters expand into larger blocks\. Atρ=0\.25\\rho=0\.25, where 21 layers use FA, FA covers most of the model, but the bottom layers \(0–2\) and the final layer are still assigned to SWA\. This hierarchical retreat reveals a clear priority order: the middle\-layer FA core is the last region to be replaced by SWA, indicating that it is the part of the network in which FA is most critical\.

#### Intra\-layer Heterogeneity under Head\-wise Allocation\.

The head\-wise heatmaps in Figure[11](https://arxiv.org/html/2606.18056#A4.F11)show that KV heads within the same layer often use different attention types\. Atρ=0\.50\\rho=0\.50on 0\.6B, most layers contain a mixture of FA and SWA heads rather than being uniformly one type, confirming that layer\-level decisions are suboptimal because they force a uniform attention type on heads that may serve functionally distinct roles\. Despite the finer granularity, the head\-wise patterns preserve the same macro\-level trend seen in layer\-wise allocation: bottom layers are SWA\-dominated, and middle layers are FA\-dominated, as quantified by the per\-layer SWA head ratio in Figure[5](https://arxiv.org/html/2606.18056#S5.F5)\.

### D\.3Training FLOPs

ModelDense FAConSA \(ρ=0\.50\\rho=0\.50\)0\.6B17\.42×101517\.42\\times 10^\{15\}14\.65×101514\.65\\times 10^\{15\}\(↓\\downarrow15\.9%\)1\.7B52\.57×101552\.57\\times 10^\{15\}47\.02×101547\.02\\times 10^\{15\}\(↓\\downarrow10\.6%\)Table 4:Comparison of training FLOPs per step between Dense FA and ConSA atρ=0\.50\\rho=0\.50under the Stage\-2 setting\. The sequence length iss=8,192s=8,192, the global batch size is512512, and the SWA window size isw=512w=512\.Table[4](https://arxiv.org/html/2606.18056#A4.T4)reports the per\-step training FLOPs of ConSA and Dense FA on the 0\.6B and 1\.7B models under the Stage\-2 configuration\. Atρ=0\.50\\rho=0\.50, ConSA reduces per\-step training FLOPs by15\.9%15\.9\\%on 0\.6B and10\.6%10\.6\\%on 1\.7B relative to Dense FA\. Since attention FLOPs scale quadratically with sequence length, while the non\-attention term scales linearly, the savings from ConSA grow with context length, making the relative benefit more pronounced in long\-context training regimes\.

### D\.4Analysis of Attention Behavior

#### Task Selection and Evidence Distribution\.

We select samples from four subsets of LongBenchBaiet al\.\([2024](https://arxiv.org/html/2606.18056#bib.bib33)\), which together cover a range of evidence distributions\. NarrativeQAKociskýet al\.\([2018](https://arxiv.org/html/2606.18056#bib.bib34)\)poses a question over a single long literary work, where the answer is typically supported by one or two localized passages\. HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.18056#bib.bib36)\)requires reasoning across multiple documents, so that the supporting evidence is spread over several non\-contiguous regions of the context\. Passage Retrieval requires the model to identify which paragraph among many contains a given query\. In contrast to Needle\-in\-a\-HaystackLiuet al\.\([2024](https://arxiv.org/html/2606.18056#bib.bib37)\), in which the model must retrieve a short, semantically unrelated string from otherwise irrelevant context, the candidate paragraphs in Passage Retrieval are highly similar in topic\. This makes the task substantially more challenging and causes attention to be spread across multiple plausible paragraphs\. GovReportHuanget al\.\([2021](https://arxiv.org/html/2606.18056#bib.bib35)\)requires generating a summary of a government report, a task that depends on content distributed throughout the entire document\. For all visualizations, the input is truncated to a maximum length of 1,500 tokens\. The sample used in Figures[6](https://arxiv.org/html/2606.18056#S5.F6),[16](https://arxiv.org/html/2606.18056#A4.F16), and[17](https://arxiv.org/html/2606.18056#A4.F17)is drawn from NarrativeQA\. All attention behavior analyses presented are extracted from the pre\-trained checkpoint before mask learning, and continued pre\-training, so that the observed patterns reflect the model’s natural attention behavior rather than adaptation to a particular FA/SWA configuration\.

#### Cross\-task Attention Behavior\.

We extend the analysis by examining how the same layers attend to inputs from different tasks \(Figures[12](https://arxiv.org/html/2606.18056#A4.F12)–[15](https://arxiv.org/html/2606.18056#A4.F15)\)\. The layers preferred for SWA across sparsity levels \(L1, L4\) exhibit only minor variation across tasks: L1 remains nearly uniformly distributed, and L4 retains its mild tail\-side rise, with limited task\-specific structure\. In contrast, the layers preferred for FA \(L16, L22\) show more pronounced cross\-task differences: their spike distributions become denser and span a broader range as the evidence of the task becomes more dispersed, shifting from relatively concentrated spikes on question\-answering inputs to broadly elevated attention across the full sequence on summarization inputs\. This suggests that the sparsity preferences of ConSA effectively distinguish layers whose attention distribution is largely input\-independent from those whose distribution adapts to the evidence structure of the task\. Preserving full attention is most valuable for the latter\.

#### Intra\-layer Head Heterogeneity\.

The layer\-wise visualization averages across all KV heads within a layer, potentially masking divergent behaviors among individual heads\. Figures[16](https://arxiv.org/html/2606.18056#A4.F16)and[17](https://arxiv.org/html/2606.18056#A4.F17)decompose two representative layers into their eight KV heads\. In L9, which is assigned to FA across three sparsity levels in the layer\-wise setting, the majority of heads exhibit*dense broad*attention, but others remain nearly uniform\. The layer\-wise decision is dominated by the broad\-attending majority, but the uniform heads gain little from full attention\. L27 presents the mirror case: assigned to SWA across all sparsity levels in the layer\-wise setting, most of its heads are uniform or local, but a minority display distant spikes that the SWA window would truncate\. Here, the layer\-wise decision is dominated by the local majority, at the cost of discarding the long\-range information captured by the few broad heads\. Accordingly, under head\-wise allocation \(Figure[11](https://arxiv.org/html/2606.18056#A4.F11)\), the heads within L9 and L27 are no longer forced into a single type; instead, asρ\\rhoincreases, individual heads progressively transition to SWA based on their own attention range, in contrast to the layer\-wise setting \(Figure[9](https://arxiv.org/html/2606.18056#A2.F9)\) where the entire layer switches at once\. This heterogeneity explains why head\-wise allocation outperforms layer\-wise allocation in Table[1](https://arxiv.org/html/2606.18056#S5.T1): head\-wise granularity allows ConSA to retain FA selectively for the broad heads within an otherwise local layer and, conversely, to release uniform heads within an otherwise broad layer\.

![Refer to caption](https://arxiv.org/html/2606.18056v1/x24.png)\(a\)0\.6B,ρ=0\.25\\rho=0\.25
![Refer to caption](https://arxiv.org/html/2606.18056v1/x25.png)\(b\)0\.6B,ρ=0\.50\\rho=0\.50
![Refer to caption](https://arxiv.org/html/2606.18056v1/x26.png)\(c\)0\.6B,ρ=0\.75\\rho=0\.75
![Refer to caption](https://arxiv.org/html/2606.18056v1/x27.png)\(d\)1\.7B,ρ=0\.50\\rho=0\.50

Figure 11:Learned head\-wise FA/SWA allocation across model scales and sparsity levels\. Each cell indicates whether a KV head usesFA \(red\)orSWA \(blue\)\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x28.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x29.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x30.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x31.png)

Figure 12:Cross\-task last\-token attention distribution for L1 of the 0\.6B model\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x32.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x33.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x34.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x35.png)

Figure 13:Cross\-task last\-token attention distribution for L4 of the 0\.6B model\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x36.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x37.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x38.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x39.png)

Figure 14:Cross\-task last\-token attention distribution for L16 of the 0\.6B model\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x40.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x41.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x42.png)
![Refer to caption](https://arxiv.org/html/2606.18056v1/x43.png)

Figure 15:Cross\-task last\-token attention distribution for L22 of the 0\.6B model\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x44.png)Figure 16:Head\-wise last\-token attention distribution for L9 of the 0\.6B model \(assigned to FA across threeρ\\rhoin the layer\-wise setting of Figure[9](https://arxiv.org/html/2606.18056#A2.F9)\)\. Most heads show*dense broad*attention pattern\.![Refer to caption](https://arxiv.org/html/2606.18056v1/x45.png)Figure 17:Head\-wise last\-token attention distribution for L27 of the 0\.6B model \(assigned to SWA across threeρ\\rhoin the layer\-wise setting of Figure[9](https://arxiv.org/html/2606.18056#A2.F9)\)\. Most heads show*uniform*or*local*attention pattern\.

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

Dynamic Linear Attention

Hugging Face Daily Papers

DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.