NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

arXiv cs.CL Papers

Summary

Proposes a training-free NLL-guided method for selecting which layers to retain full attention in hybrid attention models, achieving comparable accuracy with 1/4 full-attention layers against a 1/2 periodic baseline on long-context tasks.

arXiv:2606.27791v1 Announce Type: new Abstract: Hybrid attention models that mix full and sliding-window attention across layers offer a promising approach to efficient long-context inference, but the critical question of \emph{which layers} should retain full attention remains unsolved. Existing methods use either fixed periodic patterns or attention-based heuristics that may not capture what matters for downstream accuracy. We propose NLL-guided layer selection, a training-free method that directly measures each layer's importance by computing the negative log-likelihood degradation on answer tokens when that layer uses sliding-window instead of full attention. On LongMemEval with Qwen3-4B, our method achieves 64.6\% accuracy using only 1/4 full-attention layers, matching the 1/2-FA periodic baseline (65.0\%) while halving the computational budget. NLL-guided selection outperforms the SWAA-reported periodic 1/4-FA baseline by 10.4 percentage points and a matched LightTransfer-style baseline by 26.4 percentage points. De-confounding analysis shows the signal is consistent with long-range attention needs rather than generic layer sensitivity. The method requires only $\sim$15 minutes of one-time calibration, advancing the efficiency-accuracy Pareto frontier for long-context LLM deployment.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
Source: [https://arxiv.org/html/2606.27791](https://arxiv.org/html/2606.27791)
FARS, Qiong Tang222Equal contribution; human authors listed in alphabetical order\.,Xiangkun Hu222Equal contribution; human authors listed in alphabetical order\.,Xiangyang Liu222Equal contribution; human authors listed in alphabetical order\.,Yiran Chen222Equal contribution; human authors listed in alphabetical order\.,Yunfan Shao222Equal contribution; human authors listed in alphabetical order\. Analemma fars@analemma\.ai

###### Abstract

Hybrid attention models that mix full and sliding\-window attention across layers offer a promising approach to efficient long\-context inference, but the critical question of*which layers*should retain full attention remains unsolved\. Existing methods use either fixed periodic patterns or attention\-based heuristics that may not capture what matters for downstream accuracy\. We propose NLL\-guided layer selection, a training\-free method that directly measures each layer’s importance by computing the negative log\-likelihood degradation on answer tokens when that layer uses sliding\-window instead of full attention\. On LongMemEval with Qwen3\-4B, our method achieves 64\.6% accuracy using only 1/4 full\-attention layers, matching the 1/2\-FA periodic baseline \(65\.0%\) while halving the computational budget\. NLL\-guided selection outperforms the SWAA\-reported periodic 1/4\-FA baseline by 10\.4 percentage points and a matched LightTransfer\-style baseline by 26\.4 percentage points\. De\-confounding analysis shows the signal is consistent with long\-range attention needs rather than generic layer sensitivity\. The method requires only∼\\sim15 minutes of one\-time calibration, advancing the efficiency\-accuracy Pareto frontier for long\-context LLM deployment\.

> Disclosure:This paper was produced by FARS \(Fully Automated Research System\)111[https://analemma\.ai/fars/](https://analemma.ai/fars/), which autonomously performed the ideation, literature review, experiment design and execution, result analysis, and manuscript composition\. The accompanying code is publicly available\.222[https://gitlab\.com/fars\-a/nll\-guided\-swaa\-layer\-selection](https://gitlab.com/fars-a/nll-guided-swaa-layer-selection)The human authors contributed review and minor editorial revisions\. They have verified the authenticity of all cited references and confirmed that all reported experimental results originate from actual code execution\. Readers should be aware that the prose and presentation of this manuscript are primarily machine\-generated and may not meet the standards of fully human\-authored work\.

## 1Introduction

Large language models \(LLMs\) are increasingly deployed on long\-context tasks such as retrieval\-augmented generation, multi\-document question answering, and conversational agents with extended memory\(Wanget al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib15)\)\. However, the quadratic complexity of standard Transformer self\-attention\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.27791#bib.bib1)\)makes processing long prompts computationally expensive, creating a fundamental tension between model capability and deployment efficiency\.

Several approaches address this challenge\. Efficient attention mechanisms such as sparse patterns\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.27791#bib.bib2); Zaheeret al\.,[2020](https://arxiv.org/html/2606.27791#bib.bib3)\)and linear approximations reduce complexity but often degrade quality when applied to models pretrained with full attention\. KV cache compression methods\(Zhanget al\.,[2023](https://arxiv.org/html/2606.27791#bib.bib6); Xiaoet al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib10)\)reduce memory requirements but have limited impact on prefill computation\. Hybrid attention approaches offer a promising middle ground: SWAA\(Yuet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib12)\)demonstrates that pretrained full\-attention models can be adapted to use sliding\-window attention \(SWA\) during prefill with minimal quality loss when combined with full\-attention decode and strategic layer selection\.

A critical question remains:*which layers should retain full attention?*Existing methods use either fixed periodic patterns, which ignore layer\-specific roles, or attention\-based heuristics like LightTransfer\(Zhanget al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib13)\), which rely on indirect signals that may not capture what matters for downstream accuracy\. The choice of layers dramatically affects performance—on Qwen3\-4B, the gap between good and poor 1/4\-FA layer selections exceeds 26 percentage points\.

We proposeNLL\-guided layer selection, a principled approach that directly measures what we care about: how much does each layer’s output quality degrade when we restrict its attention? By computing the negative log\-likelihood \(NLL\) on answer tokens under different attention configurations, we identify layers that genuinely benefit from full attention for long\-range information flow\. Our contributions are:

- •We introduce NLL\-guided layer selection, a training\-free method for identifying which layers should retain full attention in hybrid sliding\-window models\.
- •We demonstrate that NLL\-Guided 1/4\-FA achieves 64\.6% accuracy on LongMemEval, matching the 1/2\-FA periodic baseline \(65\.0%\) while halving the full\-attention budget, and outperforming the SWAA\-reported periodic 1/4\-FA baseline by 10\.4 percentage points\.
- •We provide de\-confounding evidence through long\- versus short\-prompt calibration, showing that the NLL signal is specific to long\-range attention needs \(Spearmanρ=0\.306\\rho=0\.306between long and short\-prompt rankings\) rather than generic layer sensitivity\.
- •We show the method is practical for deployment: calibration requires only∼\\sim15 minutes on 4 GPUs and amortizes after∼\\sim1,354 inference requests at 24k prompt length\.

## 2Related Work

### 2\.1Efficient Attention Mechanisms

The quadratic complexity of self\-attention\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.27791#bib.bib1)\)has motivated extensive research into efficient alternatives\. Sparse attention patterns, such as those in Longformer\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.27791#bib.bib2)\)and BigBird\(Zaheeret al\.,[2020](https://arxiv.org/html/2606.27791#bib.bib3)\), reduce complexity by restricting attention to local windows combined with global tokens\. Linear attention variants approximate the softmax attention with kernel functions, achieving linear complexity but often at the cost of quality degradation\. FlashAttention\(Dao,[2024](https://arxiv.org/html/2606.27791#bib.bib4)\)and PagedAttention\(Kwonet al\.,[2023](https://arxiv.org/html/2606.27791#bib.bib5)\)improve implementation efficiency through memory\-aware computation without changing the attention mechanism itself\. TCA\-Attention\(Youet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib18)\)calibrates head\-specific token sparsity budgets and selects informative tokens online\. These approaches modify the attention computation uniformly across all layers or at the token/head level, whereas our work selectively applies different attention patterns to different layers based on their measured importance\.

### 2\.2KV Cache Compression

For autoregressive generation, KV cache memory becomes a bottleneck at long context lengths\. H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2606.27791#bib.bib6)\)identifies “heavy\-hitter” tokens that receive disproportionate attention and retains only these in the cache\. SnapKV\(Liet al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib7)\)compresses the KV cache by clustering similar key\-value pairs\. Quest\(Tanget al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib8)\)introduces query\-aware sparsity that dynamically selects relevant KV entries per query\. MInference\(Jianget al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib9)\)accelerates prefilling through dynamic sparse attention patterns\. StreamingLLM\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib10)\)enables infinite\-length generation by maintaining attention sinks alongside a sliding window\. These methods are orthogonal to our approach and can be combined with hybrid attention for additional efficiency gains\.

### 2\.3Hybrid Attention Models

Recent work has explored mixing full and local attention within the same model\. Gemma 2\(Teamet al\.,[2024](https://arxiv.org/html/2606.27791#bib.bib11)\)alternates between local sliding\-window and global attention layers in a fixed pattern determined during pretraining\. SWAA\(Yuet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib12)\)demonstrates that pretrained full\-attention models can be adapted to use sliding\-window attention at inference time without retraining, using periodic layer selection patterns\. LightTransfer\(Zhanget al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib13)\)proposes attention\-based heuristics \(“lazy ratio”\) to select which layers should retain full attention\. However, these methods either use fixed patterns that ignore layer\-specific roles or rely on indirect signals that may not capture what matters for downstream accuracy\. Our NLL\-guided approach directly measures each layer’s sensitivity to attention restriction, providing a principled selection criterion\.

### 2\.4Knowledge Distillation for Hybrid Models

Liet al\.\([2025](https://arxiv.org/html/2606.27791#bib.bib14)\)propose KL\-guided layer selection for distilling full\-attention models into hybrid architectures, using KL divergence between teacher and student outputs to identify critical layers\. While conceptually related, their method requires training a student model, whereas our approach is entirely training\-free and can be applied to any pretrained Transformer with a one\-time calibration procedure\.

## 3Method

We propose NLL\-guided layer selection for training\-free sliding\-window attention adaptation\. Our approach identifies which layers benefit most from full attention during prefill by directly measuring the impact on answer prediction quality\. Figure[1](https://arxiv.org/html/2606.27791#S3.F1)illustrates the overall framework\.

![Refer to caption](https://arxiv.org/html/2606.27791v1/figures/framework_overview.png)Figure 1:Overview of NLL\-guided full\-attention layer selection for SWAA\. The method uses teacher\-forced NLL on answer tokens to score each layer’s sensitivity to sliding\-window attention, then selects the top\-kklayers with highest degradation for full attention during inference\.### 3\.1Problem Formulation

Consider a Transformer withLLlayers deployed with sliding\-window attention adaptation \(SWAA\)\(Yuet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib12)\)\. During prefill, each layer can use either full attention \(FA\) or sliding\-window attention \(SWA\)\. Given a budget ofkklayers that may use full attention during prefill, we seek to select the setS⊆\{0,…,L−1\}S\\subseteq\\\{0,\\ldots,L\-1\\\}with\|S\|=k\|S\|=kthat maximizes downstream task accuracy\. Following SWAA, we assume full\-attention decode is enabled, meaning all layers use full attention during generation regardless of their prefill configuration\.

### 3\.2NLL\-Guided Layer Scoring

Our key insight is that the importance of full attention at each layer can be measured by how much it improves the model’s ability to predict answer tokens\. For an input consisting of a promptx1:mx\_\{1:m\}and answery1:ny\_\{1:n\}, we define the per\-layer score as the reduction in negative log\-likelihood \(NLL\) on answer tokens when that layer uses full attention instead of SWA during prefill\.

Formally, letℒans​\(⋅\)\\mathcal\{L\}\_\{\\text\{ans\}\}\(\\cdot\)denote the mean NLL on answer tokens under a given attention configuration\. For each layerℓ\\ell, we compute:

Δℓ=ℒans​\(SWA at layer​ℓ\)−ℒans​\(FA at layer​ℓ\),\\Delta\_\{\\ell\}=\\mathcal\{L\}\_\{\\text\{ans\}\}\(\\text\{SWA at layer \}\\ell\)\-\\mathcal\{L\}\_\{\\text\{ans\}\}\(\\text\{FA at layer \}\\ell\),\(1\)where all other layers use SWA during prefill\. A largerΔℓ\\Delta\_\{\\ell\}indicates that layerℓ\\ellbenefits more from full attention for long\-range information flow from prompt to answer\.

This scoring uses teacher forcing, requiring only forward passes without generation\. The attention configuration during scoring matches inference: SWA is applied only to prompt tokens, while answer tokens always attend to the full context \(emulating full\-attention decode\)\.

### 3\.3Layer Selection

Given the per\-layer scores\{Δℓ\}ℓ=0L−1\\\{\\Delta\_\{\\ell\}\\\}\_\{\\ell=0\}^\{L\-1\}averaged over a calibration set, we select the top\-kklayers byΔℓ\\Delta\_\{\\ell\}to form the full\-attention setSS\. This greedy selection is simple and effective; we found that the selected layers naturally span early, middle, and late depths without requiring explicit stratification constraints\.

### 3\.4Calibration and Inference

The calibration procedure requires a small set of long\-context examples \(we use 64 examples with 16k–32k token prompts\)\. For each example, we performL\+1L\+1forward passes: one baseline pass with all layers using SWA, plus one pass per layer with that layer toggled to FA\. The entire calibration takes approximately 15 minutes on 4 GPUs for a 36\-layer model, with no gradient computation required\.

Once calibration is complete, the selected layer setSSis fixed for all subsequent inference\. During deployment, layers inSSuse full attention during prefill while other layers use SWA\. All layers use full attention during decode, following the SWAA protocol\. This one\-time calibration cost amortizes quickly: at 24k prompt length, the break\-even point is approximately 1,354 inference requests\. See Appendix[A](https://arxiv.org/html/2606.27791#A1)for implementation details\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate NLL\-guided layer selection on Qwen3\-4B\-Thinking\-2507333[https://huggingface\.co/Qwen/Qwen3\-4B\-Thinking\-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)\(Yanget al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib17)\), a 36\-layer model, using the LongMemEval benchmark\(Wuet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib16)\)\. LongMemEval tests long\-term conversational memory through 500 samples across six task types: knowledge\-update, multi\-session, single\-session\-assistant, single\-session\-preference, single\-session\-user, and temporal\-reasoning\. Prompts are approximately 24k tokens on average\.

We use the SWAA\(Yuet al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib12)\)configuration with sliding window size 2048, keep\-first\-10 attention sinks, and full\-attention decode enabled\. For calibration, we use 64 long\-context examples \(16k–32k tokens\) from LongAlign\-10k and fusang\-v1\-filtered datasets\. Generation uses vLLM with 8 GPUs, batch size 64, and temperature 0\. Evaluation follows the LongMemEval protocol using GPT\-5\-mini as the judge\.

We compare against five baselines: \(1\)Full Attention\(all 36 layers use FA\), \(2\)1/2\-FA Periodic\(18 layers, every other layer\), \(3\)1/4\-FA Periodic\(SWAA\-reported 9\-layer periodic pattern\), \(4\)LightTransfer 1/4\-FA\(Zhanget al\.,[2025](https://arxiv.org/html/2606.27791#bib.bib13)\)\(9 layers selected by lazy\-ratio heuristic, evaluated under our SWAA protocol with keep\_first=10\), and \(5\)Naive SWA\(all layers use SWA, no keep\-first tokens, no FA decode\)\. Baselines \(1\)–\(3\) and \(5\) use SWAA\-reported values\. The LightTransfer comparison uses keep\_first=10 rather than LightTransfer’s default keep\_first=100, so it tests the layer\-ranking heuristic under our SWAA protocol rather than reproducing LightTransfer’s preferred setting\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.27791#S4.T1)presents the main comparison\. NLL\-Guided 1/4\-FA achieves 64\.6% accuracy, within 0\.4 percentage points of the 1/2\-FA Periodic baseline \(65\.0%\) while using only half the full\-attention budget \(9 vs 18 layers\)\. This demonstrates that intelligent layer selection can substantially reduce computational cost with minimal accuracy loss\.

Table 1:Accuracy comparison on LongMemEval\_24k \(500 samples\)\. All methods use Qwen3\-4B\-Thinking\-2507 with SWA window=2048, keep\_first=10, and FA decode, except Naive SWA \(no keep\-first, no FA decode\)\. Binomial 95% half\-widths are∼\\sim4pp\. Best inbold, second\-bestunderlined\.Compared to other 1/4\-FA methods, NLL\-Guided outperforms the SWAA\-reported periodic baseline by 10\.4 percentage points \(64\.6% vs 54\.2%\), demonstrating that data\-driven selection substantially outperforms fixed patterns under the same FA budget\. The improvement over the matched LightTransfer baseline is even more pronounced at 26\.4 percentage points \(64\.6% vs 38\.2%\), indicating that NLL\-based scoring provides a stronger signal than attention\-pattern heuristics for this task\. Appendix[A](https://arxiv.org/html/2606.27791#A1)reports simple sampling\-uncertainty estimates and clarifies the calibration\-domain scope\.

### 4\.3Per\-Task Analysis

Table[2](https://arxiv.org/html/2606.27791#S4.T2)shows the per\-task breakdown comparing NLL\-Guided and the matched LightTransfer baseline\. NLL\-Guided outperforms this baseline on all six task types, with improvements ranging from 13\.3 to 37\.1 percentage points\. The largest gains appear on single\-session\-user \(\+37\.1pp\) and temporal\-reasoning \(\+33\.9pp\), suggesting that NLL\-guided selection particularly benefits tasks requiring precise long\-range information retrieval\.

Table 2:Per\-task\-type accuracy breakdown on LongMemEval\_24k\. NLL\-Guided consistently outperforms the matched LightTransfer baseline across all 6 task types\.
### 4\.4De\-confounding Analysis

A potential concern is that the NLL signal might reflect generic layer sensitivity rather than long\-range attention needs\. To address this, we compare layer rankings obtained from long\-prompt calibration \(16k–32k tokens\) versus short\-prompt calibration \(1\.5k tokens, within the SWA window where SWA and FA are equivalent\)\.

Figure[2](https://arxiv.org/html/2606.27791#S4.F2)shows the comparison\. The Spearman correlation between long\-prompt and short\-prompt rankings is low \(ρ=0\.306\\rho=0\.306,p=0\.069p=0\.069\), and only 3 of 9 selected layers overlap \(Jaccard similarity = 0\.2\)\. Furthermore, long\-promptΔ\\Delta\-NLL values are 85\.6×\\timeslarger in magnitude than short\-prompt values\. These results are consistent with the NLL signal being specific to long\-range attention needs rather than generic layer sensitivity\.

![Refer to caption](https://arxiv.org/html/2606.27791v1/figures/deconfounding_scatter.png)Figure 2:Layer ranking comparison between long\-prompt \(16k–32k tokens\) and short\-prompt \(1\.5k tokens\) calibration\. Low correlation \(Spearmanρ=0\.306\\rho=0\.306\) and minimal overlap \(Jaccard=0\.2\) are consistent with the NLL signal being specific to long\-range attention needs\.
### 4\.5Layer Selection Patterns

Figure[3](https://arxiv.org/html/2606.27791#S4.F3)visualizes the per\-layerΔ\\Delta\-NLL scores\. The selected layers \[1, 3, 9, 10, 12, 13, 15, 21, 34\] naturally span early \(1, 3\), middle \(9–15, 21\), and late \(34\) depths without requiring explicit stratification\. Layer 15 shows the highestΔ\\Delta\-NLL \(0\.011\), followed by layers 9, 13, and 21\. This non\-periodic, data\-driven pattern differs fundamentally from fixed periodic selections and suggests that different layers serve distinct roles in long\-range information flow\.

![Refer to caption](https://arxiv.org/html/2606.27791v1/figures/delta_nll_per_layer.png)Figure 3:Per\-layer NLL degradation \(Δ\\Delta\-NLL\) when using SWA instead of FA\. Blue bars indicate the 9 layers selected for full attention\. The selected layers span early, middle, and late depths with a non\-periodic pattern\.
### 4\.6Calibration Stability

We analyze the stability of layer selection with respect to calibration set size\. Using 16 examples instead of 64 yields a Jaccard similarity of 0\.64 with the 64\-example selection, with 7 of 9 layers overlapping\. Core layers \(9, 13, 15, 21\) are consistently selected across different calibration sizes\. Downstream accuracy with 16\-example calibration is 62\.2%, a modest 2\.4 percentage point drop from the 64\-example result \(64\.6%\), but still substantially above the periodic baseline \(54\.2%\)\. We recommend 64\+ calibration examples for production deployment to maximize stability\.

## 5Conclusion

We presented NLL\-guided layer selection, a principled, training\-free method for identifying which layers should retain full attention in hybrid sliding\-window attention models\. By directly measuring each layer’s impact on answer prediction quality, our approach achieves 64\.6% accuracy with only 1/4 full\-attention layers, matching the 1/2\-FA periodic baseline \(65\.0%\) while halving the computational budget\. The method outperforms the SWAA\-reported periodic 1/4\-FA baseline by 10\.4 percentage points and a matched LightTransfer\-style baseline by 26\.4 percentage points\.

Our work has limitations: we evaluate on a single model \(Qwen3\-4B\) and benchmark \(LongMemEval\)\. Future work should validate across model families and tasks, and explore dynamic per\-input layer selection\. Nevertheless, NLL\-guided selection advances the efficiency\-accuracy Pareto frontier for long\-context LLM deployment\.

## References

- Longformer: the long\-document transformer\.ArXivabs/2004\.05150\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- T\. Dao \(2024\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.InThe Twelfth International Conference on Learning Representations, ICLR 2024,Cited by:[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- H\. Jiang, Y\. Li, C\. Zhang, Q\. Wu, X\. Luo, S\. Ahn, Z\. Han, A\. H\. Abdi, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu \(2024\)MInference 1\.0: accelerating pre\-filling for long\-context llms via dynamic sparse attention\.ArXivabs/2407\.02490\.Cited by:[§2\.2](https://arxiv.org/html/2606.27791#S2.SS2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.Cited by:[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- Y\. Li, S\. Yang, S\. Tan, M\. Mishra, R\. Panda, J\. Zhou, and Y\. Kim \(2025\)Distilling to hybrid attention models via kl\-guided layer selection\.ArXivabs/2512\.20569\.Cited by:[§2\.4](https://arxiv.org/html/2606.27791#S2.SS4.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. F\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024\)SnapKV: llm knows what you are looking for before generation\.ArXivabs/2404\.14469\.Cited by:[§2\.2](https://arxiv.org/html/2606.27791#S2.SS2.p1.1)\.
- J\. Tang, Y\. Zhao, K\. Zhu, G\. Xiao, B\. Kasikci, and S\. Han \(2024\)Quest: query\-aware sparsity for efficient long\-context llm inference\.ArXivabs/2406\.10774\.Cited by:[§2\.2](https://arxiv.org/html/2606.27791#S2.SS2.p1.1)\.
- G\. Team, M\. Riviere,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.ArXivabs/2408\.00118\.Cited by:[§2\.3](https://arxiv.org/html/2606.27791#S2.SS3.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.pp\. 5998–6008\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- X\. Wang, M\. Salmani, P\. Omidi, X\. Ren, M\. Rezagholizadeh, and A\. Eshaghi \(2024\)Beyond the limits: a survey of techniques to extend the context length in large language models\.pp\. 8299–8307\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025,Cited by:[§4\.1](https://arxiv.org/html/2606.27791#S4.SS1.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations, ICLR 2024,Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.27791#S2.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang,et al\.\(2025\)Qwen3 technical report\.ArXivabs/2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2606.27791#S4.SS1.p1.1)\.
- Z\. You, Y\. Chen, S\. Zhang, Z\. Qiu, T\. Wu, Y\. Li, Y\. Wang, and M\. Tan \(2025\)Training\-free context\-adaptive attention for efficient long context modeling\.CoRRabs/2512\.09238\.Cited by:[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- Y\. Yu, J\. Liu, Q\. Wu, H\. Wang, and J\. Pei \(2025\)SWAA: sliding window attention adaptation for efficient and quality preserving long context processing\.ArXivabs/2512\.10411\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.27791#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.27791#S3.SS1.p1.4),[§4\.1](https://arxiv.org/html/2606.27791#S4.SS1.p2.1)\.
- M\. Zaheer, G\. Guruganesh, K\. A\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontañón, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang, and A\. Ahmed \(2020\)Big bird: transformers for longer sequences\.ArXivabs/2007\.14062\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.27791#S2.SS1.p1.1)\.
- X\. Zhang, F\. Zhang, C\. Du, C\. Du, T\. Pang, W\. Gao, and M\. Lin \(2025\)LightTransfer: your long\-context LLM is secretly a hybrid model with effortless adaptation\.Trans\. Mach\. Learn\. Res\.2025\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.27791#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.27791#S4.SS1.p3.1)\.
- Z\. \(\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. W\. Barrett, Z\. Wang, and B\. Chen \(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.ArXivabs/2306\.14048\.Cited by:[§1](https://arxiv.org/html/2606.27791#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.27791#S2.SS2.p1.1)\.

## Appendix AImplementation Details

### A\.1Calibration Data

We use 64 long\-context examples sampled from LongAlign\-10k and fusang\-v1\-filtered datasets, with prompt lengths between 16k and 32k tokens\. Examples are selected to have answer lengths of at least 20 tokens to ensure meaningful NLL computation\.

### A\.2Scoring Procedure

For each of the 36 layers, we compute the mean NLL on answer tokens under two conditions: \(1\) all layers use SWA during prefill, and \(2\) the target layer uses FA while others use SWA\. The difference gives the per\-layerΔ\\Delta\-NLL score\. We use teacher forcing with no gradient computation, enabling efficient calibration\.

### A\.3SWAA Configuration

Following the SWAA protocol, we use: sliding window size = 2048, keep\-first = 10 \(attention sinks\), and full\-attention decode enabled\. Generation uses vLLM with enforce\_eager=True, temperature=0, and max\_completion\_len=10000\.

### A\.4Uncertainty and Scope

Because LongMemEval\_24k contains 500 examples, a simple binomial approximation gives 95% half\-widths of about±\\pm4\.2pp for NLL\-Guided \(64\.6%\),±\\pm4\.4pp for the 1/4\-FA periodic baseline \(54\.2%\), and±\\pm4\.3pp for the matched LightTransfer\-style baseline \(38\.2%\)\. These intervals do not model judge variability or paired\-example covariance, but they show that the main same\-budget margins are larger than basic sampling noise\. We view paired bootstrap intervals, additional judge replications, and broader model/benchmark sweeps as the next steps for a full statistical treatment\.

The calibration examples come from LongAlign\-10k and fusang\-v1\-filtered rather than LongMemEval itself\. This avoids calibrating directly on the evaluation benchmark, but it leaves open how strongly the selected layer set depends on calibration\-domain coverage\. The present study therefore establishes that a small general long\-context calibration set can produce a strong Qwen3\-4B layer set for LongMemEval; testing cross\-domain calibration and additional FA budgets is important future work\.

Similar Articles

Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

https://x.com/seclink/status/2072187033263784397

X AI KOLs Timeline

Hybrid Sliding Window Attention (Hybrid SWA) is a mixed attention mechanism in long-context language models that balances computational efficiency with full long-range dependencies. By alternating between local SWA layers and global attention layers, it significantly compresses KV cache while maintaining inference capability. This article details its design principles, application in models such as Gemma and Qwen, and best practices in open-source projects like vLLM and HuggingFace.

Dynamic Linear Attention

Hugging Face Daily Papers

DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.

Lighthouse Attention (11 minute read)

TLDR AI

Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.