Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

arXiv cs.CL 06/29/26, 04:00 AM Papers
positional-bias layer-specific rope long-context genetic-algorithm transformer attention
Summary
Introduces LPES, a layer-specific positional embedding scaling method that mitigates the 'lost-in-the-middle' problem in LLMs by assigning distinct scaling factors per layer using a genetic algorithm with Bézier curves, achieving up to 11.2% accuracy gain without fine-tuning or latency increase.
arXiv:2606.27705v1 Announce Type: new Abstract: Large Language Models (LLMs) still struggle with the ``lost-in-the-middle'' problem, where critical information located in the middle of long-context inputs is often underrepresented or lost. While existing methods attempt to address this by combining multi-scale rotary position embeddings (RoPE), they typically suffer from high latency or rely on suboptimal hand-crafted scaling strategies. To overcome these limitations, we introduce a layer-specific positional embedding scaling~(LPES) method that assigns distinct scaling factors to each layer. LPES achieves a more balanced attention distribution without fine-tuning model parameters or increasing inference delay. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating B\'{e}zier curves to significantly reduce the search space. Extensive experiments demonstrate that LPES effectively mitigates positional attention bias and delivers consistent improvements across multiple long-context benchmarks, yielding up to an $11.2$\% accuracy gain on the key-value retrieval dataset.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:23 AM
# Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
Source: [https://arxiv.org/html/2606.27705](https://arxiv.org/html/2606.27705)
Changze Lv1,3,Zhenghua Wang11footnotemark:11,3,Yiran Ding11footnotemark:12,Yixin Wu1,3,Tianlong Li1,3, Zhibo Xu1,3,Muling Wu1,3,Tianyuan Shi1,3,Shizheng Li1,3,Qi Qian1,3, Xuanjing Huang1,3,Xiaoqing Zheng1,3 1Fudan University,2Westlake University 3Shanghai Key Laboratory of Intelligent Information Processing \{\\\{zhenghuawang23, czlv24\}\\\}@m\.fudan\.edu\.cn\{\\\{yiran\.ding\}\\\}@hdu\.edu\.cn \{\\\{xjhuang,zhengxq\}\\\}@fudan\.edu\.cn

###### Abstract

Large Language Models \(LLMs\) still struggle with the “lost\-in\-the\-middle” problem, where critical information located in the middle of long\-context inputs is often underrepresented or lost\. While existing methods attempt to address this by combining multi\-scale rotary position embeddings \(RoPE\), they typically suffer from high latency or rely on suboptimal hand\-crafted scaling strategies\. To overcome these limitations, we introduce a layer\-specific positional embedding scaling \(LPES\) method that assigns distinct scaling factors to each layer\. LPES achieves a more balanced attention distribution without fine\-tuning model parameters or increasing inference delay\. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bézier curves to significantly reduce the search space\. Extensive experiments demonstrate that LPES effectively mitigates positional attention bias and delivers consistent improvements across multiple long\-context benchmarks, yielding up to an11\.211\.2% accuracy gain on the key\-value retrieval dataset\.

Mitigating Position Bias in Transformers via Layer\-Specific Positional Embedding Scaling

Changze Lv††thanks:Equal contribution\.1,3, Zhenghua Wang11footnotemark:11,3, Yiran Ding11footnotemark:12, Yixin Wu1,3, Tianlong Li1,3,Zhibo Xu1,3,Muling Wu1,3,Tianyuan Shi1,3,Shizheng Li1,3,Qi Qian1,3,Xuanjing Huang1,3,Xiaoqing Zheng††thanks:Corresponding Author\.1,31Fudan University,2Westlake University3Shanghai Key Laboratory of Intelligent Information Processing\{\\\{zhenghuawang23, czlv24\}\\\}@m\.fudan\.edu\.cn\{\\\{yiran\.ding\}\\\}@hdu\.edu\.cn\{\\\{xjhuang,zhengxq\}\\\}@fudan\.edu\.cn

## 1Introduction

Enabling Large Language Models \(LLMs\) to process long inputs is essential for supporting complex tasks such as long\-text summarization\(Fenget al\.,[2021](https://arxiv.org/html/2606.27705#bib.bib11); Zhanget al\.,[2021](https://arxiv.org/html/2606.27705#bib.bib14)\), code generation\(Zhenget al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib12); Liuet al\.,[2024a](https://arxiv.org/html/2606.27705#bib.bib13)\), and long\-context question\-answering\(Liet al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib10)\)\. Rotary position embeddings \(RoPE\)Suet al\.\([2021](https://arxiv.org/html/2606.27705#bib.bib42)\), widely adopted in transformer\-based LLMs, were designed to encode relative distances between input tokens, facilitating more effective processing of long\-context inputs\. However, as the context length increases, RoPE\-based LLMs continue to suffer from positional bias\. A representative manifestation of this issue is the well\-known lost\-in\-the\-middle phenomenon\(Liuet al\.,[2024c](https://arxiv.org/html/2606.27705#bib.bib15)\), where models tend to over\-attend to tokens near the beginning and the end of the input, while relatively neglecting information located in the middle\.

Several approaches have been proposed to address the position bias problem by combining multiple RoPEs with different bases or scaling factors\(Chenet al\.,[2023b](https://arxiv.org/html/2606.27705#bib.bib19); Zhanget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib52); Linet al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib20)\)\.Chenet al\.\([2023b](https://arxiv.org/html/2606.27705#bib.bib19)\)observed that RoPE with different bases induces attention troughs at specific positions, which impairs the model’s ability to capture the corresponding content\. To mitigate this, they introduced a method, named Attention Buckets, that combines multiple RoPEs with different bases to achieve a more balanced attention distribution\. Similarly,Linet al\.\([2024](https://arxiv.org/html/2606.27705#bib.bib20)\)proposed an MoICE method that assigns multiple RoPE bases to each attention head and aggregates the outputs through a weighted sum\. However, these methods rely heavily on manually designed rules to determine scaling factors or base values, and require multiple forward passes during inference—one for each specific base or scaling factor—followed by ensembling the results\. Although some operations can be parallelized, this procedure inevitably increases inference time and computational cost\.

![Refer to caption](https://arxiv.org/html/2606.27705v1/x1.png)Figure 1:Comparison of the proposed LPES with two representative existing methods\. \(a\) Attention Buckets combines multiple RoPEs with different bases through model parallels\. \(b\) MoICE assigns multiple bases to each attention head\. Unlike these existing methods which require multiple forward passes during inference, our LPES \(c\) achieves superior performance with a single forward pass, significantly reducing inference time\.Varying RoPE bases across the entire model can be seen as model\-level ensembling, while applying multiple bases to individual attention heads corresponds to module\-level ensembling \(Figure[1](https://arxiv.org/html/2606.27705#S1.F1)\)\. Model\-level ensembling requires multiple model inferences, incurring substantial computational overhead, whereas module\-level scaling suffers from a large search space due to fine\-grained granularity, limiting the applicability of automatic search algorithms\. To balance efficiency and flexibility, we apply multiple scaled RoPEs at the layer level, achieving competitive or superior performance with a single forward pass, thus avoiding the associated inference overhead\.

Choosing an appropriate scaling factor for each layer is still a non\-trivial problem\. LetLLdenote the number of layers in a transformer\-based network, andMMthe number of possible values for the scaling factors; the total number of combinations isMLM^\{L\}, which makes an exhaustive search computationally intractable\. Determining optimal scaling factors is inherently a combinatorial optimization problem, and thus cannot be easily solved by gradient\-based methods\. To overcome this, we leverage the Bézier curve, which defines a smooth, continuous mapping between layer depth and scaling factors using a small set of discrete control points\. LettingCCdenote the number of control points, the search space is reduced to\(M×L\)C\(M\\times L\)^\{C\}\. In addition to reducing the search space, we find that the smoothness of curve\-based scaling preserves layer\-wise representational structure and serves as a beneficial inductive bias\. We further develop a curve\-constrained genetic algorithm to solve this combinatorial optimization problem\. By restricting the search space to Bézier curves, we can efficiently optimize layer\-specific scaling factors, typically within33to44hours using only a few hundred examples \(e\.g\.,200200instances\) on four H100 GPUs\. In long\-text tasks, our method introduces no additional inference latency while delivering superior performance over existing approaches\.

This study makes the following contributions:

- •We propose a layer\-specific positional embedding scaling method, termed LPES, which effectively mitigates the position bias without incurring additional inference latency\. LPES achieves significant speedups,2\.42×2\.42\\timesfaster than MoICE\(Linet al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib20)\)and1\.45×1\.45\\timesfaster than Ms\-PoE\(Zhanget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib52)\), while also improving the model’s ability to handle long\-context tasks\.
- •We introduce an efficient genetic search algorithm in which the search space is constrained by Bézier curves, enabling rapid optimization of layer\-specific scaling factors using only a small set of examples\.
- •Extensive experiments on multiple benchmark datasets demonstrate that our method preserves the model’s general capabilities while producing a more balanced attention distribution without costly fine\-tuning, making it broadly applicable across different models and tasks\.

## 2Related Work

![Refer to caption](https://arxiv.org/html/2606.27705v1/x2.png)Figure 2:Illustration of the proposed layer\-specific positional embedding scaling \(LPES\) method\. Left: Bézier curves can represent a wide variety of shapes\. Middle: An optimized Bézier curve found by our search algorithm, which defines a smooth, continuous curve using a limited set of discrete control points\. Right: The relationship between the scaling factors and the optimized Bézier curve, and their application within the attention mechanism of a transformer\-based network\.Chenet al\.\([2023b](https://arxiv.org/html/2606.27705#bib.bib19)\)observed that RoPE with different bases can produce attention troughs at specific positions, which is called “Attention Waves”, thereby impairing the model’s ability to capture the relevant content\. To address this, their “Attention Buckets” method integrates multiple RoPE bases through model\-parallel inference to achieve a more uniform attention distribution\.Zhanget al\.\([2024](https://arxiv.org/html/2606.27705#bib.bib52)\)suggested that the long\-term decay in attention may contribute to the position bias, and proposed Ms\-PoE that assigns distinct scaling factors to attention heads based on their relative sensitivity to positional information\. MoICE\(Linet al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib20)\), building on the work ofChenet al\.\([2023b](https://arxiv.org/html/2606.27705#bib.bib19)\), employs gradient descent to learn the weights for combining results from different bases at the level of individual attention heads\. However, a major limitation of these approaches is their high computational cost and inference latency\. Specifically, Attention Buckets requires multiple forward passes, while both Ms\-PoE and MoICE require repeated attention computations to integrate multi\-scale RoPE information\. They also rely on heuristic or hand\-crafted rules to select bases or scaling factors\. By contrast, our method achieves superior performance with a single forward pass and proposes an automatic search algorithm, which effectively determines optimal scaling factors using only a few hundred examples\.

## 3Method

### 3\.1Problem Definition

In this study, we focus on RoPE, which is defined as follows:

⟨f\(𝒒,i\),f\(𝒌,j\)⟩=𝒒TR\(i−j\)𝒌\\langle f\(\{\\bm\{q\}\},i\),f\(\{\\bm\{k\}\},j\)\\rangle=\{\\bm\{q\}\}^\{\\mathrm\{T\}\}R\(i\-j\)\{\\bm\{k\}\}\(1\)wheref\(𝒙,i\)f\(\{\\bm\{x\}\},i\)denotes a position\-dependent rotation applied at positioniito the query𝒒\{\\bm\{q\}\}, andf\(𝒌,j\)f\(\{\\bm\{k\}\},j\)represents the RoPE\-rotated key at positionjj\. The notation⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangledenotes the inner product between the two position\-aware vectors, andR\(Δ\)R\(\\Delta\)is the rotation corresponding to the relative offsetΔ=i−j\\Delta=i\-j\. This equation shows that the inner product depends only on the vectors𝒒\\bm\{q\},𝒌\\bm\{k\}, and the relative distance between them\.Chenet al\.\([2023a](https://arxiv.org/html/2606.27705#bib.bib40)\)showed that the context window can be extended by applying a scaling factorssto the position index as follows:

f′\(𝒙,i\)=f\(𝒙,i/s\)f^\{\\prime\}\(\{\\bm\{x\}\},i\)=f\(\{\\bm\{x\}\},i/s\)\(2\)We further show that the scaling factors can mitigate long\-term decay and induce diverse attention patterns \(Appendix[A](https://arxiv.org/html/2606.27705#A1)\)\. Accordingly, our goal is to search for a unique scaling factorssfor each layer to combine information from multiple scaled RoPEs, alleviating long\-term decay and attention wave effects, and thereby reducing positional bias\.

We model layer depth and scaling factors using Bézier curves, which drastically reduce the search space by determining all layer scales from a few control points\. The details are analyzed in Appendix[B](https://arxiv.org/html/2606.27705#A2)\. Furthermore, in Section[4\.2](https://arxiv.org/html/2606.27705#S4.SS2), we show that the smooth and continuous nature of curve\-based modeling preserves layer\-wise representational structure\. Brute\-force search demonstrates that smooth scaling naturally emerges as a high\-performing configuration, highlighting continuity across layers as a beneficial inductive bias\.

As illustrated in Figure[2](https://arxiv.org/html/2606.27705#S2.F2), a Bézier curve can be viewed as a smooth curve that connects all the scaling factors in a two\-dimensional plane\. The problem of selecting scaling factors for all layers can then be transformed into searching for an appropriate Bézier curve\. Fortunately, Bézier curves can model a wide variety of shapes using only a small set of discrete control points, which significantly reduces the search space\. A Bézier curve of degreedd, withd\+1d\+1control points, is defined as follows\(Mortenson,[1999](https://arxiv.org/html/2606.27705#bib.bib26)\):

B\(t\)=∑k=0dbkd\(t\)Pk,0≤t≤1\.B\(t\)=\\sum\_\{k=0\}^\{d\}b^\{d\}\_\{k\}\(t\)P\_\{k\},\\quad 0\\leq t\\leq 1\.\(3\)wherettis the parametric coordinate controlling a point’s position along the curve,PkP\_\{k\}are the control points for the curve, andbkdb\_\{k\}^\{d\}are the Bernstein basis polynomials, which are defined as:

bkd\(t\)=d\!k\!\(d−k\)\!tk\(1−t\)d−k,k=0,…,d\.b\_\{k\}^\{d\}\(t\)=\\frac\{d\!\}\{k\!\(d\-k\)\!\}t^\{k\}\(1\-t\)^\{d\-k\},\\;\\;k=0,\\dots,d\.\(4\)Once a Bézier curve is determined, the scaling factorshs\_\{h\}for layerhhcan be computed as follows:

sh=projy\[B\(t\(xh\)\)\]s\_\{h\}=\\text\{proj\}\_\{y\}\\left\[B\(t\(x\_\{h\}\)\)\\right\]\(5\)where the notationprojy\[⋅\]\\mathrm\{proj\}\_\{y\}\[\\cdot\]denotes the operation of extracting theyy\-coordinate of a two\-dimensional point\. The functiont\(⋅\)t\(\\cdot\)mapsxhx\_\{h\}to the corresponding parametertt\(see Appendix[D](https://arxiv.org/html/2606.27705#A4)\), wherexhx\_\{h\}represents the position of layerhhwithin the evenly spacedxx\-coordinates defined by the minimum and maximum values of the control points\. The value ofxhx\_\{h\}can be computed by:

xh=P0x\+Pdx−P0xL−1⋅h,h=0,…,L−1\.x\_\{h\}=P\_\{0\}^\{x\}\+\\frac\{P\_\{d\}^\{x\}\-P\_\{0\}^\{x\}\}\{L\-1\}\\cdot h,\\quad h=0,\\dots,L\-1\.\(6\)whereLLdenotes the number of layers in a network, andPtxP\_\{t\}^\{x\}is thexx\-coordinates of thett\-th control point for the Bézier curve\.

Given a training dataset𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}consisting ofNNexamples, wherexix\_\{i\}is an input to the large language model andyiy\_\{i\}is the corresponding ground\-truth output, our goal is to maximize the following function:

ℒ𝒟\(𝜽\)=1N∑i=1N𝕀\{LLM\(xi,𝜽\)≃yi\}\\mathcal\{L\}\_\{\\mathcal\{D\}\}\(\{\\bm\{\\theta\}\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\\{\\text\{LLM\}\(x\_\{i\},\{\\bm\{\\theta\}\}\)\\simeq y\_\{i\}\\\}\(7\)where𝜽=\(P0,…,Pd\)\{\\bm\{\\theta\}\}=\(P\_\{0\},\\dots,P\_\{d\}\)denotes the set of control points defining a Bézier curve of degreedd\(each control pointPkP\_\{k\}is a two\-dimensional point\),LLM\(xi,𝜽\)\\text\{LLM\}\(x\_\{i\},\{\\bm\{\\theta\}\}\)denotes the output of a language model given inputxix\_\{i\}, with all scaling factors determined according to Equation \([5](https://arxiv.org/html/2606.27705#S3.E5)\) based on the Bézier curve specified by𝜽\{\\bm\{\\theta\}\}, and𝕀\{⋅\}\\mathbb\{I\}\\\{\\cdot\\\}is an indicator function with binary output0or11\. We constructed the training dataset such that the content containing information useful for generating correct answers appears at varying positions within the input, thereby encouraging the model to distribute its attention more evenly across the input\.

### 3\.2Optimization Algorithm

We can regard𝜽=\(P0,…,Pd\)\{\\bm\{\\theta\}\}=\(P\_\{0\},\\dots,P\_\{d\}\)as a set of newly introduced hyper\-parameters that influence the behavior of an LLM\. EachPkP\_\{k\}is a two\-dimensional vector whosexx\- andyy\-coordinates can take multiple values\. Even though Bézier curves of degreed=3d=3which haved\+1=4d\+1=4control points, are capable of representing a wide variety of curves, Selecting suitable control points is a combinatorial optimization problem that is difficult to solve using gradient\-based methods \(Appendix[C](https://arxiv.org/html/2606.27705#A3)\)\. Due to the high complexity of the search space, a brute\-force approach for determining the scaling factors across layers is intractable; instead, we employ a specialized genetic algorithm to optimize the control points of the Bézier curves\.

In our genetic algorithm, each individual is represented as\(P0x,P0y,…,Pdx,Pdy\)\(P\_\{0\}^\{x\},P\_\{0\}^\{y\},\\dots,P\_\{d\}^\{x\},P\_\{d\}^\{y\}\), wherePkxP\_\{k\}^\{x\}andPkyP\_\{k\}^\{y\}denote thexx\- andyy\-coordinates of thekk\-th control point, and corresponds to a specific Bézier curve\. The initial population is constructed as follows\. First, we initialize an individual in whichkk\-th control point is generated by:

\(k\(L−1\)/d,1\.5\),k∈\{0,…,d\}\(k\(L\-1\)/d,1\.5\),\\;k\\in\\\{0,\\dots,d\\\}\(8\)whereLLis the number of layers in a network\. Based on the empirical results reported byZhanget al\.\([2024](https://arxiv.org/html/2606.27705#bib.bib52)\), we set theyy\-coordinate values of all control points to1\.51\.5\. Subsequently, the remaining individuals are generated by applying a mutation operator \(described below\) to this initial individual until the population reaches the predefined size\.

The fitness of an individual𝜽=\(P0x,P0y,…,Pdx,Pdy\)\{\\bm\{\\theta\}\}=\(P\_\{0\}^\{x\},P\_\{0\}^\{y\},\\dots,P\_\{d\}^\{x\},P\_\{d\}^\{y\}\)is evaluated by configuring the layer\-wise scaling factors of an LLM according to𝜽\{\\bm\{\\theta\}\}, running the LLM on a dataset𝒟\\mathcal\{D\}, and calculating the resulting scoreℒ𝒟\(𝜽\)\\mathcal\{L\}\_\{\\mathcal\{D\}\}\(\{\\bm\{\\theta\}\}\)as defined in Equation \([7](https://arxiv.org/html/2606.27705#S3.E7)\)\. When constructing the training dataset, we deliberately vary the position of relevant context within the input, which can generally be categorized into three types: the query\-relevant content appears at the beginning, middle, or end of the input sequence\. We denote these three corresponding sub\-datasets as𝒟B\\mathcal\{D\}\_\{\\text\{B\}\},𝒟M\\mathcal\{D\}\_\{\\text\{M\}\}, and𝒟E\\mathcal\{D\}\_\{\\text\{E\}\}, respectively\. Considering that original LLMs tend to allocate attention unevenly across different positions, we introduce three weights to reflect the relative importance of these sub\-datasets when optimizing the model’s scaling factors\. The final fitness of an individual is then computed asλBℒ𝒟B\(𝜽\)\+λMℒ𝒟M\(𝜽\)\+λEℒ𝒟E\(𝜽\)\\lambda\_\{\\text\{B\}\}\\mathcal\{L\}\_\{\\mathcal\{D\}\_\{\\text\{B\}\}\}\(\{\\bm\{\\theta\}\}\)\+\\lambda\_\{\\text\{M\}\}\\mathcal\{L\}\_\{\\mathcal\{D\}\_\{\\text\{M\}\}\}\(\{\\bm\{\\theta\}\}\)\+\\lambda\_\{\\text\{E\}\}\\mathcal\{L\}\_\{\\mathcal\{D\}\_\{\\text\{E\}\}\}\(\{\\bm\{\\theta\}\}\), whereλB≥0\\lambda\_\{\\text\{B\}\}\\geq 0,λM≥0\\lambda\_\{\\text\{M\}\}\\geq 0,λE≥0\\lambda\_\{\\text\{E\}\}\\geq 0, andλB\+λM\+λE=1\\lambda\_\{\\text\{B\}\}\+\\lambda\_\{\\text\{M\}\}\+\\lambda\_\{\\text\{E\}\}=1\.

The crossover operator is performed by randomly selecting a pair of individuals with relatively high fitness scores as parents, choosing a single crossover point at random, and exchanging this point between the parents\. This process produces two offspring, from which we retain only the one with the higher fitness\.

The mutation operator modifiesPxP^\{x\}andPyP^\{y\}within a specified range to prevent excessive variations in the resulting curve, as shown in Equation \([9](https://arxiv.org/html/2606.27705#S3.E9)\)\. LetMxM\_\{x\}andMyM\_\{y\}denote the maximum allowable change for thexx\- andyy\-coordinate, respectively\. After mutation, thekk\-th control point \(P^kx,P^ky\\hat\{P\}\_\{k\}^\{x\},\\hat\{P\}\_\{k\}^\{y\}\) of an individual must remain within the following range:

P^kx∈\{\[max⁡\(0,Pkx−Mx\),min⁡\(Pk\+1x,Pkx\+Mx\)\],k=0,\[max⁡\(Pk−1x,Pkx−Mx\),min⁡\(Pk\+1x,Pkx\+Mx\)\],0<k<d,\[max⁡\(Pk−1x,Pkx−Mx\),min⁡\(Pkx\+Mx,L−1\)\],k=d,\\displaystyle\\hat\{P\}\_\{k\}^\{x\}\\in\(9\)
P^ky∈\[max⁡\(1,Pky−My\),min⁡\(Pky\+My,2\)\],0≤k≤d\.\\hat\{P\}\_\{k\}^\{y\}\\in\\bigl\[\\max\(1,P\_\{k\}^\{y\}\-M\_\{y\}\),\\min\(P\_\{k\}^\{y\}\+M\_\{y\},2\)\\bigr\],0\\leq k\\leq d\.\(10\)
To ensure the smoothness of the curve and prevent undesirable abrupt changes in the scaling factor\(Dinget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib44)\), thexx\-coordinate values of all control points must increase monotonically\. The following condition should therefore be satisfied when performing either crossover or mutation operations\. LetPixP\_\{i\}^\{x\}andPjxP\_\{j\}^\{x\}denote thexx\-coordinates of theii\-th andjj\-th control points, respectively\. Their relationship is required to satisfy:

0≤Pix<Pjx≤L−1ifi<j0\\leq P\_\{i\}^\{x\}<P\_\{j\}^\{x\}\\leq L\-1\\quad\\text\{if\}\\quad i<j\(11\)Offspring that fail to meet the above condition are discarded, and the crossover or mutation process is repeated until the condition is satisfied\.

Starting with the initial population, individuals are selected based on their fitness, followed by the application of the crossover and mutation operators\. This process is repeated iteratively until the maximum number of generations is reached\. The complete process is summarized in Algorithm[1](https://arxiv.org/html/2606.27705#alg1)\.

## 4Experiments

The experiments are divided into three parts\. First, we evaluate the impact of LPES on context utilization, inference latency, and general capabilities\. The results show that LPES improves context utilization while preserving general capabilities and introducing no additional inference latency\. Second, we analyze the effectiveness of curve\-based modeling for layer\-wise scaling factors from the perspectives of inter\-layer representational structure and inductive bias\. Third, we conduct ablation studies to examine the effects of curve types and the number of control points\.

### 4\.1Boosting Context Utilization

ModelsMethods𝟎%\\bf 0\\%𝟐𝟓%\\bf 25\\%𝟓𝟎%\\bf 50\\%𝟕𝟓%\\bf 75\\%𝟏𝟎𝟎%\\bf 100\\%Avg\.𝟎%\\bf 0\\%𝟐𝟎%\\bf 20\\%𝟒𝟎%\\bf 40\\%𝟔𝟎%\\bf 60\\%𝟖𝟎%\\bf 80\\%𝟏𝟎𝟎%\\bf 100\\%Avg\.MDQAKey\-Value RetrievalVicuna\-7B\-v1\.5Baseline70\.470\.458\.058\.055\.455\.455\.455\.460\.460\.459\.959\.995\.295\.271\.671\.681\.081\.079\.079\.077\.477\.473\.473\.480\.980\.9Positional Interpolation71\.271\.259\.659\.658\.858\.856\.456\.456\.256\.260\.460\.498\.698\.692\.892\.883\.883\.890\.090\.085\.885\.883\.083\.089\.089\.0Attention Buckets72\.661\.461\.460\.660\.660\.860\.859\.659\.663\.063\.0𝟏𝟎𝟎\\bf 10094\.688\.688\.691\.691\.687\.687\.665\.865\.888\.088\.0Ms\-PoE72\.661\.461\.461\.861\.862\.059\.059\.063\.563\.595\.295\.263\.263\.284\.884\.891\.691\.687\.487\.477\.877\.883\.383\.3MoICE71\.671\.661\.261\.260\.660\.660\.860\.862\.463\.363\.3𝟏𝟎𝟎\\bf 10093\.293\.290\.287\.487\.489\.489\.470\.070\.088\.488\.4LPES \(Ours\)\\cellcolordarkgreen71\.471\.4\\cellcolordarkgreen62\.2\\cellcolordarkgreen62\.0\\cellcolordarkgreen61\.061\.0\\cellcolordarkgreen61\.661\.6\\cellcolordarkgreen63\.6\\cellcolordarkgreen99\.499\.4\\cellcolordarkgreen92\.892\.8\\cellcolordarkgreen87\.887\.8\\cellcolordarkgreen93\.6\\cellcolordarkgreen90\.4\\cellcolordarkgreen88\.8\\cellcolordarkgreen92\.1StableBeluga\-7BBaseline67\.867\.859\.259\.259\.659\.659\.459\.468\.268\.262\.862\.890\.290\.234\.234\.244\.044\.016\.616\.659\.859\.879\.479\.454\.054\.0Positional Interpolation69\.658\.658\.658\.258\.260\.060\.065\.465\.462\.462\.495\.295\.253\.653\.631\.831\.828\.628\.661\.661\.683\.683\.659\.159\.1Attention Buckets69\.269\.259\.059\.059\.859\.859\.259\.267\.467\.463\.063\.0𝟏𝟎𝟎\\bf 10079\.879\.854\.454\.458\.258\.268\.468\.489\.289\.275\.675\.6Ms\-PoE68\.468\.457\.057\.060\.260\.261\.068\.468\.463\.063\.090\.290\.227\.227\.227\.627\.659\.470\.470\.489\.089\.060\.660\.6MoICE67\.467\.460\.060\.060\.260\.260\.060\.068\.663\.263\.299\.899\.871\.271\.252\.252\.254\.854\.874\.491\.491\.474\.074\.0LPES \(Ours\)\\cellcolordarkgreen68\.868\.8\\cellcolordarkgreen60\.0\\cellcolordarkgreen60\.8\\cellcolordarkgreen61\.0\\cellcolordarkgreen68\.268\.2\\cellcolordarkgreen64\.5\\cellcolordarkgreen99\.299\.2\\cellcolordarkgreen82\.4\\cellcolordarkgreen57\.2\\cellcolordarkgreen56\.256\.2\\cellcolordarkgreen70\.470\.4\\cellcolordarkgreen89\.6\\cellcolordarkgreen75\.8Qwen2\.5\-7BBaseline69\.469\.461\.061\.062\.662\.658\.658\.663\.663\.663\.063\.099\.899\.888\.688\.692\.692\.690\.690\.699\.099\.099\.299\.295\.095\.0Positional Interpolation68\.668\.662\.062\.062\.262\.258\.458\.464\.064\.063\.063\.0𝟏𝟎𝟎\\bf 10093\.293\.291\.291\.288\.688\.698\.698\.699\.099\.095\.195\.1Attention Buckets69\.669\.662\.262\.263\.063\.060\.260\.262\.062\.063\.463\.4𝟏𝟎𝟎\\bf 10089\.289\.291\.491\.491\.691\.698\.298\.299\.299\.294\.994\.9Ms\-PoE69\.469\.461\.861\.863\.463\.460\.260\.261\.461\.463\.263\.2𝟏𝟎𝟎\\bf 10094\.294\.291\.291\.293\.693\.698\.098\.099\.299\.296\.096\.0MoICE68\.468\.461\.261\.263\.063\.061\.061\.063\.863\.863\.563\.599\.899\.888\.088\.092\.692\.691\.691\.699\.099\.099\.495\.195\.1LPES \(Ours\)\\cellcolordarkgreen69\.6\\cellcolordarkgreen64\.8\\cellcolordarkgreen69\.2\\cellcolordarkgreen63\.0\\cellcolordarkgreen65\.4\\cellcolordarkgreen66\.4\\cellcolordarkgreen99\.899\.8\\cellcolordarkgreen97\.4\\cellcolordarkgreen93\.2\\cellcolordarkgreen94\.0\\cellcolordarkgreen99\.2\\cellcolordarkgreen99\.299\.2\\cellcolordarkgreen97\.1

Table 1:Comparison of accuracy across varying positions of relevant information \(e\.g\.,50%50\\%denotes the middle\) against established baselines\. LPES consistently exceeds all baseline performance, validating its efficacy in neutralizing positional bias\.#### Base Models

We selected three RoPE\-based LLMs for our experiments: Vicuna\-77B\-v1\.51\.5\(Chianget al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib38)\), and StableBeluga\-77B\(Mahanet al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib34)\), each with a44k\-token context window, as well as Qwen2\.52\.5\-77B\(Yanget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib60)\), which supports a130130k\-token context window\.

#### Benchmarks

MDQA\(Liuet al\.,[2024b](https://arxiv.org/html/2606.27705#bib.bib33)\)is a popular multi\-document question answering dataset\. The key\-value retrieval dataset\(Liuet al\.,[2024b](https://arxiv.org/html/2606.27705#bib.bib33)\)features unique UUID key\-value pairs, ideal for evaluating relevant information extraction\. ZeroSCROLLS\(Shahamet al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib36)\)includes multiple open\-ended long\-text tasks, with sub\-datasets and metrics summarized in Table[14](https://arxiv.org/html/2606.27705#A7.T14)\. For closed\-ended tasks, L\-Eval\(Anet al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib21)\)is used, as outlined in Table[13](https://arxiv.org/html/2606.27705#A7.T13)\(Appendix[G](https://arxiv.org/html/2606.27705#A7)\)\. Finally, MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.27705#bib.bib61)\)and C\-Eval\(Huanget al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib62)\)assess generalization ability across various tasks\.

#### Baselines

Positional Interpolation \(PI\) uses layer\-agnostic scaling factors, which are the mean of the searched layer\-wise scaling factors\(Chenet al\.,[2023a](https://arxiv.org/html/2606.27705#bib.bib40)\)\. Attention Buckets performs multiple forward passes, each using a different RoPE base, and then aggregates the information from these passes\(Chenet al\.,[2023b](https://arxiv.org/html/2606.27705#bib.bib19)\)\. Ms\-PoE assigns scaling factors ranging from1\.21\.2to1\.81\.8to attention heads based on their sensitivity to relevant information\(Zhanget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib52)\)\. Building on the work ofChenet al\.\([2023b](https://arxiv.org/html/2606.27705#bib.bib19)\), MoICE computes attention scores using seven different RoPE bases and then performs a weighted sum of these scores using learned weights\(Linet al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib20)\)\.

#### Experimental Setup

For LLMs with a44k\-token context window, we use1010MDQA documents or5050key–value pairs as context\. To evaluate positional bias under longer contexts, Qwen2\.52\.5\-77B is provided with2020MDQA documents or150150key–value pairs, and its accuracy is measured as the ground\-truth information appears at different positions within the context\. For ZeroSCROLLS and L\-Eval, the context window is set to3,5843\{,\}584tokens, with a maximum of512512decoded tokens\. We additionally report the performance of LPES under a1616K context window in Appendix[H](https://arxiv.org/html/2606.27705#A8)\. To assess generalization, scaling factors learned on MDQA are transferred to ZeroSCROLLS and L\-Eval, with additional evaluation on the MMLU and C\-Eval benchmarks to measure generalization ability\.

During optimization,λB\\lambda\_\{\\text\{B\}\},λM\\lambda\_\{\\text\{M\}\}, andλE\\lambda\_\{\\text\{E\}\}are set to0\.20\.2,0\.30\.3, and0\.50\.5, respectively, with detailed analysis provided in Appendix[E](https://arxiv.org/html/2606.27705#A5)\. Layer\-wise scaling factors are learned by searching the control points of cubic Bézier curves using200200samples from the MDQA or key–value retrieval datasets, and are evaluated on500500held\-out samples per dataset\.

#### Result Analysis

Layer\-specific positional embedding scaling greatly mitigates position bias\.Table[1](https://arxiv.org/html/2606.27705#S4.T1)shows that LPES consistently outperforms the baselines in average performance across different positions, notably boosting Vicuna’s average accuracy by11\.2%11\.2\\%in key\-value retrieval\. LPES demonstrates strong transferability when applying MDQA\-optimized scaling factors to ZeroSCROLLS and L\-Eval \(Table[2](https://arxiv.org/html/2606.27705#S4.T2)\)\. The results confirm that these factors generalize robustly across diverse models and long\-text tasks\. Furthermore, the results on longer context windows and larger model scales \(detailed in Appendix[H](https://arxiv.org/html/2606.27705#A8)\) further validate the applicability of LPES\. Additionally, LPES preserves the model’s general capabilities, as shown in Table[4](https://arxiv.org/html/2606.27705#S4.T4)\. Notably, the scaling factors are treated as hyperparameters rather than trainable model parameters\. Their adjustment is therefore considered optimization rather than training\. In machine learning, training typically refers to updating model weights and biases using gradient\-based methods\. In contrast, our approach does not modify model parameters and can thus be regarded as a training\-free method\. This property avoids catastrophic forgettingDe Langeet al\.\([2021](https://arxiv.org/html/2606.27705#bib.bib74)\)caused by large\-scale parameter updates and makes the method particularly suitable for already deployed models\.

LPES yields a more balanced attention distribution without additional inference cost\.Ms\-PoE and MoICE are sample\-dependent, their scaling factors cannot be precomputed and must be determined for each input\. Specifically, Ms\-PoE entails an additional attention pass to assess head sensitivity, whereas MoICE requires parallel computations across seven RoPE bases alongside serial routing weight calculations\.

To demonstrate the advantage in inference efficiency, we sample500500examples from the MDQA dataset and report the average inference time of Vicuna on a single H100 GPU\. For a fair comparison, FlashAttention\-2\(Dao,[2023](https://arxiv.org/html/2606.27705#bib.bib67)\)was used as the attention backend for all methods\. As shown in Table[4](https://arxiv.org/html/2606.27705#S4.T4), LPES is roughly1\.45×1\.45\\timesfaster than Ms\-PoE and2\.42×2\.42\\timesfaster than MoICE\.

Open\-ended Long\-Text TasksClosed\-ended Long\-Text TasksModelMethodGovRptQasperSumScrFdQmsumNarrQASqualitySpcDgstAvg\.CourseraQuALITYTOEFLSFictionAvg\.Vicuna\-7B\-v1\.5Baseline18\.4422\.8218\.4214\.5010\.9816\.5621\.3916\.9137\.2138\.1238\.0057\.9042\.81MoICE22\.2932\.3413\.3114\.7913\.6116\.2222\.6019\.3042\.3543\.7139\.3357\.2045\.65LPES \(Ours\)\\cellcolordarkgreen21\.47\\cellcolordarkgreen33\.37\\cellcolordarkgreen14\.39\\cellcolordarkgreen15\.53\\cellcolordarkgreen11\.52\\cellcolordarkgreen16\.91\\cellcolordarkgreen22\.24\\cellcolordarkgreen19\.35\\cellcolordarkgreen40\.41\\cellcolordarkgreen42\.57\\cellcolordarkgreen40\.67\\cellcolordarkgreen58\.20\\cellcolordarkgreen45\.46Qwen2\.5\-7BBaseline24\.7622\.9214\.6916\.259\.7814\.8553\.6622\.4245\.4762\.4366\.0060\.8758\.69MoICE25\.5623\.5115\.1223\.1910\.6416\.9253\.8124\.1148\.1364\.2867\.3366\.0061\.44LPES \(Ours\)\\cellcolordarkgreen27\.56\\cellcolordarkgreen23\.91\\cellcolordarkgreen16\.18\\cellcolordarkgreen23\.19\\cellcolordarkgreen11\.97\\cellcolordarkgreen14\.92\\cellcolordarkgreen53\.81\\cellcolordarkgreen25\.51\\cellcolordarkgreen48\.51\\cellcolordarkgreen66\.43\\cellcolordarkgreen69\.28\\cellcolordarkgreen66\.42\\cellcolordarkgreen62\.66

Table 2:Performance comparison onopen\-endedandclosed\-endedlong\-text benchmarks\. Open\-ended tasks are reported on the left, while closed\-ended tasks are shown on the right\.Table 3:General capability of models equipped with LPES on MMLU and C\-Eval datasets\.
Table 4:Comparison of inference efficiency between LPES and baseline methods\.

### 4\.2Motivation for Curve\-Based Modeling

#### Preserved Representational Structure

The smooth and continuous nature of the Bézier curve enforces gradual variations in scaling across layers, which helps preserve the coherence of the model’s layer\-wise representational structure\. We compare against several intuitive baselines:*uniform scaling*, where all layers share a single scaling factor which is the mean of the searched layer\-wise scaling factors;*noisy Bézier scaling*, which adds independent uniform noise sampled from𝒰\(−0\.1,0\.1\)\\mathcal\{U\}\(\-0\.1,0\.1\)to each layer’s Bézier\-derived scale;*shuffled scaling*, which randomly permutes the layer\-wise scaling factors while preserving their overall distribution; and*fully random scaling*, where each layer independently samples its scale from𝒰\(1,2\)\\mathcal\{U\}\(1,2\)\.

![Refer to caption](https://arxiv.org/html/2606.27705v1/x3.png)Figure 3:Comparison of representational structure deviation under different scaling strategies\. Vanilla B’ezier curve achieves smaller representational deviation while effectively integrating information from multiple base RoPE configurations\.Inspired by RSAKriegeskorteet al\.\([2008](https://arxiv.org/html/2606.27705#bib.bib73)\), which measures representational structure across input samples at a fixed layer using representational similarity matrices \(RSM\), we focus instead on the*layer dimension*\. Specifically, we construct a layer\-wise RSM by computing pairwise dot\-product similarities between the last\-token hidden representations of different layers\. Given the set of hidden states\{𝐇l\}l=1L\\\{\\mathbf\{H\}\_\{l\}\\\}\_\{l=1\}^\{L\}, where𝐇l∈ℝd\\mathbf\{H\}\_\{l\}\\in\\mathbb\{R\}^\{d\}denotes the representation at layerll, the entries of the RSM are defined as:

𝐑𝐒𝐌ij=𝐇i⊤𝐇j,1≤i,j≤L,\\mathbf\{RSM\}\_\{ij\}=\\mathbf\{H\}\_\{i\}^\{\\top\}\\mathbf\{H\}\_\{j\},\\quad 1\\leq i,j\\leq L,\(12\)where𝐑𝐒𝐌ij\\mathbf\{RSM\}\_\{ij\}quantifies the similarity between theii\-th andjj\-th layers\. This matrix captures the global structural organization of representations across the model’s depth\. We quantify representational stability via the*representational structure deviation*𝒟\\mathcal\{D\}, which measures the average absolute difference between the RSM of a perturbed model \(𝐑𝐒𝐌p\\mathbf\{RSM\}^\{\\text\{p\}\}\) and that of a vanilla configuration without scaling \(𝐑𝐒𝐌v\\mathbf\{RSM\}^\{\\text\{v\}\}\) as follows:

𝒟=1L2∑i=1L∑j=1L\|𝐑𝐒𝐌i,jp−𝐑𝐒𝐌i,jv\|,\\mathcal\{D\}=\\frac\{1\}\{L^\{2\}\}\\sum\_\{i=1\}^\{L\}\\sum\_\{j=1\}^\{L\}\\left\|\\mathbf\{RSM\}\_\{i,j\}^\{\\text\{p\}\}\-\\mathbf\{RSM\}\_\{i,j\}^\{\\text\{v\}\}\\right\|,\(13\)This metric captures global changes in inter\-layer relationships across layers, rather than at individual layers\. Experiments are conducted using Vicuna\-v1\.5\-7B on 500 randomly sampled MDQA examples, with results averaged over 16 random seeds\. As shown in Figure[3](https://arxiv.org/html/2606.27705#S4.F3), the smooth Bézier\-based scaling consistently yields smaller structural deviations, indicating minimal disruption to the model’s internal representations\.

![Refer to caption](https://arxiv.org/html/2606.27705v1/x4.png)Figure 4:Trend of fitness and smoothness over brute\-force search epochs\. The population progressively evolves toward smoother scaling factors across layers, indicating that the smoothness of curves serves as an effective inductive bias\.
#### Empirical Convergence Behavior

We conduct a brute\-force genetic algorithm search using the fitness function defined above over200200randomly sampled MDQA inputs, where each individual represents a set of layer\-wise scaling factors, initialized uniformly from𝒰\(1,2\)\\mathcal\{U\}\(1,2\)to avoid introducing any prior smoothness bias\. During the search, we track both performance—measured by the average fitness of the best11,22,44, and88individuals—and smoothness, quantified by a second\-order metric:

𝒮=1L−2∑l=2L−1‖sl\+1−2sl\+sl−1‖2,\\mathcal\{S\}=\\frac\{1\}\{L\-2\}\\sum\_\{l=2\}^\{L\-1\}\\left\\lVert s\_\{l\+1\}\-2s\_\{l\}\+s\_\{l\-1\}\\right\\rVert\_\{2\},\(14\)whereLLis the total number of layers, andsls\_\{l\}denotes the scaling factor at layerll\. Here,𝒮\\mathcal\{S\}quantifies the local curvature, with smaller values indicating smoother transitions\. As shown in Figure[4](https://arxiv.org/html/2606.27705#S4.F4), higher\-performing configurations consistently exhibit lower𝒮\\mathcal\{S\}, indicating that smooth variation emerges during the search\. This finding suggests that smoothness constitutes a beneficial inductive bias, motivating the use of smooth curve modeling—such as Bézier curves—to efficiently parameterize high\-performing scaling configurations\.

### 4\.3Ablation Studies

In this section, we present three ablation studies using Vicuna\-v1\.5\-7B on MDQA\. First, we demonstrate that Bézier curves outperform alternative curves in determining layer\-specific scaling factors\. Next, we examine the impact of control point counts on convergence quality and speed\. We further provide an ablation study on the hyperparameterλ\\lambdain Appendix[E](https://arxiv.org/html/2606.27705#A5)\.

Table 5:Performance comparison of different curve types for determining layer\-wise scaling factors\. Bézier curves achieve superior performance\.#### Curve Type

Bézier curves provide a compact, low\-dimensional parameterization capable of approximating a wide variety of curve shapes\(Nuntawisuttiwong and Dejdumrong,[2021](https://arxiv.org/html/2606.27705#bib.bib22)\)\. To demonstrate the advantages of Bézier curve modeling, we consider two alternative approaches with the same number of control points: linear interpolation between control points and step\-function modeling based on these control points\. Although these alternatives differ in their curve formulations, they also serve as layer\-specific scaling strategies within our framework\. While linear interpolation offers slightly higher computational efficiency, which can be neglected in the search procedure \(Appendix[E](https://arxiv.org/html/2606.27705#A5)\), we ultimately adopt Bézier curves due to their superior performance\. As shown in Table[5](https://arxiv.org/html/2606.27705#S4.T5), Bézier curves outperform other curve\-fitting methods, and the minor additional cost required to determine the scaling factors is fully offset by the inference\-time performance gains\.

#### Number of Control Points

Setting the maximum iterations to2020, we vary the number of control points to evaluate performance—measured by the mean and variance of accuracy across positions—and convergence speed\. While more control points improve Bézier curve fitting precision and the likelihood of finding optimal scaling factors, they also expand the search space, slowing convergence\. As shown in Table[6](https://arxiv.org/html/2606.27705#S4.T6), using four control points provides a favorable trade\-off between performance and convergence speed\. In contrast, brute\-force search shows little tendency to converge within the limited number of iterations, further demonstrating the efficiency of our curve\-constrained genetic algorithm\.

Table 6:Performance comparison of different numbers of control points\. Additional points improve accuracy and reduce positional bias, but slow convergence due to the increased computational cost of optimization\. Overall, experimental results indicate a performance insensitivity to the number of control points used\.

## 5Conclusion

We present layer\-specific positional embedding scaling \(LPES\), a method that mitigates position bias in transformer\-based LLMs by assigning distinct scaling factors to each layer, achieving balanced attention over input without fine\-tuning or extra latency\. Optimal scaling factors are efficiently identified via a Bézier\-constrained genetic algorithm, reducing the search space and converging with only a few hundred examples\. Experiments show that LPES consistently improves long\-context performance, preserves general capabilities, and requires only a single forward pass, achieving up to2\.42×2\.42\\timesspeedup over MoICE and1\.45×1\.45\\timesover Ms\-PoE, which makes LPES a broadly applicable and efficient solution\.

## Limitations

In this work, we adopt a training\-free strategy that assigns different scaling factors across layers to encourage a more balanced attention distribution\. While this design enables straightforward and efficient deployment, we do not explore the behavior of our method in training\-based settings\. Extending the proposed approach to training or fine\-tuning pipelines could potentially yield further gains, which we leave for future work\. Nevertheless, this limitation does not diminish the practical effectiveness or applicability of our method in real\-world scenarios\.

## References

- C\. An, S\. Gong, M\. Zhong, X\. Zhao, M\. Li, J\. Zhang, L\. Kong, and X\. Qiu \(2023\)L\-eval: instituting standardized evaluation for long context language models\.arXiv preprint arXiv:2307\.11088\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Chen, S\. Wong, L\. Chen, and Y\. Tian \(2023a\)Extending context window of large language models via positional interpolation\.arXiv preprint arXiv:2306\.15595\.Cited by:[§3\.1](https://arxiv.org/html/2606.27705#S3.SS1.p1.11),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px3.p1.2)\.
- Y\. Chen, A\. Lv, T\. Lin, C\. Chen, Y\. Wu, F\. Huang, Y\. Li, and R\. Yan \(2023b\)Fortify the shortest stave in attention: enhancing context awareness of large language models for effective tool use\.arXiv preprint arXiv:2312\.04455\.Cited by:[Appendix A](https://arxiv.org/html/2606.27705#A1.p3.1),[§1](https://arxiv.org/html/2606.27705#S1.p2.1),[§2](https://arxiv.org/html/2606.27705#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px3.p1.2)\.
- W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez,et al\.\(2023\)Vicuna: an open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.See https://vicuna\. lmsys\. org \(accessed 14 April 2023\)2\(3\),pp\. 6\.Cited by:[Appendix A](https://arxiv.org/html/2606.27705#A1.p2.1),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px1.p1.7)\.
- T\. Dao \(2023\)Flashattention\-2: faster attention with better parallelism and work partitioning\.arXiv preprint arXiv:2307\.08691\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px5.p3.3)\.
- M\. De Lange, R\. Aljundi, M\. Masana, S\. Parisot, X\. Jia, A\. Leonardis, G\. Slabaugh, and T\. Tuytelaars \(2021\)A continual learning survey: defying forgetting in classification tasks\.IEEE transactions on pattern analysis and machine intelligence44\(7\),pp\. 3366–3385\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px5.p1.1)\.
- Y\. Ding, L\. L\. Zhang, C\. Zhang, Y\. Xu, N\. Shang, J\. Xu, F\. Yang, and M\. Yang \(2024\)Longrope: extending llm context window beyond 2 million tokens\.arXiv preprint arXiv:2402\.13753\.Cited by:[Appendix B](https://arxiv.org/html/2606.27705#A2.p1.17),[Appendix C](https://arxiv.org/html/2606.27705#A3.p1.1),[§3\.2](https://arxiv.org/html/2606.27705#S3.SS2.p8.6)\.
- X\. Feng, X\. Feng, and B\. Qin \(2021\)A survey on dialogue summarization: recent advances and new frontiers\.arXiv preprint arXiv:2107\.03175\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Huang, Y\. Bai, Z\. Zhu, J\. Zhang, J\. Zhang, T\. Su, J\. Liu, C\. Lv, Y\. Zhang, Y\. Fu,et al\.\(2023\)C\-eval: a multi\-level multi\-discipline chinese evaluation suite for foundation models\.Advances in Neural Information Processing Systems36,pp\. 62991–63010\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Kriegeskorte, M\. Mur, and P\. A\. Bandettini \(2008\)Representational similarity analysis\-connecting the branches of systems neuroscience\.Frontiers in systems neuroscience2,pp\. 249\.Cited by:[§4\.2](https://arxiv.org/html/2606.27705#S4.SS2.SSS0.Px1.p2.3)\.
- T\. Li, G\. Zhang, Q\. D\. Do, X\. Yue, and W\. Chen \(2024\)Long\-context llms struggle with long in\-context learning\.arXiv preprint arXiv:2404\.02060\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- H\. Lin, A\. Lv, Y\. Song, H\. Zhu, R\. Yan,et al\.\(2024\)Mixture of in\-context experts enhance llms’ long context awareness\.Advances in Neural Information Processing Systems37,pp\. 79573–79596\.Cited by:[1st item](https://arxiv.org/html/2606.27705#S1.I1.i1.p1.2),[§1](https://arxiv.org/html/2606.27705#S1.p2.1),[§2](https://arxiv.org/html/2606.27705#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px3.p1.2)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2024a\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024b\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024c\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- D\. Mahan, R\. Carlow, L\. Castricato, N\. Cooper, and C\. Laforte \(2023\)Stable beluga models\.External Links:[Link](https://arxiv.org/html/2606.27705v1/%5Bhttps://huggingface.co/stabilityai/StableBeluga2%5D(https://huggingface.co/stabilityai/StableBeluga2))Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px1.p1.7)\.
- M\.E\. Mortenson \(1999\)Mathematics for computer graphics applications\.G \- Reference,Information and Interdisciplinary Subjects Series,Industrial Press\.External Links:ISBN 9780831131111,LCCN 99010096,[Link](https://books.google.co.jp/books?id=YmQy799flPkC)Cited by:[§3\.1](https://arxiv.org/html/2606.27705#S3.SS1.p3.2)\.
- T\. Nuntawisuttiwong and N\. Dejdumrong \(2021\)An approximation of bézier curves by a sequence of circular arcs\.Information Technology and Control50\(2\),pp\. 213–223\.Cited by:[§4\.3](https://arxiv.org/html/2606.27705#S4.SS3.SSS0.Px1.p1.1)\.
- U\. Shaham, M\. Ivgi, A\. Efrat, J\. Berant, and O\. Levy \(2023\)ZeroSCROLLS: a zero\-shot benchmark for long text understanding\.arXiv preprint arXiv:2305\.14196\.Cited by:[Appendix A](https://arxiv.org/html/2606.27705#A1.p2.1),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Shang, L\. L\. Zhang, S\. Wang, G\. Zhang, G\. Lopez, F\. Yang, W\. Chen, and M\. Yang \(2025\)LongRoPE2: near\-lossless llm context window scaling\.arXiv preprint arXiv:2502\.20082\.Cited by:[Appendix C](https://arxiv.org/html/2606.27705#A3.p1.1)\.
- J\. Su, Y\. Lu, S\. Pan, B\. Wen, and Y\. Liu \(2021\)RoFormer: enhanced transformer with rotary position embedding\.Cornell University \- arXiv,Cornell University \- arXiv\(en\-US\)\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- Q\. Sun, E\. Cetin, and Y\. Tang \(2025\)Transformer2: self\-adaptive llms\.arXiv preprint arXiv:2501\.06252\.Cited by:[Appendix C](https://arxiv.org/html/2606.27705#A3.p3.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[Appendix A](https://arxiv.org/html/2606.27705#A1.p2.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px1.p1.7)\.
- Y\. Zhang, A\. Ni, Z\. Mao, C\. H\. Wu, C\. Zhu, B\. Deb, A\. H\. Awadallah, D\. Radev, and R\. Zhang \(2021\)Summˆ n: a multi\-stage summarization framework for long input dialogues and documents\.arXiv preprint arXiv:2110\.10150\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.
- Z\. Zhang, R\. Chen, S\. Liu, Z\. Yao, O\. Ruwase, B\. Chen, X\. Wu, and Z\. Wang \(2024\)Found in the middle: how language models use long contexts better via plug\-and\-play positional encoding\.arXiv preprint arXiv:2403\.04797\.Cited by:[Appendix A](https://arxiv.org/html/2606.27705#A1.p1.2),[1st item](https://arxiv.org/html/2606.27705#S1.I1.i1.p1.2),[§1](https://arxiv.org/html/2606.27705#S1.p2.1),[§2](https://arxiv.org/html/2606.27705#S2.p1.1),[§3\.2](https://arxiv.org/html/2606.27705#S3.SS2.p2.10),[§4\.1](https://arxiv.org/html/2606.27705#S4.SS1.SSS0.Px3.p1.2)\.
- Q\. Zheng, X\. Xia, X\. Zou, Y\. Dong, S\. Wang, Y\. Xue, Z\. Wang, L\. Shen, A\. Wang, Y\. Li,et al\.\(2023\)CodeGeeX: a pre\-trained model for code generation with multilingual evaluations on humaneval\-x\. corr abs/2303\.17568 \(2023\)\.arXiv preprint arXiv:2303\.1756810\.Cited by:[§1](https://arxiv.org/html/2606.27705#S1.p1.1)\.

## Appendix ALong\-Term Decay and Attention Wave in RoPE

Zhanget al\.\([2024](https://arxiv.org/html/2606.27705#bib.bib52)\)observed that the long\-term decay of RoPE causes the model to focus more on the end of a sequence\. As the relative distance grows, attention scores drop rapidly, leading the model to overemphasize nearby tokens during autoregressive decoding while neglecting distant ones\. To mitigate this issue, they scale RoPE by a factors\>=1s\>=1\(Figure[5](https://arxiv.org/html/2606.27705#A1.F5)\), which effectively reduces the relative distance to1/s1/sof its original value \(Figure[6](https://arxiv.org/html/2606.27705#A1.F6)\)\. This adjustment slows the decay rate, enabling the model to attend not only to nearby tokens but also to more distant ones, particularly those in the middle of the sequence\.

To demonstrate that scaling RoPE can indeed enhance the model’s attention to middle positions, we use theVicuna\-7B\-v1\.5\(Chianget al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib38)\)andLLaMA\-2\-7B\-hf\(Touvronet al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib35)\)which both consist of 32 transformer layers to conduct experiments on the validation dataset ofQMSum\(Shahamet al\.,[2023](https://arxiv.org/html/2606.27705#bib.bib36)\)\. We split the context into three parts and calculate the attention scores to the middle\-part tokens at different scales\. In Figure[7](https://arxiv.org/html/2606.27705#A1.F7), an increase in the scale factor leads to higher attention scores, demonstrating that scaling RoPE allows the model to focus more on middle\-part content during autoregressive decoding\.

Chenet al\.\([2023b](https://arxiv.org/html/2606.27705#bib.bib19)\)analyze the phenomenon of oscillatory “attention waves” in Transformer models, where attention fluctuates across tokens instead of being smoothly distributed\. These oscillations, mainly induced by the mechanisms of RoPE, can cause the model to under\-attend to important information located at attention troughs, limiting long\-context utilization and potentially introducing instability\. To address this issue, the authors propose theAttention Bucketsapproach, which runs multiple model parallels with different bases in RoPE and combines the decoded logits across these bases, producing complementary attention wave patterns\. The method enhances the model’s sensitivity to context across all positions\.

![Refer to caption](https://arxiv.org/html/2606.27705v1/x5.png)Figure 5:We obtain multi\-scale RoPE by scaling the positional indices\.![Refer to caption](https://arxiv.org/html/2606.27705v1/x6.png)Figure 6:The rapid decay of RoPE prioritizes local focus, and the attention waves may cause the model to overlook crucial information at attention troughs, whereas the scaling operation can slow this decay and generate diverse wave patterns\.![Refer to caption](https://arxiv.org/html/2606.27705v1/x7.png)Figure 7:The attention score to the middle part across some layers\. The scaling operation can enhance the model’s attention to middle positions\.
## Appendix BSearch Space and Time Complexity Analysis

We followDinget al\.\([2024](https://arxiv.org/html/2606.27705#bib.bib44)\), discretizing the continuous search space to enable more efficient searching\. Assume the control points of the Bézier curve are\(Px,Py\)\(P^\{x\},P^\{y\}\), wherePx∈\[0,L−1\]P^\{x\}\\in\[0,L\-1\]\(LLis the number of scaled layers\) andPy∈\[1,2\]P^\{y\}\\in\[1,2\]\. The values ofPxP^\{x\}are discretized with a step size of11, and the values ofPyP^\{y\}are discretized with a step size of0\.10\.1\. Given that the model consists of3232layers, there are3232possible selections inPXP^\{X\}, while the scaling factor chosen from thePYP^\{Y\}set offers1111options as shown in Table[7](https://arxiv.org/html/2606.27705#A2.T7)\. The total number of choices for the brute\-force search is113211^\{32\}\. If a Cubic Bezier curve is used, each control point has32×1132\\times 11possible combinations\. With four control points, the total search space is3524352^\{4\}which approximately narrows the search space by a significant factor102010^\{20\}compared to the brute\-force search\.

Table 7:Search space for the control point of Bézier curves\.In our method, the dominant cost of the genetic algorithm arises from evaluating the fitness function, which requires running model inference to assess the effectiveness of different scaling factors\. In contrast, the computational overhead of other GA operations—such as assignment, mutation, and crossover—is negligible\. Using 4×\\timesH100 GPUs, we measured the per\-epoch time cost of each operation as follows:

Table 8:Measured runtime per epoch of each operation in the genetic algorithm when using 4×\\timesH100 GPUs\. Model inference dominates the total cost\.Assume the algorithm runs for at mostMMepochs and generatesNNnew individuals per epoch, and the search usesSSsamples\. Each individual requires three inference runs \(placing the correct document at different positions\)\. Thus, the total number of inference calls is3NMS3NMS\. In practice, we perform data\-parallel inference usingNcardN\_\{\\text\{card\}\}GPUs with batch sizeBB, which reduces the effective runtime toO\(\(3MNS\)/\(Ncard⋅B\)\)\.O\\\!\\left\(\(3MNS\)/\(\{N\_\{\\text\{card\}\}\\cdot B\)\}\\right\)\.

Table 9:Hyperparameter settings of the constrained genetic algorithm
## Appendix CLimitations of Gradient\-Based Methods

We also attempted to determine the layer\-specific scaling factors using gradient descent, but observed poor convergence behavior\. This may also shed light on why LongRoPE\(Dinget al\.,[2024](https://arxiv.org/html/2606.27705#bib.bib44)\)and LongRoPE2\(Shanget al\.,[2025](https://arxiv.org/html/2606.27705#bib.bib71)\)employ genetic algorithms rather than backpropagation to determine the scaling factors across RoPE dimensions\. Although the genetic algorithm incurs higher computational overhead compared to directly optimizing hyperparameters via backpropagation, it consistently converges to a more favorable set of scaling parameters\. Furthermore, incorporating Bézier curves significantly accelerates the convergence process\.

In thegradient\-based setting, we construct three datasets from the MDQA, each containing2,0002,000samples in which the correct document is placed at a different position \(i\.e\., first, middle, or last\)\. In each epoch, a total of2,0002,000samples are drawn from these datasets based on the value ofλ\\lambdaas specified in Section §[4\.1](https://arxiv.org/html/2606.27705#S4.SS1), where a largerλ\\lambdaindicates a higher probability of sampling from the corresponding dataset\. For stable training, we use a batch size of3232, a learning rate of1e−51e\-5, and train the model for a total of3030epochs\.

For the gradient\-based method, we observed that even with a large batch size and a small learning rate, the optimization of scaling factors via backpropagation failed to converge\. A possible reason is the limited number of trainable parameters\(Sunet al\.,[2025](https://arxiv.org/html/2606.27705#bib.bib64)\)\. We evaluated the model at the 30th epoch and found a significant degradation in performance, as shown in Table[10](https://arxiv.org/html/2606.27705#A3.T10)\.

Table 10:Gradient\-based methods lead to accuracy degradation in the MDQA dataset\.
## Appendix DCubic Bézier Curve Parameterization for Layer Assignment

Consider a cubic Bézier curve with four control points:

P0\\displaystyle P\_\{0\}=\(x0,y0\),P1=\(x1,y1\),\\displaystyle=\(x\_\{0\},y\_\{0\}\),\\quad P\_\{1\}=\(x\_\{1\},y\_\{1\}\),\(15\)P2\\displaystyle P\_\{2\}=\(x2,y2\),P3=\(x3,y3\)\.\\displaystyle=\(x\_\{2\},y\_\{2\}\),\\quad P\_\{3\}=\(x\_\{3\},y\_\{3\}\)\.where thexx\-coordinates are strictly increasing since Equation[11](https://arxiv.org/html/2606.27705#S3.E11):

x0<x1<x2<x3\.x\_\{0\}<x\_\{1\}<x\_\{2\}<x\_\{3\}\.\(16\)
The parametric form of the cubic Bézier curve is

x\(t\)\\displaystyle x\(t\)=\(1−t\)3x0\+3\(1−t\)2tx1\+3\(1−t\)t2x2\+t3x3,\\displaystyle=\(1\-t\)^\{3\}x\_\{0\}\+3\(1\-t\)^\{2\}tx\_\{1\}\+3\(1\-t\)t^\{2\}x\_\{2\}\+t^\{3\}x\_\{3\},\(17\)y\(t\)\\displaystyle y\(t\)=\(1−t\)3y0\+3\(1−t\)2ty1\+3\(1−t\)t2y2\+t3y3,\\displaystyle=\(1\-t\)^\{3\}y\_\{0\}\+3\(1\-t\)^\{2\}ty\_\{1\}\+3\(1\-t\)t^\{2\}y\_\{2\}\+t^\{3\}y\_\{3\},wheret∈\[0,1\]t\\in\[0,1\]\.

Since thexix\_\{i\}are strictly increasing, the functionx\(t\)x\(t\)is typically monotonic\. This property allows the use of a binary search over the interval\[0,1\]\[0,1\]to efficiently find the parameterttcorresponding to any given target valuexx, which defines the functiont\(x\)t\(x\)\.

## Appendix EHyperparameters of the constrained genetic algorithm

In our experiments, we observed that when scaling RoPE, the model tends to improve performance at early positions while neglecting performance at later positions\. Consequently, when setting𝝀\\bm\{\\lambda\}, we favor assigning larger weights to later positions\. Here, we define⟨λB,λM,λE⟩\\langle\\lambda\_\{\\text\{B\}\},\\lambda\_\{\\text\{M\}\},\\lambda\_\{\\text\{E\}\}\\rangleas the weights assigned to the accuracy of the beginning, middle, and end positions, respectively, in the genetic algorithm’s fitness function\. In this study, we compare three weighting schemes:⟨0\.333,0\.333,0\.333⟩\\langle 0\.333,0\.333,0\.333\\rangle,⟨0\.1,0\.3,0\.6⟩\\langle 0\.1,0\.3,0\.6\\rangle, and⟨0\.2,0\.3,0\.5⟩\\langle 0\.2,0\.3,0\.5\\rangle\.

Table 11:The impact of hyper\-parameters𝝀\\bm\{\\lambda\}on the optimized layer\-wise scaling factors, showing that performance is largely insensitive to their choice\.
## Appendix FSearch Algorithm Robustness

In this section, we evaluate the robustness of the scaling factors under variations in the search dataset\. On the MDQA dataset, we use Vicuna\-1\.5\-7B and randomly sample200200training instances to form the search set for each run\. Across five independent runs with different search sets, the method achieves an average performance of63\.6863\.68with a sample variance of only0\.0270\.027, demonstrating that our approach is highly stable across different search sets\. Overall, our method consistently outperforms prior approaches, highlighting the robustness of the proposed search algorithm\.

Table 12:Performance of LPES across five runs compared with baseline methods\. Percentages indicate the relative position of relevant documents in the context\.
## Appendix GDataset Details

![Refer to caption](https://arxiv.org/html/2606.27705v1/x8.png)\(a\)MDQA prompt
![Refer to caption](https://arxiv.org/html/2606.27705v1/x9.png)\(b\)KV prompt

Figure 8:Prompt templates used in MDQA and Key\-Value Retrieval datasets\.Table 13:Overview and evaluation metrics of the sub\-datasets in L\-Eval\.Table 14:Overview and evaluation metrics of the sub\-datasets in ZeroSCROLLS\.Table 15:Results under longer\-context settings \(16k tokens\) on the L\-Eval benchmark\. LPES consistently improves performance over the baseline and MoICE on both Vicuna\-13B\-v1\.5\-16k and Qwen2\.5\-7B, demonstrating strong scalability to larger models and longer context windows\.Table 16:Performance comparison on ZeroSCROLLS benchmarks with a 16k context length using Qwen2\.5\-7B\. LPES consistently improves average performance over both the baseline and MoICE across diverse tasks\.
## Appendix HEffectiveness of LPES on Longer Contexts

We conduct experiments on Vicuna\-1\.5\-13B and Qwen\-2\.5\-7B under a 16k\-token context setting on L\-Eval to verify the effectiveness of LPES in long\-context scenarios\. The decoding length is set to512512tokens, so the maximum usable context window is limited to15,87215\{,\}872tokens\. As shown in Table[15](https://arxiv.org/html/2606.27705#A7.T15)and[16](https://arxiv.org/html/2606.27705#A7.T16), the results demonstrate that our method remains effective on larger models and extended context lengths, highlighting its strong scalability and robustness\.

Algorithm 1Scaling factor search algorithmInput:an LLMℳ\\mathcal\{M\}, a dataset𝒟\\mathcal\{D\}, population sizeNpsN\_\{\\text\{ps\}\}, the number of offspring generated by crossoverNcrN\_\{\\text\{cr\}\}, zzzzzzthe number of mutated individualsNmuN\_\{\\text\{mu\}\}, and maximum number of generationsTT\.

1:

𝒮0\\mathcal\{S\}\_\{0\}=Initial\-Population\-Generation\(

𝒟\\mathcal\{D\},

NpsN\_\{\\text\{ps\}\}\);// Randomly generate the initial population\.

2:for

i=1i=1to

TTdo

3:Evaluate\-Fitness\(

𝒮i−1\\mathcal\{S\}\_\{i\-1\},

ℳ\\mathcal\{M\},

𝒟\\mathcal\{D\}\);// Evaluate the fitness of all individuals in the population\.

4:

𝒮pa\\mathcal\{S\}\_\{\\text\{pa\}\}=Select\-Parents\(

𝒮i−1\\mathcal\{S\}\_\{i\-1\}\);// Select the parent pool according to fitness values\.

5:

𝒮cr\\mathcal\{S\}\_\{\\text\{cr\}\}=Crossover\-Operator\(

𝒮pa\\mathcal\{S\}\_\{\\text\{pa\}\},

NcrN\_\{\\text\{cr\}\}\);// Produce offspring using the crossover operator\.

6:

𝒮mu\\mathcal\{S\}\_\{\\text\{mu\}\}=Mutation\-Operator\(

𝒮pa\\mathcal\{S\}\_\{\\text\{pa\}\},

NmuN\_\{\\text\{mu\}\}\);// Generate offspring using the mutation operator\.

7:

𝒮i\\mathcal\{S\}\_\{i\}=

𝒮pa\\mathcal\{S\}\_\{\\text\{pa\}\}∪\\cup𝒮cr\\mathcal\{S\}\_\{\\text\{cr\}\}∪\\cup𝒮mu\\mathcal\{S\}\_\{\\text\{mu\}\};// Merge the individuals to form the next generation’s population\.

8:endfor

9:Return the individual with the highest fitness in

𝒮T\\mathcal\{S\}\_\{T\}\.
Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

Similar Articles

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Scaling LLMs horizontally: hidden-state coupling without weight modification [R]

Uncovering the Latent Potential of Deep Intermediate Representations

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Submit Feedback

Similar Articles

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
Scaling LLMs horizontally: hidden-state coupling without weight modification [R]
Uncovering the Latent Potential of Deep Intermediate Representations
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
SNLP: Layer-Parallel Inference via Structured Newton Corrections