Training-Inference Consistent Segmented Execution for Long-Context LLMs
Summary
This paper proposes a training-inference consistent segmented execution framework for long-context LLMs to address the mismatch between full-context training and restricted inference regimes, achieving comparable performance with significantly reduced memory usage.
View Cached Full Text
Cached at: 05/13/26, 06:17 AM
# Training–Inference Consistent Segmented Execution for Long-Context LLMs
Source: [https://arxiv.org/html/2605.11744](https://arxiv.org/html/2605.11744)
###### Abstract
Transformer\-based large language models face severe scalability challenges in long\-context generation due to the computational and memory costs of full\-context attention\. Under practical computation and memory constraints, many inference\-efficient long\-context methods improve efficiency by adopting bounded\-context or segment\-level execution only during inference, while continuing to train models under full\-context attention, resulting in a mismatch between training and inference execution and state\-transition semantics\. Based on this insight, we propose a training–inference consistent segment\-level generation framework, in which training and inference follow the same segment\-level forward execution semantics\. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head\-specific access to past KV states during the forward pass without involving them in gradient propagation\. Across long\-context benchmarks, our approach achieves performance comparable to full\-context attention, while achieving competitive latency–memory trade\-offs against strong inference\-efficient baselines, and substantially improving scalability at very long context lengths \(e\.g\., approximately6×6\\timeslower peak prefill memory at 128K compared to full\-context attention with FlashAttention\)\.
Long\-context language models, efficient attention, training\-inference alignment, segmented execution
## 1Introduction
Long\-context modeling is increasingly vital for large language models\(Achiam et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib1); Team et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib24); Anthropic,[2024](https://arxiv.org/html/2605.11744#bib.bib3)\), underpinning practical applications like document understanding, sustained dialogue, and complex reasoning\(Bai et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib4)\)\. However, the quadratic computational complexity of full\-context self\-attention fundamentally limits the scalability of Transformer models in long\-context scenarios\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.11744#bib.bib26); Keles et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib18)\)\. Accordingly, long\-context inference is often performed using restricted execution regimes, such as bounded\-context or chunked attention, to reduce computational costs\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29); Liu et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib20)\)\. Recent work has substantially improved the efficiency of long\-context inference through execution\-level optimizations that preserve attention semantics, reducing memory consumption and practical inference costs without altering model outputs\(Dao,[2024](https://arxiv.org/html/2605.11744#bib.bib10); Agrawal et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib2)\)\. Yet, as context length continues to scale, the resource savings offered by such semantics\-preserving execution\-level optimizations alone are often insufficient in practice \(Figure[1](https://arxiv.org/html/2605.11744#S1.F1)\), and more restrictive execution strategies, such as windowed or sparse attention mechanisms, are commonly adopted\.
Figure 1:Peak GPU memory consumption during long\-context prefill\.Despite their effectiveness at inference time, most existing approaches impose these restricted execution regimes only during inference, while still relying on full\-context attention during training\. This leads to a mismatch between training and inference in both execution and cross\-segment state evolution\. As a result, the model can depend on information available during training that is absent under the constrained inference regime, which can undermine stability and generalization in long\-context settings\.
To address this limitation, we propose a training–inference consistent segment\-level generation framework, which treats segmented execution as a shared modeling assumption rather than an inference\-time optimization\. We partition the sequence into segments and carry only a fixed\-size KV tail as the*sole*differentiable cross\-segment interface state, which is used*verbatim*in both training and inference\. Training restricts cross\-segment credit assignment to the most recentKKstate transitions via TBPTT; under this strictly bounded recursion, TBPTT computes the exact gradient of an inference\-consistent objective, preventing reliance on information unavailable at inference time\. To access evidence beyond the carried\-KV horizon, the model additionally consumes a retrieved KV prefix in a forward\-only manner \(no gradient\), which does not participate in state recursion\. Architecturally, we realize this design with head\- and layer\-sparse long\-range heads, while the majority of heads support local, state\-carrying computation, yielding execution semantics that are strictly aligned between training and inference\.
Our contributions are as follows:
- •We propose a segment\-level modeling framework for long\-context modeling that enforces training–inference consistency by design, decomposing cross\-segment information flow into a local continuity channel and a separate forward\-only mechanism that provides long\-range conditioning\.
- •We show that training–inference alignment can be theoretically guaranteed without introducing persistent memory variables, by restricting cross\-segment learning to a controlled interface state\. Under this constrained formulation, truncated backpropagation computes the exact gradient of the inference\-consistent objective, rather than an approximation\.
- •We empirically validate the proposed training–inference consistent framework across multiple long\-context benchmarks and context lengths, demonstrating strong performance under constrained execution, ablations indicating that TBPTT withK=1K\{=\}1is sufficient and optimal, and substantially improved scalability \(e\.g\.,∼6×\\sim 6\\timeslower memory at 128K\-context prefill compared to full attention\)\.
Figure 2:Training–inference consistent segmented execution\. A sequence is processed segment by segment with two cross\-segment inputs: a*carried KV tail*Ci−1C\_\{i\-1\}\(the only*differentiable*state that propagates across segments\) and an optional retrieved prefixRi−1R\_\{i\-1\}read from a*past\-only*KV pool\. During training, TBPTT with depthKKtruncates credit assignment along the state chain \(red cross\), so gradients flow throughCCfor at mostKKsegment transitions \(blue brace\), while the retrieval path and earlier history are forward\-only \(no gradient\)\.
## 2Related Work
Execution\-Level Optimizations\.Recent work has improved the efficiency of exact attention\-based long\-context inference in Transformer\-based models through execution\- and system\-level optimizations, without modifying model parameters, training objectives, or attention semantics\. Representative examples include kernel\-level optimizations such as FlashAttention and FlashAttention\-2\(Dao et al\.,[2022](https://arxiv.org/html/2605.11744#bib.bib11); Dao,[2024](https://arxiv.org/html/2605.11744#bib.bib10)\), runtime systems for efficient KV cache management and scheduling such as vLLM and SARATHI\(Kwon et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib19); Agrawal et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib2)\), as well as system\-level approaches that explore heterogeneous memory offloading for large\-scale inference, e\.g\., FlexGen\(Sheng et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib23)\)\. While these methods reduce constant factors and improve throughput under moderate context lengths, they preserve full\-context attention semantics and therefore do not address the computational and memory challenges encountered at much longer context lengths, where restricted execution is required\.
Restricted Execution at Inference\.Another line of work enables long\-context inference by restricting attention connectivity or retained states only at inference time, while leaving training\-time execution unchanged\. Streaming\-based methods, such as StreamingLLM\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29)\)and LM\-Infinite\(Han et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib14)\)achieve zero\-shot length generalization by adapting the inference\-time attention pattern of models trained on short contexts, while leaving training\-time execution unchanged\. In a different direction, MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)accelerates long\-context prefill through inference\-time sparse attention, while preserving dense attention during training\. Restricted execution can also arise from state\-level constraints rather than attention sparsification, as in ChunkKV\(Liu et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib20)\), which selectively compresses and retains KV states during inference without retraining\. Although restrictions are imposed in different ways, spanning attention scope, sparsity, and state retention, these approaches rely on training\-time execution assumptions that differ from inference\-time behavior, leading to mismatches in attention connectivity or retained\-state usage at inference time\.
Training–Inference Alignment\.Training–inference alignment has been explored for settings where inference adopts restricted execution by explicitly incorporating comparable constraints during training\. Longformer\(Beltagy et al\.,[2020](https://arxiv.org/html/2605.11744#bib.bib6)\)replaces full self\-attention with a fixed sparse pattern so that training and inference operate under identical attention connectivity\. Core Context Aware \(CCA\)\(Chen et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib8)\)enforces alignment under a reduced\-context computation graph by applying the mechanism consistently during adaptation\. Alignment has also been studied for streaming or segmented execution: Shiftable Context\(Raffel et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib22)\)enforces consistent segment structures between training and inference, while Sliding Window Attention Training\(Fu et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib13)\)trains models directly with windowed attention to avoid degradation when such constraints are introduced only at inference time\. Taken together, these approaches mitigate training–inference mismatch by enforcing consistent attention connectivity or context structure across training and inference stages\.
Memory\-Based and Recurrent Models\.An alternative modeling paradigm extends effective context length by introducing persistent memory mechanisms or recurrent state propagation across segments\. Transformer\-XL\(Dai et al\.,[2019](https://arxiv.org/html/2605.11744#bib.bib9)\)is the earliest work to introduce segment\-level recurrence with gradient truncation in Transformers, whereas our approach formalizes training–inference consistent execution semantics and separates short\-term differentiable state propagation from forward\-only long\-range retrieval\. Compressive Transformer\(Rae et al\.,[2020](https://arxiv.org/html/2605.11744#bib.bib21)\)builds on Transformer\-XL by retaining older states via learned compression\. Recurrent Memory Transformer\(Bulatov et al\.,[2022](https://arxiv.org/html/2605.11744#bib.bib7)\)further introduces explicit memory tokens that are recurrently updated and trained to store global information across segments\. Memorizing Transformers\(Wu et al\.,[2022](https://arxiv.org/html/2605.11744#bib.bib28)\)instead rely on an external key–value memory accessed via retrieval to support long\-range recall\. These methods rely on explicit, persistent cross\-segment states to carry long\-range information, whose update dynamics during training are not necessarily aligned with the execution semantics encountered at inference time\. By contrast, our approach avoids persistent memory states and enforces training–inference consistency under segment\-level execution\.
Figure 3:Head\- and layer\-sparse long\-range retrieval\. \(a\) In a non\-long\-range layerℓ∉ℒlong\\ell\\notin\\mathcal\{L\}\_\{\\text\{long\}\}, local heads attend to within\-segment tokens and the carried KV state from the previous segment \(green\), while long\-range heads use within\-segment causal attention only \(orange\)\. \(b\) In a long\-range\-enabled layerℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}, local heads remain unchanged, while long\-range heads additionally attend to a retrieved prefix from a past\-only KV pool \(blue\)\. In all cases, attention within the current segment remains causal\.
## 3Method
In long\-context language modeling, the computational and memory costs of full\-context attention make unrestricted execution impractical at scale, leading to the use of segmented or otherwise constrained attention mechanisms\. Many existing methods impose such constraints only at inference time, while retaining full\-context attention during training, resulting in a mismatch between training and inference execution semantics\. In this section, we describe a training–inference consistent segmented execution framework, in which both training and inference follow an identical segment\-level execution scheme, explicitly constraining cross\-segment state propagation and long\-range access under the same execution semantics\.
### 3\.1Training–Inference Consistent Segmented Execution
#### Setup
As illustrated in Figure[2](https://arxiv.org/html/2605.11744#S1.F2), we partition a token sequence intoNNnon\-overlapping segments\{x\(i\)\}i=1N\\\{x^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, wheremi=\|x\(i\)\|m\_\{i\}=\|x^\{\(i\)\}\|denotes the length of segmentii\. To enable segmented inference under bounded attention, we expose a restricted cross\-segment interface stateCi∈𝒞C\_\{i\}\\in\\mathcal\{C\}\(a fixed\-size carried KV interface in our implementation\), together with a forward\-only retrieval prefixRi∈ℛR\_\{i\}\\in\\mathcal\{R\}provided by a long\-range module\.
###### Definition 3\.1\(Segment\-level execution semantics\)\.
For each segmentii, the model runs the same forward operator at both training and inference:
\(Ci,o\(i\)\)=Fθ\(x\(i\),Ci−1,Ri−1\),\(C\_\{i\},o^\{\(i\)\}\)\\;=\\;F\_\{\\theta\}\\big\(x^\{\(i\)\},C\_\{i\-1\},R\_\{i\-1\}\\big\),\(1\)whereo\(i\)o^\{\(i\)\}denotes outputs used to compute the LM loss,Ci−1C\_\{i\-1\}is the*only*differentiable cross\-segment interface state available to the model, andRi−1R\_\{i\-1\}is a forward\-only retrieval prefix\. The segment loss is
ℓi\(θ;Ci−1,Ri−1\)=−∑t=1milogpθ\(xt\(i\)∣x<t\(i\),Ci−1,Ri−1\)\.\\ell\_\{i\}\(\\theta;C\_\{i\-1\},R\_\{i\-1\}\)=\-\\sum\_\{t=1\}^\{m\_\{i\}\}\\log p\_\{\\theta\}\\\!\\left\(x\_\{t\}^\{\(i\)\}\\mid x\_\{<t\}^\{\(i\)\},C\_\{i\-1\},R\_\{i\-1\}\\right\)\.\(2\)
#### Operational interpretation\.
For a sequence split into three segmentsx\(1\),x\(2\),x\(3\)x^\{\(1\)\},x^\{\(2\)\},x^\{\(3\)\}, withC0=R0=∅C\_\{0\}=R\_\{0\}=\\emptyset, the segment\-level recurrence in Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\) unrolls as
\(C1,o\(1\)\)\\displaystyle\(C\_\{1\},o^\{\(1\)\}\)=Fθ\(x\(1\),∅,∅\),\\displaystyle=F\_\{\\theta\}\(x^\{\(1\)\},\\emptyset,\\emptyset\),\(C2,o\(2\)\)\\displaystyle\(C\_\{2\},o^\{\(2\)\}\)=Fθ\(x\(2\),C1,R1\),\\displaystyle=F\_\{\\theta\}\(x^\{\(2\)\},C\_\{1\},R\_\{1\}\),\(C3,o\(3\)\)\\displaystyle\(C\_\{3\},o^\{\(3\)\}\)=Fθ\(x\(3\),C2,R2\)\.\\displaystyle=F\_\{\\theta\}\(x^\{\(3\)\},C\_\{2\},R\_\{2\}\)\.Hereo\(i\)o^\{\(i\)\}is used to predict tokens in the current segmentx\(i\)x^\{\(i\)\}, whileCiC\_\{i\}is the carried state produced for the next segment\. Thus,Ci−1C\_\{i\-1\}andRi−1R\_\{i\-1\}are conditioning inputs in Eq\. \([2](https://arxiv.org/html/2605.11744#S3.E2)\), not prediction targets\. Past tokens are not replaced byCCorRR; rather, the model accesses them through two prefix inputs:CCcarries recent local KV states from the previous segment, whileRRprovides forward\-only retrieved KV states from earlier segments\. Section[3\.4](https://arxiv.org/html/2605.11744#S3.SS4)describes how these two inputs are implemented\.
#### Training–Inference Forward Identity
Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\) is used*verbatim*at training and inference\. The only difference lies in how gradients are allowed to propagate across the state chain, formalized next\.
### 3\.2Inference\-Consistent Truncated Training Objective
Under the segment\-level execution semantics defined in Section[3\.1](https://arxiv.org/html/2605.11744#S3.SS1), we next formalize the training objective induced by truncated state propagation\.
###### Definition 3\.2\(KK\-truncated consistent objective\)\.
Letsg\(⋅\)\\mathrm\{sg\}\(\\cdot\)denote the stop\-gradient operator \(identity in forward, zero gradient in backward\)\. Letbi=max\(0,i−K−1\)b\_\{i\}=\\max\(0,i\-K\-1\)denote the truncation boundary, withC0=∅C\_\{0\}=\\emptyset\. For eachii, define a truncated state chainC~i−1\(K\)\\tilde\{C\}\_\{i\-1\}^\{\(K\)\}by unrolling the state update for at mostKKsegments, with the boundary stateCbiC\_\{b\_\{i\}\}treated as a constant:
C~bi\(K\)\\displaystyle\\tilde\{C\}\_\{b\_\{i\}\}^\{\(K\)\}=sg\(Cbi\),\\displaystyle=\\mathrm\{sg\}\(C\_\{b\_\{i\}\}\),\(3\)C~j\(K\)\\displaystyle\\tilde\{C\}\_\{j\}^\{\(K\)\}=Φθ\(x\(j\),C~j−1\(K\),Rj−1\),j=bi\+1,…,i−1\.\\displaystyle=\\Phi\_\{\\theta\}\\big\(x^\{\(j\)\},\\tilde\{C\}\_\{j\-1\}^\{\(K\)\},R\_\{j\-1\}\\big\),\\quad j=b\_\{i\}\+1,\\dots,i\-1\.\(4\)whereΦθ\\Phi\_\{\\theta\}is the state\-update component induced byFθF\_\{\\theta\}\. We define the consistent objective
LK\(θ\)=∑i=1Nℓi\(θ;C~i−1\(K\),Ri−1\)\.L\_\{K\}\(\\theta\)=\\sum\_\{i=1\}^\{N\}\\ell\_\{i\}\\big\(\\theta;\\tilde\{C\}\_\{i\-1\}^\{\(K\)\},R\_\{i\-1\}\\big\)\.\(5\)
#### Operational meaning of the truncation depth\.
Continuing the three\-segment unrolling above, consider the representative loss on the third segment,ℓ3\(θ;C~2\(K\),R2\)\\ell\_\{3\}\(\\theta;\\tilde\{C\}\_\{2\}^\{\(K\)\},R\_\{2\}\)\. WithK=1K=1, gradients propagate through the state update that producesC2C\_\{2\}from segment 2 and stop atC1C\_\{1\}\. WithK=2K=2, the state chain is further unrolled through the update that producesC1C\_\{1\}from segment 1, and gradients stop atC0C\_\{0\}\. In both cases, the forward computation remains unchanged and follows Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\); Eqs\. \([3](https://arxiv.org/html/2605.11744#S3.E3)\)–\([4](https://arxiv.org/html/2605.11744#S3.E4)\) only determine how far gradients propagate along the carried\-state chain\.
### 3\.3Exactness of TBPTT for the Stated Objective
###### Theorem 3\.3\(Exactness of TBPTT forLKL\_\{K\}\)\.
Truncated backpropagation through time \(TBPTT\) with truncation depthKKcomputes the exact gradient∇θLK\(θ\)\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)\.
###### Proof sketch\.
By Definition[3\.2](https://arxiv.org/html/2605.11744#S3.Thmtheorem2), the stop\-gradient boundarysg\(Cbi\)\\mathrm\{sg\}\(C\_\{b\_\{i\}\}\)removes all gradient paths across more thanKKstate transitions in the computational graph of each loss term\. TBPTT with truncation depthKKperforms backpropagation on exactly this truncated computational graph, therefore it yields∇θLK\(θ\)\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)\. A detailed chain\-rule derivation \(Jacobian product / adjoint recursion\) is provided in Appendix[A](https://arxiv.org/html/2605.11744#A1)\(see Appendix[A\.2](https://arxiv.org/html/2605.11744#A1.SS2)\)\. ∎
###### Corollary 3\.4\(Training–inference alignment\)\.
Since training and inference share identical forward execution in Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\), and TBPTT computes the exact gradient of the inference\-consistent objective in Eq\. \([5](https://arxiv.org/html/2605.11744#S3.E5)\), the resulting training procedure is fully aligned with inference under the same execution semantics\.
#### Forward\-Only Long\-Range Retrieval
In our system, the long\-range module constructsRi−1R\_\{i\-1\}from detached past KV states and uses it only in the forward pass, so it does not introduce additional cross\-segment credit assignment paths\. We formalize this as Lemma[B\.1](https://arxiv.org/html/2605.11744#A2.Thmtheorem1)in Appendix[B](https://arxiv.org/html/2605.11744#A2)\.
### 3\.4Model Architecture: Head\- and Layer\-Sparse Long\-Range Retrieval
In Sections[3\.1](https://arxiv.org/html/2605.11744#S3.SS1)–[3\.3](https://arxiv.org/html/2605.11744#S3.SS3), we defined a training–inference consistent segmented execution semantics and proved that TBPTT computes the exact gradient of the corresponding consistent objective\. We now instantiate the abstract operatorFθ\(⋅\)F\_\{\\theta\}\(\\cdot\)in Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\) with a Transformer decoder architecture that \(i\) implements the differentiable cross\-segment interface stateCiC\_\{i\}via a fixed\-size carried KV tail, and \(ii\) optionally augments context using a forward\-only retrieval prefixRiR\_\{i\}in a sparse manner across layers and heads\. Figure[3](https://arxiv.org/html/2605.11744#S2.F3)illustrates the resulting context\-access patterns\.
#### Head and Layer Partition
Consider anLL\-layer Transformer decoder withHHattention heads per layer\. For each layerℓ\\ell, we partition heads into two disjoint groups: local headsℋlocal\\mathcal\{H\}\_\{\\text\{local\}\}and long\-range headsℋlong\\mathcal\{H\}\_\{\\text\{long\}\}, and denoteα:=\|ℋlocal\|/H\\alpha:=\|\\mathcal\{H\}\_\{\\text\{local\}\}\|/H\. This head\-level sparsity follows observations from recent mechanistic and systems analyses showing that only a small subset of heads exhibits retrieval\-like long\-context behavior, whereas the majority behaves more like streaming/local heads\(Wu et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib27); Xiao et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib30)\)\. We further select a small subset of layersℒlong⊆\{1,…,L\}\\mathcal\{L\}\_\{\\text\{long\}\}\\subseteq\\\{1,\\dots,L\\\}as long\-range\-enabled layers, where long\-range heads additionally consume a retrieved prefix\. At the layer level, prior work reports structured redundancy in Transformer attention layers, indicating that many attention layers can be removed with only limited performance degradation\(He et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib15)\)\. In our default configuration, we enable long\-range retrieval on layersℒlong=\{6,8,11,18\}\\mathcal\{L\}\_\{\\mathrm\{long\}\}=\\\{6,8,11,18\\\}and use a prior\-based long\-range head groupℋlong=\{0,1,2,4,9,12,14,15,16,18,19,22,23,26,29,30\}\\mathcal\{H\}\_\{\\mathrm\{long\}\}=\\\{0,1,2,4,9,12,14,15,16,18,19,22,23,26,29,30\\\}, with the remaining heads assigned toℋlocal\\mathcal\{H\}\_\{\\mathrm\{local\}\}\. We ablate alternative head groupings in Appendix[G\.6](https://arxiv.org/html/2605.11744#A7.SS6)\. All layer and head indices follow the zero\-indexed implementation convention\.
#### Local Continuity Channel: Carried KV Interface\{Ci\}\\\{C\_\{i\}\\\}
For each segmentx\(i\)x^\{\(i\)\}, after computing per\-layer key/value states, we form the cross\-segment interface stateCiC\_\{i\}as a fixed\-size carried KV tail of lengthMMproduced by local heads\. Concretely, for each layerℓ\\ell, we cache the lastMMkey/value vectors fromℋlocal\\mathcal\{H\}\_\{\\text\{local\}\}and expose them to the next segment as the only differentiable cross\-segment state\. When processing segmentii, local heads attend to the concatenation of this carried KV tail \(from segmenti−1i\{\-\}1\) and the within\-segment KV of segmentii\(Figure[3](https://arxiv.org/html/2605.11744#S2.F3), green\+orange\)\. This realizes the interface stateCi−1C\_\{i\-1\}used in Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\), and is exactly the state channel whose gradient propagation is aligned with inference \(Corollary[3\.4](https://arxiv.org/html/2605.11744#S3.Thmtheorem4)\)\.
#### Long\-Range Channel: Forward\-Only Retrieval Prefix\{Ri\}\\\{R\_\{i\}\\\}
To capture dependencies beyond the carried KV horizon, we introduce an external past\-only KV pool used by long\-range\-enabled layers\. For eachℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}, we maintain a past\-only pool that stores historical key/value vectors computed from long\-range heads and supports retrieving a fixed\-size prefix of lengthRRvia top\-kksimilarity search\. In our implementation, we do not apply eviction; the pool retains historical KV states for the selected long\-range heads and long\-range\-enabled layers\. Before processing segmentii, we construct a compact query summary from the tail of segmenti−1i\{\-\}1\(using long\-range heads\) and retrieveRRkey/value vectors from the pool as the retrieval prefixRi−1R\_\{i\-1\}\. In layersℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}, long\-range heads attend to the concatenation of this retrieved prefix and the within\-segment KV of segmentii\(Figure[3](https://arxiv.org/html/2605.11744#S2.F3), blue\+orange\)\. In layersℓ∉ℒlong\\ell\\notin\\mathcal\{L\}\_\{\\text\{long\}\}, long\-range heads simply reduce to standard within\-segment causal attention only \(Figure[3](https://arxiv.org/html/2605.11744#S2.F3), orange\)\. KV states inserted into the pool are detached for future retrieval, and retrieval itself is forward\-only, ensuring consistency with Sections[3\.1](https://arxiv.org/html/2605.11744#S3.SS1)–[3\.3](https://arxiv.org/html/2605.11744#S3.SS3)\. Formal analysis and implementation details are provided in Lemma[B\.1](https://arxiv.org/html/2605.11744#A2.Thmtheorem1)and Appendix[D](https://arxiv.org/html/2605.11744#A4)\.
#### RoPE Re\-indexing for Prefix Concatenation
Both the carried KV tail and the retrieval prefix are treated as a prefix preceding the current segment\. To preserve correct positional encoding under this concatenation, we re\-index RoPE positions by assigning prefix positions\{0,…,P−1\}\\\{0,\\dots,P\{\-\}1\\\}\(withP=MP=Mfor the carried KV tail andP=RP=Rfor the retrieval prefix\) and shifting the current segment positions by an offset ofPP\. This allows reuse of standard RoPE attention while integrating prefix KV as preceding context\. We provide exact position assignments and the corresponding RoPE application details in Appendix[E](https://arxiv.org/html/2605.11744#A5)\.
Finally, by enabling retrieval only for a subset of layers and heads, the architecture preserves the local execution semantics used in Sections[3\.1](https://arxiv.org/html/2605.11744#S3.SS1)–[3\.3](https://arxiv.org/html/2605.11744#S3.SS3), while limiting the attention\-visible compute and activation\-memory overhead of long\-range context access; we quantify the time/memory benefits in Section[3\.5](https://arxiv.org/html/2605.11744#S3.SS5)\.
Figure 4:Perplexity on PG19 test under varying evaluation context lengths\. Results are reported for LLaMA2\-32K and LLaMA2\-80K\.
### 3\.5Computational Cost under Segment\-Level Restricted Execution
We analyze the computational and memory cost of the proposed architecture \(Section[3\.4](https://arxiv.org/html/2605.11744#S3.SS4)\) under the segment\-level execution semantics in Definition[3\.1](https://arxiv.org/html/2605.11744#S3.Thmtheorem1)\.
We consider a sequence of total lengthT=∑i=1N\|x\(i\)\|T=\\sum\_\{i=1\}^\{N\}\|x^\{\(i\)\}\|, where\{x\(i\)\}i=1N\\\{x^\{\(i\)\}\\\}\_\{i=1\}^\{N\}are the non\-overlapping segments used in Eq\. \([1](https://arxiv.org/html/2605.11744#S3.E1)\)\. For clarity, we assume a constant segment length\|x\(i\)\|=S\|x^\{\(i\)\}\|=S, henceN=T/SN=T/S\. The differentiable cross\-segment interface stateCiC\_\{i\}is a carried KV tail of fixed lengthMM, and the forward\-only retrieval prefixRiR\_\{i\}has a fixed lengthRR\. Each layer hasHHattention heads, partitioned into local headsℋlocal\\mathcal\{H\}\_\{\\text\{local\}\}and long\-range headsℋlong\\mathcal\{H\}\_\{\\text\{long\}\}\(Section[3\.4](https://arxiv.org/html/2605.11744#S3.SS4)\), withα:=\|ℋlocal\|/H\\alpha:=\|\\mathcal\{H\}\_\{\\text\{local\}\}\|/H\. Only layers inℒlong\\mathcal\{L\}\_\{\\text\{long\}\}consume the retrieval prefix; denoteβ:=\|ℒlong\|/L\\beta:=\|\\mathcal\{L\}\_\{\\text\{long\}\}\|/L\.
#### Time Complexity\.
For a segment of lengthSS, local heads attend to a context of sizeS\+MS\+M\(within\-segment plus carried tail\)\. Long\-range heads attend toSSin layersℓ∉ℒlong\\ell\\notin\\mathcal\{L\}\_\{\\text\{long\}\}, and toS\+RS\+Rin layersℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}\. Abstracting the dominant attention cost in each layer as𝒪\(S⋅Lctx\)\\mathcal\{O\}\(S\\cdot L\_\{\\mathrm\{ctx\}\}\)per head, the effective context length per token can be summarized as
S\+αM\+β\(1−α\)R\.S\+\\alpha M\+\\beta\(1\-\\alpha\)R\.Therefore, the total attention time over the full sequence is
𝒪\(T⋅\(S\+αM\+β\(1−α\)R\)\),\\mathcal\{O\}\\\!\\left\(T\\cdot\\bigl\(S\+\\alpha M\+\\beta\(1\-\\alpha\)R\\bigr\)\\right\),\(up to a constant factor depending on the number of layersLLand headsHH\)\.
#### Retrieval Overhead\.
In addition to attention, retrieval incurs an extra search cost over the past\-only KV pool\. In our implementation, the pool is not capacity\-bounded: it retains historical KV states only for the selected long\-range heads and long\-range\-enabled layers\. Consequently, the searchable pool grows with the processed history, but this growth is reduced by the sparsity factorβ\(1−α\)\\beta\(1\-\\alpha\)over layers and heads\. This retrieval cost is separate from the attention cost above: after retrieval, each long\-range head attends only to a fixed\-size prefix of lengthRR, so the attention\-visible context remains bounded byS\+αM\+β\(1−α\)RS\+\\alpha M\+\\beta\(1\-\\alpha\)R\.
#### Memory Complexity\.
Under segment\-level execution, the peak KV cache for attention is dominated by the current segment KV and the prefix KV\. Across all layers, local heads attend to an attention\-visible KV length ofS\+MS\+M\. For long\-range heads, the attention\-visible KV length isS\+RS\+Rin layersℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}andSSin layersℓ∉ℒlong\\ell\\notin\\mathcal\{L\}\_\{\\text\{long\}\}\. Thus, the peak attention\-visible KV length can be summarized as
L¯KV=S\+αM\+β\(1−α\)R,\\bar\{L\}\_\{\\mathrm\{KV\}\}=S\+\\alpha M\+\\beta\(1\-\\alpha\)R,implying a bounded per\-segment attention\-memory footprint that does not scale withTT\. The persistent retrieval pool is separate from this attention\-visible KV footprint\. Since we retain historical KV states for the selected long\-range heads and long\-range\-enabled layers without eviction, the pool memory grows linearly with the processed sequence length, with an effective sparsity factorβ\(1−α\)\\beta\(1\-\\alpha\)\. Importantly, these stored states are forward\-only and do not create additional gradient\-carrying activation memory\.
Table 1:Comparison of different methods on LongBench\-E\(Bai et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib4)\)\. TTFT \(time to first token\) measures the latency of prompt processing before token generation, while Mem\. reports the peak GPU memory allocated during the prefill stage; lower values indicate better performance for both TTFT and Mem\. TTFT and Mem\. are measured with a 16K\-token prefill for LLaMA2\-7B\-32K and a 32K\-token prefill for LLaMA2\-7B\-80K\. All experiments are conducted on A100 GPUs\.MethodsS\. QAM\. QASum\.FS\. LearningSyntheticCodeAvg\.TTFT \(s\)Mem\. \(GB\)LLaMA2\-7B\-32K\(Vanilla Self\-Attention\)2\.826\.093\.0965\.360\.5060\.9123\.131\.6223\.61CCA\-attention\(Chen et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib8)\)3\.365\.644\.6055\.412\.1255\.5621\.121\.7928\.08StreamingLLM\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29)\)1\.935\.223\.9561\.930\.2558\.1221\.901\.5922\.19DuoAttention\(Xiao et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib30)\)2\.856\.303\.4564\.820\.6759\.9123\.001\.5318\.15MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)2\.876\.393\.5665\.440\.0060\.2223\.082\.8422\.19\(Ours\)1\.904\.0712\.1062\.300\.2858\.8123\.241\.7018\.56LLaMA2\-7B\-80K\(Vanilla Self\-Attention\)6\.895\.9311\.3566\.510\.7748\.8323\.384\.1334\.67StreamingLLM\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29)\)3\.483\.766\.6062\.350\.1753\.0221\.563\.0731\.77CCA\-attention\(Chen et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib8)\)6\.227\.866\.1158\.21\.7351\.7521\.983\.8843\.64DuoAttention\(Xiao et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib30)\)6\.576\.1211\.8663\.680\.8948\.5022\.943\.7923\.66MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)6\.935\.9611\.1966\.421\.0148\.6023\.354\.1331\.77\(Ours\)7\.588\.568\.7063\.760\.4256\.0224\.173\.4919\.06
## 4Experiments
### 4\.1Experimental Setup
Benchmarks and Tasks\.We evaluate long\-context modeling through a set of complementary evaluation views: \(1\) language modeling perplexity \(PPL\) across varying context lengths, which characterizes stability and degradation patterns as the effective context length increases; \(2\) downstream performance on the LongBench benchmark\(Bai et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib4)\), which comprises a diverse set of long\-context tasks requiring cross\-segment information integration and long\-range dependency modeling, providing a task\-level evaluation of long\-context generalization; and \(3\) controlled length\-scaling analyses using the CWE and FWE tasks from the RULER benchmark\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib16)\)\. RULER offers a setting in which context length can be systematically extended, enabling focused stress testing of robustness and dependency modeling behavior\.
Models and Backbones\.Our primary experiments are conducted on LLaMA 2 models\(Touvron et al\.,[2023](https://arxiv.org/html/2605.11744#bib.bib25)\)with extended context lengths of 32K and 80K\(Fu et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib12)\), which are widely used settings in long\-context evaluation\. We also report results on LLaMA 3\.1 for LongBench v2\(Bai et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib5)\)to verify whether similar trends hold on a more recent backbone, with detailed results provided in appendix[G](https://arxiv.org/html/2605.11744#A7)for completeness\.
Baselines\.We compare our method against representative long\-context modeling approaches with different execution constraints\. These include full self\-attention as an unconstrained reference, Core Context Aware \(CCA\)\(Chen et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib8)\)\(training–inference aligned via context compression\), MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)\(inference\-only sparse attention\), StreamingLLM\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29)\)\(sliding\-window attention with sinks\), and DuoAttention\(Xiao et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib30)\)\(head\-level separation for selective long\-range access\)\.
Implementation Details\.For methods whose execution requires training–inference alignment \(our method and CCA\), we fine\-tune the pretrained models using a standard language modeling objective to match their inference\-time execution semantics\. All alignment\-based methods are fine\-tuned under identical settings\. All other baseline methods are evaluated under their standard pretrained configurations without additional training\. For StreamingLLM, we adopt the implementation provided in MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)\. Additional implementation details are provided in the Appendix[F](https://arxiv.org/html/2605.11744#A6)\. Our code is available at:[link](https://github.com/sxp11/Training-Inference-Consistent-Segmented-Execution)\.
### 4\.2Main Results
Table 2:Performance as context length increases from 4K to 64K on RULER tasks\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib16)\)\. CWE and FWE denote the Common Words Extraction and Frequent Words Extraction tasks, respectively\. Scores report recall\-based accuracy at each context length\. “–” indicates zero performance, i\.e\., failure to generalize to the corresponding length\. Avg\* reports the average over 4K–32K, excluding the 64K setting\.Methods4K8K16K32K64KAvg\*CWEFWECWEFWECWEFWECWEFWECWEFWECWEFWELLaMA2\-7B\-32K\(Vanilla Self\-Attention\)60\.6548\.0039\.2535\.8320\.0547\.5011\.8034\.00––32\.9441\.33StreamingLLM\(Xiao et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib29)\)61\.8047\.8320\.4538\.8319\.3543\.009\.5035\.83––27\.7841\.37MInference\(Jiang et al\.,[2024](https://arxiv.org/html/2605.11744#bib.bib17)\)60\.3548\.6739\.4036\.1720\.1546\.1711\.8536\.00––32\.9441\.75CCA\-attention\(Chen et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib8)\)41\.1045\.6736\.2539\.5015\.9029\.836\.3512\.83––24\.9031\.96DuoAttention\(Xiao et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib30)\)60\.1548\.8338\.8042\.6718\.1548\.3312\.2033\.83––32\.3343\.42Ours70\.6053\.8354\.1538\.6741\.4544\.8319\.3538\.172\.0034\.1746\.3943\.88
Figure 5:Prefill latency and peak GPU memory under increasing evaluation context lengths for LLaMA2\-7B\-32K\. \(a\) Prefill time measures the forward\-pass latency \(seconds\) to process the entire input prompt\. \(b\) Peak memory \(GB\) reports the maximum GPU allocated memory during the prefill stage\.Figure 6:Prefill latency versus peak GPU memory at 64K context length for LLaMA2\-7B\-32K\.Perplexity under Increasing Context Lengths\.We evaluate perplexity on the PG19 test set under increasing evaluation context lengths\. Figure[4](https://arxiv.org/html/2605.11744#S3.F4)reports results for both LLaMA2\-32K and LLaMA2\-80K\. Evaluation at 64K exceeds the training context length for LLaMA2\-32K, and represents a challenging long\-context regime in terms of modeling stability across methods\.
Across commonly used medium\-length contexts \(4K–32K\), different methods exhibit distinct perplexity trends as context length increases\. Methods based on sparse, windowed, or compressed\-context mechanisms \(e\.g\., MInference, StreamingLLM, and CCA\) often show non\-monotonic behavior with noticeable fluctuations\. Full attention remains relatively smooth at moderate lengths but degrades as context length grows\.
In contrast, our method shows a smoother perplexity trend across the evaluated context lengths, with perplexity increasing gradually as context length grows, and does not exhibit abrupt spikes when the evaluation length exceeds the training context\.
Downstream Performance on LongBench\-E\.Table[1](https://arxiv.org/html/2605.11744#S3.T1)reports results on LongBench\-E across two long\-context backbones\. Overall, our method attains the highest average score under both settings, with improvements observed at both 32K and 80K across multiple task categories\.
The largest improvement is observed on summarization, where our method outperforms all baselines at 32K \(e\.g\., 12\.1 vs\. 3\.09–4\.60\), while maintaining comparable performance on question answering tasks\. At longer contexts, these gains extend to multi\-document QA, where our method attains an average score of 8\.56 at 80K, contributing to the highest overall average\.
These results reflect the effect of enforcing training–inference consistency at the segment level, and correspond to improved robustness in long\-context modeling across both local generation and cross\-segment reasoning tasks\. All improvements are obtained while preserving the efficiency advantages of segment\-level execution, with lower memory usage and a favorable latency–memory trade\-off compared to full attention\. A detailed efficiency analysis is provided in Section[4\.3](https://arxiv.org/html/2605.11744#S4.SS3)\. Results on a more recent backbone \(LLaMA 3\.1\) and LongBench v2 are reported in Appendix[G](https://arxiv.org/html/2605.11744#A7)for completeness\.
Length Generalization on RULER\.We evaluate length generalization on the RULER benchmark by increasing the context length from 4K to 64K in Table[2](https://arxiv.org/html/2605.11744#S4.T2)\. Within the effective context range of LLaMA2\-7B\-32K \(4K–32K\), performance degrades for all methods as context length grows; however, our method consistently achieves higher and more stable accuracy on both CWE and FWE, resulting in the best average performance \(Avg\*: 46\.39 / 43\.88\) among all baselines\. When the context length is further extended beyond the training range to 64K, most existing methods collapse to zero performance, whereas our method retains non\-zero accuracy, exhibiting a smoother degradation behavior rather than an abrupt failure\. These results indicate that under strictly constrained execution, our method demonstrates stronger robustness to context length scaling, including beyond the nominal training context range\.
### 4\.3Prefill Latency and Memory Efficiency
We analyze efficiency in long\-context settings by examining prefill latency and peak GPU memory consumption as the evaluation context length increases\. As shown in Figure[5](https://arxiv.org/html/2605.11744#S4.F5), while both latency and memory grow with context length for all methods, our approach exhibits substantially more gradual scaling behavior than existing baselines, particularly at longer contexts\.
Figure[6](https://arxiv.org/html/2605.11744#S4.F6)further illustrates this effect at a representative context length of 64K\. Our method maintains a low peak memory footprint and moderate prefill latency, whereas other methods either incur significantly higher memory usage or suffer from increased latency\. Overall, these results indicate that our approach achieves a more balanced latency–memory trade\-off during long\-context prefill, rather than optimizing a single efficiency dimension in isolation\.
### 4\.4Ablation: Validating Training–Inference Consistency
Table[3](https://arxiv.org/html/2605.11744#S4.T3)studies the effect of training–inference alignment and the TBPTT truncation depthKKunder our segment\-level execution semantics\. Here,*Misaligned*trains the model with standard full\-context attention, but evaluates it under our segmented execution, resulting in mismatched forward semantics\. Removing alignment \(*Misaligned*\) leads to a substantial performance drop across all LongBench\(\-E\) categories, indicating the effect of optimizing a training objective consistent with inference execution\. Notably, increasing the TBPTT depth fromK=1K\{=\}1toK=2K\{=\}2does not improve performance and can slightly degrade it\. This contrasts with classical recurrent models, and reflects a key difference in our framework, where the carried KV interface constitutes the only differentiable cross\-segment state\. These results empirically supportK=1K\{=\}1as the most aligned and effective choice under our training–inference consistent segmented execution\. Additional ablation results and analyses are provided in Appendix[G](https://arxiv.org/html/2605.11744#A7)\.
Table 3:Effect of training–inference alignment and TBPTT depth on LongBench\-E category scores\. The ablation evaluates the necessity of alignment and the impact of truncation depth\.MethodS\.QAM\.QASumFS\.Syn\.CodeAvgAligned \(TBPTT=1\)7\.588\.568\.7063\.760\.4256\.0224\.17Misaligned3\.522\.796\.8036\.970\.0021\.3711\.91Aligned \(TBPTT=2\)8\.888\.489\.3664\.420\.5852\.7124\.07
## 5Conclusion
We propose a training–inference consistent segmented execution framework for long\-context language modeling under constrained execution\. The framework treats segmented execution as a shared modeling assumption between training and inference, and aligns state evolution and credit assignment through a controlled cross\-segment interface\. Under this formulation, we show that when cross\-segment recursion is strictly bounded, truncated backpropagation computes the exact gradient of a consistent training objective, preventing reliance on information unavailable at inference time\. Empirically, the proposed approach performs competitively across diverse long\-context benchmarks and achieves favorable latency–memory trade\-offs, while substantially improving scalability in extreme long\-context regimes\.
## Acknowledgements
This work was funded by National Natural Science Foundation of China \(Grant No\. 62366036\), Outstanding Youth Fund Project of Inner Mongolia Autonomous Region \(Grant No\. 2025JQ010\), Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region \(Grant No\. NJYT24033\), Major Science and Technology Projects of Inner Mongolia Autonomous Region \(Grant No\. 2025ZDSF0029\), Key R&D and Achievement Transformation Program of Inner Mongolia Autonomous Region \(Grant No\. 2025YFDZ0011, 2025YFDZ0026, 2025YFSH0021, 2025YFHH0073\), Hohhot Science and Technology Project \(Grant No\. 2023\-Zhan\-Zhong\-1\)\.
## Impact Statement
This work improves the scalability of long\-context language modeling by proposing a training–inference consistent segmented execution framework that reduces memory usage and inference latency while maintaining strong performance\. The primary positive impact is enabling more accessible and energy\-efficient deployment of long\-context LLMs for applications such as document understanding and long\-form summarization\. As a general\-purpose efficiency improvement, the method may also lower the cost of deploying long\-context LLMs, which could amplify existing risks associated with misuse of such models, including misinformation generation or handling of sensitive content\. We encourage responsible deployment through standard safeguards, including privacy\-preserving data practices, access control for sensitive use cases, and content safety measures\.
## References
- Achiam et al\. \(2023\)Achiam, J\., Adler, S\., Agarwal, S\., Ahmad, L\., Akkaya, I\., Aleman, F\. L\., Almeida, D\., Altenschmidt, J\., Altman, S\., Anadkat, S\., et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Agrawal et al\. \(2023\)Agrawal, A\., Panwar, A\., Mohan, J\., Kwatra, N\., Gulavani, B\. S\., and Ramjee, R\.SARATHI: efficient LLM inference by piggybacking decodes with chunked prefills\.*CoRR*, abs/2308\.16369, 2023\.doi:10\.48550/ARXIV\.2308\.16369\.URL[https://doi\.org/10\.48550/arXiv\.2308\.16369](https://doi.org/10.48550/arXiv.2308.16369)\.
- Anthropic \(2024\)Anthropic\.The claude 3 model family: Opus, sonnet, haiku, 2024\.Technical report\.
- Bai et al\. \(2024\)Bai, Y\., Lv, X\., Zhang, J\., Lyu, H\., Tang, J\., Huang, Z\., Du, Z\., Liu, X\., Zeng, A\., Hou, L\., et al\.Longbench: A bilingual, multitask benchmark for long context understanding\.In*Proceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\)*, pp\. 3119–3137, 2024\.
- Bai et al\. \(2025\)Bai, Y\., Tu, S\., Zhang, J\., Peng, H\., Wang, X\., Lv, X\., Cao, S\., Xu, J\., Hou, L\., Dong, Y\., et al\.Longbench v2: Towards deeper understanding and reasoning on realistic long\-context multitasks\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 3639–3664, 2025\.
- Beltagy et al\. \(2020\)Beltagy, I\., Peters, M\. E\., and Cohan, A\.Longformer: The long\-document transformer\.*CoRR*, abs/2004\.05150, 2020\.URL[https://arxiv\.org/abs/2004\.05150](https://arxiv.org/abs/2004.05150)\.
- Bulatov et al\. \(2022\)Bulatov, A\., Kuratov, Y\., and Burtsev, M\.Recurrent memory transformer\.In Koyejo, S\., Mohamed, S\., Agarwal, A\., Belgrave, D\., Cho, K\., and Oh, A\. \(eds\.\),*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022*, 2022\.
- Chen et al\. \(2025\)Chen, Y\., You, Z\., Zhang, S\., Li, H\., Li, Y\., Wang, Y\., and Tan, M\.Core context aware transformers for long context language modeling\.In*Forty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025*\. OpenReview\.net, 2025\.URL[https://openreview\.net/forum?id=MAHPZNduS4](https://openreview.net/forum?id=MAHPZNduS4)\.
- Dai et al\. \(2019\)Dai, Z\., Yang, Z\., Yang, Y\., Carbonell, J\. G\., Le, Q\. V\., and Salakhutdinov, R\.Transformer\-xl: Attentive language models beyond a fixed\-length context\.In Korhonen, A\., Traum, D\. R\., and Màrquez, L\. \(eds\.\),*Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers*, pp\. 2978–2988\. Association for Computational Linguistics, 2019\.doi:10\.18653/V1/P19\-1285\.URL[https://doi\.org/10\.18653/v1/p19\-1285](https://doi.org/10.18653/v1/p19-1285)\.
- Dao \(2024\)Dao, T\.Flashattention\-2: Faster attention with better parallelism and work partitioning\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=mZn2Xyh9Ec](https://openreview.net/forum?id=mZn2Xyh9Ec)\.
- Dao et al\. \(2022\)Dao, T\., Fu, D\. Y\., Ermon, S\., Rudra, A\., and Ré, C\.Flashattention: Fast and memory\-efficient exact attention with io\-awareness\.In Koyejo, S\., Mohamed, S\., Agarwal, A\., Belgrave, D\., Cho, K\., and Oh, A\. \(eds\.\),*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022*, 2022\.
- Fu et al\. \(2024\)Fu, Y\., Panda, R\., Niu, X\., Yue, X\., Hajishirzi, H\., Kim, Y\., and Peng, H\.Data engineering for scaling language models to 128k context\.In*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=TaAqeo7lUh](https://openreview.net/forum?id=TaAqeo7lUh)\.
- Fu et al\. \(2025\)Fu, Z\., Song, W\., Wang, Y\., Wu, X\., Zheng, Y\., Zhang, Y\., Xu, D\., Wei, X\., Xu, T\., and Zhao, X\.Sliding window attention training for efficient large language models\.*CoRR*, abs/2502\.18845, 2025\.doi:10\.48550/ARXIV\.2502\.18845\.URL[https://doi\.org/10\.48550/arXiv\.2502\.18845](https://doi.org/10.48550/arXiv.2502.18845)\.
- Han et al\. \(2024\)Han, C\., Wang, Q\., Peng, H\., Xiong, W\., Chen, Y\., Ji, H\., and Wang, S\.Lm\-infinite: Zero\-shot extreme length generalization for large language models\.In Duh, K\., Gómez\-Adorno, H\., and Bethard, S\. \(eds\.\),*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024*, pp\. 3991–4008\. Association for Computational Linguistics, 2024\.doi:10\.18653/V1/2024\.NAACL\-LONG\.222\.URL[https://doi\.org/10\.18653/v1/2024\.naacl\-long\.222](https://doi.org/10.18653/v1/2024.naacl-long.222)\.
- He et al\. \(2024\)He, S\., Sun, G\., Shen, Z\., and Li, A\.What matters in transformers? not all attention is needed\.*CoRR*, abs/2406\.15786, 2024\.doi:10\.48550/ARXIV\.2406\.15786\.URL[https://doi\.org/10\.48550/arXiv\.2406\.15786](https://doi.org/10.48550/arXiv.2406.15786)\.
- Hsieh et al\. \(2024\)Hsieh, C\.\-P\., Sun, S\., Kriman, S\., Acharya, S\., Rekesh, D\., Jia, F\., and Ginsburg, B\.RULER: What’s the Real Context Size of Your Long\-Context Language Models?In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=kIoBbc76Sy](https://openreview.net/forum?id=kIoBbc76Sy)\.
- Jiang et al\. \(2024\)Jiang, H\., Li, Y\., Zhang, C\., Wu, Q\., Luo, X\., Ahn, S\., Han, Z\., Abdi, A\. H\., Li, D\., Lin, C\., Yang, Y\., and Qiu, L\.Minference 1\.0: Accelerating pre\-filling for long\-context llms via dynamic sparse attention\.In Globersons, A\., Mackey, L\., Belgrave, D\., Fan, A\., Paquet, U\., Tomczak, J\. M\., and Zhang, C\. \(eds\.\),*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*, 2024\.
- Keles et al\. \(2023\)Keles, F\. D\., Wijewardena, P\. M\., and Hegde, C\.On the computational complexity of self\-attention\.In*International conference on algorithmic learning theory*, pp\. 597–619\. PMLR, 2023\.
- Kwon et al\. \(2023\)Kwon, W\., Li, Z\., Zhuang, S\., Sheng, Y\., Zheng, L\., Yu, C\. H\., Gonzalez, J\., Zhang, H\., and Stoica, I\.Efficient memory management for large language model serving with pagedattention\.In Flinn, J\., Seltzer, M\. I\., Druschel, P\., Kaufmann, A\., and Mace, J\. \(eds\.\),*Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23\-26, 2023*, pp\. 611–626\. ACM, 2023\.doi:10\.1145/3600006\.3613165\.URL[https://doi\.org/10\.1145/3600006\.3613165](https://doi.org/10.1145/3600006.3613165)\.
- Liu et al\. \(2025\)Liu, X\., Tang, Z\., Dong, P\., Li, Z\., Li, B\., Hu, X\., and Chu, X\.Chunkkv: Semantic\-preserving KV cache compression for efficient long\-context LLM inference\.*CoRR*, abs/2502\.00299, 2025\.doi:10\.48550/ARXIV\.2502\.00299\.URL[https://doi\.org/10\.48550/arXiv\.2502\.00299](https://doi.org/10.48550/arXiv.2502.00299)\.
- Rae et al\. \(2020\)Rae, J\. W\., Potapenko, A\., Jayakumar, S\. M\., Hillier, C\., and Lillicrap, T\. P\.Compressive transformers for long\-range sequence modelling\.In*8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020*\. OpenReview\.net, 2020\.URL[https://openreview\.net/forum?id=SylKikSYDH](https://openreview.net/forum?id=SylKikSYDH)\.
- Raffel et al\. \(2023\)Raffel, M\., Penney, D\., and Chen, L\.Shiftable context: Addressing training\-inference context mismatch in simultaneous speech translation\.In Krause, A\., Brunskill, E\., Cho, K\., Engelhardt, B\., Sabato, S\., and Scarlett, J\. \(eds\.\),*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, volume 202 of*Proceedings of Machine Learning Research*, pp\. 28519–28530\. PMLR, 2023\.URL[https://proceedings\.mlr\.press/v202/raffel23a\.html](https://proceedings.mlr.press/v202/raffel23a.html)\.
- Sheng et al\. \(2023\)Sheng, Y\., Zheng, L\., Yuan, B\., Li, Z\., Ryabinin, M\., Chen, B\., Liang, P\., Ré, C\., Stoica, I\., and Zhang, C\.Flexgen: High\-throughput generative inference of large language models with a single GPU\.In Krause, A\., Brunskill, E\., Cho, K\., Engelhardt, B\., Sabato, S\., and Scarlett, J\. \(eds\.\),*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, volume 202 of*Proceedings of Machine Learning Research*, pp\. 31094–31116\. PMLR, 2023\.URL[https://proceedings\.mlr\.press/v202/sheng23a\.html](https://proceedings.mlr.press/v202/sheng23a.html)\.
- Team et al\. \(2024\)Team, G\., Georgiev, P\., Lei, V\. I\., Burnell, R\., Bai, L\., Gulati, A\., Tanzer, G\., Vincent, D\., Pan, Z\., Wang, S\., et al\.Gemini 1\.5: Unlocking multimodal understanding across millions of tokens of context\.*arXiv preprint arXiv:2403\.05530*, 2024\.
- Touvron et al\. \(2023\)Touvron, H\., Martin, L\., Stone, K\., Albert, P\., Almahairi, A\., Babaei, Y\., Bashlykov, N\., Batra, S\., Bhargava, P\., Bhosale, S\., et al\.Llama 2: Open foundation and fine\-tuned chat models\.*arXiv preprint arXiv:2307\.09288*, 2023\.
- Vaswani et al\. \(2017\)Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, Ł\., and Polosukhin, I\.Attention is all you need\.*Advances in neural information processing systems*, 30, 2017\.
- Wu et al\. \(2025\)Wu, W\., Wang, Y\., Xiao, G\., Peng, H\., and Fu, Y\.Retrieval head mechanistically explains long\-context factuality\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net, 2025\.URL[https://openreview\.net/forum?id=EytBpUGB1Z](https://openreview.net/forum?id=EytBpUGB1Z)\.
- Wu et al\. \(2022\)Wu, Y\., Rabe, M\. N\., Hutchins, D\., and Szegedy, C\.Memorizing transformers\.In*The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022*\. OpenReview\.net, 2022\.URL[https://openreview\.net/forum?id=TrjbxzRcnf\-](https://openreview.net/forum?id=TrjbxzRcnf-)\.
- Xiao et al\. \(2024\)Xiao, G\., Tian, Y\., Chen, B\., Han, S\., and Lewis, M\.Efficient streaming language models with attention sinks\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=NG7sS51zVF](https://openreview.net/forum?id=NG7sS51zVF)\.
- Xiao et al\. \(2025\)Xiao, G\., Tang, J\., Zuo, J\., Guo, J\., Yang, S\., Tang, H\., Fu, Y\., and Han, S\.Duoattention: Efficient long\-context LLM inference with retrieval and streaming heads\.In*The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025*\. OpenReview\.net, 2025\.URL[https://openreview\.net/forum?id=cFu7ze7xUm](https://openreview.net/forum?id=cFu7ze7xUm)\.
- Yuan et al\. \(2025\)Yuan, J\., Gao, H\., Dai, D\., Luo, J\., Zhao, L\., Zhang, Z\., Xie, Z\., Wei, Y\., Wang, L\., Xiao, Z\., Wang, Y\., Ruan, C\., Zhang, M\., Liang, W\., and Zeng, W\.Native sparse attention: Hardware\-aligned and natively trainable sparse attention\.In Che, W\., Nabende, J\., Shutova, E\., and Pilehvar, M\. T\. \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025*, pp\. 23078–23097\. Association for Computational Linguistics, 2025\.URL[https://aclanthology\.org/2025\.acl\-long\.1126/](https://aclanthology.org/2025.acl-long.1126/)\.
- Zhao et al\. \(2025\)Zhao, W\., Zhou, Z\., Su, Z\., Xiao, C\., Li, Y\., Li, Y\., Zhang, Y\., Zhao, W\., Li, Z\., Huang, Y\., Sun, A\., Han, X\., and Liu, Z\.Infllm\-v2: Dense\-sparse switchable attention for seamless short\-to\-long adaptation\.*CoRR*, abs/2509\.24663, 2025\.doi:10\.48550/ARXIV\.2509\.24663\.URL[https://doi\.org/10\.48550/arXiv\.2509\.24663](https://doi.org/10.48550/arXiv.2509.24663)\.
## Appendix AAdditional Theory: Exactness of TBPTT Under Consistent Segmented Objectives
### A\.1Computational\-Graph Equivalence Viewpoint
This subsection elaborates the proof sketch in Theorem[3\.3](https://arxiv.org/html/2605.11744#S3.Thmtheorem3)\. Definition[3\.2](https://arxiv.org/html/2605.11744#S3.Thmtheorem2)inserts the stop\-gradient boundarysg\(Cbi\)\\mathrm\{sg\}\(C\_\{b\_\{i\}\}\)into the state chain, wherebi=max\(0,i−K−1\)b\_\{i\}=\\max\(0,i\-K\-1\)\. This eliminates any gradient path that traverses more thanKKconsecutive state transitions\. Thus, for each loss termℓi\(θ;C~i−1\(K\),Ri−1\)\\ell\_\{i\}\(\\theta;\\tilde\{C\}\_\{i\-1\}^\{\(K\)\},R\_\{i\-1\}\), the backpropagation graph contains only the unrolled state updates from segmentsbi\+1b\_\{i\}\+1toi−1i\-1\. TBPTT with truncation depthKKcomputes gradients on exactly this truncated graph, hence equals∇θLK\(θ\)\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)\.
### A\.2Chain\-Rule Derivation via Jacobian Products and Adjoint Recursion
We provide an explicit chain\-rule expansion for the gradient computed by TBPTT and show that it equals∇θLK\(θ\)\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)\.
#### Notation\.
LetCi∈ℝdCC\_\{i\}\\in\\mathbb\{R\}^\{d\_\{C\}\}be the vectorized interface state\. Define the Jacobian of the state transition:
Ji:=∂Ci∂Ci−1∈ℝdC×dC,J\_\{i\}:=\\frac\{\\partial C\_\{i\}\}\{\\partial C\_\{i\-1\}\}\\in\\mathbb\{R\}^\{d\_\{C\}\\times d\_\{C\}\},\(6\)and the sensitivity of the transition to parameters:
Ui:=∂Ci∂θ\.U\_\{i\}:=\\frac\{\\partial C\_\{i\}\}\{\\partial\\theta\}\.\(7\)For a fixed loss termℓi\\ell\_\{i\}, define its gradient w\.r\.t\. the pre\-segment state:
gi−1\(i\):=∂ℓi∂Ci−1\.g\_\{i\-1\}^\{\(i\)\}:=\\frac\{\\partial\\ell\_\{i\}\}\{\\partial C\_\{i\-1\}\}\.\(8\)
#### Truncated dependency induced bysg\(⋅\)\\mathrm\{sg\}\(\\cdot\)\.
UnderLK\(θ\)L\_\{K\}\(\\theta\), the computational graph forℓi\\ell\_\{i\}includes only the unrolled state transitions frombi\+1b\_\{i\}\+1toi−1i\-1, becauseC~bi\(K\)=sg\(Cbi\)\\tilde\{C\}\_\{b\_\{i\}\}^\{\(K\)\}=\\mathrm\{sg\}\(C\_\{b\_\{i\}\}\)blocks gradients into earlier segments\. Therefore, forℓi\\ell\_\{i\}we only need to account for parameter effects onCjC\_\{j\}forj∈\{bi\+1,…,i−1\}j\\in\\\{b\_\{i\}\+1,\\dots,i\-1\\\}\.
#### Adjoint recursion \(backward pass\)\.
Define adjoints forℓi\\ell\_\{i\}:
ai−1\(i\)\\displaystyle a\_\{i\-1\}^\{\(i\)\}:=∂ℓi∂Ci−1=gi−1\(i\),\\displaystyle:=\\frac\{\\partial\\ell\_\{i\}\}\{\\partial C\_\{i\-1\}\}=g\_\{i\-1\}^\{\(i\)\},\(9\)aj−1\(i\)\\displaystyle a\_\{j\-1\}^\{\(i\)\}:=aj\(i\)Jj,forj=i−1,i−2,…,bi\+1\.\\displaystyle:=a\_\{j\}^\{\(i\)\}J\_\{j\},\\quad\\text\{for \}j=i\-1,i\-2,\\dots,b\_\{i\}\+1\.\(10\)This recursion is precisely the backward propagation across the truncated state chain, stopping at the boundary stateCbiC\_\{b\_\{i\}\}\.
#### Gradient expansion for one loss term\.
By the chain rule, the gradient ofℓi\\ell\_\{i\}w\.r\.t\.θ\\thetaunder the truncated graph is
∂ℓi∂θ=∂ℓi∂θ\|direct\+∑j=bi\+1i−1aj\(i\)Uj\\frac\{\\partial\\ell\_\{i\}\}\{\\partial\\theta\}=\\left\.\\frac\{\\partial\\ell\_\{i\}\}\{\\partial\\theta\}\\right\|\_\{\\mathrm\{direct\}\}\+\\sum\_\{j=b\_\{i\}\+1\}^\{i\-1\}a\_\{j\}^\{\(i\)\}U\_\{j\}\(11\)where the “direct” term captures explicit dependence ofℓi\\ell\_\{i\}onθ\\thetanot mediated byCC, and the summation captures all contributions flowing through the unrolled \(and truncated\) state updates\.
#### From one term to the full objective\.
Summing Eq\. \([11](https://arxiv.org/html/2605.11744#A1.E11)\) overiiyields
∇θLK\(θ\)=∑i=1N\(∂ℓi∂θ\|direct\+∑j=bi\+1i−1aj\(i\)Uj\),\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)=\\sum\_\{i=1\}^\{N\}\\left\(\\left\.\\frac\{\\partial\\ell\_\{i\}\}\{\\partial\\theta\}\\right\|\_\{\\mathrm\{direct\}\}\+\\sum\_\{j=b\_\{i\}\+1\}^\{i\-1\}a\_\{j\}^\{\(i\)\}U\_\{j\}\\right\),\(12\)which exactly matches the computation performed by TBPTT with truncation depthKK: backpropagate each loss through at mostKKtransitions \(Eq\. \([10](https://arxiv.org/html/2605.11744#A1.E10)\)\) and accumulate contributions\.
This completes a constructive proof that TBPTT computes the exact gradient∇θLK\(θ\)\\nabla\_\{\\theta\}L\_\{K\}\(\\theta\)\.
## Appendix BForward\-Only Long\-Range Retrieval Does Not Add Gradient Paths
###### Lemma B\.1\(No\-gradient retrieval prefix\)\.
Suppose the long\-range prefix is constructed as
Ri−1=sg\(ρ\(Mi−1,qi−1\)\),R\_\{i\-1\}=\\mathrm\{sg\}\\big\(\\rho\(M\_\{i\-1\},q\_\{i\-1\}\)\\big\),\(13\)whereρ\\rhois an arbitrary retrieval operator \(e\.g\., top\-kksearch\),Mi−1M\_\{i\-1\}is an external memory, andqi−1q\_\{i\-1\}is a query\. ThenRi−1R\_\{i\-1\}introduces no gradient path w\.r\.t\. model parameters:
∂Ri−1∂θ=0\.\\frac\{\\partial R\_\{i\-1\}\}\{\\partial\\theta\}=0\.\(14\)
###### Proof\.
The stop\-gradient operatorsg\(⋅\)\\mathrm\{sg\}\(\\cdot\)makesRi−1R\_\{i\-1\}constant in backward propagation, hence its derivative w\.r\.t\.θ\\thetais zero\. ∎
#### Implication\.
Lemma[B\.1](https://arxiv.org/html/2605.11744#A2.Thmtheorem1)implies the optional long\-range module only changes forward conditioning but does not create additional cross\-segment credit assignment paths\. Therefore it does not affect Theorem[3\.3](https://arxiv.org/html/2605.11744#S3.Thmtheorem3)\.
## Appendix CAttention Contexts for Local and Long\-Range Heads
We formalize the per\-head attention context used by the architecture in Section[3\.4](https://arxiv.org/html/2605.11744#S3.SS4)\. Let segmentx\(i\)x^\{\(i\)\}have lengthmim\_\{i\}\. For layerℓ\\elland headhh, denote within\-segment query/key/value matrices asQi\(ℓ,h\)∈ℝmi×dQ\_\{i\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times d\},Ki\(ℓ,h\)∈ℝmi×dK\_\{i\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times d\}, andVi\(ℓ,h\)∈ℝmi×dV\_\{i\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times d\}\.
We use standard scaled dot\-product attention:
Attn\(Q,K,V\)=softmax\(QK⊤d\+ℳ\)V,\\mathrm\{Attn\}\(Q,K,V\)=\\mathrm\{softmax\}\\\!\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\}\}\+\\mathcal\{M\}\\right\)V,\(15\)whereℳ\\mathcal\{M\}is a causal mask over the concatenated prefix\+segment sequence: each segment query position may attend to all prefix positions and to within\-segment positions up to itself\.
#### Carried KV tail and retrieval prefix\.
For local headsh∈ℋlocalh\\in\\mathcal\{H\}\_\{\\text\{local\}\}, let the carried KV tail from the previous segment beCi−1\(ℓ,h\)=\(Kcarry\(ℓ,h\)∈ℝM×d,Vcarry\(ℓ,h\)∈ℝM×d\)C\_\{i\-1\}^\{\(\\ell,h\)\}=\(K\_\{\\text\{carry\}\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{M\\times d\},V\_\{\\text\{carry\}\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{M\\times d\}\)\. For long\-range headsh∈ℋlongh\\in\\mathcal\{H\}\_\{\\text\{long\}\}andℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}, let the retrieval prefix beRi−1\(ℓ,h\)=\(Kret\(ℓ,h\)∈ℝR×d,Vret\(ℓ,h\)∈ℝR×d\)R\_\{i\-1\}^\{\(\\ell,h\)\}=\(K\_\{\\text\{ret\}\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{R\\times d\},V\_\{\\text\{ret\}\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{R\\times d\}\)\.
#### Context construction\.
Define context keys/values\(Kctx,Vctx\)\(K\_\{\\text\{ctx\}\},V\_\{\\text\{ctx\}\}\)as:
\(Kctx,Vctx\)=\{\(\[Kcarry\(ℓ,h\);Ki\(ℓ,h\)\],\[Vcarry\(ℓ,h\);Vi\(ℓ,h\)\]\)ifh∈ℋlocal,\(\[Kret\(ℓ,h\);Ki\(ℓ,h\)\],\[Vret\(ℓ,h\);Vi\(ℓ,h\)\]\)ifh∈ℋlongandℓ∈ℒlong,\(Ki\(ℓ,h\),Vi\(ℓ,h\)\)ifh∈ℋlongandℓ∉ℒlong\.\(K\_\{\\text\{ctx\}\},V\_\{\\text\{ctx\}\}\)=\\begin\{cases\}\(\[K\_\{\\text\{carry\}\}^\{\(\\ell,h\)\};K\_\{i\}^\{\(\\ell,h\)\}\],\[V\_\{\\text\{carry\}\}^\{\(\\ell,h\)\};V\_\{i\}^\{\(\\ell,h\)\}\]\)&\\text\{if \}h\\in\\mathcal\{H\}\_\{\\text\{local\}\},\\\\ \(\[K\_\{\\text\{ret\}\}^\{\(\\ell,h\)\};K\_\{i\}^\{\(\\ell,h\)\}\],\[V\_\{\\text\{ret\}\}^\{\(\\ell,h\)\};V\_\{i\}^\{\(\\ell,h\)\}\]\)&\\text\{if \}h\\in\\mathcal\{H\}\_\{\\text\{long\}\}\\text\{ and \}\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\},\\\\ \(K\_\{i\}^\{\(\\ell,h\)\},V\_\{i\}^\{\(\\ell,h\)\}\)&\\text\{if \}h\\in\\mathcal\{H\}\_\{\\text\{long\}\}\\text\{ and \}\\ell\\notin\\mathcal\{L\}\_\{\\text\{long\}\}\.\\end\{cases\}\(16\)The head output is thenOi\(ℓ,h\)=Attn\(Qi\(ℓ,h\),Kctx,Vctx\)O\_\{i\}^\{\(\\ell,h\)\}=\\mathrm\{Attn\}\(Q\_\{i\}^\{\(\\ell,h\)\},K\_\{\\text\{ctx\}\},V\_\{\\text\{ctx\}\}\)\.
## Appendix DTop\-kkRetrieval Operator
We detail how the past\-only KV pool produces a fixed\-size retrieval prefixRi−1R\_\{i\-1\}of lengthRR\.
#### Memory layout\.
For eachℓ∈ℒlong\\ell\\in\\mathcal\{L\}\_\{\\text\{long\}\}, we maintain a past\-only KV pool storing key/value vectors for headsh∈ℋlongh\\in\\mathcal\{H\}\_\{\\text\{long\}\}accumulated from earlier segments\. Pool updates are append\-only and performed without gradient flow \(Lemma[B\.1](https://arxiv.org/html/2605.11744#A2.Thmtheorem1)\)\.
#### Query construction\.
From segmenti−1i\{\-\}1, we take the lastLqL\_\{q\}query vectors from long\-range heads and form a compact set of query summaries using \(i\) sliding\-window mean pooling and \(ii\) a short tail average to emphasize recent tokens\.
#### Scoring and top\-kk\.
For each head independently, we compute dot\-product scores between query summaries and all keys in the pool, select top\-kkindices, and deduplicate them\. We then choose a small set of anchor indices with the highest scores\.
#### Window expansion and padding to lengthRR\.
We expand anchors with a small fixed offset window to obtain a contiguous set of indices, deduplicate/sort, and if fewer thanRRindices remain we pad with a deterministic fallback \(e\.g\., earliest positions\) to reach exactlyRRkeys/values\.
#### Output\.
We gather the selected keys/values as\(Kret\(ℓ,h\),Vret\(ℓ,h\)\)∈ℝR×d\(K\_\{\\text\{ret\}\}^\{\(\\ell,h\)\},V\_\{\\text\{ret\}\}^\{\(\\ell,h\)\}\)\\in\\mathbb\{R\}^\{R\\times d\}and return them as the retrieval prefixRi−1\(ℓ,h\)R\_\{i\-1\}^\{\(\\ell,h\)\}used in Eq\. \([16](https://arxiv.org/html/2605.11744#A3.E16)\)\.
## Appendix ERoPE Re\-indexing for Prefix Concatenation
We specify the RoPE positions used when concatenating a prefix of lengthPPbefore the current segment, whereP=MP=Mfor the carried KV tail andP=RP=Rfor the retrieved prefix\. For segmentiiof lengthmim\_\{i\}, we assign prefix positionsppref=\{0,…,P−1\}p\_\{\\text\{pref\}\}=\\\{0,\\dots,P\{\-\}1\\\}and segment positionspseg=\{P,…,P\+mi−1\}p\_\{\\text\{seg\}\}=\\\{P,\\dots,P\{\+\}m\_\{i\}\{\-\}1\\\}\.
#### RoPE application\.
We apply RoPE to segment queries/keys using positionspsegp\_\{\\text\{seg\}\}\. For prefix KV \(carried or retrieved\), only keys require RoPE rotation; we apply RoPE to prefix keys using positionspprefp\_\{\\text\{pref\}\}\. Values are unchanged\. This is equivalent to treating the prefix as preceding context under standard RoPE while leaving the segment attention structure unchanged\.
## Appendix FAdditional Implementation Details
#### Implementation details of our method\.
Unless otherwise specified, our method is trained and evaluated under identical segment\-level execution settings\. Both training and inference use a fixed segment length ofS=4096S\{=\}4096, and training samples are partitioned into non\-overlapping segments at the individual sample level\. The differentiable carried interface is instantiated as a KV tail of fixed lengthM=512M\{=\}512passed from segmenti−1i\{\-\}1to segmentii\. The long\-range module is forward\-only and uses the lastLq∈\[24,64\]L\_\{q\}\\in\[24,64\]query vectors from the previous segment to form retrieval queries, returning a fixed\-size retrieved prefix of lengthR=512R\{=\}512when enabled\.
For the main experiments, we use truncated backpropagation through time with truncation depthK=1K\{=\}1, which corresponds to a maximum training context length of 8K tokens\. Experiments withK=2K\{=\}2\(maximum training context length 12K\) are only used for ablation studies\. For alignment fine\-tuning, we optimize a standard language modeling objective on the SlimPajama dataset for 1,000 training steps, using a micro\-batch size of 1 with gradient accumulation of 8\. All alignment\-based models are trained under identical optimization settings\. Our primary experiments are conducted on LLaMA2\-7B models with extended context lengths of 32K and 80K\. Additional experiments on LLaMA\-3\.1 are reported only for LongBench\-v2 evaluation\.
#### Implementation details of baseline methods\.
We compare our approach against representative long\-context baselines, including CCA\-attention, MInference, NSA, StreamingLLM, and DuoAttention\.
For CCA\-attention, we adopt the official implementation and follow the default configuration, using a window size of 1024 and a group size of 16\.
For MInference, we use the official implementation with the recommended sparse attention configuration provided by the authors\.
For NSA\(Yuan et al\.,[2025](https://arxiv.org/html/2605.11744#bib.bib31)\), since the official implementation was not publicly available at the time of our experiments, we use the reproduced implementation fromZhao et al\. \([2025](https://arxiv.org/html/2605.11744#bib.bib32)\)\. NSA is evaluated on the same LLaMA2\-7B\-80K backbone and under the same standard LongBench protocol used in Appendix[G\.1](https://arxiv.org/html/2605.11744#A7.SS1)\.
For StreamingLLM, the official implementation does not employ FlashAttention, which would lead to unfair efficiency comparisons in long\-context settings\. Therefore, we evaluate StreamingLLM through the implementation provided in the MInference framework, which supports FlashAttention\-based execution\. Specifically, we enable the StreamingLLM execution path by settingattn\_type=streamingwithkv\_type=streamingllm, corresponding to a sliding\-window attention mechanism with attention sinks\. We use a local window size of 2044 tokens and retain 4 initial sink tokens \(i\.e\.,n\_local=2044andn\_init=4\), following the standard configuration used in prior work\.
For DuoAttention, we use the official implementation and the attention patterns released by the authors\. For LLaMA2\-7B\-32K and LLaMA2\-7B\-80K, we apply the providedLlama\-2\-7B\-32K\-Instructpatterns\. For LLaMA\-3\.1\-8B\-Instruct, we use the officialMeta\-Llama\-3\.1\-8B\-Instructpatterns\. Unless otherwise specified, all baseline methods are evaluated under their standard pretrained configurations without additional fine\-tuning\.
## Appendix GAdditional Experiment Results
### G\.1Standard LongBench Evaluation with NSA Baseline
Table 4:Standard LongBench results on LLaMA2\-7B\-80K\. LongBench\-E is a length\-stratified subset of LongBench; we additionally report results on the standard LongBench benchmark and include NSA as an additional sparse\-attention baseline\.MethodNQAQasperMF\-enMF\-zhHQA2WikiMuSiQueDureaderGovReportQMSumMultiNewsTRECTriviaQASAMSumLCCRepoBench\-PAvg\.LLaMA2\-7B\-80K2\.074\.996\.643\.596\.004\.152\.3217\.9710\.5520\.348\.5669\.0086\.7543\.2955\.6645\.3324\.20StreamingLLM1\.033\.064\.931\.886\.832\.312\.818\.105\.0315\.3012\.0460\.5087\.0741\.3657\.3245\.8622\.21DuoAttention1\.565\.285\.814\.615\.794\.252\.7416\.9511\.2619\.379\.7369\.0087\.1231\.3656\.0745\.5823\.53MInference1\.925\.046\.443\.585\.603\.872\.6618\.3810\.3220\.858\.8169\.5086\.6743\.7455\.8245\.3024\.28NSA2\.245\.237\.108\.345\.446\.993\.6711\.6810\.1015\.144\.5260\.0073\.2726\.8453\.9444\.1121\.16Ours5\.898\.228\.119\.686\.859\.873\.8112\.779\.3816\.8110\.3063\.5088\.2241\.9958\.4751\.9325\.36
As shown in Table[4](https://arxiv.org/html/2605.11744#A7.T4), our method achieves the best overall average on the standard LongBench benchmark\. Compared with NSA, our method obtains stronger average performance and improves substantially on QA\-oriented tasks such as NarrativeQA, Qasper, HotpotQA, 2WikiMQA, and MuSiQue, suggesting that the gains observed on LongBench\-E are not specific to the length\-stratified subset\.
### G\.2LLaMA\-3\.1 Results on LongBench\-v2
Since the main experiments are conducted on LLaMA2\-based backbones, we further examine whether the proposed method exhibits similar trends on a more recent architecture\. We apply our approach to LLaMA\-3\.1\-8B\-Instruct and report additional results on LongBench\-v2, without modifying the method or tuning hyperparameters for this backbone\.
As shown in Table[5](https://arxiv.org/html/2605.11744#A7.T5), our method shows performance trends consistent with those observed on LLaMA2, achieving competitive accuracy across difficulty levels and context length categories\. While DuoAttention attains the highest overall score in this setting, our approach remains comparable without relying on backbone\-specific attention patterns or additional fine\-tuning, suggesting that the proposed training–inference consistent segmented execution generalizes beyond a specific backbone architecture\.
Table 5:Comparisons on LongBench\-V2\.MethodEasyHardShortMediumLongOverallLlama\-3\.1\-8B\-Instruct30\.729\.635\.026\.528\.730\.0StreamingLLM22\.923\.830\.018\.622\.223\.5MInference27\.128\.033\.324\.225\.027\.6DuoAttention31\.231\.236\.127\.031\.531\.2\(Ours\)27\.631\.233\.328\.426\.929\.8
### G\.3Effect of Training–Inference Alignment on Language Modeling Perplexity
To clarify why training–inference misalignment leads to the downstream performance degradation observed in the main experiments, we report language modeling perplexity under different TBPTT settings\. Under aligned training–inference execution \(TBPTT = 1 or 2\), perplexity remains stable across evaluation context lengths, indicating that truncated backpropagation preserves language modeling stability in the proposed framework\. In contrast, misaligned training results in rapidly increasing and unstable perplexity as context length grows, particularly beyond the training range\. These results indicate that the downstream performance collapse reported in the main text is preceded by severe instability at the language modeling level, highlighting the necessity of training–inference consistency under constrained execution\. This observation empirically corroborates the theoretical result that, under the proposed constrained cross\-segment recursion, TBPTT withK=1K\{=\}1yields an exact gradient for the inference\-consistent objective\.
Table 6:Perplexity under increasing context lengths for different training–inference alignment and TBPTT settings\.Method4K8K16K32K64KAvgAligned \(TBPTT=1\)7\.187\.107\.077\.067\.077\.10Misaligned7\.1633\.4868\.3594\.60111\.3862\.99Aligned \(TBPTT=2\)7\.187\.097\.057\.047\.047\.08
### G\.4Effect of Local State Capacity
To study the effect of local cross\-segment state capacity, we evaluate language modeling perplexity and downstream long\-context performance under different local KV state sizes, while keeping all other settings fixed\. As shown in Table[7](https://arxiv.org/html/2605.11744#A7.T7), increasing the local KV state size consistently reduces perplexity across context lengths, indicating improved stability of cross\-segment modeling, with diminishing gains at larger capacities\.
This trend is mirrored in downstream results on LongBench\-E \(Table[8](https://arxiv.org/html/2605.11744#A7.T8)\)\. Compared to removing the local state, introducing a finite\-capacity local KV state improves overall performance, while further increasing the state size yields only marginal and non\-uniform gains across task categories\. Overall, these results suggest that a moderate local state capacity is sufficient to balance language modeling stability and downstream performance under constrained execution\. We hypothesize that overly large local state capacity may encourage over\-reliance on short\-range carried states, which does not uniformly benefit tasks requiring global abstraction, leading to non\-monotonic gains across categories\.
Table 7:Perplexity under increasing context lengths with different local state capacities\.Local KV state size4K8K16K32K64KAvg07\.187\.157\.157\.157\.177\.165127\.187\.107\.077\.067\.077\.1010247\.187\.077\.037\.037\.037\.07Table 8:Effect of local state capacity on LongBench\-E category performance\.Local KV state sizeS\. QAM\. QASum\.FS\. LearningSyntheticCodeAvg07\.107\.357\.5862\.370\.3554\.8823\.275127\.588\.568\.7063\.760\.4256\.0224\.1710248\.087\.678\.5763\.800\.2156\.8224\.19
### G\.5Effect of Long\-Range Module Placement
To examine the effect of long\-range module placement, we compare language modeling perplexity and downstream long\-context performance under different numbers of long\-range layers\. This ablation isolates the impact of long\-range module depth while keeping all other settings unchanged\.
From a language modeling perspective, perplexity remains highly similar across all configurations and evaluation context lengths \(Table[9](https://arxiv.org/html/2605.11744#A7.T9)\), indicating that inserting long\-range modules does not materially affect basic language modeling stability\. In contrast, downstream results on LongBench\-E exhibit clearer differences \(Table[10](https://arxiv.org/html/2605.11744#A7.T10)\)\. Configurations without long\-range modules or with only a small number of such layers achieve lower overall performance, whereas introducing long\-range modules at more layers leads to consistent improvements, particularly on tasks requiring cross\-segment information integration\.
Taken together, these results indicate that long\-range modules primarily improve downstream cross\-segment reasoning without materially affecting language modeling perplexity, consistent with their architectural role as forward\-only evidence access mechanisms that do not participate in state recursion\.
Table 9:Perplexity under increasing context lengths with different numbers of long\-range layers\.Long\-range layers4K8K16K32K64KAvg07\.177\.107\.067\.047\.047\.0827\.187\.107\.067\.057\.057\.0947\.187\.107\.077\.067\.077\.10Table 10:LongBench\-E category performance with different numbers of long\-range layers\.Long\-range layersS\. QAM\. QASum\.FS\. LearningSyntheticCodeAvg06\.105\.528\.7762\.100\.0053\.2922\.6325\.475\.617\.5162\.510\.0853\.4722\.4447\.588\.568\.7063\.760\.4256\.0224\.17
### G\.6Effect of Long\-Range Head Grouping
We further ablate the choice of long\-range heads while keeping the number of long\-range heads fixed\. All variants use the same long\-range\-enabled layersℒlong=\{6,8,11,18\}\\mathcal\{L\}\_\{\\mathrm\{long\}\}=\\\{6,8,11,18\\\}and differ only in how attention heads are assigned toℋlong\\mathcal\{H\}\_\{\\mathrm\{long\}\}andℋlocal\\mathcal\{H\}\_\{\\mathrm\{local\}\}\. We compare three head grouping strategies: \(i\) contiguous grouping, whereℋlong=\{0,…,15\}\\mathcal\{H\}\_\{\\mathrm\{long\}\}=\\\{0,\\dots,15\\\}andℋlocal=\{16,…,31\}\\mathcal\{H\}\_\{\\mathrm\{local\}\}=\\\{16,\\dots,31\\\}; \(ii\) interleaved grouping, whereℋlong=\{0,2,4,…,30\}\\mathcal\{H\}\_\{\\mathrm\{long\}\}=\\\{0,2,4,\\dots,30\\\}andℋlocal=\{1,3,5,…,31\}\\mathcal\{H\}\_\{\\mathrm\{local\}\}=\\\{1,3,5,\\dots,31\\\}; and \(iii\) prior\-based grouping, which usesℋlong=\{0,1,2,4,9,12,14,15,16,18,19,22,23,26,29,30\}\\mathcal\{H\}\_\{\\mathrm\{long\}\}=\\\{0,1,2,4,9,12,14,15,16,18,19,22,23,26,29,30\\\}\. The remaining heads are assigned toℋlocal\\mathcal\{H\}\_\{\\mathrm\{local\}\}\. All layer and head indices follow the zero\-indexed implementation convention\.
Table 11:Effect of long\-range head grouping on the standard LongBench benchmark using LLaMA2\-7B\-80K\. All variants use the same long\-range\-enabled layersℒlong=\{6,8,11,18\}\\mathcal\{L\}\_\{\\mathrm\{long\}\}=\\\{6,8,11,18\\\}and the same number of long\-range heads; only the head grouping strategy is changed\.Head GroupingNQAQasperMF\-enMF\-zhHQA2WikiMuSiQueDureaderGovReportQMSumMultiNewsTRECTriviaQASAMSumLCCRepoBench\-PAvg\.Contiguous4\.184\.756\.885\.996\.275\.073\.119\.6211\.5116\.1810\.9262\.0086\.3242\.1856\.2751\.8923\.95Interleaved5\.325\.046\.746\.095\.617\.152\.7710\.316\.5914\.8910\.3762\.5087\.6841\.6758\.1452\.1723\.94Prior\-based5\.898\.228\.119\.686\.859\.873\.8112\.779\.3816\.8110\.3063\.5088\.2241\.9958\.4751\.9325\.36
As shown in Table[11](https://arxiv.org/html/2605.11744#A7.T11), contiguous and interleaved groupings obtain comparable average performance, while the prior\-based grouping achieves the best overall average\. This suggests that the benefit of the long\-range channel depends not only on the number of long\-range heads, but also on assigning long\-range capacity to heads with stronger retrieval\-oriented behavior\.Similar Articles
Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.
Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
The paper introduces MISA, a method that applies a mixture-of-experts approach to the indexer heads in sparse attention mechanisms, significantly reducing computational costs for long-context LLM inference while maintaining performance.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct proposes a saliency-guided sparse update strategy for improving long-context reasoning in LLMs by selectively updating weights associated with high-magnitude activations in query and key vectors, achieving ~8% improvement on LongBench v2.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.