Dual Dimensionality for Local and Global Attention

arXiv cs.CL Papers

Summary

Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.

arXiv:2606.18587v1 Announce Type: new Abstract: Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# Dual Dimensionality for Local and Global Attention
Source: [https://arxiv.org/html/2606.18587](https://arxiv.org/html/2606.18587)
Zhiyuan Wang UC Santa Barbara zwang796@ucsb\.edu &Xuan Luo UC Santa Barbara xuan\_luo@cs\.ucsb\.edu Sirui Zeng UC Santa Barbara sirui\_zeng@ucsb\.edu &Xifeng Yan UC Santa Barbara xyan@cs\.ucsb\.edu

###### Abstract

Decoder\-only Transformers compute attention over the KV cache of preceding tokens\. Keys \(and Values\) are typically represented with the same dimensionality, regardless of its distance from the prediction target\. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens\. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long\-range memory, for which lower\-dimensional representations may suffice\. We formalize this idea as Distance\-Adaptive Representation \(DAR\), implemented in a controlled setting that preserves full\-dimensional representations within a local context window while assigning reduced\-dimensional representations \(e\.g\. 1/4 of the original dimensionality\) to tokens beyond that window\. Across multiple pretraining scales \(70M to 410M parameters\), as well as continued supervised fine\-tuning on a 1B\-scale model, this approach closely matches the performance of full\-dimensional baselines\. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance\. These results challenge the common assumption that key and value dimensionality should be uniform across token positions\. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference\.

## 1Introduction

The success of Transformer\-based language models is largely attributed to the self\-attention mechanismVaswaniet al\.\([2017](https://arxiv.org/html/2606.18587#bib.bib18)\), which allows each token to attend to all preceding context\. In standard implementations, every previous token contributes key and value states of the same dimensionality, regardless of its distance from the current prediction target\. This reflects an implicit architectural assumption that the representational capacity required of past tokens does not depend on how far they are from the position being predicted\.

We revisit this assumption motivated by a simple observation about natural language\. When producing a sequence of words, the most recent context has direct effects on the next word, such as avoiding immediate repetition, following local grammatical rules, and keeping sentiment consistent, while more distant context provides long range memory and context\. This asymmetry suggests that local and distant tokens may contribute different kinds of information to next\-token prediction\. Formally, we hypothesize that local tokens near the prediction target carry rich, fine\-grained information\. This information is sensitive to subtle distinctions, and benefits from high\-dimensional representations\. If this hypothesis holds, can we reduce the dimensionality of attention representations as token distance increases without substantially harming model performance?

While prior studies have extensively explored the KV cache reduction problem, none of them has addressed the aforementioned question directly\. We categorize the relevant literature into two distinct categories\. The first maintains a local context window while sparsifying attention over distant tokens\. Specifically, KV cache eviction methods, e\.g\., sliding\-window attentionBeltagyet al\.\([2020](https://arxiv.org/html/2606.18587#bib.bib22)\), StreamingLLMXiaoet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib12)\), and H2OZhanget al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib11)\)—systematically discard past tokens based on varying importance criteria\. All of them, however, retain a span of recent tokens that are guaranteed not to be evicted, suggesting that information carried by local tokens is relatively more important for prediction\. The second approach modifies the model architecture itself to reduce representational dimensionality\. Multi\-head Latent Attention \(MLA\)Liuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib17)\), proposed by DeepSeek, applies uniform low\-rank compression across all past tokens, allowing the model to adapt to this low\-rank regime through pretraining\. Although MLA reduces memory overhead, its uniform latent dimensionality treats local and distant tokens identically\. Compressed Sparse Attention \(CSA\) in DeepSeek\-V4DeepSeek\-AI \([2026](https://arxiv.org/html/2606.18587#bib.bib4)\)reduces KV cache further by compressing multiple tokens horizontally into one token\. Taken together, prior work has yet to characterize how token distance influences the dimensionality required for attention\. This motivates our investigation of the hypothesis thatrepresentational capacity should be allocated based on token distance rather than applied uniformly\. We refer to this principle as Distance\-Adaptive Representation \(DAR\)\.

![Refer to caption](https://arxiv.org/html/2606.18587v1/x1.png)Figure 1:Tokens within a local window of sizeww\(including the current tokenxnx\_\{n\}\) are represented at dimensionalitydd, while tokens beyond the window are represented at a lower dimensionalityddownd\_\{\\text\{down\}\}\. The current token attends to all preceding tokens\.To verify this hypothesis, we adopt a simple implementation of DAR that maintains full\-dimensional attention representations for local tokens and lower\-dimensional representations for distant tokens, illustrated in Figure[1](https://arxiv.org/html/2606.18587#S1.F1)\. Our main findings are as follows:

- •At a fixed model scale, the dimensionality assigned to distant tokens can be substantially reduced with minimal loss of perplexity, and degrades only below a critical threshold\. The same reduction applied uniformly across all token distances degrades more sharply, indicating that local tokens require a higher minimum dimensionality than distant tokens\.
- •The hypothesized dimensional asymmetry holds across multiple pretraining scales \(70M, 160M, and 410M parameters\), where distance\-adaptive dimensionality achieves perplexity comparable to full\-dimensional baseline at every scale\.
- •The hypothesis extends beyond pretraining perplexity: when applied as continued supervised fine\-tuning on a 1B\-scale model, distance\-adaptive dimensionality preserves downstream task performance\.

## 2Distance\-Adaptive Representation

In this work, we use a two\-regime partition scheme to evaluate Distance\-Adaptive Representation \(DAR\), a principle in which the representational capacity allocated to a token in attention varies with its distance from the prediction target\. Under this scheme, full dimensionality is assigned to neighboring tokens within a local window, while a fixed lower dimensionality is used for all tokens outside the window\.

### 2\.1Bottleneck Representation for Distant Tokens

For each token at positionjj, let𝐡j∈ℝd\\mathbf\{h\}\_\{j\}\\in\\mathbb\{R\}^\{d\}denote its hidden state\. To test the two\-regime partition, we keep the original hidden state𝐡j\\mathbf\{h\}\_\{j\}for tokens within a window ofwwrecent positions, and produce a lower\-dimensional alternative for tokens beyond the window through a lightweight projection:

𝐡jD=𝐡j​𝐖down,\\mathbf\{h\}\_\{j\}^\{D\}=\\mathbf\{h\}\_\{j\}\\mathbf\{W\}\_\{\\text\{down\}\},\(1\)where𝐖down∈ℝd×ddown\\mathbf\{W\}\_\{\\text\{down\}\}\\in\\mathbb\{R\}^\{d\\times d\_\{\\text\{down\}\}\}\. The bottleneck dimensionalityddown<dd\_\{\\text\{down\}\}<dcontrols the representational capacity available to distant tokens and is the central hyperparameter of our design\. We use𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}as the underlying representation for distant tokens whenever they are accessed in attention\. This treatment is consistent with MLALiuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib17)\);𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}can be interpreted as compressed latent vector\. The key difference is that tokens within the sliding window retain full dimensionality \(though, in principle, they could also use a compressed representation\)\. We additionally evaluated a variant that applies a sigmoid nonlinearity after the down\-projection in Eq\. \([1](https://arxiv.org/html/2606.18587#S2.E1)\)\. Empirically, we observed comparable performance to the linear formulation\. We therefore adopt the simpler linear projection throughout the paper\.

### 2\.2Hybrid Attention over Two Representations

Given a query𝐪i\\mathbf\{q\}\_\{i\}at positionii, the model attends to the keys and values of all preceding tokens\. Because tokens within and beyond the local window are represented at different dimensionalities \(ddandddownd\_\{\\text\{down\}\}, respectively\), the attention computation proceeds along two paths: a*local*path for tokens within the window and a*global*path for tokens beyond it\. To allow both paths to share the same key and value projections𝐖K\\mathbf\{W\}\_\{K\}and𝐖V\\mathbf\{W\}\_\{V\}, we lift the bottlenecked representation𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}back to the model dimensionddbefore computing keys and values along the global path\. For clarity, we present the formulation with a single attention head and omit standard operations such as layer normalization; multi\-head attention follows directly by replicating the construction across heads\.

For tokens beyond the window, the bottlenecked representation𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}is first projected back to dimensiondd:

𝐡j′=𝐡jD​𝐖up,\\mathbf\{h\}\_\{j\}^\{\\prime\}=\\mathbf\{h\}\_\{j\}^\{D\}\\,\\mathbf\{W\}\_\{\\text\{up\}\},\(2\)where𝐖up∈ℝddown×d\\mathbf\{W\}\_\{\\text\{up\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{down\}\}\\times d\}\. This up\-projection does not restore information lost in the bottleneck: the resulting representation has dimensionalityddbut its information content is bounded by the bottleneck dimensionddownd\_\{\\text\{down\}\}\. Its purpose is solely to align the global path’s representation with the projection space expected by𝐖K\\mathbf\{W\}\_\{K\}and𝐖V\\mathbf\{W\}\_\{V\}\.

For each preceding positionjj, the keys and values used in attention are then computed based on its distance from the query positionii:

𝐤j=\{RoPE⁡\(𝐡j​𝐖K\),if​i−j<w,RoPE⁡\(𝐡j′​𝐖K\),otherwise,𝐯j=\{𝐡j​𝐖V,if​i−j<w,𝐡j′​𝐖V,otherwise,\\mathbf\{k\}\_\{j\}=\\begin\{cases\}\\operatorname\{RoPE\}\(\\mathbf\{h\}\_\{j\}\\mathbf\{W\}\_\{K\}\),&\\text\{if \}i\-j<w,\\\\ \\operatorname\{RoPE\}\(\\mathbf\{h\}\_\{j\}^\{\\prime\}\\mathbf\{W\}\_\{K\}\),&\\text\{otherwise,\}\\end\{cases\}\\quad\\mathbf\{v\}\_\{j\}=\\begin\{cases\}\\mathbf\{h\}\_\{j\}\\mathbf\{W\}\_\{V\},&\\text\{if \}i\-j<w,\\\\ \\mathbf\{h\}\_\{j\}^\{\\prime\}\\mathbf\{W\}\_\{V\},&\\text\{otherwise,\}\\end\{cases\}\(3\)whereRoPE⁡\(⋅\)\\operatorname\{RoPE\}\(\\cdot\)applies rotary position embeddings andwwis the size of the local window\. The attention output for query𝐪i\\mathbf\{q\}\_\{i\}is then computed in the standard way:

𝐨i=Softmax⁡\(𝐪i​𝐊i⊤dk\)​𝐕i,\\mathbf\{o\}\_\{i\}=\\operatorname\{Softmax\}\\\!\\left\(\\frac\{\\mathbf\{q\}\_\{i\}\\mathbf\{K\}\_\{i\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\mathbf\{V\}\_\{i\},\(4\)where𝐊i=\[𝐤1;…;𝐤i\]\\mathbf\{K\}\_\{i\}=\[\\mathbf\{k\}\_\{1\};\\dots;\\mathbf\{k\}\_\{i\}\],𝐕i=\[𝐯1;…;𝐯i\]\\mathbf\{V\}\_\{i\}=\[\\mathbf\{v\}\_\{1\};\\dots;\\mathbf\{v\}\_\{i\}\], anddkd\_\{k\}is the per\-head key dimensionality of the underlying multi\-head attention\. As in standard attention, the query is computed as𝐪i=𝐡i​𝐖Q\\mathbf\{q\}\_\{i\}=\\mathbf\{h\}\_\{i\}\\mathbf\{W\}\_\{Q\}, and the attention output𝐨i\\mathbf\{o\}\_\{i\}is further projected by an output projection𝐖O\\mathbf\{W\}\_\{O\}before being passed to the next layer\. The window sizewwthus serves as the boundary between the two paths, determining whether a preceding token is attended to via the original representation or via the bottlenecked representation\.

### 2\.3Training and Inference

During training, each past token maintains two representations\. Each query attends to all past tokens, with the appropriate key/value representations selected based on distance \(Eq\. \([3](https://arxiv.org/html/2606.18587#S2.E3)\)\)\. The standard next\-token prediction objective is used:

ℒ=−∑t=1Tlog⁡P​\(xt∣x<t;θ\),\\mathcal\{L\}=\-\\sum\_\{t=1\}^\{T\}\\log P\(x\_\{t\}\\mid x\_\{<t\};\\,\\theta\),\(5\)wherextx\_\{t\}is thett\-th token,x<tx\_\{<t\}denotes all preceding tokens,TTis the sequence length, andθ\\thetadenotes all model parameters\. No auxiliary losses or additional supervision signals are introduced; backpropagation updates the bottleneck projections𝐖down\\mathbf\{W\}\_\{\\text\{down\}\}and𝐖up\\mathbf\{W\}\_\{\\text\{up\}\}from positions where the query attends to the global path\.

During inference, our current experiments maintain both sets of key and value states for each token, mirroring the training setup\. This is not necessary for inference, since for each query, every past token contributes through exactly one path based on distance\. However, this does not affect the validation of our hypothesis\.

Section[5](https://arxiv.org/html/2606.18587#S5)discusses more efficient implementations and further optimizations, including the use of Decoupled Rotary Position Embedding from MLALiuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib17)\)\.

## 3Experiments

We conduct pretraining and supervised fine\-tuning experiments to validate the two\-regime partition scheme described above\. If DAR is effective, the two\-regime partition scheme should perform close to full\-dimensional attention, and substantially better than uniform lower\-dimensional attention applied to all tokens\.

![Refer to caption](https://arxiv.org/html/2606.18587v1/x2.png)Figure 2:Document\-length distribution \(CDF\) of three perplexity evaluation corpora, tokenized with the Pythia tokenizer\. The vertical dashed lines mark the local window size \(w=128w=128\) and the training sequence length \(2,0482,048\)\. The distribution shows that the majority of evaluation tokens lie well beyond thew=128w=128window, rigorously stressing our model’s reliance on the global path\.#### Pretraining experiments\.

For both the hypothesis validation experiments and the scaling analysis, we pretrain models from scratch following the Pythia training recipeAndonianet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib9)\); Bidermanet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib5)\), with a maximum sequence length of 2,048 tokens\. The hypothesis validation experiments use the Pythia\-70M architecture, while the scaling analysis additionally includes Pythia\-160M and Pythia\-410M\. All models are trained on a 10B\-token subset of the PileBidermanet al\.\([2022](https://arxiv.org/html/2606.18587#bib.bib7)\); Gaoet al\.\([2020](https://arxiv.org/html/2606.18587#bib.bib6)\), well above the compute\-optimal token count for models at these scalesHoffmannet al\.\([2022](https://arxiv.org/html/2606.18587#bib.bib8)\)\. Batch sizes vary across experiments due to GPU availability and are reported in each section\. Performance is evaluated by perplexity on a subset of FineWeb\-EduLozhkovet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib24)\), WikiText\-103Merityet al\.\([2016](https://arxiv.org/html/2606.18587#bib.bib25)\)and C4Raffelet al\.\([2020](https://arxiv.org/html/2606.18587#bib.bib26)\); as shown in Figure[2](https://arxiv.org/html/2606.18587#S3.F2), most evaluation sequences are substantially longer than the local window sizeww, ensuring that the global path is activated throughout evaluation\. Note that for documents exceeding the maximum training sequence length, we employ a rolling evaluation strategy to ensure full sequence coverage, meaning no token is discarded\.

#### Supervised fine\-tuning experiments\.

To assess whether our findings generalize to task\-level evaluation, we adopt the instruction\-tuned OLMo\-2\-1B\-SFTOLMoet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib23)\)as a starting point and perform additional supervised fine\-tuning with our architectural modification\. Training proceeds in two stages, each consisting of one epoch over the OLMo\-specific variant of the Tülu 3 dataset used for OLMo\-2\-1B\-SFTLambertet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib37)\)\. In the first stage, only the bottleneck parameters\{𝐖down,𝐖up\}\\\{\\mathbf\{W\}\_\{\\text\{down\}\},\\mathbf\{W\}\_\{\\text\{up\}\}\\\}are trained while the rest of the model is frozen, allowing the bottleneck to learn an effective lower\-dimensional representation of distant tokens before the rest of the model adapts to it\. In the second stage, all parameters are trained jointly so that the model as a whole adjusts to the two\-path attention computation\. We use the AdamW optimizer with a linear learning rate schedule \(warmup ratio0\.030\.03\), a batch size of512512and a maximum sequence length of2,0482\{,\}048\. The first stage uses a learning rate of3×10−43\\times 10^\{\-4\}, and the second uses3×10−53\\times 10^\{\-5\}\. We start from a model that has already been instruction\-tuned because this allows us to evaluate downstream task capability without additional pretraining, which would have exceeded our compute budget\. Performance is evaluated usinglm\-evaluation\-harnessGaoet al\.\([2021](https://arxiv.org/html/2606.18587#bib.bib10)\)on six downstream benchmarks, covering knowledge\-intensive reasoning, commonsense, mathematical reasoning, code generation, and long\-context summarization \(detailed in Section[3\.4](https://arxiv.org/html/2606.18587#S3.SS4)\)\. Our experiments were conducted on NVIDIA 8xA100 and 4xGH200 GPUs\.

### 3\.1Core Hypothesis Validation

We test the hypothesis at the Pythia\-70M scale using a batch size of256256, for a total of19,07319,073training steps over our 10B token budget\. We vary the bottleneck dimensionddownd\_\{\\text\{down\}\}under a fixed window sizew=128w=128, and comparing against two reference points: \(i\) a full\-dimensional baseline \(d=512d=512, "Vanilla"\), and \(ii\) a uniform reduction baseline that applies the same lower dimensionalityddownd\_\{\\text\{down\}\}to all tokens regardless of distance\. This second baseline isolates the effect of the distance\-aware design from the effect of lower\-dimensional representations alone\.

Table[1](https://arxiv.org/html/2606.18587#S3.T1)reports perplexity across the three evaluation corpora\. Two observations support the hypothesis\. First, DAR withddown=256d\_\{\\text\{down\}\}=256andddown=128d\_\{\\text\{down\}\}=128outperforms the full\-dimensional baseline \(Rel\. 98\.57% and 99\.61%, respectively\); only atddown=64d\_\{\\text\{down\}\}=64does noticeable degradation appear \(Rel\. 101\.99%\)\. This suggests that distant tokens do not require the full dimensionality, and that representational capacity beyond a certain threshold may not be necessary for attention over distant context\. The improvement atddown=256d\_\{\\text\{down\}\}=256andddown=128d\_\{\\text\{down\}\}=128is consistent with this interpretation: removing redundant capacity in distant representations does not hurt prediction\. Second, when the lower\-dimensional representations are applied uniformly across all token positions, performance degrades more sharply: atddown=128d\_\{\\text\{down\}\}=128, uniform reduction reaches Rel\. 105\.49% while DAR remains at 99\.61%; at the more aggressiveddown=64d\_\{\\text\{down\}\}=64, uniform reduction degrades to Rel\. 111\.30%, while DAR only reaches 101\.99%\. The difference between DAR and uniform reduction isolates the value of preserving full dimensionality for local tokens, providing direct evidence that local tokens require higher representational capacity than distant ones\.

Figure[3](https://arxiv.org/html/2606.18587#S3.F3)shows the relative perplexity trajectory across pretraining\. In the early stages, all variants exhibit elevated perplexity relative to Vanilla, but the gap closes at different rates\. DAR withddown∈\{128,256\}d\_\{\\mathrm\{down\}\}\\in\\\{128,256\\\}converges to Vanilla by the end of training, while DAR withddown=64d\_\{\\mathrm\{down\}\}=64remains slightly above\. Uniform reduction remains above Vanilla throughout training across allddownd\_\{\\mathrm\{down\}\}values, with the gap widening asddownd\_\{\\mathrm\{down\}\}decreases\. At eachddownd\_\{\\mathrm\{down\}\}, DAR outperforms Uniform reduction throughout pretraining, demonstrating that the dimensional asymmetry holds across the entire training trajectory\.

Table 1:DAR validation at the Pythia\-70M scale\. DAR is run with window sizew=128w=128across all bottleneck dimensions\. Perplexity is reported on three evaluation corpora: a subset of FineWeb\-Edu, C4 and WikiText\-103\. Rel\. is the average per\-dataset perplexity ratio relative to Vanilla, reported as a percentage \(smaller is better\)\.![Refer to caption](https://arxiv.org/html/2606.18587v1/x3.png)Figure 3:Average perplexity ratio relative to Vanilla \(= 100%, shown as horizontal line\) across training steps at the Pythia\-70M scale\.
### 3\.2Generalization Across Pretraining Scales

To examine whether the same observation holds at larger pretraining scales, we extend the experiment to Pythia\-160M and Pythia\-410M using a batch size of896896, for a total of5,4505,450training steps over our 10B token budget\. We compare DAR against the corresponding full\-dimensional baselines at each scale\. The bottleneck dimension is fixed atddown=d/4d\_\{\\text\{down\}\}=d/4across all scales, matching the moderate compression setting at which DAR closely matched Vanilla at the 70M scale\.

As shown in Table[2](https://arxiv.org/html/2606.18587#S3.T2), DAR matches or outperforms the full\-dimensional baseline across all three scales we evaluate\. DAR slightly outperforms Vanilla at 70M \(Rel\. 99\.61%\), remains essentially equal at 160M \(Rel\. 100\.88%\), and outperforms Vanilla more clearly at 410M \(Rel\. 97\.98%\)\. This indicates that the dimensional asymmetry between local and distant tokens is not limited to the 70M setting and continues to hold as both the model and its capacity grow\. The results suggest that, within the evaluated scale range, DAR can preserve competitive performance using the same relative ratio,dd​o​w​n=d/4d\_\{down\}=d/4\. This provides preliminary evidence that distant\-token representations may not require full dimensionality, although larger\-scale experiments are needed to determine how this trend holds more generally\.

Table 2:Generalization of DAR across pretraining scales\. DAR usesw=128w=128andddown=d/4d\_\{\\text\{down\}\}=d/4at each scale\. Perplexity is reported on a subset of FineWeb\-Edu, C4 and WikiText\-103\. Rel\.\(%\) is the average per\-dataset perplexity ratio relative to the Vanilla model at the same scale \(smaller is better\)\.
### 3\.3Window Size Ablation

Table 3:Effect of window sizewwon DAR at the Pythia\-70M scale withddown=128d\_\{\\mathrm\{down\}\}\{=\}128\. Perplexity is reported on three evaluation corpora: a subset of FineWeb\-Edu, C4 and WikiText\-103\. Rel\. is the average per\-dataset perplexity ratio relative to Vanilla, reported as a percentage \(smaller is better\)\.To verify that DAR is robust to the choice of window sizeww, we sweepw∈\{0,1,4,16,64,128,256\}w\\in\\\{0,1,4,16,64,128,256\\\}at the Pythia\-70M scale withddown=128d\_\{\\text\{down\}\}=128\. Table[3](https://arxiv.org/html/2606.18587#S3.T3)shows that DAR remains close to Vanilla across a wide range of window sizes\. Performance is largely unchanged forw≥4w\\geq 4, degrades only slightly atw=1w=1, and drops noticeably whenw=0w=0\. These results suggest that only a small number of nearby tokens require full\-dimensional representations, consistent with our hypothesis that high\-dimensional representations are primarily needed for nearby tokens\. Since performance is stable across a broad range of window sizes, we usew=128w=128in all subsequent experiments as a conservative default within the plateau region, while remaining much smaller than the sequence length\.

### 3\.4Effect on Downstream Tasks

To further examine whether DAR preserves task\-level performance, we evaluate it on a suite of downstream benchmarks under different bottleneck dimensionsddownd\_\{\\text\{down\}\}, while keeping the window size fixed atw=128w=128\. To isolate the effect ofddownd\_\{\\text\{down\}\}from the effect of introducing the bottleneck module itself, we use the same DAR architecture across all configurations and treat the settingddown=d=2048d\_\{\\text\{down\}\}=d=2048as the no\-bottleneck baseline; this configuration includes the same down\-projection and up\-projection modules as the other configurations, but applies no actual dimensionality reduction\.

We evaluate on MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.18587#bib.bib29)\)for massive multitask understanding, HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2606.18587#bib.bib31)\)for commonsense inference, CommonsenseQATalmoret al\.\([2019](https://arxiv.org/html/2606.18587#bib.bib32)\)for commonsense question answering, GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.18587#bib.bib33)\)for mathematical reasoning, MBPPAustinet al\.\([2021](https://arxiv.org/html/2606.18587#bib.bib34)\)for code generation, and Multi\-News from LongBenchBaiet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib28)\)for multi\-document summarization\. We employ a 5\-shot setting for MMLU, HellaSwag, CommonsenseQA, and GSM8K, a 3\-shot setting for MBPP, and a zero\-shot setting for Multi\-News\. The reported metrics are accuracy \(Acc\) on MMLU and CommonsenseQA, normalized accuracy \(Acc\-norm\) on HellaSwag, flexible\-extract match on GSM8K, Pass@1 on MBPP, and ROUGE scores on Multi\-News\. To ensure the global path is engaged during evaluation, we exclude samples whose input context is shorter than the window size\. The average input context lengths for the six tasks are742742,532532,332332,939939,673673, and1,3941\{,\}394tokens, respectively\.

Table 4:Downstream task evaluation\. All configurations use the DAR architecture withw=128w=128\. The first row, withddown=d=2048d\_\{\\text\{down\}\}=d=2048, applies no actual dimensionality reduction and serves as the no\-bottleneck baseline; subsequent rows progressively reduceddownd\_\{\\text\{down\}\}\. Avg\. is the average across the six benchmarks\. Rel\.\(%\) is the average of task\-specific relative scores compared to the no\-bottleneck baseline \(smaller is worse\)\.Table[4](https://arxiv.org/html/2606.18587#S3.T4)reports the per\-task scores\. DAR maintains or slightly exceeds the no\-bottleneck baseline at moderate reductions: atddown=1024d\_\{\\text\{down\}\}=1024\(d/2d/2\),ddown=512d\_\{\\text\{down\}\}=512\(d/4d/4\), andddown=256d\_\{\\text\{down\}\}=256\(d/8d/8\), Rel\. reaches 101\.09%, 102\.19%, and 98\.68% respectively\. Performance degrades sharply at more aggressive reductions: Rel\. drops to 88\.29% atddown=128d\_\{\\text\{down\}\}=128and 82\.00% atddown=64d\_\{\\text\{down\}\}=64\. This indicates that for the evaluated tasks, distant\-token representations can tolerate dimensionality reduction up to roughlyd/8d/8, but below this threshold, distant\-token information becomes insufficient\.

## 4Related Work

Prior studies have extensively explored KV cache reduction, with many approaches focusing on uniform compression strategies such as low\-rank projection, quantization, key\-value sharing, latent attention, and compressed sparse attention\. These methods primarily aim to reduce memory footprint and inference latency under fixed architectural assumptions\. Beyond uniform compression, other approaches explore more dynamic mechanisms such as sparse attention and dynamic KV cache eviction\. While these methods improve efficiency by selectively reducing stored or accessed information, they typically rely on heuristic sparsity structures\.

### 4\.1Sliding Window Attention

Local window mechanisms have been adopted in many different forms\. Sliding window attentionBeltagyet al\.\([2020](https://arxiv.org/html/2606.18587#bib.bib22)\)restricts attention to a window of local tokens, and StreamingLLMXiaoet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib12)\)extends this design with a small set of attention sinks to maintain generation quality over long contexts\. H2OZhanget al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib11)\), SnapKVLiet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib3)\), etc\. observe that a small subset of tokens, termed heavy hitters, contribute disproportionately to attention scores, and proposes a dynamic policy that retains both local tokens and these heavy hitters\. SKVQDuanmuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib21)\)preserves local tokens at full numerical precision while applying low\-bit quantization to tokens outside the window, motivated by the observation that local tokens tend to receive higher attention weights\. Frameworks like XAttentionXuet al\.\([2025](https://arxiv.org/html/2606.18587#bib.bib1)\)and MInferenceJianget al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib2)\)could dramatically accelerate long\-context inference using sparse attention\.

These methods are primarily motivated by reducing the cost of attention or its inference\-time footprint\. They are training\-free and applied at inference time to already\-pretrained models\. They do not develop the dual dimensionality proposed in this work\.

### 4\.2Multi\-head Latent Attention

Recent architectures reduce the size of key and value representations directly during pretraining\. Multi\-Query Attention \(MQA\)Shazeer \([2019](https://arxiv.org/html/2606.18587#bib.bib19)\)and Grouped\-Query Attention \(GQA\)Ainslieet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib16)\)reduce the number of independent key and value heads, sharing them across queries to lower memory and computation costs\. In contrast, Multi\-head Latent Attention \(MLA\)Liuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib17)\)compresses per\-token representations into a low\-rank latent space, yielding key and value states with substantially lower dimensionality than standard multi\-head attention\. Compressed Sparse Attention \(CSA\)DeepSeek\-AI \([2026](https://arxiv.org/html/2606.18587#bib.bib4)\)reduces KV cache further by compressing hidden states of multiple tokens into one\. Since these designs are introduced during pretraining, the model can adapt its representations to operate effectively under the imposed constraints\.

Among these methods, MLA is most closely related to our work, as it directly modifies the dimensionality of attention representations\. Our work explores a different aspect of this design space: rather than applying a uniform reduction, we ask whether the required dimensionality should vary with a token’s distance from the prediction target\. This perspective suggests an adaptive allocation of representational capacity\.

### 4\.3Multi\-Granularity Representation

Recent advances in representation learning have explored embedding information at multiple levels of granularity within a single vector\. Matryoshka Representation Learning \(MRL\)Kusupatiet al\.\([2022](https://arxiv.org/html/2606.18587#bib.bib35)\)introduces a nested structure that allows a single embedding to be truncated to various sizes while maintaining high accuracy\. This concept was then extended to the KV cache in MatryoshkaKVLinet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib36)\), which enables dynamic capacity adjustment during inference through trainable orthogonal projections\. These methods typically aim for resource\-agnostic flexibility, where the dimensionality is adjusted based on external computational constraints\. Our work shifts the focus from such external flexibility to an intrinsic structural principle to study the dimensionality of token representations with distance\.

## 5Limitations

Direct lower\-dimensional global attention\.We currently project𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}back to dimensionddvia𝐖up\\mathbf\{W\}\_\{\\text\{up\}\}before computing keys and values for the global path, even though the global path conceptually operates on lower\-dimensional information\. An alternative design would perform the global\-path attention entirely inddownd\_\{\\text\{down\}\}\-dimensional space, with separate query, key, and value projections operating on𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}\. We do not pursue this here, as our primary goal is to validate the hypothesis under a setup that closely mirrors standard attention\.

Compute and memory efficiency\.In our current implementation, two sets of key and value states are stored for each token, doubling the KV cache memory compared to vanilla attention\. For inference, a more efficient cache scheme is possible: by absorbing𝐖up\\mathbf\{W\}\_\{\\text\{up\}\}into𝐖K\\mathbf\{W\}\_\{K\}and𝐖V\\mathbf\{W\}\_\{V\}, the global\-path key and value can be computed on demand directly from𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}\. Under this scheme, tokens beyond the window only need to cache𝐡jD∈ℝddown\\mathbf\{h\}\_\{j\}^\{D\}\\in\\mathbb\{R\}^\{d\_\{\\text\{down\}\}\}, while tokens within the window cache the full\-dimensional key and value alongside𝐡jD\\mathbf\{h\}\_\{j\}^\{D\}\. This reduces the memory complexity fromO​\(T​d\)O\(Td\)toO​\(T​ddown\+w​d\)O\(Td\_\{\\text\{down\}\}\+wd\), which scales asO​\(T​ddown\)O\(Td\_\{\\text\{down\}\}\)sincewwis independent of sequence length\. Further absorption of𝐖K\\mathbf\{W\}\_\{K\}and𝐖V\\mathbf\{W\}\_\{V\}into𝐖Q\\mathbf\{W\}\_\{Q\}and𝐖O\\mathbf\{W\}\_\{O\}is possible through the decoupled RoPE formulation introduced in MLALiuet al\.\([2024](https://arxiv.org/html/2606.18587#bib.bib17)\)\. Practical deployment would also require integration with hardware\-aware attention implementations such as FlashAttentionDaoet al\.\([2022](https://arxiv.org/html/2606.18587#bib.bib38)\)and serving frameworks like vLLMKwonet al\.\([2023](https://arxiv.org/html/2606.18587#bib.bib39)\)\. We view this as a promising direction enabled by our findings, but not a contribution of the present work\.

Model scale and architecture coverage\.Due to limited resource, our experiments cover decoder\-only Transformer models from 70M to 410M parameters in pretraining and 1B parameters in supervised fine\-tuning, trained on the Pile and the Tülu 3 SFT mixture, respectively\. We encourage future studies to extend the DAR framework to substantially larger scales and validate its efficacy across varied model architectures and training data\.

## 6Conclusion

We hypothesized that the representational dimensionality required for a token in attention varies with its distance from the prediction target, and introduced Distance\-Adaptive Representation \(DAR\), a principle that allocates representational capacity according to this distance\. Through controlled pretraining and supervised fine\-tuning experiments, we show that distant tokens can be represented with substantially lower dimensionality without significantly degrading perplexity or downstream task performance, whereas applying the same reduction uniformly across all tokens leads to noticeable performance loss\. These results provide direct evidence for an asymmetric demand on representational capacity and challenge the common assumption that attention representations should be uniform across token positions\. We hope this work motivates further investigation into more sophisticated allocations of representational capacity in attention\.

## Acknowledgements

Xuan Luo was partially supported by the BioPACIFIC MIP of the National Science Foundation under Award No\. DMR\-1933487\. We would like to thank Meta for donating the A100\-40G GPUs used in our experiments\. We also gratefully acknowledge the generous support of the NVIDIA Academic Grant Program and NCSA DeltaAI through allocation CIS260864 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support \(ACCESS\) program, which is supported by U\.S\. National Science Foundation\.

## References

- \[1\]\(2023\)GQA: training generalized multi\-query transformer models from multi\-head checkpoints\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 4895–4901\.Cited by:[§4\.2](https://arxiv.org/html/2606.18587#S4.SS2.p1.1)\.
- \[2\]GPT\-NeoX: large scale autoregressive language modeling in pytorchExternal Links:[Link](https://www.github.com/eleutherai/gpt-neox),[Document](https://dx.doi.org/10.5281/zenodo.5879544)Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[3\]J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[4\]Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li\(2024\-08\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 3119–3137\.External Links:[Link](https://aclanthology.org/2024.acl-long.172),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.172)Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[5\]I\. Beltagy, M\. E\. Peters, and A\. Cohan\(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[6\]S\. Biderman, K\. Bicheno, and L\. Gao\(2022\)Datasheet for the pile\.arXiv preprint arXiv:2201\.07311\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[7\]S\. Biderman, H\. Schoelkopf, Q\. G\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff,et al\.\(2023\)Pythia: a suite for analyzing large language models across training and scaling\.InInternational Conference on Machine Learning,pp\. 2397–2430\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[8\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[9\]T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré\(2022\)Flashattention: fast and memory\-efficient exact attention with io\-awareness\.Advances in neural information processing systems35,pp\. 16344–16359\.Cited by:[§5](https://arxiv.org/html/2606.18587#S5.p2.14)\.
- \[10\]DeepSeek\-AI\(2026\-04\)DeepSeek\-V4: towards highly efficient million\-token context intelligence\.Note:Technical ReportExternal Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p3.1),[§4\.2](https://arxiv.org/html/2606.18587#S4.SS2.p1.1)\.
- \[11\]H\. Duanmu, Z\. Yuan, X\. Li, J\. Duan, X\. Zhang, and D\. Lin\(2024\)SKVQ: sliding\-window key and value cache quantization for large language models\.arXiv preprint arXiv:2405\.06219\.Cited by:[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[12\]L\. Gao, S\. Biderman, S\. Black, L\. Golding, T\. Hoppe, C\. Foster, J\. Phang, H\. He, A\. Thite, N\. Nabeshima,et al\.\(2020\)The Pile: an 800gb dataset of diverse text for language modeling\.arXiv preprint arXiv:2101\.00027\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[13\]L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou\(2021\-09\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.5371628),[Link](https://doi.org/10.5281/zenodo.5371628)Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px2.p1.6)\.
- \[14\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[15\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.1555610\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[16\]H\. Jiang, Y\. Li, C\. Zhang, Q\. Wu, X\. Luo, S\. Ahn, Z\. Han, A\. H\. Abdi, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu\(2024\)MInference 1\.0: accelerating pre\-filling for long\-context LLMs via dynamic sparse attention\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=fPBACAbqSN)Cited by:[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[17\]A\. Kusupati, G\. Bhatt, A\. Rege, M\. Wallingford, A\. Sinha, V\. Ramanujan, W\. Howard\-Snyder, K\. Chen, S\. Kakade, P\. Jain,et al\.\(2022\)Matryoshka representation learning\.Advances in Neural Information Processing Systems35,pp\. 30233–30249\.Cited by:[§4\.3](https://arxiv.org/html/2606.18587#S4.SS3.p1.1)\.
- \[18\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§5](https://arxiv.org/html/2606.18587#S5.p2.14)\.
- \[19\]N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu,et al\.\(2024\)Tulu 3: pushing frontiers in open language model post\-training\.arXiv preprint arXiv:2411\.15124\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px2.p1.6)\.
- \[20\]Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen\(2024\)SnapKV: LLM knows what you are looking for before generation\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,Cited by:[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[21\]B\. Lin, Z\. Zeng, Z\. Xiao, S\. Kou, T\. Hou, X\. Gao, H\. Zhang, and Z\. Deng\(2024\)MatryoshkaKV: adaptive kv compression via trainable orthogonal projection\.arXiv preprint arXiv:2410\.14731\.Cited by:[§4\.3](https://arxiv.org/html/2606.18587#S4.SS3.p1.1)\.
- \[22\]A\. Liu, B\. Feng, B\. Wang, B\. Wang, B\. Liu, C\. Zhao, C\. Dengr, C\. Ruan, D\. Dai, D\. Guo,et al\.\(2024\)Deepseek\-V2: a strong, economical, and efficient mixture\-of\-experts language model\.arXiv preprint arXiv:2405\.04434\.Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.18587#S2.SS1.p1.8),[§2\.3](https://arxiv.org/html/2606.18587#S2.SS3.p3.1),[§4\.2](https://arxiv.org/html/2606.18587#S4.SS2.p1.1),[§5](https://arxiv.org/html/2606.18587#S5.p2.14)\.
- \[23\]A\. Lozhkov, L\. Ben Allal, L\. von Werra, and T\. Wolf\(2024\)FineWeb\-Edu: the finest collection of educational content\.Hugging Face\.External Links:[Link](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),[Document](https://dx.doi.org/10.57967/hf/2497)Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[24\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2016\)Pointer sentinel mixture models\.External Links:1609\.07843Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[25\]T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan, N\. Lambert, D\. Schwenk, O\. Tafjord, T\. Anderson, D\. Atkinson, F\. Brahman, C\. Clark, P\. Dasigi, N\. Dziri, M\. Guerquin, H\. Ivison, P\. W\. Koh, J\. Liu, S\. Malik, W\. Merrill, L\. J\. V\. Miranda, J\. Morrison, T\. Murray, C\. Nam, V\. Pyatkin, A\. Rangapur, M\. Schmitz, S\. Skjonsberg, D\. Wadden, C\. Wilhelm, M\. Wilson, L\. Zettlemoyer, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi\(2024\)2 OLMo 2 Furious\.External Links:2501\.00656,[Link](https://arxiv.org/abs/2501.00656)Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px2.p1.6)\.
- \[26\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§3](https://arxiv.org/html/2606.18587#S3.SS0.SSS0.Px1.p1.1)\.
- \[27\]N\. Shazeer\(2019\)Fast transformer decoding: one write\-head is all you need\.arXiv preprint arXiv:1911\.02150\.Cited by:[§4\.2](https://arxiv.org/html/2606.18587#S4.SS2.p1.1)\.
- \[28\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\-06\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Link](https://aclanthology.org/N19-1421),[Document](https://dx.doi.org/10.18653/v1/N19-1421),1811\.00937Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[29\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p1.1)\.
- \[30\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2023\)Efficient streaming language models with attention sinks\.arXiv preprint arXiv:2309\.17453\.Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[31\]R\. Xu, G\. Xiao, H\. Huang, J\. Guo, and S\. Han\(2025\)XAttention: block sparse attention with antidiagonal scoring\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=KG6aBfGi6e)Cited by:[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.
- \[32\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[§3\.4](https://arxiv.org/html/2606.18587#S3.SS4.p2.6)\.
- \[33\]Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett,et al\.\(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.Advances in Neural Information Processing Systems36,pp\. 34661–34710\.Cited by:[§1](https://arxiv.org/html/2606.18587#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.18587#S4.SS1.p1.1)\.

Similar Articles

Dynamic Linear Attention

arXiv cs.CL

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv cs.LG

Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.