PARTREP: Learning What to Repeat for Decoder-only LLMs
Summary
PartRep proposes a selective prompt repetition method for decoder-only LLMs that appends only the most informative tokens (selected via NLL) instead of the full prompt, reducing KV cache and prefill FLOPs while retaining most of the accuracy gains across multiple benchmarks.
View Cached Full Text
Cached at: 07/03/26, 05:41 AM
# PartRep: Learning What to Repeat for Decoder-only LLMs
Source: [https://arxiv.org/html/2607.01792](https://arxiv.org/html/2607.01792)
Andikawati P Widjaja♡Yongjun Kim♠Hyounghun Kim♠Jaeho Lee♠ ♡Bandung Institute of Technology♠Pohang University of Science and Technology
###### Abstract
While decoder\-only LLMs excel at a vast array of natural language tasks, they suffer from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones\. A simple and effective remedy is prompt repetition—just appending a second copy of the prompt before generation can redistribute grounding across positions and improve reasoning performance\. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long\-context settings\. We proposePartRep, a selective augmentation method that appends only the most informative tokens—rather than the entire prompt\. We use token\-wise negative log\-likelihood \(NLL\) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from the surrounding context and therefore benefit more from late\-position repetition\. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high\-NLL tokens from early\-layer hidden states, enabling token selection during mid\-prefill via early exit\. Across eight benchmarks \(including MMLU, GSM8K, and RULER\) and three model families \(Qwen2\.5, Llama3\.2, Gemma4\),PartRepretains most of the gains of full repetition while using only 59\.4% of its KV cache and 79\.0% of its prefill FLOPs\.
PartRep: Learning What to Repeat for Decoder\-only LLMs
Andikawati P Widjaja♡Yongjun Kim♠Hyounghun Kim♠Jaeho Lee♠♡Bandung Institute of Technology♠Pohang University of Science and Technology
## 1Introduction
Decoder\-only large language models \(LLMs\) have become the dominant architecture in natural language processing, driving significant advancements across a wide range of generative and analytical tasks\(huang2023advancing;achiam2023gpt\)\. A defining feature of these models is causal attention, which allows each token to attend to all preceding tokens, while masking out the future ones\. Although this design enables effective autoregressive generation, it also induces an asymmetric information flow: later tokens have richer contextual grounding than earlier ones\(springer2024repetition;behnamghader2024llm2veclargelanguagemodels\)\. This asymmetry has been linked to several documented limitations, including premise order sensitivity in reasoning\(chen2024premise\), option ordering bias\(pezeshkpour2024large;wei2024unveiling;ok2026lost\), and the “lost in the middle” phenomenon\(liu2024lost\)\. This raises a natural question: How can we redistribute contextual grounding more evenly?
Figure 1:Overview of token repetition strategies\. Vanilla prompting uses the original prompt once\. Full repetition appends the entire prompt, improving accuracy at the cost of doubling the input length to2L2L\.PartRep, our method, appends only selected key tokens, producing a shorter sequence of length\(L\+τL\)\(L\+\\tau L\)that preserves the accuracy gains of repetition while reducing computational cost\.A surprisingly effective remedy for this asymmetry is prompt repetition: present the same prompt twice before generation\. By placing a second copy of the prompt at the end of the input, every original token gains a later “echo” that can attend back over the full context, effectively redistributing contextual grounding across the sequence\. Despite requiring no parameter updates or architectural changes, this simple intervention has been shown to improve reasoning performance across diverse tasks\(leviathan2025prompt;xu2024re;springer2024repetition\)\.
However, it also introduces several limitations: Appending a full copy of the prompt doubles the KV cache footprint and quadruples self\-attention FLOPs\. These overheads become increasingly prohibitive as the prompt length grows, rendering this approach unscalable for long contexts\.
To this end, we presentPartRep, a method to selectively repeat only the most critical tokens—rather than duplicating the entire prompt \(Figure[1](https://arxiv.org/html/2607.01792#S1.F1)\)\.PartRepscores each input token by its informational importance, then appends only the highest\-scoring tokens before generation\. As the scoring criterion, we adopt the negative log\-likelihood \(NLL\) of each token as a proxy for information density: tokens with high NLL \(i\.e\., high surprisal\) are those the model finds least predictable from preceding context, indicating that their content is not redundantly encoded by surrounding tokens\. As we show in[Section˜B\.1](https://arxiv.org/html/2607.01792#A2.SS1), repeating such tokens yields the greatest marginal benefit, as their echoes inject information that would otherwise remain underrepresented in later positions’ attention\.
Since exact NLL computation requires a full forward pass and would undermine our efficiency goal, we instead train a lightweight gate to predict the top\-τ\\tauhigh\-NLL tokens directly from early\-layer hidden states, enabling selection during mid\-prefill\. The selected tokens are then appended to the original prompt via a short natural\-language bridge \(e\.g\., “Pay attention to these key tokens…”\), after which a single forward pass produces the final output\.
We demonstrate the robustness of our approach across 8 benchmarks, including GSM8K and RULER, and a diverse set of decoder\-only LLMs encompassing Qwen2\.5, Llama3\.2, and Gemma4 families\.PartReppreserves the accuracy benefits of full repetition, while requiring only 59\.4% of the KV cache budget and 79\.0% of the prefill FLOPs\.
We summarize our contributions as follows:
- •We proposePartRep, a selective prompt augmentation method that approximates the benefits of full repetition at a fraction of its pre\-fill compute and KV cache memory cost\.
- •We show that token\-wise NLL provides a principled importance signal for token selection and design an efficient gating mechanism that approximates it without a full forward pass\.
- •We empirically validatePartRepacross eight benchmarks and three model families, showing that it consistently matches or surpasses full repetition across the setups, while substantially reducing prefill computation and memory overhead\.
## 2Related work
#### Prompt repetition\.
Several studies have shown that prompt repetition can improve LLM performance\.xu2024refound that “re\-reading” a question before chain\-of\-thought reasoning improves LLM performance, arguing that repetition enables causally masked models to partially emulate bidirectional attention and mitigate limitations of decoder\-only architectures\.arora2024justextended similar ideas to recurrent language models, whilespringer2024repetitionformalized the phenomenon through “echo embeddings,” showing that representations from repeated tokens substantially outperform single\-pass embeddings on retrieval and similarity tasks\. At a larger scale,leviathan2025promptobserved consistent gains from prompt repetition across both open\- and closed\-weight models in non\-reasoning mode\. Although prior work has primarily focused on full\-prompt repetition, related findings on partial \(option\-level\) repetition byok2026lostfurther suggest that the benefits that arise from restoring information pathways are blocked by causal attention\. However, option\-level repetition is limited to multiple\-choice settings and does not generalize to open\-ended generation tasks prevalent in real\-world LLM usage\. In contrast, we introduce a task\-agnostic framework that selectively repeats only the most informative tokens within an arbitrary prompt, avoiding both the doubled KV cache cost of full repetition and the multiple\-choice constraint of option\-level repetition\.
#### KV cache eviction\.
A number of prior works propose to evict KV cache ofless importanttokens to mitigate the high memory I/O cost of long\-context LLM inference\. For instance, H2O\(zhang2023h2o\)retains the tokens that frequently receive high attention during the next\-token prediction, while KVzip\(kim2025kvzip\)asks the model to reconstruct the prompt and utilize the attention during this process\. More recent works focus on avoiding computing full attention matrices for each token during the prefill stage to reduce the latency overhead\. For example, FastKVzip\(kim2026fast\)and KVzap\(jegou2026kvzap\)introduce lightweight predictors trained to predict the token importance based on the hidden states of early transformer layers, where the token importance is estimated by prior methods \(e\.g\., KVzip\)\.
We adopt a similar predictor\-based strategy, but for a fundamentally different objective\. Eviction estimates which tokens can be safely discarded, while partial repetition estimates which tokens yield additional benefit when re\-attended to, and these two notions of importance are not duals of each other\. We elaborate in[section˜3\.1](https://arxiv.org/html/2607.01792#S3.SS1), where we show, for example, that the final prompt token is typically essential under eviction yet least useful to repeat\.
#### Summarization\.
Another line of work reduces memory cost by compressing the input prompt itself\. LLMLingua\(jiang2023llmlingua\)uses a compact language model to identify and removeless importanttokens before feeding a summarized prompt to the main model\. In this sense, summarization can be viewed as a text\-level analogue of KV cache eviction: instead of discarding key\-values after prefill, it removes redundant tokens before they enter the model\. Our work similarly estimates token importance and constructs a shortened textual representation in the form of a summary\. However, the objectives differ fundamentally\. LLMLingua assumes that the original prompt contains inherent redundancy and compresses it to reduce memory consumption\. In contrast, our goal is not merely to shorten the prompt, but to approximate the accuracy gains ofredundantfull repetition under constrained memory budgets\.
## 3Problem formulation
Suppose that we are given an initialuser promptas input, formulated as a sequence of tokens
𝐭=\(t1,t2,…,tL\)\\displaystyle\\mathbf\{t\}=\(t\_\{1\},t\_\{2\},\\dots,t\_\{L\}\)\(1\)with lengthLL\. For decoder\-only LLM, the computational complexity of prefill self\-attention scales quadratically asO\(L2\)O\(L^\{2\}\), and the stored key\-value \(KV\) cache scales linearly asO\(L\)O\(L\)\.
Thefull repetition\(leviathan2025prompt\)is the idea of repeating the full prompt\. In other words, the prompt is duplicated to form
𝐭FR=𝐭⊕𝐭,\\mathbf\{t\}\_\{\\text\{FR\}\}=\\mathbf\{t\}\\oplus\\mathbf\{t\},\(2\)where⊕\\oplusdenotes sequence concatenation\. It has been known that prefilling𝐭FR\\mathbf\{t\}\_\{\\text\{FR\}\}enables critical pseudo\-bidirectional attention for the sequence, thereby increasing the accuracy of the model\. On the other hand, the prefill complexity gets quadrupled and the KV cache footprint doubles\.
Extending this paradigm, we formulate the problem ofpartial repetition,which aims to achieve the benefit of repetition without inheriting its prohibitive memory and latency costs\. Our goal is to isolate animportant, highly informativesubset of the original prompt such that appending only this subset—rather than the full prompt—suffices to improve model accuracy\. Precisely, we consider the prompt structured as
𝐭PR=𝐭⊕𝐭part\\displaystyle\\mathbf\{t\}\_\{\\text\{PR\}\}=\\mathbf\{t\}\\oplus\\mathbf\{t\}\_\{\\text\{part\}\}\(3\)where𝐭part\\mathbf\{t\}\_\{\\text\{part\}\}denotes thekeyword subsequenceof𝐭\\mathbf\{t\}\. Given this prompt structure, our goal is to find a token selectorf\(𝐭\)=𝐭partf\(\\mathbf\{t\}\)=\\mathbf\{t\}\_\{\\text\{part\}\}which maximizes the model accuracy\. More concretely, our goal is to solve
maxfacc\(𝐭⊕f\(𝐭\)\)\\displaystyle\\max\_\{f\}\\\>\\mathrm\{acc\}\(\\mathbf\{t\}\\oplus f\(\\mathbf\{t\}\)\)\(4\)whereacc\(Q\)\\mathrm\{acc\}\(Q\)denotes the accuracy of the base LLM prompted byQQ, subject to the constraints
f\(𝐭\)⊂𝐭,\|f\(𝐭\)\|≤τ⋅\|𝐭\|,\\displaystyle f\(\\mathbf\{t\}\)\\subset\\mathbf\{t\},\\qquad\|f\(\\mathbf\{t\}\)\|\\leq\\tau\\cdot\|\\mathbf\{t\}\|,\(5\)where⊂\\subsetdenotes the subsequence relation \(instead of a subset\), andτ∈\(0,1\)\\tau\\in\(0,1\)is a repetition threshold enforced to meet the given memory budget\.
The primary advantage of formulating as atoken selection\(i\.e\., imposingf\(𝐭\)⊂𝐭f\(\\mathbf\{t\}\)\\subset\\mathbf\{t\}\) is computational\. The predictorffcan be viewed as making a binary decision on each input token, acting as a “gate” that accepts or rejects each token\. As the output dimension is simple, we can expectffto be implementable with a lightweight model\. Indeed, we implement this with a two\-layer MLP with attention, on top of the target model’s hidden state\. The same may not be possible if we let the output off\(𝐭\)f\(\\mathbf\{t\}\)lie in the general vocabulary space, e\.g\., a textual summary of𝐭\\mathbf\{t\}, or a continuous token space\.
#### Efficiency\.
Through compressing the repeated prompt, we can expect the prefill self\-attention computation to be proportional to\(1\+τ\)2L2\(1\+\\tau\)^\{2\}L^\{2\}, which is lower than4L24L^\{2\}of the full repetition\. Likewise, the KV cache will be proportional to\(1\+τ\)L\(1\+\\tau\)Lwhich is less than2L2Lof the full repetition\. However, note that partial repetition introduces an additional inference cost for the token selection proceduref\(⋅\)f\(\\cdot\)\. Fortunately, as we will see in[Section˜6\.2](https://arxiv.org/html/2607.01792#S6.SS2), the cost is small\.
Figure 2:Inference procedure of the proposedPartRep\. We first prefill the LLM with the original prompt, then pass its early\-layer hidden states through the gating module\. Next, we select the top\-τ\\taufraction of tokens, and repeat the selected tokens \(i\.e\., append it after the original prompt\), then continue with the prefill\.
### 3\.1Comparison with KV cache eviction
Recall that the task of KV cache eviction is about deciding “what to discard”\(zhang2023h2o\)\. The task of partial repetition, i\.e\., deciding “what to repeat,” is similar in the sense that we need to estimate the importance of each token to reduce the KV cache cost\. However, the tasks fundamentally differ from each other in two aspects\.
First, tackling partial repetition via KV cache eviction is computationally suboptimal\. To see this, consider applying KV cache eviction methods on a fully repeated prompt as a mean of partial repetition\. In this case, we must still perform prefill with2L2Ltokens, thus losing any computational advantage in the prefill computation\.
Second, the notion ofimportanceis different in the two tasks\. To see this, consider the last token of the prompt\. In partial repetition, this is the least important token to be repeated, as the last token in the original prompt already attended to all other tokens\. In KV cache eviction, however, the last token \(of the repeated prompt\) is typically treated as essential\(zhang2023h2o\)\.
## 4Algorithm
We developPartRep, a learning\-based method that captures the benefits of repetition without inheriting its memory and latency cost by repeating only highlyimportanttokens\.
To select these tokens, we first motivate the negative log\-likelihood \(NLL\) as a principled proxy for information density \([Section˜4\.1](https://arxiv.org/html/2607.01792#S4.SS1)\)\. Since calculating the exact token\-level NLL at runtime is prohibitively expensive, we introduce a lightweight gate trained to predict these scores from early\-layer hidden states \([Section˜4\.2](https://arxiv.org/html/2607.01792#S4.SS2)\)\.[Section˜4\.3](https://arxiv.org/html/2607.01792#S4.SS3)describes the offline procedure used to train the gate across diverse domains\. At inference, selected tokens are appended to the original prompt before the model’s second forward pass\.[Figure˜2](https://arxiv.org/html/2607.01792#S3.F2)illustrates the overallPartReppipeline, with details in[Section˜4\.4](https://arxiv.org/html/2607.01792#S4.SS4)\.
### 4\.1Token importance scoring
To identify and retain only critical tokens, we use the negative log\-likelihood \(NLL\) of the next\-token prediction as a direct proxy of the predicted tokens’ information density\(jaeger2006speakers\)\. In a standard autoregressive language model, the probability of each tokentit\_\{i\}conditioned on all preceding tokenst<it\_\{<i\}is predicted\. The NLL for a token is:
NLL\(ti\)=−logP\(ti∣t<i\)\\displaystyle\\text\{NLL\}\(t\_\{i\}\)=\-\\log P\(t\_\{i\}\\mid t\_\{<i\}\)\(6\)whereP\(⋅\|⋅\)P\(\\cdot\|\\cdot\)denotes the next\-token probability predicted by the model\. Tokens with high prediction probabilities yield low NLL scores, indicating high redundancy with previous context and low information added\. Conversely, “surprising” tokens with low probabilities yield high NLL scores\. Our method leverages this intrinsic property by defining the predictorf\(⋅\)f\(\\cdot\)to isolate the token subset carrying the highest information density within the repetition thresholdτ\\tau:
f⋆\(𝐭\)=\{ti∈𝐭∣NLL\(ti\)≥ητ\}\.\\displaystyle f^\{\\star\}\(\\mathbf\{t\}\)=\\\{t\_\{i\}\\in\\mathbf\{t\}\\mid\\text\{NLL\}\(t\_\{i\}\)\\geq\\eta\_\{\\tau\}\\\}\.\(7\)Here,ητ\\eta\_\{\\tau\}is the threshold determined to meet the required budget for each sample, i\.e\.,τ\\tau\-quantile\.
#### Rationale\.
The NLL\-based selector \([7](https://arxiv.org/html/2607.01792#S4.E7)\) can be viewed as an approximate solution for a more general optimization problem\. In particular, consider the followinginformation maximizationprinciple:
maxfH\(f\(𝐭\)\|𝐭\),\\displaystyle\\max\_\{f\}H\(f\(\\mathbf\{t\}\)\|\\mathbf\{t\}\),\(8\)whereH\(𝐭2\|𝐭1\)H\(\\mathbf\{t\}\_\{2\}\|\\mathbf\{t\}\_\{1\}\)denotes the conditional entropy of the model generating a sequence𝐭2\\mathbf\{t\}\_\{2\}conditioned on the context𝐭1\\mathbf\{t\}\_\{1\}\.111Note that this is different from the usual conditional entropy, where we expectH\(f\(X\)\|X\)=0H\(f\(X\)\|X\)=0for any deterministicf\(⋅\)f\(\\cdot\)\. This discrepancy is due to the fact that we consider conditional entropy of a token sequence given its context, modeled by an LLM, rather than modelingXXandf\(X\)f\(X\)as random variables themselves\.In other words, we are selecting a subsequencef\(𝐭\)f\(\\mathbf\{t\}\)that can maximize the information added by the repetition\.
[Equation˜7](https://arxiv.org/html/2607.01792#S4.E7)approximately solves this problem, where the approximation is that we also condition on the tokens that may be discarded later\. Through this approximation, we can save computation; otherwise, we will need to add each context word iteratively, through multiple rounds\.
### 4\.2Gating mechanism
Whilef⋆f^\{\\star\}can select critical tokens in a principled way, computing it requires much prefill computation to pass the prompt through the LLM\. To alleviate this overhead, we employ alearning\-based approach: We train a small, lightweight gating module that can approximatef⋆f^\{\\star\}on\-the\-fly\.
Precisely, our gating mechanism works as a composition of two different functions:
f\(𝐭\)=g∘ϕLLM\(𝐭\)\.\\displaystyle f\(\\mathbf\{t\}\)=g\\circ\\phi\_\{\\text\{LLM\}\}\(\\mathbf\{t\}\)\.\(9\)Here,ϕLLM\(𝐭\)\\phi\_\{\\text\{LLM\}\}\(\\mathbf\{t\}\)denotes the hidden states of the original prompt𝐭\\mathbf\{t\}extracted from the base LLM at layerl⋆l^\{\\star\}, whileg\(⋅\)g\(\\cdot\)is a lightweight token\-wise gating function\. By leveragingϕLLM\(𝐭\)\\phi\_\{\\text\{LLM\}\}\(\\mathbf\{t\}\), we can keepg\(⋅\)g\(\\cdot\)lightweight while minimizing computational overhead\. In particular, these features reuse computations from the original prefill stage and therefore incur no additional computational cost\.
#### Architecture\.
The token\-wise gating functiong\(⋅\)g\(\\cdot\)takes a similar structure withkim2026fast, motivated by its success in KV cache eviction\. In a nutshell, we first apply a two\-layer MLP to map raw hidden states—a highly entangled mixture of syntactic, semantic, and positional signals—onto a compact query space\. Then, we use an attention module to compare the query to the keys that represent latent signatures of high\-NLL tokens, to compute the final repetition probability\.
Concretely, given some hidden state𝐡∈ℝd\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\}, we first compute the query
𝐪=W2SiLU\(W1𝐡\),\\displaystyle\\mathbf\{q\}=W\_\{2\}\\\>\\mathrm\{SiLU\}\(W\_\{1\}\\mathbf\{h\}\),\(10\)whereW1∈ℝ2dh×d,W2∈ℝdh×2dhW\_\{1\}\\in\\mathbb\{R\}^\{2d\_\{h\}\\times d\},W\_\{2\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times 2d\_\{h\}\}are weight matrices\. We fixdh=64d\_\{h\}=64across models to maintain efficiency\. The query𝐪\\mathbf\{q\}is then evaluated againstsslearnable keysK∈ℝs×dhK\\in\\mathbb\{R\}^\{s\\times d\_\{h\}\}to output the corresponding weighted sum of values𝐯∈ℝs\\mathbf\{v\}\\in\\mathbb\{R\}^\{s\}:
g\(𝐡\)=σ\(𝐯⊤Softmax\(𝐪K⊤/dh\)\)\.\\displaystyle g\(\\mathbf\{h\}\)=\\sigma\\left\(\\mathbf\{v\}^\{\\top\}\\mathrm\{Softmax\}\\left\(\\mathbf\{q\}K^\{\\top\}/\\sqrt\{d\_\{h\}\}\\right\)\\right\)\.\(11\)Here,σ\(⋅\)\\sigma\(\\cdot\)is the sigmoid activation function to generate the final probability score\.
### 4\.3Training the gating module
Now we describe how we train the gating functionf=g∘ϕLLMf=g\\circ\\phi\_\{\\text\{LLM\}\}\(Equation[9](https://arxiv.org/html/2607.01792#S4.E9)\) using the NLL supervision \(Equation[6](https://arxiv.org/html/2607.01792#S4.E6)\)\. To avoid prohibitive training cost, we train only the gating moduleggand keep the base feature extractorϕLLM\\phi\_\{\\text\{LLM\}\}frozen\.
#### Dataset construction\.
We first construct a training dataset by sampling the prompts\{𝐭i\}i=1n\\\{\\mathbf\{t\}\_\{i\}\\\}\_\{i=1\}^\{n\}and generating the corresponding feature\-NLL pairs
𝒟=\{\(ϕLLM\(𝐭i\),NLL\(𝐭i\)\)\}i=1n,\\displaystyle\\mathcal\{D\}=\\\{\(\\phi\_\{\\text\{LLM\}\}\(\\mathbf\{t\}\_\{i\}\),\\text\{NLL\}\(\\mathbf\{t\}\_\{i\}\)\)\\\}\_\{i=1\}^\{n\},\(12\)whereNLL\(𝐭i\)\\text\{NLL\}\(\\mathbf\{t\}\_\{i\}\)denotes the sequence of token\-wise NLL scores\. At this stage, we use the NLL itself as the label, instead off⋆\(𝐭i\)f^\{\\star\}\(\\mathbf\{t\}\_\{i\}\)\. This allows us to reuse the constructed dataset for different repetition thresholdsητ\\eta\_\{\\tau\}\. As this labeling can utilize the batched inference, it can be done efficiently\.
For the generality of the learned gating function, we draw sample prompts\{𝐭i\}i=1n\\\{\\mathbf\{t\}\_\{i\}\\\}\_\{i=1\}^\{n\}from a general educational corpora, rather than domain\-specific prompt sets\. In particular, we stream and process sequences ranging from 10 to 1024 tokens in length, compiling a robust dataset of 3 million training tokens sampled from FineWeb\-Edu \(2 million\) and Ultra\-FineWeb\-Edu \(1 million\)\.
MethodAvg KV CacheAvg Prefill FLOPsARCOBQAMMLUMedQASciQMMLU\-ProGSM8KAvg\.No Repetition235\.51\.549 T79\.876\.662\.547\.590\.327\.884\.266\.9Appending summary\+ Naïve Summary277\.43\.676 T\\cellcolorwinblue81\.7\\cellcolorlossred76\.5\\cellcolorlossred60\.8\\cellcolorwinblue50\.9\\cellcolorwinblue91\.6\\cellcolorwinblue28\.6\\cellcolorwinblue85\.5\\cellcolorwinblue67\.9\+ LLMLingua277\.42\.105 T\\cellcolorwinblue81\.5\\cellcolorwinblue77\.8\\cellcolorlossred61\.0\\cellcolorwinblue49\.3\\cellcolorwinblue90\.7\\cellcolorlossred27\.7\\cellcolorlossred84\.0\\cellcolorwinblue67\.4Full Repetition471\.13\.140 T\\cellcolorwinblue82\.8\\cellcolorwinblue78\.0\\cellcolorlossred60\.3\\cellcolorwinblue51\.0\\cellcolorwinblue92\.9\\cellcolorlossred27\.3\\cellcolorwinblue87\.3\\cellcolorwinblue68\.5Compressing full\. rep\.\+ Echo Eviction235\.53\.152 T\\cellcolorwinblue81\.3\\cellcolorlossred69\.6\\cellcolorlossred60\.3\\cellcolorwinblue50\.7\\cellcolorlossred90\.1\\cellcolorlossred22\.0\\cellcolorlossred76\.6\\cellcolorlossred64\.4\+ H2O Eviction270\.83\.162 T\\cellcolorlossred75\.0\\cellcolorlossred76\.2\\cellcolorlossred60\.5\\cellcolorwinblue50\.6\\cellcolorlossred77\.7\\cellcolorlossred27\.4\\cellcolorlossred72\.9\\cellcolorlossred62\.9\+ LLMLingua Comp\.78\.61\.130 T\\cellcolorlossred15\.2\\cellcolorlossred18\.8\\cellcolorlossred14\.8\\cellcolorlossred12\.2\\cellcolorlossred20\.6\\cellcolorlossred10\.5\\cellcolorlossred12\.2\\cellcolorlossred14\.9PartRep\(ours,τ=0\.15\\tau=0\.15\)280\.42\.481 T\\cellcolorwinblue81\.4\\cellcolorwinblue77\.8\\cellcolorwinblue63\.2\\cellcolorwinblue50\.3\\cellcolorwinblue92\.2\\cellcolorwinblue27\.8\\cellcolorwinblue85\.5\\cellcolorwinblue68\.3
Table 1:Accuracy comparison of various prompt repetition methods across seven benchmarks, on Qwen 2\.5\-3B\. Colored cells inredandblueindicates performance drops and gains relative to the “No Repetition,” respectively\.
#### Training procedure\.
Given a sample prompt𝐭\\mathbf\{t\}of lengthLL, trainingggis formulated asLLtoken\-wise binary classification tasks, where the ground\-truth labels are provided by the top\-τ\\tauNLL selectorf⋆f^\{\\star\}\(for a designatedτ\\tau\)\. Specifically, we generate binary labels as
𝐲=\(𝟏\{t1∈f⋆\(𝐭\)\},…,𝟏\{tL∈f⋆\(𝐭\)\}\),\\displaystyle\\mathbf\{y\}=\\big\(\\mathbf\{1\}\\\{t\_\{1\}\\in f^\{\\star\}\(\\mathbf\{t\}\)\\\},\\ldots,\\mathbf\{1\}\\\{t\_\{L\}\\in f^\{\\star\}\(\\mathbf\{t\}\)\\\}\\big\),\(13\)where𝟏\\mathbf\{1\}denotes the indicator function\. Then, we train with the sample\-wise loss
ℓ\(𝐭\)=1L∑j=1LλjℓCE\(\[f\(𝐭\)\]j,yj\),\\displaystyle\\ell\(\\mathbf\{t\}\)=\\frac\{1\}\{L\}\\sum\_\{j=1\}^\{L\}\\lambda\_\{j\}\\\>\\ell\_\{\\text\{CE\}\}\(\[f\(\\mathbf\{t\}\)\]\_\{j\},y\_\{j\}\),\(14\)where\[⋅\]j\[\\cdot\]\_\{j\}denotes thejjth entry of a vector\. Here,λj\\lambda\_\{j\}is the weight determined to mitigate the effect of class imbalance\. In particular, the positive\-labeled tokens \(i\.e\.,yjy\_\{j\}= 1\) are upweighted by
λj=\(1−τ\)/τ,\\displaystyle\\lambda\_\{j\}=\(1\-\\tau\)/\\tau,\(15\)representing the ratio between negative and positive tokens in the training set\. For the negative\-labeled tokens \(i\.e\.,yj=0y\_\{j\}=0\), we simply useλj=1\\lambda\_\{j\}=1\.
### 4\.4Inference
#### Connecting prompt\.
Given an input prompt𝐭\\mathbf\{t\}, the model performs an initial forward pass and pauses at the target layerl⋆l^\{\\star\}\. Hidden states at this layer are used to computef\(𝐭\)f\(\\mathbf\{t\}\), and the selected tokens are then appended to the original prompt, connected through a trigger:‘‘\\nPay attention to these key tokens:\\n’’\. The model then performs the full forward pass over the extended prompt𝐭⊕f\(𝐭\)\\mathbf\{t\}\\oplus f\(\\mathbf\{t\}\), reusing the hidden layer features of the original prompt𝐭\\mathbf\{t\}\.
#### Token windowing\.
PartRepdirectly appends tokens selected by the gate, which is effective when those tokens alone provide sufficient cues for downstream prediction\. However, token\-level repetition can be insufficient in two cases\. First, subword tokenization may split an informative word into multiple fragments, so repeating only one selected fragment yields an incomplete lexical cue\. Second, when the answer depends on local compositional structure, such as relations expressed within a sentence, repeating isolated salient tokens may discard the context needed to interpret them correctly\.
To handle these cases, we introduce token windowing as an optional post\-selection expansion strategy\. When enabled, each selected token can be expanded to include either the full word containing it or a broader local context window\. This option allowsPartRepto preserve lexical integrity or short\-range relational structure when isolated token repetition is insufficient\.
## 5Experiment
### 5\.1Experimental setup
#### Benchmarks\.
We evaluate on a total of eight tasks\. Five benchmark datasets assess general knowledge and scientific retrieval capabilities: ARC\-Challenge\(clark2018think\); OpenBookQA\(mihaylov2018can\); MMLU\(hendrycks2020measuring\); MedQA\(jin2021disease\); SciQ\(welbl2017crowdsourcing\)\. Two benchmark datasets assess more complex reasoning abilities: MMLU\-Pro\(wang2024mmlu\); GSM8K\(cobbe2021training\)\. One benchmark dataset assesses long\-context capabilities: RULER\(hsieh2024ruler\)\.
At the inference stage, we use complete test sets for all benchmarks, except for RULER, where we use a subset of 100 questions per each tasks, 1300 questions in total\.
#### Models\.
We conduct experiments on three different instruction\-finetuned LLMs: Qwen 2\.5\-3B\(qwen2024qwen2\); Llama 3\.2\-3B\(grattafiori2024llama\); Gemma 4\-E4B\(googledeepmind2026gemma4modelcard\)\.
Notably, Gemma 4\-E4B employs a hybrid mechanism alternating between local sliding window and global attention layers\. For this model, we strictly constrain our extraction point to coincide with a global attention layer to prevent an information bottleneck within the extracted hidden states\.
#### Baselines\.
We compare our method against seven standard baselines with repetition\. Three of them are methods that do not involve full repetition\.
- •No repetition\.This is the vanilla setup that uses the user prompt as is without any repetition\.
- •\+ Naïve summary\.After the original prompt, we append the summary of the prompt generated by the model itself\.
- •\+ LLMLingua\.Same, but using LLMLingua instead of summarization\(jiang2023llmlingua\)\.
The other four are the methods based on full repetition\.
- •Full repetition\.We repeat the whole prompt twice\(leviathan2025prompt\)\.
- •\+ Echo eviction\.After processing fully repeated prompt, we remove the hidden states of the first half\(springer2024repetition\)\.
- •\+ H2O eviction\.Same, but we use H2O to select tokens to evict\(zhang2023h2o\)\.
- •\+ LLMLingua\.We apply LLMLingua on the fully repeated prompt\(jiang2023llmlingua\)\.
#### Evaluation metrics\.
Our evaluation focuses on assessing the tradeoffs across three dimensions:
- •Accuracy\.Accuracy on the target task\.
- •KV cache\.The average number of prompt tokens whose layerwise KV cache is stored\.
- •Prefill compute\.The computation required for prefilling the prompt, measured in FLOPs\.
Note that we account for algorithm\-specific computational overhead, including prompt processing and, when applicable, the additional cost of token selection, summarization, or cache eviction\.
#### Implementation details\.
See Appendix[A\.1](https://arxiv.org/html/2607.01792#A1.SS1)for training and Appendix[A\.2](https://arxiv.org/html/2607.01792#A1.SS2)for inference details\.
### 5\.2Main results
Table[1](https://arxiv.org/html/2607.01792#S4.T1)reports the accuracy of Qwen 2\.5\-3B under different repetition strategies\. We first observe that full repetition provides the strongest gain with over 1\.6%p increase over vanilla, confirming that repeating the prompt benefits decoder\-only LLMs\.
PartRepclosely matches this gain with an average accuracy of 68\.3%, while repeating only a small subset of the prompt\. Critically, our method provides consistent gains over benchmarks, even on tasks where full repetition drops the performance, e\.g\., MMLU and MMLU\-Pro\. This may be due to the fact our method can help avoid the accuracy degradations from having too long context \(e\.g\., lost\-in\-the\-middle phenomenon\)\.
Summary\-based alternatives also perform strongly, but remain below our method\. This suggests that compressing prompt retains useful information, but replacing the context with an abstract summary is less effective than preserving the original prompt and appending selected tokens\.
Table 2:Accuracy comparison across various LLM architectures, measured on MMLU benchmark\.Table 3:Accuracy comparison on the RULER long\-context benchmark at varying context lengths\. For PartRep, we useτ=0\.3\\tau=0\.3with local\-context token window expansion\.
### 5\.3Other LLM architectures
In[Table˜2](https://arxiv.org/html/2607.01792#S5.T2), we provide additional evaluations on diverse LLM architectures with varying baseline capabilities: Llama 3\.2\-3B and Gemma 4\-E4B\. In particular, we evaluate on the MMLU benchmark\. The results confirm that our method achieves consistent gains over various architectures\. Notably, under the strict constraints of Gemma models’ hybrid sliding\-window and global attention mechanisms, our method stays reliable\.
### 5\.4Long\-context tasks
We further evaluate our method on long\-context settings using the RULER benchmark \([Table˜3](https://arxiv.org/html/2607.01792#S5.T3)\) and Qwen2\.5\-3B\. Results confirm that repetition remains beneficial even when the input context becomes substantially longer\.
PartRepachieves the strongest performance across all evaluated context length, although the gap gets narrower as the number of token grows\. This implies that selectively repeating “only” informative tokens remains effective in long\-context scenarios\. In contrast, full repetition may just repeat redundant information and suffer from the drawbacks of long context, and compressing the prompt \(i\.e\., using LLMLingua\) may discard critical structural cues that can only be preserved by keeping the original context intact\.
## 6Analysis
### 6\.1Ablation studies
#### Gating module\.
In[Table˜4](https://arxiv.org/html/2607.01792#S6.T4), we compare the effectiveness of the two\-layer MLP \(Eq\.[10](https://arxiv.org/html/2607.01792#S4.E10)\) against alternative architectures: a lighter linear layer and a heavier transformer\-based architecture\. We use ARC\-Challenge benchmark for this comparison\. We observe that our architecture achieves a nice trade\-off point of accuracy and computation\. Simpler models achieve lower accuracy, while the benefits considering more complicated architectures gets saturated over this scale\.
Table 4:Ablations on gating module architecture\. Our choice \(2\-layer MLP\) achieves favorable accuracy with minimal computational overhead\.Table 5:Effect of token\-window expansion for handling fragmented subword selections\. Adding local word context improves accuracy from 61\.5 to 66\.0 while introducing only a moderate KV\-cache increase\.
#### Token windowing\.
We also ablate the token windowing on the Llama 3\.2\-3B, which uses subword tokenization\.[Table˜5](https://arxiv.org/html/2607.01792#S6.T5)reports the effect of expanding selected token span on ARC\-Challenge\. As shown in Table[5](https://arxiv.org/html/2607.01792#S6.T5), whole\-word expansion improves accuracy from 61\.5 to 64\.6 with a small increase in KV cache, while adding one neighboring word on each side further improves accuracy to 66\.0 with a moderate increase in KV cache\.
#### Other ablations\.
We provide more ablations on:
- •The scoring criterion forf⋆f^\{\\star\}\(Appendix[B\.1](https://arxiv.org/html/2607.01792#A2.SS1)\)
- •The early exit layer indexl⋆l^\{\\star\}\(Appendix[B\.2](https://arxiv.org/html/2607.01792#A2.SS2)\)
- •The repetition budgetτ\\tau\(Appendix[B\.3](https://arxiv.org/html/2607.01792#A2.SS3)\)
- •The connecting prompt \(Appendix[B\.4](https://arxiv.org/html/2607.01792#A2.SS4)\)
- •The number of learnable keysss\(Appendix[7](https://arxiv.org/html/2607.01792#A2.T7)\)
Figure 3:Memory and compute scaling with prompt length\. The panel \(a\) reports the number of stored KV cache tokens, and the panel \(b\) reports the estimated prefill FLOPs\. As prompt length increases, Full Repetition incurs substantially larger overhead, whereas Partial Repetition maintains lower KV cache usage and prefill compute by appending only selected informative tokens\.
### 6\.2Efficiency comparison
[Figure˜3](https://arxiv.org/html/2607.01792#S6.F3)exhibits the KV cache and FLOPs required by each baseline as prompt length scales\. While full repetition maximizes reasoning accuracy, the curve shows it scales prohibitively, doubling the KV cache footprint and inflating prefill complexity toO\(4L2\)O\(4L^\{2\}\)FLOPs\. Meanwhile, as the prompt gets longer,PartRep’s KV cache growth turns more and more negligible\. As for the complexity, although the gate’s token selection procedure introduces additional computational overhead, our empirical results demonstrate that the total required FLOPs still remain substantially lower than what is required in full repetition\.
## 7Conclusion
We proposePartRep, a selective prompt augmentation method that approximates the benefits of full repetition at a fraction of its compute and memory cost\. Across eight benchmarks and three model families,PartRepconsistently improves over vanilla prompting and remains competitive with full repetition across reasoning, knowledge, and long\-context tasks, requiring only 59\.4% of the KV cache footprint and 79% of the prefill FLOPs of the full repetition\. Our ablation studies further demonstrate that high\-NLL token supervision, lightweight gating, and local token\-window expansion are effective design choices to balance accuracy and inference efficiency\. We believe thatPartRepprovides a practical approach for improving decoder\-only LLM inference, particularly in long\-context scenarios where efficient use of KV cache and prefill computation is essential\.
## Limitations
One limitation of thePartRepis that it requires an offline training for each target LLM, instead of being applicable zero\-shot or transferrable across architectures\. This undermines the applicability of our framework to scenarios where we do not have much training budget\.
Another key limitation is its applicability to different data modalities, e\.g\., images\. Our approach relies on an implicit assumption that each token can convey a meaningful signal by itself\. However, as repeating only a few tokens from an image and concatenating them as a sequence may produce a highly distorted image, our method may not generalize well to vision\-language models\.
## References
## Appendix AImplementation Details
### A\.1Training Details
We train our gate for 25 epochs using Adam optimizer with learning rate of1e−31\\mathrm\{e\}\{\-3\}and batch size of 4096\. We also apply a CosineAnnealingLR scheduler and use early stopping based on validation loss with a patience of 5 epochs\.
### A\.2Inference Details
All experiments are performed with greedy decoding single runs and bfloat16 precision\. Maximum output tokens are set to 1000 for most of the benchmarks, except GSM8K, which allows up to 3000\. We infer using vLLM on 6 out of 8 prompt repetition methods, except for Echo Eviction and H2O Eviction which use manual KV decoding\.
## Appendix BAdditional Ablations
This appendix provides additional analyses of the design choices behindPartRep\. We study five components of the method: the target scoring criterion for token selection, the early\-exit layer used for token scoring, the repetition budget, the connector prompt used to append selected tokens, and the effect of the number of learnable keys\. Unless otherwise specified, each ablation varies only one component while keeping all remaining settings fixed to the default configuration used in the main experiments\. We conduct all ablation experiments on ARC\-Challenge, with Qwen2\.5\-3B\.
### B\.1Token Scoring Criterion
Figure 4:Comparison of token scoring criteria under different repetition scoring criteria\. Selecting tokens with the highest NLL consistently yields strong performance, supporting NLL as an effective supervision signal for the gating module\.As our method uses token\-wise negative log\-likelihood \(NLL\) as the target importance signal for training the gating module, we first study whether this is a valid choice compared to several considerable choices\. We compare the highest NLL option with low\-NLL tokens, random selection, attention\-based selection, and TF–IDF\-based selection\. For TF–IDF\-based selection, we use training dataset to measure the IDF, and target prompt to measure the TF\.
Figure[4](https://arxiv.org/html/2607.01792#A2.F4)shows that selecting high\-NLL tokens provides the most reliable accuracy improvements across repetition budgets and achieves the strongest overall performance\. In contrast, repeating low\-NLL tokens or randomly selected tokens is substantially less effective, indicating that the benefit does not arise merely from adding extra tokens\. Attention and TF\-IDF scores show promising results, but we choose high\-NLL since they are a more principled solution as we have shown in[Section˜4\.1](https://arxiv.org/html/2607.01792#S4.SS1), and they consistently match or surpass those scores\. These results support our hypothesis: tokens that are difficult to predict from preceding context contain information that is particularly valuable to reintroduce at later positions\.
Table 6:Connector prompt templates are evaluated for appending the selected token subsequence to the original prompt\.
### B\.2Early\-exit Layer Choice
Figure 5:Effect of the early\-exit layer used by the gating module\. Accuracy improves when the gate receives sufficiently contextualized hidden states, whereas latency increases monotonically with extraction depth\. We select layer 18 as the default operating point\.We study which intermediate layer should be used to extract the hidden states consumed by the gating module\. Extracting features from very shallow layers reduces selection latency, but these representations may not yet contain sufficient semantic information for identifying high\-information tokens\. Conversely, extracting from deeper layers increases latency and reduces the computational advantage of early exit\.
Figure[5](https://arxiv.org/html/2607.01792#A2.F5)shows that accuracy improves substantially once moderately deep representations are used, while later extraction points provide limited or no additional benefit despite higher latency\. In particular, layer 18 achieves the best accuracy–latency trade\-off\. We therefore use this layer as the default extraction point in the main experiments\.
### B\.3Repetition Budget
Figure 6:Effect of the repetition ratioτ\\tau\. Larger budgets increase the KV\-cache footprint, while accuracy is non\-monotonic, suggesting that selectively repeating a compact set of informative tokens is preferable to indiscriminately increasing the repeated context\.We next examine the effect of the repetition budgetτ\\tau, which controls the fraction of original prompt tokens appended by Partial Repetition\. Increasingτ\\tauprovides the model with more repeated information, but also increases the KV\-cache footprint and prefill cost\. Thus, the optimal operating point should preserve the benefit of repetition without approaching the overhead of full repetition\.
As shown in Figure[6](https://arxiv.org/html/2607.01792#A2.F6), accuracy does not improve monotonically with a larger repetition budget\. We conjecture that this non\-monotonic trend is partly associated with the verbal prompt trigger used to introduce the repeated tokens\. The trigger frames the appended subsequence as a compact set of salient cues; however, asτ\\tauincreases, the repeated subsequence may contain increasingly redundant or weakly informative tokens, making this framing less precise and diluting the salience of genuinely useful tokens\.
The selected operating point,τ=0\.15\\tau=0\.15, achieves strong accuracy while retaining a substantially smaller KV\-cache footprint than full repetition\.
### B\.4Connecting Prompt
Figure 7:Comparison of connector prompts used to append the selected tokens in ARC\. A simple natural\-language instruction, “Pay attention to these key tokens:”, performs best and is used as the default connector\.After selecting informative tokens, Partial Repetition appends them to the original prompt through a short connector string\. We evaluate whether the form of this connector affects the usefulness of the repeated tokens\. Specifically, we compare a natural\-language verbal instruction, structured wrappers such as ChatML and XML, a raw token concatenation baseline, and a more technical verbal instruction\. See example query in table[6](https://arxiv.org/html/2607.01792#A2.T6)\.
Figure[7](https://arxiv.org/html/2607.01792#A2.F7)shows that the simple verbal connector, “Pay attention to these key tokens:”, achieves the strongest accuracy\. This suggests that explicitly framing the appended subsequence as task\-relevant information helps the model interpret the repeated tokens, whereas raw concatenation or unnecessarily structured wrappers are less effective\. We therefore adopt the verbal connector throughout the main experiments\.
### B\.5Number of learnable keys\.
Table 7:Ablations on the number of learnable keys used for the gating module\. Our choice \(64\) achieve better accuracy than using less \(32\) or more \(128\) keys\.In[Table˜7](https://arxiv.org/html/2607.01792#A2.T7), we test various choices on the number of learnable keys for the gating module \(Eq\.[11](https://arxiv.org/html/2607.01792#S4.E11)\), on the ARC\-Challenge dataset\. We observe that our choice of 64 achieves the sweet spot, where increasing or decreasing it only degrades the accuracy\.
## Appendix CQualitative results
This appendix provides qualitative examples of how various prompt repetition methods reconstruct the final input prompt given to the model\. We use examples from the ARC\-Challenge dataset and run them on Qwen2\.5\-3B\. For KV cache eviction method, we represent tokens with evictedred\. The original prompt is as follows\.
Farmers in Wyoming were concerned because some of their chickens were being preyed upon by hawks that lived in areas around their ranches\. The farmers grouped together and hunted the hawks until they were no longer in their area\. Which would most likely happen next?A\. The chicken population would go down\.B\. Populations of mice and rats would increase\.C\. Another bird of prey would replace the hawk\.D\. The chickens would have a lower rate of disease\.Reply with one letter in the format:The answer is:
### C\.1No Repetition
This method utilizes the vanilla setup, where the original ARC prompt is given to the model without any additional compressed context or repetition\.
### C\.2Full Repetition
This method repeats the entire original prompt twice\. It gives the model a complete second look over the prompt, but at the cost of doubling the prompt length\.
Farmers in Wyoming were concerned because some of their chickens were being preyed upon by hawks that lived in areas around their ranches\. The farmers grouped together and hunted the hawks until they were no longer in their area\. Which would most likely happen next?A\. The chicken population would go down\.B\. Populations of mice and rats would increase\.C\. Another bird of prey would replace the hawk\.D\. The chickens would have a lower rate of disease\.Reply with one letter in the format:The answer is:
### C\.3PartRep \(Ours\)
This method keeps the original prompt in Section[C\.1](https://arxiv.org/html/2607.01792#A3.SS1)intact and appends onlyimportanttokens chosen by our method at the end\.
Pay attention to these key tokens:Wyoming were concerned some chickens prey grouped hunted until would Pop mice Another replaceReply with one letter in the format:The answer is:
### C\.4Naïve Summary
This method appends a short summary generated by the model to the original prompt in Section[C\.1](https://arxiv.org/html/2607.01792#A3.SS1)\. Summary length is restricted toτ\\tautokens, keeping it on a par withPartRepin terms of KV\-cache\.
Here is the summary of the prompt:Farmers in Wyoming are concerned about hawks preying on their chickens,
### C\.5LLMLingua
This method also appends a short summary to the original prompt in Section[C\.1](https://arxiv.org/html/2607.01792#A3.SS1), however, the prompt is compressed by Microsoft’s LLMLingua instead of the model itself\. Summary length is also restricted toτ\\tauto keep fair comparison\.
Here is the summary of the prompt:Wyoming chickens preyed\.\.?chicken\.mice\.
### C\.6LLMLingua Comp\.
This method compresses the fully repeated prompt in Section[C\.2](https://arxiv.org/html/2607.01792#A3.SS2)using LLMLingua\. We only feed the compressed prompt to the main model, hence KV\-cache memory remains low\.
Wyoming chickens preyed\.\.?chicken\.mice rats\.\.\.\.\.preyed\.\.?chicken population\.mice rats\.\.\.\.\.Reply with one letter in the format:The answer is:
### C\.7Echo Eviction
This method first applies Full Repetition as in Section[C\.2](https://arxiv.org/html/2607.01792#A3.SS2), then evicts the KV\-cache that corresponds with the original prompt \(Section[C\.1](https://arxiv.org/html/2607.01792#A3.SS1)\)\. Therefore, the textual prompt given to the model is identical to Full Repetition, yet only the repeated part is retained for answer generation\.
Farmers in Wyoming were concerned because some of their chickens were being preyed upon by hawks that lived in areas around their ranches\. The farmers grouped together and hunted the hawks until they were no longer in their area\. Which would most likely happen next?A\. The chicken population would go down\.B\. Populations of mice and rats would increase\.C\. Another bird of prey would replace the hawk\.D\. The chickens would have a lower rate of disease\.Reply with one letter in the format:The answer is:Farmers in Wyoming were concerned because some of their chickens were being preyed upon by hawks that lived in areas around their ranches\. The farmers grouped together and hunted the hawks until they were no longer in their area\. Which would most likely happen next?A\. The chicken population would go down\.B\. Populations of mice and rats would increase\.C\. Another bird of prey would replace the hawk\.D\. The chickens would have a lower rate of disease\.Reply with one letter in the format:The answer is:
### C\.8H2O Eviction
This method also starts from the fully repeated prompt as in Section[C\.2](https://arxiv.org/html/2607.01792#A3.SS2), but further applies H2O to retain only selected heavy\-hitter tokens in the KV\-cache\. Number of retained tokens is set to \(L\+τL\+\\tau\) to match the length ofPartRep\.
Farmers in Wyoming were concerned because some of their chickens were being preyed upon by hawks that lived in areas around their ranches\. The farmers grouped together and hunted the hawks untilthey were nolonger intheir area\. Which would most likely happen next?A\. The chicken population wouldgo down\.B\. Populations of mice and rats would increase\.C\. Another birdof prey would replace the hawk\.D\. The chickens would have a lowerrate of disease\.Reply with one letter in the format:The answer is:Farmers in Wyoming were concerned because some of their chickenswere being preyed upon byhawks \. The farmers grouped together and hunted the hawks untilthey were no longer in their area\. Whichwould most likely happen next?A\.The chicken population would go down\.B\.Populations of mice and rats would increase\.C\.Another bird of prey would replace the hawk\.D\.The chickens would have a lower rate of disease\.Reply withone letter in the format:The answer is:
## Appendix DResources
All experiments are performed on 8×NVIDIA GeForce RTX 4090 GPUs\.
## Appendix EArtifact Licenses and Intended Use
In this section, we provide the licenses of the used open\-weight models, datasets, and benchmarks that are used in our study\. Our use of these artifacts follows their original licenses and is entirely consistent with their intended use for academic research\.
### E\.1Models
- •Qwen2\.5\-3B Instruct: Qwen RESEARCH LICENSE AGREEMENT
- •Llama3\.2\-3B Instruct: Llama 3\.2 Community License Agreement
- •Gemma4\-E4B it: Apache License 2\.0
- •llmlingua\-2\-xlm\-roberta\-large\-meetingbank: MIT License
### E\.2Datasets and Benchmarks
- •ARC: Creative Commons Attribution Share Alike 4\.0
- •OpenbookQA: Apache License 2\.0
- •SciQ: Creative Common Attribution Non Commercial 3\.0
- •MedQA: Creative Commons Attribution 4\.0
- •MMLU: MIT License
- •MMLU\-Pro: MIT License
- •GSM8K: MIT License
- •Nvidia RULER: Apache License 2\.0
- •FineWeb\-Edu: Open Data Commons License Attribution family
- •Ultra\-FineWeb\-EDU: Apache License 2\.0Similar Articles
Probing the Prompt KV Cache: Where It Becomes Dispensable
This paper systematically investigates when and which parts of the prompt KV cache become dispensable during LLM decoding, showing that redundancy primarily involves chat template scaffolding rather than task content, and replacement with neutral filler preserves accuracy.
Intermittent random token injection during decoding stage increases LLM diversity without fine-tuning
A Harvard research paper introduces Recoding-Decoding (RD), a novel decoding scheme that injects random priming phrases and diverting tokens to tap into an LLM's long-tail knowledge, significantly boosting output diversity without fine-tuning. The method maintains high relevance while mitigating response homogenization, with stronger models showing greater diversity gains.
@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…
Explains the two phases of LLM inference - prefill and decode - detailing how GPU bottlenecks shift from compute-bound during prefill to memory-bound during decode, and the importance of KV caching.
llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp
This pull request optimizes llama.cpp by avoiding unnecessary copying of logits during prompt decode in multi-token prediction, improving inference performance.
@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.