Context Distillation as Latent Memory Management

arXiv cs.LG Papers

Summary

This paper formulates context distillation as a latent memory management problem, proposing a framework that stores distilled contexts as independent LoRA adapters with retrieval, routing, and self-gating to improve robustness and efficiency.

arXiv:2605.28889v1 Announce Type: new Abstract: Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled latent memories should be stored, retrieved, and safely activated in non-oracle settings. We formulate context distillation as a latent memory management problem. We distill each context into an independent LoRA adapter, forming a modular memory bank that enables explicit memory selection. Given a query, our framework retrieves candidate memories, routes the query to the most suitable adapter, and uses a Self-Gating mechanism to decide whether latent memory should be activated. To improve efficiency, we further introduce cache sharing to reduce management overhead during inference. Experiments show that our method substantially outperforms baselines with retrieval, while Self-Gating improves robustness by deactivate unnecessary latent memories.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:12 AM

# Context Distillation as Latent Memory Management
Source: [https://arxiv.org/html/2605.28889](https://arxiv.org/html/2605.28889)
Ziyang Zheng1,2,Zeju Li1,Xiangyu Wen1,Jianyuan Zhong1, Junhua Huang2,Lei Chen2,Mingxuan Yuan2,Qiang Xu1

1The Chinese University of Hong Kong,2Huawei Noah’s Ark Lab Correspondence:[qxu@cse\.cuhk\.edu\.hk](https://arxiv.org/html/2605.28889v1/mailto:email@domain)

###### Abstract

Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled latent memories should be stored, retrieved, and safely activated in non\-oracle settings\. We formulate context distillation as a latent memory management problem\. We distill each context into an independent LoRA adapter, forming a modular memory bank that enables explicit memory selection\. Given a query, our framework retrieves candidate memories, routes the query to the most suitable adapter, and uses a Self\-Gating mechanism to decide whether latent memory should be activated\. To improve efficiency, we further introduce cache sharing to reduce management overhead during inference\. Experiments show that our method substantially outperforms baselines with retrieval, while Self\-Gating improves robustness by deactivate unnecessary latent memories\.

Context Distillation as Latent Memory Management

Ziyang Zheng1,2, Zeju Li1, Xiangyu Wen1, Jianyuan Zhong1,Junhua Huang2,Lei Chen2,Mingxuan Yuan2,Qiang Xu11The Chinese University of Hong Kong,2Huawei Noah’s Ark LabCorrespondence:[qxu@cse\.cuhk\.edu\.hk](https://arxiv.org/html/2605.28889v1/mailto:email@domain)

## 1Introduction

LLMs adapt to documents, tasks, and users primarily by placing relevant information in the context window\(Brownet al\.,[2020](https://arxiv.org/html/2605.28889#bib.bib23); Lewiset al\.,[2020](https://arxiv.org/html/2605.28889#bib.bib24)\)\. This in\-context adaptation is flexible but inherently temporary: the same information must be re\-read for every query, and long prompts increase latency\(Tayet al\.,[2022](https://arxiv.org/html/2605.28889#bib.bib25)\), memory usage, and generation instability\(Zhaoet al\.,[2021](https://arxiv.org/html/2605.28889#bib.bib26); Liuet al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib27)\)\. These limitations raise a central question:Can contextual information be converted from temporary text into persistent model memory, and if so, how should many such memories be stored, retrieved, and safely activated?

Context distillation \(CD\)\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1); Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2); Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9); Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16); Charakornet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib14),[2025](https://arxiv.org/html/2605.28889#bib.bib15)\)offers a natural answer to the first part of this question\. Instead of conditioning on a long context at every inference step, CD trains a model to reproduce its context\-conditioned behavior without explicitly seeing the context, thereby compressing textual context into latent memory stored in model parameters for future use\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/x1.png)Figure 1:Comparison of cumulative context distillation and our method\. Cumulative distillation compresses all contexts into one adapter, whereas our framework stores contexts asmodular memories:retrieves, routes, andgatesthe latent memories for each query\.However, the second part of the question remains largely unaddressed\. Practical deployment requires not only writing a context into parameters, but also managing a collection of parameterized memories\. Existing CD methods typically bypass this issue by assuming oracle access to the relevant distilled memory\(Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9); Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16); Snellet al\.,[2022](https://arxiv.org/html/2605.28889#bib.bib20); Cacciaet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib30)\), or by continually writing new contexts into a single adapter with a cumulative paradigm\(Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2); Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)\.

The cumulative paradigm, as illustrated in Figure[1](https://arxiv.org/html/2605.28889#S1.F1), is conceptually simple: each new context is absorbed into the current parameter state, so the resulting model is expected to implicitly contain all previously observed contexts\. As a result, newly distilled contexts can overwrite earlier ones\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib10); Li and Hoiem,[2017](https://arxiv.org/html/2605.28889#bib.bib29); Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.28889#bib.bib28)\), leading to severe forgetting and unreliable activation of previously latent memories\. Moreover, since all contexts are compressed into a single state, the model lacks an explicit mechanism to select among memories or to deactivate them when a query does not require contextual knowledge\. Consequently, cumulative distillation is fragile in realistic non\-oracle settings, where the system must determine both which latent memory to use and whether any memory should be activated\.

In this paper, we formulate context distillation as a*latent memory management*problem, where distilled contexts are maintained as persistent modular memories\. Specifically, we distill each context into a dedicated LoRA adapter to form a memory bank\. As shown in Figure[1](https://arxiv.org/html/2605.28889#S1.F1), our framework retrieves candidate memories for a given query, routes it to the most suitable adapter, and employs a Self\-Gating mechanism to decide whether latent memory should be activated, avoiding harmful use on context\-agnostic queries\. To improve efficiency, we further introduce cache\-sharing to amortize context encoding during distillation and reduce retrieval and gating overhead at inference time\. Experiments under realistic settings show that our method substantially outperforms cumulative\-memory baselines, while Self\-Gating and cache\-sharing improve robustness and efficiency\.

Our contributions are threefold:

- •We formulate context distillation as latent memory management, shifting focus from single\-context internalization to storage, retrieval, and activation of latent memories\.
- •We propose a modular adapter memory bank with two\-stage latent retrieval and Self\-Gating\.
- •We introduce cache\-sharing distillation, enabling efficient adapter switching while preserving downstream performance\.

## 2Background

### 2\.1KV\-cache Compression

KV\-cache compression aims to compress long contexts into shorter latent tokens, typically achieving a15×15\\times\-30×30\\timescompression ratio\. Previous works\(Eyubogluet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib13); Liu and Qiu,[2025](https://arxiv.org/html/2605.28889#bib.bib18); Liet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib19)\)first compress the contextccwith a compressorϕ\\phiand apply context distillation target:

DKL\(π\(⋅∣q,c\)∥π\(⋅∣q,ϕ\(c\)\)\),D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\(\\cdot\\mid q,c\)\\,\\middle\\\|\\,\\pi\(\\cdot\\mid q,\\phi\(c\)\)\\right\),\(1\)whereqqdenotes a query\. The selection ofϕ\\phivaries across different works\. Cartridges\(Eyubogluet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib13)\)and Latent Context Compilation\(Liet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib19)\)compress context via Base Model with LoRA, while C3\(Liu and Qiu,[2025](https://arxiv.org/html/2605.28889#bib.bib18)\)uses a small Base Model as context compression encoder\.

### 2\.2Context Distillation

Context distillation aims to internalize the context into the model parameters by minimizing:

DKL\(πθ\(⋅∣q,c\)∥πθ\+Δ​θ\(⋅∣q\)\)\.D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid q,c\)\\,\\middle\\\|\\,\\pi\_\{\\theta\+\\Delta\\theta\}\(\\cdot\\mid q\)\\right\)\.\(2\)Snellet al\.\([2022](https://arxiv.org/html/2605.28889#bib.bib20)\)first proposed internalizing contextual information into model parameters through distillation\. Similarly, Prompt baking\(Bhargavaet al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib17)\)proposes baking prompt information into model parameters\.Charakornet al\.\([2025](https://arxiv.org/html/2605.28889#bib.bib15),[2026](https://arxiv.org/html/2605.28889#bib.bib14)\)adopt the idea of meta\-learning and use a hypernetwork to facilitate the distillation\. Recently, OPCD\(Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9)\)and Opsdl\(Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16)\)adopt on\-policy distillation to further enhance the distillation capability\.Cacciaet al\.\([2025](https://arxiv.org/html/2605.28889#bib.bib30)\)first discuss the synergies between CD and RAG\. However, they only augment the oracle latent memory with the retrieved passage, instead of managing the latent memory itself\. Beyond internalizing a single context, how to manage latent memories remains largely unaddressed\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/x2.png)Figure 2:Overview of our training stage and inference stage\.Training:Teacher model precomputes the KV\-cache of target document for reuse during training, while student computes the prefix KV\-cache with Base Model to enable theHot\-swapproperty for management\.Inference:Given a query, we first retrieve the most relevant latent memoryΔ​θi\\Delta\\theta\_\{i\}, and then use Self\-Gating to decide whether to continue withθ\+Δ​θi\\theta\+\\Delta\\theta\_\{i\}or fallback toθ\\theta\. Owing to the cache\-sharing distillation introduced during training, memory management remains efficient\.
### 2\.3Cumulative Distillation

Cumulative context distillation\(Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2); Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)updates the model sequentially by distilling each new context into its parameters:

θi=argminθDKL\(πθi−1\(⋅∣q,ci\)∥πθ\(⋅∣q\)\),\\theta\_\{i\}=\\arg\\min\_\{\\theta\}D\_\{\\mathrm\{KL\}\}\\big\(\\pi\_\{\\theta\_\{i\-1\}\}\(\\cdot\\mid q,c\_\{i\}\)\\parallel\\pi\_\{\\theta\}\(\\cdot\\mid q\)\\big\),\(3\)with the goal of progressively internalizing all preceding contexts into the latest modelθn\\theta\_\{n\}, thereby enabling inference with the latest adapter instead of relying on an oracle selection of the appropriate adapter\. However, such sequential writing can lead to severe catastrophic forgetting\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib10); Li and Hoiem,[2017](https://arxiv.org/html/2605.28889#bib.bib29); Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.28889#bib.bib28)\)\. Beyond the forgetting issue, we further argue that under an idealized recursive distillation assumption, cumulative distillation behaves analogously to distilling the Base Model conditioned on accumulated contexts\{c1,…,ci\}\\\{c\_\{1\},\\ldots,c\_\{i\}\\\}:

πθi∗​\(y∣q\)=πθ​\(y∣q,\[c1,…,ci\]\)\.\\pi\_\{\\theta\_\{i\}^\{\*\}\}\(y\\mid q\)=\\pi\_\{\\theta\}\(y\\mid q,\[c\_\{1\},\\ldots,c\_\{i\}\]\)\.\(4\)We provide further details in Appendix[A](https://arxiv.org/html/2605.28889#A1)\. This perspective suggests that cumulative CD inherits the teacher model’s sensitivity to irrelevant accumulated contexts, as well as its degradation under long\-context inputs\. Consequently, performance may deteriorate even under the oracle test setting\.

## 3Methodology

### 3\.1Cache\-Sharing Context Distillation

Given a context stream𝒞=\{c1,c2,…,ct\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{t\}\\\}, we freeze the base modelfθf\_\{\\theta\}and train a separate PEFT module, e\.g\., LoRA,Δ​θi\\Delta\\theta\_\{i\}for each contextcic\_\{i\}\. The student modelθ\+Δ​θi\\theta\+\\Delta\\theta\_\{i\}is trained to mimic the teacher model, i\.e\., the base model explicitly conditioned oncic\_\{i\}\. For a set of queries\{q1,…,qn\}\\\{q\_\{1\},\\dots,q\_\{n\}\\\}, we minimize the KL divergence:

Δθi∗=arg⁡minΔ​θi1n∑j=1nDKL\(πθ​\(yj\|qj,ci\)∥πθ\+Δ​θi\(yj\|KVqj\)\),\\begin\{split\}\\Delta\\theta\_\{i\}^\{\*\}=\\mathop\{\\arg\\min\}\_\{\\Delta\\theta\_\{i\}\}\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}D\_\{\\text\{KL\}\}\\big\(&\\pi\_\{\\theta\}\(y\_\{j\}\|q\_\{j\},c\_\{i\}\)\\\\ &\\parallel\\pi\_\{\\theta\+\\Delta\\theta\_\{i\}\}\(y\_\{j\}\|\\text\{KV\}\_\{q\_\{j\}\}\)\\big\),\\end\{split\}\(5\)whereKVqj=fθ​\(qj\)\\text\{KV\}\_\{q\_\{j\}\}=f\_\{\\theta\}\(q\_\{j\}\)andyjy\_\{j\}denotes the target response distribution produced by the teacher\. To reduce both training and inference cost, we introduce a dual cache\-sharing mechanism for the teacher and student models\.

#### Teacher Cache

For the teacherπθ\(⋅\|qj,ci\)\\pi\_\{\\theta\}\(\\cdot\|q\_\{j\},c\_\{i\}\), prefilling the long contextcic\_\{i\}dominates computation\. During distillation of a fixed adapterΔ​θi\\Delta\\theta\_\{i\}, however,cic\_\{i\}remains unchanged, while onlyqjq\_\{j\}andyjy\_\{j\}vary\. We therefore precompute KV\-cache ofcic\_\{i\}once with the base model and reuse it for all training steps, substantially accelerating teacher\-side generation\.

#### Student Cache

We further introduce a decoupled cache\-sharing pipeline for the student, yielding nearly zero\-overhead adapter switching, which we refer to as theHot\-swapproperty\. Specifically, student\-side cache sharing deliberately trains a cache\-compatible student distribution\. Instead of applying the adapter while encoding the query prefix, we compute the prefix KV\-cache with the frozen base model and activate the adapter only for response generation\. This restricts the adapter to be compatible with base\-model prefix caches, enabling adapter hot\-swapping without KV recomputation\. Consequently, the system can seamlessly switch among adaptersΔ​θi\\Delta\\theta\_\{i\}while reusing the same prefix cache, which is essential for the Gating and Retrieval systems described in Sections[3\.2](https://arxiv.org/html/2605.28889#S3.SS2)and[3\.3](https://arxiv.org/html/2605.28889#S3.SS3)\.

### 3\.2Self\-Gating System

As illustrated in Figure[4](https://arxiv.org/html/2605.28889#S3.F4), we observe a distinct phenomenon in the predictive entropy of the first generated token\. In standard In\-Context Learning \(ICL\), there exists a clear distributional margin between context\-specific queries and context\-agnostic queries\. In contrast, the base model without any appended context exhibits no such distributional gap\. Since our distillation objective in Eq\.[5](https://arxiv.org/html/2605.28889#S3.E5)explicitly aligns the output distribution of the LoRA student with the ICL teacher, the trained context\-specific LoRA models inherit this entropy gap\.

Computing this first\-token entropy with standard ICL is prohibitively expensive due to long\-context\. However, our cache\-sharing mechanism makes it computationally trivial\. Motivated by this observation, we propose aSelf\-Gatingmechanism that leverages the first\-token entropy for dynamic model gating\. As shown in Figure[3](https://arxiv.org/html/2605.28889#S3.F3), given a context adapterΔ​θi\\Delta\\theta\_\{i\}, we first process the input prefix \(i\.e\., prompt and query, denoted simply asqq\), exclusively through the frozen base modelfθf\_\{\\theta\}to obtain the shared prefix KV\-cache:

KVq=fθ​\(q\)\.\\text\{KV\}\_\{q\}=f\_\{\\theta\}\(q\)\.\(6\)We then temporarily activate the LoRA moduleΔ​θi\\Delta\\theta\_\{i\}to generate the probability distributionp1p\_\{1\}of the first token:

p1=fθ\+Δ​θi​\(KVq\)\.p\_\{1\}=f\_\{\\theta\+\\Delta\\theta\_\{i\}\}\(\\text\{KV\}\_\{q\}\)\.\(7\)
![Refer to caption](https://arxiv.org/html/2605.28889v1/x3.png)Figure 3:Overview of the proposed Self\-Gating mechanism\. The system routes the generation between the specific LoRA and the base model based on the first\-token entropy, with zero KV\-cache recomputation\.![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/entropy_distribution.png)Figure 4:Distribution of the first\-token entropy\. A clear gap exists between context\-specific and context\-agnostic queries when context is provided, whereas the base model exhibits no such distinction\.Letℋ​\(p1\)\\mathcal\{H\}\(p\_\{1\}\)denote the Shannon entropy of this distribution\. We use a thresholdλ\\lambdafor gating:

- •Continue \(ℋ​\(p1\)<λ\\mathcal\{H\}\(p\_\{1\}\)<\\lambda\):Low entropy indicates that the LoRA model is highly confident in answering the query based on its injected knowledge\. We therefore retain the LoRA module and continue decoding\.
- •Fallback \(ℋ​\(p1\)≥λ\\mathcal\{H\}\(p\_\{1\}\)\\geq\\lambda\):High entropy indicates significant uncertainty, suggesting that the query is likely context\-agnostic, e\.g\., a general knowledge question\. We then instantly deactivate the LoRA module and fall back to the base model\.

When falling back, the base model seamlessly reuses the already computedKVq\\text\{KV\}\_\{q\}for subsequent generation\. Therefore, theonlyadditional computational overhead introduced by Self\-Gating is the forward pass of a single token using the LoRA model in Eq\.[7](https://arxiv.org/html/2605.28889#S3.E7), making the entire gating process exceptionally efficient\.

### 3\.3Latent Memory Retrieval System

By distilling each context into an independent LoRA module\{Δ​θ1,…,Δ​θt\}\\\{\\Delta\\theta\_\{1\},\\dots,\\Delta\\theta\_\{t\}\\\}, we recast context utilization as a module selection and routing problem\. To manage these distributed latent memories, we introduce a two\-stage retrieval system:External RetrievalandInternal Routing, as shown in Figure[5](https://arxiv.org/html/2605.28889#S3.F5)\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/x4.png)Figure 5:Latent memory retrieval with independently distilled LoRA adapters\. An external retriever first selects candidate adapters by cosine similarity, and an internal router further ranks them using first\-token entropy and hidden states under a shared prefix KV\-cache\.#### External Retrieval \(Coarse\-Grained\)

Similar to dense retrieval in RAG, we use an off\-the\-shelf embedding modelfembf\_\{\\text\{emb\}\}to encode each context and the incoming query:

𝐞ci\\displaystyle\\mathbf\{e\}\_\{c\_\{i\}\}=femb​\(ci\),\\displaystyle=f\_\{\\text\{emb\}\}\(c\_\{i\}\),\(8\)𝐞q\\displaystyle\\mathbf\{e\}\_\{q\}=femb​\(q\)\.\\displaystyle=f\_\{\\text\{emb\}\}\(q\)\.\(9\)We then select the top\-KKLoRA modules with the highest cosine similarity to the query:

𝒮topK=TopKi∈\{1,…,t\}\(cos⁡\(𝐞ci,𝐞q\)\)\.\\mathcal\{S\}\_\{\\text\{topK\}\}=\\mathop\{\\text\{TopK\}\}\_\{i\\in\\\{1,\\dots,t\\\}\}\\big\(\\cos\(\\mathbf\{e\}\_\{c\_\{i\}\},\\mathbf\{e\}\_\{q\}\)\\big\)\.\(10\)

#### Internal Routing \(Fine\-Grained\)

Although external embeddings capture semantic relevance, they may not align perfectly with the LLM generative confidence\. We therefore refine the candidate set𝒮topK\\mathcal\{S\}\_\{\\text\{topK\}\}with an internal router\. Using the cache\-sharing mechanism in Section[3\.2](https://arxiv.org/html/2605.28889#S3.SS2), we first compute the shared prefix KV\-cache with the base model\. For each candidate adapterΔ​θi∈𝒮topK\\Delta\\theta\_\{i\}\\in\\mathcal\{S\}\_\{\\text\{topK\}\}, we briefly activate it to obtain the first\-token hidden statehih\_\{i\}and predictive entropyℋi\\mathcal\{H\}\_\{i\}\. The final adapter is selected by a lightweight feed\-forward network\(FFN\) router conditioned on the external similaritys​i​mi=cos⁡\(𝐞q,𝐞ci\)sim\_\{i\}=\\cos\(\\mathbf\{e\}\_\{q\},\\mathbf\{e\}\_\{c\_\{i\}\}\), entropyℋi\\mathcal\{H\}\_\{i\}, and hidden statehih\_\{i\}\(details in Appendix[C\.5](https://arxiv.org/html/2605.28889#A3.SS5)\):

Δ​θ∗=arg⁡maxΔ​θi∈𝒮topK⁡r​o​u​t​e​r​\(s​i​mi,ℋi,hi\)\.\\Delta\\theta^\{\*\}=\\arg\\max\_\{\\Delta\\theta\_\{i\}\\in\\mathcal\{S\}\_\{\\text\{topK\}\}\}router\(sim\_\{i\},\\mathcal\{H\}\_\{i\},h\_\{i\}\)\.\(11\)Since all candidates share the same query\-prefix KV\-cache, evaluating multiple LoRAs during internal routing introduces negligible overhead\.

## 4Experiments

In this section, we empirically study the central question:How should latent memories be stored, retrieved, and safely activated?

### 4\.1Implementation Details

#### Base Model

We conduct experiments with Qwen2\.5\-0\.5B\(Yanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib7)\)and Qwen2\.5\-7B\(Yanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib7)\)as base models\. We use Qwen3\-Embedding\-0\.6B\(Zhanget al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib22)\)for retrieval\.

#### Benchmarks

For context\-specific queries, we evaluate on NarrativeQA\(Kočiskỳet al\.,[2018](https://arxiv.org/html/2605.28889#bib.bib4)\)and SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.28889#bib.bib3)\)\. For context\-agnostic queries, we use CommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2605.28889#bib.bib5)\)\. Since SQuAD contains a large number of documents, we retain the top 300 most frequently queried documents, ranked by the number of associated queries\. For the statistics of our dataset, please refer to Appendix[B](https://arxiv.org/html/2605.28889#A2)\.

#### Baselines

We compare with two cumulative\-memory baselines, TempLora\(Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2)\)and InfiniteICL\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)\. Since TempLora does not use QA distillation, we also include TempLoraCD, which replaces its original objective with distillation loss for a fairer comparison\.

More details can be found in Appendix[C](https://arxiv.org/html/2605.28889#A3)\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/cumulative_eval_settings_narrativeqa.png)\(a\)Results on NarrativeQA
![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/cumulative_eval_settings_squad.png)\(b\)Results on SQuAD

Figure 6:Results of cumulative methods on Qwen2\.5\-0\.5B, under various evaluation settings\.Table 1:Main results on NarrativeQA and SQuAD\.†\\daggerdenotes Oracle settings with access to the ground\-truth context or adapter, serving as upper bounds for the corresponding methods\.### 4\.2How to Store

In this section, we evaluate cumulative paradigm with different settings\. TheLatestsetting always uses the latest adapterΔ​θn\\Delta\\theta\_\{n\}, theShiftsetting evaluates a queryqiq^\{i\}from documentcic\_\{i\}with the next adapterΔ​θi\+1\\Delta\\theta\_\{i\+1\}, andOraclesettings assume access to the correct context or adapter\. More details can be found in Appendix[C\.4](https://arxiv.org/html/2605.28889#A3.SS4)\.

We discuss three key limitations of the cumulative paradigm\. First, the cumulative paradigm suffers from severe forgetting in realistic settings\. Cumulative methods degrade substantially in bothLatestandShiftsettings\. For example, TempLoraCD drops to 6\.56 ROUGE\-L on NarrativeQA and 5\.62 EM on SQuAD underShiftsetting, despite much stronger performance in theOraclesetting\. This indicates that despite the cumulative distillation, the latent memory is still highly sensitive to adapter selection and prone to retrieval errors\.

Second, the cumulative paradigm limits the upper bound of context distillation\. Even underO​r​a​c​l​eOraclesetting, cumulative training is affected by increasing cumulative context length and accumulated noise, as discussed in Section[2\.3](https://arxiv.org/html/2605.28889#S2.SS3)\. As a result, TempLoraCD underperforms our method in Table[4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px3): on NarrativeQA, it achieves 26\.52 ROUGE\-L compared to 30\.22 for ours; on SQuAD, it achieves 41\.45 F1\-Score compared to 50\.05 for ours\. This suggests that accumulating contexts weakens an adapter’s ability to preserve context\-specific knowledge, as discussed in Section[2\.3](https://arxiv.org/html/2605.28889#S2.SS3)\.

Last, the cumulative paradigm fails to manage the memory usage\. InfiniteICL shows unstable behavior: it is unclear whether predictions come from latent adapter memory or explicit context\. For a queryqiq^\{i\}from documentcic\_\{i\}, InfiniteICL usesΔ​θi−1\\Delta\\theta\_\{i\-1\}as long\-term memory andcic\_\{i\}as short\-term memory\. On NarrativeQA, it relys more on short\-term context, performing well in theOraclesetting but dropping from 39\.73 to 8\.38 ROUGE\-L underShift\. On SQuAD, the trend is reversed: F1\-Score increases from 2\.73 to 19\.46 underShift, indicating greater reliance on latent memory in adapter\. Such results demonstrate that InfiniteICL lacks a stable mechanism to control when to use latent memory versus explicit context\. These results motivate explicit adapter management, rather than relying on a single cumulatively updated memory\.

Table 2:Results of hybrid queries from NarrativeQA and CommonsenseQA\. Best overall results are shown inbold, and second\-best results areunderlined\.### 4\.3Which to Use

The results in Section[4\.2](https://arxiv.org/html/2605.28889#S4.SS2)show that cumulative memories are difficult to use when the correct adapter is unknown\. We therefore evaluate whether our proposed retrieval system can make latent memories practical at inference time\.

As shown in Table[4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px3), retrieved adapters consistently outperform cumulative baselines under realistic settings\. On NarrativeQA, our method achieves 24\.54 ROUGE\-1 withQwen2\.5\-0\.5Band 28\.64 ROUGE\-1 withQwen2\.5\-7B, substantially higher than thelatestvariants of TempLora, InfiniteICL, and TempLoraCD\. On SQuAD, our Retrieval@3 setting reaches 28\.30 EM / 43\.11 F1 withQwen2\.5\-0\.5B, and 36\.34 EM / 46\.57 F1 withQwen2\.5\-7B\. These results indicate that independently stored adapters can function as useful latent memories when paired with a retrieval system\.

We further compare our method with the oracle setting\. Although the retrieved adapters do not always match oracle performance, they provide a deployable alternative when the query–adapter correspondence is unknown\. Moreover, compared with the cumulative method, whose performance drops significantly when moving from the oracle to the realistic setting, our retrieval system preserves a substantial fraction of oracle performance while removing the need for access to the oracle adapter\.

#### Efficiency

Compared with standard textual RAG, our goal is not to universally outperform explicit retrieval in task performance, but to build a retrieval system that manages latent memories and makes them effective in realistic settings at low cost\. Figure[7](https://arxiv.org/html/2605.28889#S4.F7)compares inference efficiency under different retrieval settings\.ICLdenotes generation with concatenated retrieved contexts, whilenaivedenotes LoRA\-based generation without management\.In\-Context Routingroutes using the first token generated with context, andOurs w/o cacheroutes with LoRA but without the shared prefix cache\. More details are provided in Appendix[E](https://arxiv.org/html/2605.28889#A5)\. Explicit in\-context methods process retrieved contexts as part of the input, causing FLOPs to grow rapidly with context length\. In contrast, our method switches LoRA adapters with the shared prefix cache, avoiding repeated computation over retrieved contexts and keeping the cost close to naive generation\. At 4000 tokens, our method reduces FLOPs by414\.7×414\.7\\timesunder Top\-3 and1081\.4×1081\.4\\timesunder Top\-5 compared with explicit in\-context inference\. Cache reuse further brings a2\.8×2\.8\\timesand4\.0×4\.0\\timesreduction over the w/o\-cache variant under Top\-3 and Top\-5, respectively\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/inference/paper_nqa_inference_flops_line_rag3.png)\(c\)Top\-3
![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/inference/paper_nqa_inference_flops_line_rag5.png)\(d\)Top\-5

Figure 7:Inference efficiency comparison for retrieval\.![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/training/paper_flops_total_absolute.png)\(a\)Flops Comparison
![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/training/paper_runtime_absolute.png)\(b\)Runtime Comparison
![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/training/paper_memory_absolute.png)\(c\)Memory Comparison

Figure 8:Training efficiency comparison under varying context lengths\.### 4\.4Whether to Activate

The results in Section[4\.3](https://arxiv.org/html/2605.28889#S4.SS3)show that retrieved adapters can serve as effective latent memories for context\-specific queries\. However, practical systems must also decide*whether*to activate a retrieved memory, since irrelevant adapters may interfere with the base model’s general knowledge on context\-agnostic queries\. We evaluate this issue in a hybrid setting that combines context\-specific queries from NarrativeQA and context\-agnostic queries from CommonsenseQA, where Self\-Gating decides whether to activate the adapter\. For isolating activation control, Table[2](https://arxiv.org/html/2605.28889#S4.T2)evaluates Self\-Gating on the top\-1 retrieved adapter\. This isolates whether the system can suppress an irrelevant latent memory, independent of the internal routing\.

As shown in Table[2](https://arxiv.org/html/2605.28889#S4.T2), always activating an adapter substantially degrades CommonsenseQA performance\. Cumulative methods using thelatestadapter suffer large drops, while our retrieval\-only variant reduces CommonsenseQA accuracy from 47\.80% to 37\.30% onQwen2\.5\-0\.5B, and from 85\.18% to 72\.40% onQwen2\.5\-7B\. These results indicate that retrieval alone is insufficient: retrieved latent memories must still be selectively activated\. With gating, our method largely mitigates this interference\. CommonsenseQA accuracy improves to 45\.62% onQwen2\.5\-0\.5Band 79\.44% onQwen2\.5\-7B, while maintaining competitive NarrativeQA performance\. These results show that our proposed Self\-Gating mechanism enables the model to use latent memory only when necessary and fall back to the base model otherwise\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/inference/paper_nqa_inference_flops_line_gating.png)Figure 9:Inference efficiency comparison for gating\.#### Efficiency

Figure[9](https://arxiv.org/html/2605.28889#S4.F9)compares the inference cost of different activation strategies\. Following the previous setting,In\-Context Gatingdenotes gating based on the first token generated with the retrieved context\. Unlike in\-context methods, whose cost grows with the context length, our method activates memories by selectively enabling LoRA adapters and reusing the shared prefix cache\. At 4000 tokens, it reduces FLOPs by210\.0×210\.0\\timesover in\-context gating and116\.6×116\.6\\timesover ICL, while introducing only about1%1\\%overhead over naive generation\.

### 4\.5Training Efficiency

While CD offers an effective way to reduce inference cost, its training cost increases with the context length\(Charakornet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib14); Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2); Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)\. We evaluate training efficiency by varying the context length from 500 to 4000 tokens\. As shown in Figure[8](https://arxiv.org/html/2605.28889#S4.F8), compared with the w/o cache variant that repeatedly recomputes the full context, teacher\-side cache reuse and student\-side prefix caching substantially reduce computation\. At the longest context length, the full caching pipeline achieves an8\.4×8\.4\\timesspeedup, a4\.1×4\.1\\timesreduction in peak memory usage, and a21\.9×21\.9\\timesreduction in FLOPs\. Moreover, whereas the baseline cost grows rapidly with context length, our cached pipeline keeps training time nearly constant, indicating that the cost of long\-context processing is effectively amortized\. Note that while off\-policy methods\(Eyubogluet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib13); Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2); Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1); Snellet al\.,[2022](https://arxiv.org/html/2605.28889#bib.bib20)\)only need to generate labels for a single turn, the on\-policy methods\(Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9); Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16)\)roll out responses online and compute logits conditioned on the context, making teacher\-side cache reuse even more critical\.

### 4\.6Generalization and Ablation

We further investigate the generality of the proposed method\. Table[3](https://arxiv.org/html/2605.28889#S4.T3)reports the performance of different distillation variants with and without caching, with detailed distillation settings provided in Section[D](https://arxiv.org/html/2605.28889#A4)\. Overall, the results suggest that the proposed cache\-sharing is not specific to any single distillation objective\. Across NarrativeQA and SQuAD, cache\-sharing distillation largely preserves the average performance of the corresponding non\-caching variants, and in some cases even yields improvements under certain objectives\. These findings suggest that our proposed cache\-sharing distillation can be seamlessly combined with diverse distillation algorithms while maintaining effectiveness relative to standard non\-cached distillation\. We further conduct experiment on Llama3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib8)\)in Section[G](https://arxiv.org/html/2605.28889#A7)\. We also discuss the cache\-sharing distillation, internal routing and Self\-Gating threshold with empirical ablation study in Section[F](https://arxiv.org/html/2605.28889#A6)\.

Table 3:Comparison of performance on NarrativeQA and SQuAD with Qwen2\.5\-0\.5B\.## 5Conclusion

We formulate context distillation as latent memory management\. Instead of accumulating all contexts into a single parameter state, we distill each context into an independent LoRA adapter and manage these adapters through retrieval and Self\-Gating\. This design enables modular storage, query\-aware memory selection, and safe fallback to the base model when latent memory is unnecessary\. Experiments show that our method outperforms cumulative distillation baselines in realistic non\-oracle settings and preserves general model capabilities on context\-agnostic queries, with a lightweight management system\.

## Limitations

Although context distillation with LoRA as latent memory can reduce inference cost, it currently still exhibits a performance gap compared with ICL under the oracle setting\. Future work may close this gap by developing more effective distillation strategies, such as on\-policy distillation methods\(Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9); Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16)\)\.

Furthermore, storage overhead remains a limitation, as saving a separate adapter can still be costly\. Future work may investigate more parameter\-efficient modules for context distillation, such as BitFit\(Zakenet al\.,[2022](https://arxiv.org/html/2605.28889#bib.bib31)\), to further reduce the memory footprint\.

## References

- Prompt baking\.arXiv preprint arXiv:2409\.13697\.Cited by:[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p1.1)\.
- L\. Caccia, A\. Ansell, E\. Ponti, I\. Vulić, and A\. Sordoni \(2025\)Training plug\-n\-play knowledge modules with deep context distillation\.arXiv preprint arXiv:2503\.08727\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2)\.
- B\. Cao, D\. Cai, and W\. Lam \(2025\)Infiniteicl: breaking the limit of context window size via long short\-term memory transformation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 11402–11415\.Cited by:[Appendix A](https://arxiv.org/html/2605.28889#A1.p1.4),[§B\.1](https://arxiv.org/html/2605.28889#A2.SS1.p2.1),[§C\.4](https://arxiv.org/html/2605.28889#A3.SS4.SSS0.Px2.p1.3),[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.28889#S2.SS3.p1.3),[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px3.p1.1),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3)\.
- R\. Charakorn, E\. Cetin, Y\. Tang, and R\. T\. Lange \(2025\)Text\-to\-lora: instant transformer adaption\.arXiv preprint arXiv:2506\.06105\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2)\.
- R\. Charakorn, E\. Cetin, S\. Uesaka, and R\. T\. Lange \(2026\)Doc\-to\-lora: learning to instantly internalize contexts\.arXiv preprint arXiv:2602\.15902\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3)\.
- S\. Eyuboglu, R\. Ehrlich, S\. Arora, N\. Guha, D\. Zinsley, E\. Liu, W\. Tennien, A\. Rudra, J\. Zou, A\. Mirhoseini,et al\.\(2025\)Cartridges: lightweight and general\-purpose long context representations via self\-study\.arXiv preprint arXiv:2506\.06266\.Cited by:[§B\.1](https://arxiv.org/html/2605.28889#A2.SS1.p2.1),[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.4),[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.6),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.6](https://arxiv.org/html/2605.28889#S4.SS6.p1.1)\.
- J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px2.p1.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.28889#S2.SS3.p1.2)\.
- T\. Kočiskỳ, J\. Schwarz, P\. Blunsom, C\. Dyer, K\. M\. Hermann, G\. Melis, and E\. Grefenstette \(2018\)The narrativeqa reading comprehension challenge\.Transactions of the Association for Computational Linguistics6,pp\. 317–328\.Cited by:[§B\.2](https://arxiv.org/html/2605.28889#A2.SS2.p1.1),[Table 4](https://arxiv.org/html/2605.28889#A2.T4),[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p1.1)\.
- Z\. Li, Y\. Zhou, and Q\. Xu \(2026\)Latent context compilation: distilling long context into compact portable memory\.arXiv preprint arXiv:2602\.21221\.Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.4),[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.6)\.
- Z\. Li and D\. Hoiem \(2017\)Learning without forgetting\.IEEE transactions on pattern analysis and machine intelligence40\(12\),pp\. 2935–2947\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.28889#S2.SS3.p1.2)\.
- F\. Liu and H\. Qiu \(2025\)Context cascade compression: exploring the upper limits of text compression\.arXiv preprint arXiv:2511\.15244\.Cited by:[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.4),[§2\.1](https://arxiv.org/html/2605.28889#S2.SS1.p1.6)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p1.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)Squad: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 2383–2392\.Cited by:[§B\.2](https://arxiv.org/html/2605.28889#A2.SS2.p1.1),[Table 4](https://arxiv.org/html/2605.28889#A2.T4),[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Raman, P\. Mani, D\. Liang, and Z\. Lipton \(2023\)For distillation, tokens are not all you need\.InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following,Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px3.p1.1)\.
- I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal \(2026\)Self\-distillation enables continual learning\.arXiv preprint arXiv:2601\.19897\.Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.28889#S1.p4.1),[§2\.3](https://arxiv.org/html/2605.28889#S2.SS3.p1.2)\.
- C\. Snell, D\. Klein, and R\. Zhong \(2022\)Learning by distilling context\.arXiv preprint arXiv:2209\.15189\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)Commonsenseqa: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Tay, M\. Dehghani, D\. Bahri, and D\. Metzler \(2022\)Efficient transformers: a survey\.ACM Computing Surveys55\(6\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p1.1)\.
- Y\. Wang, D\. Ma, and D\. Cai \(2024\)With greater text comes greater necessity: inference\-time training helps long text generation\.arXiv preprint arXiv:2401\.11504\.Cited by:[Appendix A](https://arxiv.org/html/2605.28889#A1.p1.4),[§F\.3](https://arxiv.org/html/2605.28889#A6.SS3.p1.1),[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.28889#S2.SS3.p1.3),[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px3.p1.1),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv e\-prints,pp\. arXiv–2412\.Cited by:[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei \(2026\)On\-policy context distillation for language models\.arXiv preprint arXiv:2602\.12275\.Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3),[§5](https://arxiv.org/html/2605.28889#Sx1.p1.1)\.
- E\. B\. Zaken, Y\. Goldberg, and S\. Ravfogel \(2022\)Bitfit: simple parameter\-efficient fine\-tuning for transformer\-based masked language\-models\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 1–9\.Cited by:[§5](https://arxiv.org/html/2605.28889#Sx1.p2.1)\.
- X\. Zhang, Z\. Ding, T\. Pan, R\. Yang, C\. Kang, X\. Xiong, and J\. Gu \(2026\)Opsdl: on\-policy self\-distillation for long\-context language models\.arXiv preprint arXiv:2604\.17535\.Cited by:[Appendix D](https://arxiv.org/html/2605.28889#A4.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.28889#S1.p2.1),[§1](https://arxiv.org/html/2605.28889#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.28889#S2.SS2.p1.2),[§4\.5](https://arxiv.org/html/2605.28889#S4.SS5.p1.3),[§5](https://arxiv.org/html/2605.28889#Sx1.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§4\.1](https://arxiv.org/html/2605.28889#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Zhao, E\. Wallace, S\. Feng, D\. Klein, and S\. Singh \(2021\)Calibrate before use: improving few\-shot performance of language models\.InInternational conference on machine learning,pp\. 12697–12706\.Cited by:[§1](https://arxiv.org/html/2605.28889#S1.p1.1)\.

## Appendix ACumulative Context Distillation

Given a stream of contextual data𝒞=\{c1,c2,…,ct\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{t\}\\\}, existing methods such as InfiniteICL\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)and TempLora\(Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2)\)update the model’s latent knowledge following a cumulative distillation paradigm\. Specifically, at stepii, the model parametersθi−1\\theta\_\{i\-1\}are updated toθi\\theta\_\{i\}by minimizing the Kullback\-Leibler \(KL\) divergence between the updated model and the previous model conditioned on the new context:

θi=arg⁡minθDK​L\(πθi−1\(⋅\|q,ci\)∥πθ\(⋅\|q\)\)\\theta\_\{i\}=\\mathop\{\\arg\\min\}\_\{\\theta\}D\_\{KL\}\\big\(\\pi\_\{\\theta\_\{i\-1\}\}\(\\cdot\|q,c\_\{i\}\)\\parallel\\pi\_\{\\theta\}\(\\cdot\|q\)\\big\)\(12\)whereqqis the query\.

To understand the theoretical limit of this paradigm, let us consider an ideal scenario\. Assume that for all stepsii, there exists an optimal set of parametersθi∗\\theta\_\{i\}^\{\*\}such that the unconditioned output distribution perfectly matches the context\-conditioned distribution from the previous step:

πθi∗​\(y\|q\)=πθi−1∗​\(y\|q,ci\),∀q\\pi\_\{\\theta\_\{i\}^\{\*\}\}\(y\|q\)=\\pi\_\{\\theta\_\{i\-1\}^\{\*\}\}\(y\|q,c\_\{i\}\),\\forall q\(13\)with the initial state defined as:

πθ1∗​\(y\|q\)=πθ​\(y\|q,c1\),∀q\\pi\_\{\\theta\_\{1\}^\{\*\}\}\(y\|q\)=\\pi\_\{\\theta\}\(y\|q,c\_\{1\}\),\\forall q\(14\)whereθ\\thetadenotes the parameters of the base model\. Note that prepending contextcic\_\{i\}to queryqqcan be conceptually viewed as forming an augmented queryqi=\[ci,q\]q\_\{i\}=\[c\_\{i\},q\]\. As illustrated in Figure[10](https://arxiv.org/html/2605.28889#A1.F10), by recursively applying Eq\.[13](https://arxiv.org/html/2605.28889#A1.E13)and Eq\.[14](https://arxiv.org/html/2605.28889#A1.E14), we can derive the equivalent distribution at stepii:

πθi∗​\(y\|q\)=πθ​\(y\|q,\[c1,c2,…,ci\]\),∀q\\pi\_\{\\theta\_\{i\}^\{\*\}\}\(y\|q\)=\\pi\_\{\\theta\}\(y\|q,\[c\_\{1\},c\_\{2\},\\dots,c\_\{i\}\]\),\\forall q\(15\)Equation[15](https://arxiv.org/html/2605.28889#A1.E15)reveals a critical property: under this idealized recursive teacher assumption, the target distribution coincides with the base model conditioned on the full concatenated context\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/x5.png)Figure 10:Theoretical upper bound of cumulative distillation paradigm\.Consequently, this cumulative paradigm suffers from several inherent limitations:

- •Inherited Long\-Context Degradation:Since the theoretical limit relies on the base model processing\[c1,…,ci\]\[c\_\{1\},\\dots,c\_\{i\}\], the distilled model may inherit the base model’s weaknesses regarding long context\.
- •Noise Assimilation:In real\-world streaming data, not all contexts are beneficial for answering subsequent queries\. The standard cumulative approach indiscriminately distills all provided tokens, forcing the model to absorb noise and irrelevant information into its latent knowledge\.

## Appendix BDataset

Table 4:Dataset statistics for NarrativeQA\(Kočiskỳet al\.,[2018](https://arxiv.org/html/2605.28889#bib.bib4)\)and SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.28889#bib.bib3)\)\. Document length, queries per document, and query length are reported as mean±\\pmstandard deviation; all lengths are measured in tokens\.### B\.1Synthetic Query Generation

Prompt 1: Query GenerationRead the following story context and generate\{batch\_size\}distinct questions based on it\.Requirements:1\.The questions must be highly related to the context\.2\.Output the questions as a numbered list, e\.g\., “1\. Question … 2\. Question …”\.3\.Do not provide answers, options, or any other text\. Only the questions\.Context:\{context\}Questions:

Following previous works\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1); Eyubogluet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib13)\), we construct synthetic queries from the NarrativeQA and SQuAD corpus\. For each document, we use the document summary as the source context and prompt an instruction\-tuned causal language model to generate natural\-language questions\. The prompt requires the model to produce a numbered list of distinct questions that are highly related to the given context, while explicitly excluding answers, answer options, or additional explanatory text\. The prompt template is shown in Prompt[B\.1](https://arxiv.org/html/2605.28889#A2.SS1)\.

For each document, we repeatedly sample from the generator until collecting up tok=1000k=1000valid questions\. We use stochastic decoding with temperature sampling to encourage diversity\. Since the raw model outputs may contain formatting artifacts, we parse the generations with a regular expression that extracts interrogative sentences from numbered or bulleted lists\. We then normalize each candidate by removing assistant\-style prefixes and truncating the text at the first question mark\.

### B\.2Dataset Statistics

We report the dataset statistics used in our paper in Table[4](https://arxiv.org/html/2605.28889#A2.T4)\. We use the document from the test dataset of NarrativeQA\(Kočiskỳet al\.,[2018](https://arxiv.org/html/2605.28889#bib.bib4)\)and validation dataset of SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.28889#bib.bib3)\)\. For train split, we generate queries with the process in Section[B\.1](https://arxiv.org/html/2605.28889#A2.SS1), and generate corresponding responses with base model during training\. For test split, we use the original queries and answers in the test dataset of NarrativeQA and validation dataset of SQuAD\.

## Appendix CExperiment Details

We evaluate in a document\-level adaptation setting\. At distillation time, the system has access to the target document but not to the benchmark evaluation questions or gold answers\. We generate synthetic training queries from the document and teacher responses from the base model, then evaluate on the original benchmark questions\. Thus, our setting measures whether a document can be compiled into a reusable latent memory for future queries about that document\.

Table 5:Training hyperparameters for LoRA distillation\.### C\.1Hyperparameter

For all context distillation method in our paper, we use the same hyperparameter for training, as shown in table[C](https://arxiv.org/html/2605.28889#A3)\.

### C\.2Environment

All experiments were conducted on a single GPU\. For Qwen2\.5\-0\.5B, we used an NVIDIA L40 GPU, while experiments with Qwen2\.5\-7B and Llama\-3\.1\-8B were run on an NVIDIA H100 GPU\. Each training run was completed within approximately 1–2 days\.

### C\.3Metrics

We evaluate SQuAD with Exact Match \(EM\) and token\-level F1, following the standard SQuAD evaluation protocol\. Both predictions and reference answers are normalized prior to scoring by lowercasing, removing punctuation and English articles, and fixing whitespace\. For instances with multiple reference answers, we compute the score against each reference and report the maximum score\.

For NarrativeQA, we report ROUGE\-1 and ROUGE\-L, computed using the HuggingFaceevaluateimplementation of ROUGE with its default configuration\. Model predictions are compared against the provided gold reference answers\.

### C\.4Baseline Evaluation

#### LoRA\-only\.

In the LoRA\-only setting, all document\-specific information is stored in the LoRA adapter\. Given a queryqiq\_\{i\}associated with documentcic\_\{i\}, whose corresponding latent memory is encoded by the adapterΔ​θi\\Delta\\theta\_\{i\}, we evaluate the following settings:

- •Latest: uses the most recent adapterΔ​θn\\Delta\\theta\_\{n\}to answer the query, i\.e\., generates the response asfθ\+Δ​θn​\(qi\)f\_\{\\theta\+\\Delta\\theta\_\{n\}\}\(q\_\{i\}\)\.
- •Oracle: assumes access to the correct adapterΔ​θi\\Delta\\theta\_\{i\}associated with documentcic\_\{i\}, i\.e\., generates the response asfθ\+Δ​θi​\(qi\)f\_\{\\theta\+\\Delta\\theta\_\{i\}\}\(q\_\{i\}\)\.
- •Shift: evaluates the queryqiq\_\{i\}using the next adapterΔ​θi\+1\\Delta\\theta\_\{i\+1\}, i\.e\., generates the response asfθ\+Δ​θi\+1​\(qi\)f\_\{\\theta\+\\Delta\\theta\_\{i\+1\}\}\(q\_\{i\}\)\. This setting is designed to probe whether information fromcic\_\{i\}is forgotten or overwritten after updating to the subsequent adapter\.

#### LoRA\+Context\.

InfiniteICL\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1)\)treats the LoRA adapter as long\-term memory and the input context as short\-term memory\. Given a queryqiq\_\{i\}associated with documentcic\_\{i\}, with the corresponding latent memory encoded byΔ​θi\\Delta\\theta\_\{i\}, we evaluate:

- •Latest: uses the latest available contextcnc\_\{n\}together with the preceding adapterΔ​θn−1\\Delta\\theta\_\{n\-1\}, i\.e\., generates the response asfθ\+Δ​θn−1​\(qi,cn\)f\_\{\\theta\+\\Delta\\theta\_\{n\-1\}\}\(q\_\{i\},c\_\{n\}\)\.
- •Oracle: assumes access to the correct contextcic\_\{i\}and the adapter immediately preceding it,Δ​θi−1\\Delta\\theta\_\{i\-1\}, i\.e\., generates the response asfθ\+Δ​θi−1​\(qi,ci\)f\_\{\\theta\+\\Delta\\theta\_\{i\-1\}\}\(q\_\{i\},c\_\{i\}\)\.
- •Shift: evaluates the queryqiq\_\{i\}with the subsequent contextci\+1c\_\{i\+1\}and the adapterΔ​θi\\Delta\\theta\_\{i\}, i\.e\., generates the response asfθ\+Δ​θi​\(qi,ci\+1\)f\_\{\\theta\+\\Delta\\theta\_\{i\}\}\(q\_\{i\},c\_\{i\+1\}\)\.

### C\.5Router Training Details

We train a lightweight router to select one adapter from the retrieved top\-kkcandidates for each query\. Given a queryqq, retrieval returns𝒞k​\(q\)=\{c1,…,ck\}\\mathcal\{C\}\_\{k\}\(q\)=\\\{c\_\{1\},\\ldots,c\_\{k\}\\\}with similarity scoress​\(q,ci\)s\(q,c\_\{i\}\)\. A separate router is trained for each dataset, routing mode, and value ofkk\.

For each candidatecic\_\{i\}, we use the feature vector

𝐱q,i=\[s​\(q,ci\),s​\(q,c1\)−s​\(q,ci\),Hq,i\(1\),𝐡q,i\(1\)\],\\mathbf\{x\}\_\{q,i\}=\\left\[s\(q,c\_\{i\}\),\\;s\(q,c\_\{1\}\)\-s\(q,c\_\{i\}\),\\;H\_\{q,i\}^\{\(1\)\},\\;\\mathbf\{h\}\_\{q,i\}^\{\(1\)\}\\right\],\(16\)wheres​\(q,ci\)s\(q,c\_\{i\}\)is the retrieval similarity,s​\(q,c1\)−s​\(q,ci\)s\(q,c\_\{1\}\)\-s\(q,c\_\{i\}\)is the gap to the top\-ranked candidate,Hq,i\(1\)H\_\{q,i\}^\{\(1\)\}is the entropy of the first\-token predictive distribution, and𝐡q,i\(1\)\\mathbf\{h\}\_\{q,i\}^\{\(1\)\}denotes the first 128 dimensions of the first\-token hidden state\.

We use candidate\-level binary supervision:

yq,i=𝟙​\[ci=c⋆​\(q\)\],y\_\{q,i\}=\\mathbbm\{1\}\[c\_\{i\}=c^\{\\star\}\(q\)\],\(17\)wherec⋆​\(q\)c^\{\\star\}\(q\)is the gold document/adapter id\. Ifc⋆​\(q\)∉𝒞k​\(q\)c^\{\\star\}\(q\)\\notin\\mathcal\{C\}\_\{k\}\(q\), all candidates forqqare labeled as negative\. The router is therefore trained as a binary scorer\. All candidate features from the training split are flattened and standardized using training\-set statistics\. The router is a two\-layer MLP that outputs one scalar logit per candidate:

rθ​\(𝐱\)=MLPθ​\(𝐱~\)\.r\_\{\\theta\}\(\\mathbf\{x\}\)=\\mathrm\{MLP\}\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\}\)\.\(18\)
We train the router with weighted binary cross\-entropy loss, using

w\+=Nnegmax⁡\(Npos,1\)w\_\{\+\}=\\frac\{N\_\{\\mathrm\{neg\}\}\}\{\\max\(N\_\{\\mathrm\{pos\}\},1\)\}\(19\)to compensate for class imbalance, and optimize with AdamW\. Unless otherwise specified, we use 100 epochs, learning rate5×10−25\\times 10^\{\-2\}, weight decay10−310^\{\-3\}, and full\-batch training\.

At inference time, the router scores candidates independently and selects

c^​\(q\)=arg⁡maxci∈𝒞k​\(q\)⁡rθ​\(𝐱q,i\),\\hat\{c\}\(q\)=\\arg\\max\_\{c\_\{i\}\\in\\mathcal\{C\}\_\{k\}\(q\)\}r\_\{\\theta\}\(\\mathbf\{x\}\_\{q,i\}\),\(20\)which is then used as the adapter id for downstream LoRA generation\.

The router is trained only on synthetic QA examples in Section[B\.1](https://arxiv.org/html/2605.28889#A2.SS1)\. For each dataset, we first construct a training split from generated questions associated with the adapter pool, and use the corresponding document id as the positive adapter label\. In contrast, evaluation is performed on the original real QA split: SQuAD uses real questions from the SQuAD validation split, and NarrativeQA uses real questions from the NarrativeQA test split\.

### C\.6Gating Threshold Selection

We use the entropy of the first generated token as the gating signal\. Given a query–adapter pair, the LoRA\-adapted model produces the next\-token logits at the first decoding step\. We convert the logits into a probability distribution

p0​\(v\)=softmax​\(z0\)v,p\_\{0\}\(v\)=\\mathrm\{softmax\}\(z\_\{0\}\)\_\{v\},and compute the first\-token entropy

ℋ0=−∑v∈𝒱p0​\(v\)​log⁡p0​\(v\)\.\\mathcal\{H\}\_\{0\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{0\}\(v\)\\log p\_\{0\}\(v\)\.A lower value ofℋ0\\mathcal\{H\}\_\{0\}indicates that the LoRA adapter is more confident for the current query, while a higher value suggests that the query is out of distribution for the selected adapter\. Therefore, we accept the LoRA adapter only when

ℋ0<λ,\\mathcal\{H\}\_\{0\}<\\lambda,and otherwise falls back to the base model\.

The thresholdλ\\lambdais selected on a held\-out threshold\-selection split, not on the final test split\. Specifically, we use synthetic queries in Section[B\.1](https://arxiv.org/html/2605.28889#A2.SS1)for NarrativeQA and use the queries in CommonSenseQA training split\. In contrast, evaluation is performed on real questions from the NarrativeQA test split and CommonSenseQA validation split\. We further analyze the sensitivity toλ\\lambdain Section[F\.2](https://arxiv.org/html/2605.28889#A6.SS2)\.

## Appendix DIntegration with Different Distillation Objectives

While the preceding experiments primarily adopt standard context distillation with forward KL divergence, our training pipeline is compatible with a broader class of distillation objectives\. We therefore integrate the proposed prefix caching mechanism with several representative distillation algorithms and examine whether caching affects downstream task performance\. Specifically, we consider four commonly used distillation objectives\.

#### Forward KL\.

Forward KL is widely adopted in prior context distillation methods\(Caoet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib1); Eyubogluet al\.,[2025](https://arxiv.org/html/2605.28889#bib.bib13); Liet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib19)\)\. It encourages the student distribution without the context to match the teacher distribution conditioned on the full context:

DKL\(πθ\(⋅∣q,c\)∥πθ\+Δ​θ\(⋅∣q\)\)\.D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid q,c\)\\,\\middle\\\|\\,\\pi\_\{\\theta\+\\Delta\\theta\}\(\\cdot\\mid q\)\\right\)\.\(21\)

#### Reverse KL\.

Reverse KL has recently been used in on\-policy distillation methods\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib10); Hübotteret al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib11); Zhanget al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib16)\), where distillation is formulated from a reinforcement\-learning perspective\. The objective is given by

DKL\(πθ\+Δ​θ\(⋅∣q\)∥πθ\(⋅∣q,c\)\)\.D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\+\\Delta\\theta\}\(\\cdot\\mid q\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid q,c\)\\right\)\.\(22\)

#### Top\-kkLogits\.

This objective applies KL divergence only over the top\-kktokens with the highest probabilities\(Ramanet al\.,[2023](https://arxiv.org/html/2605.28889#bib.bib21); Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9)\)\. Following OPCD\(Yeet al\.,[2026](https://arxiv.org/html/2605.28889#bib.bib9)\), we use

DKLtop​\-​k\(πθ\+Δ​θ\(⋅∣q\)∥πθ\(⋅∣q,c\)\)\.D\_\{\\mathrm\{KL\}\}^\{\\mathrm\{top\}\\text\{\-\}k\}\\\!\\left\(\\pi\_\{\\theta\+\\Delta\\theta\}\(\\cdot\\mid q\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid q,c\)\\right\)\.\(23\)

#### EMA Teacher\.

In this setting, the teacher is not fixed during training\. Instead, its parameters are updated as an exponential moving average of the student parameters:

θ¯t=α​θ¯t−1\+\(1−α\)​θt,\\bar\{\\theta\}\_\{t\}=\\alpha\\bar\{\\theta\}\_\{t\-1\}\+\(1\-\\alpha\)\\theta\_\{t\},\(24\)whereα∈\[0,1\)\\alpha\\in\[0,1\)is the EMA decay rate\. The distillation objective is then computed using the EMA teacher distribution:

DKL\(πθ¯t\(⋅∣q,c\)∥πθt\(⋅∣q\)\)\.D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\bar\{\\theta\}\_\{t\}\}\(\\cdot\\mid q,c\)\\,\\middle\\\|\\,\\pi\_\{\\theta\_\{t\}\}\(\\cdot\\mid q\)\\right\)\.\(25\)

## Appendix EEfficiency

### E\.1training efficiency

We evaluate training efficiency on a fixed NarrativeQA\-sampled setting withN=1000N=1000question\-answer pairs\. The benchmark uses Qwen2\.5\-0\.5B with LoRA fine\-tuning, bfloat16 precision, batch size1616, and55warm\-up steps\. We evaluate context lengthsC∈\{500,1000,2000,4000\}C\\in\\\{500,1000,2000,4000\\\}and normalize each answer to3232tokens\.

Each measured step performs a full optimizer update using a temperature\-scaled KL loss between teacher and student logits on the answer tokens\. Runtime includes both one\-time KV precomputation and the measured training loop:

ttotal=tprecompute\+ttrain\.t\_\{\\mathrm\{total\}\}=t\_\{\\mathrm\{precompute\}\}\+t\_\{\\mathrm\{train\}\}\.Warm\-up steps are excluded\. Peak memory is measured with CUDA peak memory statistics\. FLOPs are estimated with DeepSpeedFlopsProfilerby profiling one batch, scaling to 1000 queries\.

### E\.2Inference efficiency

We evaluate inference computation on NarrativeQA with Qwen2\.5\-0\.5B under fixed\-length generation\. Each run uses100100sampled queries and generates1616tokens per query\. We report total FLOPs\. FLOPs are measured with DeepSpeedFlopsProfiler\. The reported total is

FLOPstotal=FLOPsrouting\+FLOPsgeneration,\\mathrm\{FLOPs\}\_\{\\mathrm\{total\}\}=\\mathrm\{FLOPs\}\_\{\\mathrm\{routing\}\}\+\\mathrm\{FLOPs\}\_\{\\mathrm\{generation\}\},where cached variants decompose routing/generation into prefix\-prefill and cache\-reuse terms\.

## Appendix FAblation Study

Table 6:Performance comparison under different retrieval settings on NarrativeQA\. Best results within each model are highlighted in bold\.Table 7:Performance comparison under different retrieval settings on SQuAD\. Best results within each model are highlighted in bold\.### F\.1Ablation on Internal Routing

We ablate the effect of internal routing by varying the number of retrieved adapters available to the model\. In this setting,w/o internal routingserves as the baseline, where the model directly uses a single top\-ranked retrieved adapter without further selection\. In contrast,Retrieval@3andRetrieval@5enable internal routing, allowing the model to select an adapter from multiple retrieved candidates\. We report both downstream task performance and routing accuracy, where the latter measures whether the routing module selects the appropriate adapter from the retrieved candidate set\.

As shown in Tables[6](https://arxiv.org/html/2605.28889#A6.T6)and[7](https://arxiv.org/html/2605.28889#A6.T7), enabling internal routing consistently improves performance across both model scales and datasets\. For Qwen 2\.5 0\.5B on NarrativeQA,Retrieval@3improves routing accuracy by 2\.20, Rouge\-1 by 1\.07, and Rouge\-L by 1\.03 over the baseline without internal routing\.Retrieval@5further improves routing accuracy by 2\.40 and achieves the best accuracy for this model, although its Rouge scores are slightly lower than those ofRetrieval@3\. On SQuAD, the same model also benefits from internal routing:Retrieval@3improves routing accuracy by 3\.20, EM by 1\.57, and F1 by 1\.85\. Increasing the retrieval size to 5 yields the largest F1 gain of 1\.99, but leads to slightly smaller improvements in routing accuracy and EM compared withRetrieval@3\.

Similar trends are observed for Qwen 2\.5 7B\. On NarrativeQA,Retrieval@3improves routing accuracy by 2\.62, Rouge\-1 by 1\.28, and Rouge\-L by 1\.27\.Retrieval@5achieves the best overall performance, with gains of 2\.97 in routing accuracy, 1\.46 in Rouge\-1, and 1\.45 in Rouge\-L\. On SQuAD,Retrieval@3obtains the best results, improving routing accuracy by 3\.10, EM by 1\.49, and F1 by 1\.63\. In contrast,Retrieval@5brings smaller gains of 2\.19 in routing accuracy, 1\.45 in EM, and 1\.51 in F1, suggesting that retrieving more adapters does not always lead to better routing decisions\.

Overall, these results indicate that internal routing provides a clear benefit over relying on a single retrieved adapter\. By routing among multiple retrieved adapters, the model can exploit complementary task\-specific capabilities and produce more accurate predictions\. The routing accuracy results further support this conclusion: enabling internal routing consistently improves adapter selection accuracy over the baseline, and improvements in routing accuracy generally align with gains in downstream performance\. However, the comparison betweenRetrieval@3andRetrieval@5also shows that a larger candidate set may introduce additional ambiguity, especially on SQuAD, whereRetrieval@3already provides sufficient candidates for effective routing\. This suggests that internal routing benefits from a balanced retrieval size that provides enough diversity while avoiding excessive noisy candidates\.

### F\.2Ablation on Self\-Gating Threshold

We analyze how the first\-token entropy threshold affects the activation behavior of Self\-Gating in the hybrid setting\. For each evaluation sample, we first retrieve the top\-1 LoRA adapter using the same dense retriever as in the main hybrid evaluation\. Given the retrieved adapter, we compute the first\-token predictive entropyHiH\_\{i\}under the LoRA\-adapted model, using the same shared\-prefix KV\-cache computation as in the inference pipeline\. Following the Self\-Gating mechanism, the system activates the retrieved LoRA adapter when its first\-token entropy is below a thresholdλ\\lambda, and otherwise falls back to the base model:

gi​\(λ\)=𝟏​\{Hi<λ\},g\_\{i\}\(\\lambda\)=\\mathbf\{1\}\\\{H\_\{i\}<\\lambda\\\},\(26\)wheregi​\(λ\)=1g\_\{i\}\(\\lambda\)=1denotes continuing generation with the retrieved LoRA adapter\.

![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/hybrid_first_token_entropy_threshold_acc.png)Figure 11:Effect of the first\-token entropy threshold on Self\-Gating\. While the activation and fallback accuracies vary monotonically with the threshold, the balanced accuracy remains stable across a broad operating region, indicating that Self\-Gating is insensitive to precise threshold tuning\.In the hybrid evaluation, NarrativeQA queries are context\-specific and are expected to benefit from the retrieved latent memory\. Therefore, activating the retrieved LoRA adapter is treated as the correct gating decision for NarrativeQA\. In contrast, CommonSenseQA queries are context\-agnostic and should be answered using the base model without activating an irrelevant latent memory\. Thus, falling back to the base model is treated as the correct gating decision for CommonSenseQA\. The NarrativeQA gating accuracy is defined as

AccNQA​\(λ\)=1\|𝒟NQA\|​∑i∈𝒟NQA𝟏​\{Hi<λ\}\.\\mathrm\{Acc\}\_\{\\mathrm\{NQA\}\}\(\\lambda\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{NQA\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{NQA\}\}\}\\mathbf\{1\}\\\{H\_\{i\}<\\lambda\\\}\.\(27\)
The CommonSenseQA gating accuracy is defined as

AccCSQA​\(λ\)=1\|𝒟CSQA\|​∑i∈𝒟CSQA𝟏​\{Hi≥λ\}\.\\mathrm\{Acc\}\_\{\\mathrm\{CSQA\}\}\(\\lambda\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{CSQA\}\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\_\{\\mathrm\{CSQA\}\}\}\\mathbf\{1\}\\\{H\_\{i\}\\geq\\lambda\\\}\.\(28\)
We further report the balanced gating accuracy:

AccBalanced​\(λ\)=12​\(AccNQA​\(λ\)\+AccCSQA​\(λ\)\)\.\\mathrm\{Acc\}\_\{\\mathrm\{Balanced\}\}\(\\lambda\)=\\frac\{1\}\{2\}\\left\(\\mathrm\{Acc\}\_\{\\mathrm\{NQA\}\}\(\\lambda\)\+\\mathrm\{Acc\}\_\{\\mathrm\{CSQA\}\}\(\\lambda\)\\right\)\.\(29\)
We sweep the entropy thresholdλ\\lambdafrom3\.53\.5to4\.54\.5with a step size of0\.050\.05, covering the operating region aroundλ=4\.0\\lambda=4\.0\. Evaluation is conducted on the NarrativeQA test split and the CommonSenseQA validation split\. Figure[11](https://arxiv.org/html/2605.28889#A6.F11)plots the NarrativeQA activation accuracy, CommonSenseQA fallback accuracy, and balanced accuracy as functions of the entropy threshold\.

The main observation is that Self\-Gating is robust to the precise choice of the entropy threshold\. Although changingλ\\lambdashifts the relative preference between activating the retrieved LoRA adapter and falling back to the base model, the balanced accuracy remains nearly flat around the operating region\. In particular, for thresholds fromλ=3\.75\\lambda=3\.75toλ=4\.25\\lambda=4\.25, the balanced accuracy stays within a narrow range from77\.65%77\.65\\%to79\.00%79\.00\\%\. The best value in this sweep is79\.00%79\.00\\%atλ=4\.05\\lambda=4\.05, while neighboring thresholds achieve comparable performance\.

This insensitivity indicates that Self\-Gating does not rely on finely tuned thresholds to distinguish when latent memory should be activated\. Instead, the first\-token entropy of the LoRA\-adapted model provides a stable confidence signal: across a broad interval ofλ\\lambda, the system maintains a similar balance between using context\-specific latent memory for NarrativeQA and abstaining from irrelevant memory activation for CommonSenseQA\. These results support the practical robustness of the proposed gating mechanism in non\-oracle hybrid settings\.

### F\.3Ablation on Cache\-Sharing Distillation

Table 8:Ablation study on Qwen2\.5\-0\.5B evaluated on NarrativeQA and SQuAD\.Previous work\(Wanget al\.,[2024](https://arxiv.org/html/2605.28889#bib.bib2)\)discuss the similar cache sharing mechanism for long text generation, that directly apply KV\-Cache after LoRA updates, instead of recomputing them every time\. Therefore, we conduct ablation study on whether cache\-sharing distillation is important for the inference pipeline in our paper, i\.e\. we first compute KV\-cache with base model, and reuse it with Lora model for generation\.

As shown in Table[8](https://arxiv.org/html/2605.28889#A6.T8), removing cache\-sharing distillation consistently degrade performance on both datasets\. On NarrativeQA, ROUGE\-1 and ROUGE\-L drop by 4\.24 and 4\.22, respectively\. On SQuAD, EM and F1 decrease by 6\.36 and 6\.58\. These results indicate that directly reusing KV\-caches computed by the base model for the LoRA\-adapted model introduces a notable mismatch\. This degradation is likely caused by the discrepancy between the base\-model cached states and the updated attention computation after applying LoRA\. Cache\-sharing distillation explicitly exposes the model to this inference setting during training, encouraging the LoRA model to remain compatible with base\-model KV\-caches\.

## Appendix GExperiments Across Models

![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/ours_narrativeqa_scores.png)\(a\)NarrativeQA
![Refer to caption](https://arxiv.org/html/2605.28889v1/fig/ours_squad_scores.png)\(b\)SQuAD

Figure 12:Results across different models\.To further examine the robustness and scalability of our method, we evaluate it with different backbone models, including Qwen2\.5\-0\.5B, Qwen2\.5\-7B, and Llama3\.1\-8B\. The results are shown in Figure[12](https://arxiv.org/html/2605.28889#A7.F12)\. Overall, our method consistently achieves strong performance across all evaluated backbones on both NarrativeQA and SQuAD, indicating that the proposed approach is not tied to a specific model family or parameter scale\.

These results suggest that our method generalizes well across model scales and architectures\. In particular, the consistent gains from smaller to larger backbones indicate that the proposed design can leverage improved model capacity, while still remaining effective even when applied to relatively compact models\.

Similar Articles

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

arXiv cs.CL

This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.

Context Memorization for Efficient Long Context Generation

Hugging Face Daily Papers

Proposes attention-state memory, a training-free approach that stores precomputed attention states in lightweight memory to improve accuracy and reduce latency for long prefix inference, outperforming traditional methods on benchmarks.

Rethinking LoRA Memory Through the Lens of KV Cache Compression

arXiv cs.CL

This paper studies the interaction between parameter-side memory (LoRA adapters) and context-side memory (KV cache) in document-level question answering. It finds that document LoRA becomes most valuable when the KV cache is heavily compressed, recovering up to 13–21 ROUGE-L points, and that QA-supervised adapters outperform next-token-prediction.