EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering
Summary
Proposes EASE-TTT, a test-time training framework that aligns adaptation with retrieved evidence to improve long-context QA performance in smaller language models.
View Cached Full Text
Cached at: 06/08/26, 09:21 AM
# EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering
Source: [https://arxiv.org/html/2606.06906](https://arxiv.org/html/2606.06906)
Xiaopeng Yuan1, Zebin Wang2, Suwen Wang3, Zongxin Yang2, Haohan Wang1, Yushun Dong4 1University of Illinois Urbana\-Champaign 2Harvard University 3Brion, ASML US LP 4Florida State University
###### Abstract
Long\-context question answering \(QA\) remains challenging for smaller language models even when answer\-bearing evidence is already present in the input\. Existing within\-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input\-level evidence exposure rather than adapting the query\-side attention parameters that control how the model allocates attention over full\-context positions\. In contrast, lightweight test\-time adaptation methods, such as query\-only test\-time training \(qTTT\), leave evidence localization unresolved because their generic span\-level self\-supervised objectives do not identify which context positions support the current answer\. In this paper, we proposeEvidence\-AlignedSElectiveTest\-TimeTraining \(EASE\-TTT\), a within\-context retrieval\-augmented test\-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions\. Instead of replacing the full context with retrieved chunks, EASE\-TTT uses the resulting attention target to guide query\-side adaptation, with the adapted model generating the final answer from the original full context\. Experiments on six LongBench QA tasks and three small decoder\-only language models show that EASE\-TTT achieves the strongest macro\-average performance among full\-context inference, retrieval\-only baselines, and qTTT, supporting evidence\-aligned test\-time adaptation in long\-context QA\.
EASE\-TTT: Evidence\-Aligned Selective Test\-Time Training for Long\-Context Question Answering
## 1Introduction
Large language models have made rapid progress in extending their context windows, enabling them to process inputs that contain tens or even hundreds of thousands of tokens\(Dinget al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib23); Teamet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib24); Chenet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib25)\)\. However, a longer context window does not necessarily translate into better long\-context question\-answering performance\. In many long\-context question answering tasks, the answer\-bearing evidence is already present in the input, yet the model still fails to access it correctly\(Liuet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib2); Hsiehet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib26); Modarressiet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib27)\)\. This issue is particularly important for smaller language models, which often have more limited capacity to maintain reliable evidence use in long, distractor\-heavy contexts\(Gaoet al\.,[2026](https://arxiv.org/html/2606.06906#bib.bib33)\)\. In such cases, the bottleneck is not simply whether the model can fit the context, but whether it can reliably access and prioritize the evidence needed for the current question\.
Figure 1:Motivation of EASE\-TTT\. Retrieval\-only and prompt\-editing methods expose candidate evidence at the input level, but do not adapt the model’s context\-access behavior\. Test\-time training methods can adapt model parameters at inference time, but their objectives are often not explicitly aligned with question\-relevant evidence\. EASE\-TTT bridges this gap by using retrieved evidence to guide test\-time adaptation\.A natural way to address this issue is to perform retrieval within the input context\. Within\-context retrieval methods segment the long input into chunks, localize candidate evidence chunks from the same context, and use the selected chunks to construct a shorter or more focused input\(Jianget al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib28); Liet al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib29); Nairet al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib30)\)\. These methods do not rely on an external corpus; instead, they treat the given long context itself as the retrieval source\. They are effective when the selected chunks contain sufficient answer\-bearing evidence for generation\. However, they typically use retrieval only as an input\-level operation: selected chunks are used to replace, shorten, or prepend to the original context\(Shenget al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib63); Liskavetset al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib64); Wanget al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib65); Chirkovaet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib66)\)\. As a result, the model’s parameters and context\-access behavior remain unchanged\. Moreover, hard chunk selection may discard useful surrounding information, which is risky in long\-context QA where evidence may be distributed across multiple parts of the input\(Sarthiet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib31); Tianet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib32); Saad\-Falconet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib50); Luoet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib51); Wanget al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib52)\)\.
Figure 2:Overview of EASE\-TTT\. Given a long context and a question, EASE\-TTT selects question\-relevant evidence chunks, converts them into a soft attention target over full\-context positions, and updates query\-side LoRA adapters at test time\. The adapted model then generates the answer from the original full context\.This limitation suggests that evidence access should not be treated only as an inference\-time input selection problem\. For smaller models in particular, failures under long contexts may reflect a mismatch between the model’s current context\-access behavior and the evidence required by the question\(Zhuet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib53); Leeet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib60); Anet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib61); Liet al\.,[2024b](https://arxiv.org/html/2606.06906#bib.bib62)\)\. Test\-time adaptation provides a natural way to address this mismatch because it allows a model to change its behavior for each test instance at inference time\. In this work, we focus on test\-time training \(TTT\), a gradient\-based form of test\-time adaptation that performs instance\-specific parameter updates\(Sunet al\.,[2020](https://arxiv.org/html/2606.06906#bib.bib34); Wanget al\.,[2020](https://arxiv.org/html/2606.06906#bib.bib35); Hardt and Sun,[2024](https://arxiv.org/html/2606.06906#bib.bib36); Akyüreket al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib37)\)\. Recent query\-only test\-time training further shows that inference\-time compute need not be spent only on additional generated tokens; it can also be used for query\-side adaptation, allowing the model to change how it allocates attention over a given long context\(Bansalet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib12)\)\. This perspective is especially relevant to long\-context QA, where the evidence may already be present in the input but insufficiently prioritized by the model\. However, existing test\-time adaptation objectives are typically driven by generic self\-supervised, task\-level, or retrieval\-oriented signals, rather than evidence\-localized supervision that identifies which full\-context positions support the current answer\(Zhanget al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib8); Fenget al\.,[2026](https://arxiv.org/html/2606.06906#bib.bib38); Jeonget al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib48); Sunet al\.,[2026](https://arxiv.org/html/2606.06906#bib.bib49)\)\. These objectives may adapt the model to the current input, but they do not explicitly indicate which context positions support the current answer\. Therefore, there remains a gap between within\-context evidence localization and test\-time adaptation: within\-context retrieval can localize potentially relevant chunks, while query\-side test\-time training can adapt model behavior, but existing methods do not directly use question\-relevant evidence as supervision for instance\-specific adaptation\.
We proposeEvidence\-Aligned Selective Test\-Time Training \(EASE\-TTT\), a within\-context retrieval\-augmented test\-time training framework that turns question\-relevant evidence into direct supervision for long\-context adaptation\. Given a long\-context question answering instance, EASE\-TTT first selects chunks in the input context that are most relevant to the question\. Instead of replacing the original context with these chunks, it constructs a soft attention target that assigns greater probability mass to selected evidence positions while still preserving nonzero mass over the remaining context\. At test time, EASE\-TTT updates lightweight query\-side adapters with the base model frozen\. After adaptation, the model generates the answer from the original full context\. This design turns retrieval from an input\-filtering mechanism into an evidence\-aligned supervision signal for instance\-specific adaptation\.
#### Our contributions\.
- •We identify evidence\-use failure as a key bottleneck in long\-context reasoning for smaller language models: relevant evidence may be present in the input, but the model still fails to use it under distractor\-heavy contexts\.
- •We propose EASE\-TTT, a within\-context retrieval\-augmented test\-time training framework that converts question\-relevant chunks into soft supervision for query\-side adaptation\. Unlike retrieval\-only methods, EASE\-TTT does not replace the context with chunks; instead, it uses them to guide adaptation while preserving full\-context generation\.
- •We conduct an evaluation on long\-context QA benchmarks across multiple small language models\. Our results show that EASE\-TTT improves answer quality over full\-context inference, retrieval\-only baselines, and qTTT, with further analyses demonstrating the effects of evidence selection, soft attention supervision, and test\-time training\.
## 2Related Work
Within\-Context Retrieval and Evidence Selection\.A common approach to long\-context question answering is to localize question\-relevant evidence within the input context before generation\(Liet al\.,[2024a](https://arxiv.org/html/2606.06906#bib.bib54); Qiuet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib55); Leeet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib56)\)\. Unlike standard retrieval\-augmented generation\(Lewiset al\.,[2020](https://arxiv.org/html/2606.06906#bib.bib57)\), which retrieves passages from an external corpus, within\-context retrieval treats the given long input itself as the retrieval source\(Qianet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib58); Taguchiet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib59)\)\. Prior work has explored related strategies such as prompt compression, context pruning, discourse\-based document selection, and hierarchical retrieval to reduce distractors and expose useful evidence to the model\(Jianget al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib39); Zhaoet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib43); Yoonet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib44)\)\. Efficiency\-oriented variants also rely on selecting, compressing, or reorganizing input passages before generation\(Xuet al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib40); Panet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib41)\)\. However, these methods treat evidence access mainly as an input\-level operation: retrieved chunks are used to replace, shorten, reorder, or prepend to the original context\. As a result, the model’s parameters and context\-access behavior remain unchanged\. This is limiting when answer\-bearing evidence is already present in the context window but is still not reliably accessed by the model\. Moreover, hard selection can introduce a new bottleneck: selected chunks may omit useful surrounding context, separate evidence distributed across distant regions, or remove information needed to interpret the retrieved span\(Güntheret al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib42); Tianet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib32)\)\. Thus, retrieval and prompt editing can change what the model sees, but they do not change how the model attends to and uses evidence in the full context\.
Test\-Time Training\.Test\-time training \(TTT\) improves model behavior at inference time by updating parameters using self\-supervised signals derived from the test input itself\(Huet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib15); Zhanget al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib6)\)\. These approaches have been explored in settings such as distribution shift, domain adaptation, and reasoning\-time adaptation, where fixed pretrained parameters may be insufficient for the input at hand\(Hübotteret al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib9); Agarwalet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib10); Liet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib11)\)\. In the long\-context setting, TTT is especially relevant because each test instance may exhibit different local structures, evidence layouts, and distraction patterns\(Muhtaret al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib14)\)\. However, parameter\-level adaptation alone does not solve evidence access unless the training signal is aligned with the evidence required by the current question\. Applying TTT to long contexts is therefore nontrivial: the adaptation signal is often local, partial, and potentially noisy, while broad parameter updates may introduce instability or unnecessary computational overhead\(Suet al\.,[2023](https://arxiv.org/html/2606.06906#bib.bib7); Zhanget al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib8)\)\. These challenges make targeted and evidence\-aligned test\-time training important for long\-context inference\. Query\-only test\-time training \(qTTT\) narrows the update to the query projections in self\-attention rather than adapting the full model\(Bansalet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib12)\)\. However, qTTT still relies on generic self\-supervised objectives rather than explicit supervision from question\-relevant evidence\. As a result, it can update query\-side attention parameters, but it does not specify which full\-context positions should guide the update\. This creates a mismatch for long\-context QA: the model is adapted, but the adaptation is not anchored to the evidence needed to answer the question\.
Gap and Motivation\.These two lines of work address different sides of the long\-context evidence\-access problem, but neither resolves it alone\. Within\-context retrieval and prompt editing operate at the input level: they can localize or expose candidate evidence, but they leave the model’s context\-access behavior unchanged\. This is insufficient when the relevant content is already inside the context window but the model fails to attend to it\. Query\-only test\-time training operates at the parameter level: it can adapt query\-side attention behavior, but its objectives are not tied to the evidence positions required by the current question\. Consequently, existing methods either select evidence without adapting the model, or adapt the model without explicit evidence guidance\. Our method bridges this gap by using retrieved evidence chunks not as a replacement for the full context, but as supervision for query\-side test\-time training\. The final answer is still generated from the original full context, while the retrieved evidence guides how the model updates its attention behavior\.
## 3Preliminary
### 3\.1Long\-Context Question Answering and Evidence Use
We study test\-time training for long\-context question answering\. Let a test instance bez=\(c,q\)z=\(c,q\), wherec=\(c1,c2,…,cT\)c=\(c\_\{1\},c\_\{2\},\\dots,c\_\{T\}\)denotes a long input context andqqdenotes the question or instruction\. Given a pretrained language modelfθf\_\{\\theta\}, the goal is to generate an answeryyconditioned on both the full contextccand the questionqq, i\.e\.,y∼pθ\(⋅∣c,q\)y\\sim p\_\{\\theta\}\(\\cdot\\mid c,q\)\.
In long\-context QA, the relevant evidence needed to answerqqmay already be contained incc, but the model may still fail to identify or use it correctly\. This failure is especially problematic when the context contains many distractors or when the useful evidence is distributed across distant regions of the input\. Therefore, the key challenge is not only whether the model can fit the full context, but whether it can reliably access the evidence needed for the current question\.
A common way to improve evidence access is to perform retrieval within the given context\. Let𝒮=\{s1,s2,…,sM\}\\mathcal\{S\}=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{M\}\\\}denote a set of candidate chunks segmented fromcc, where each chunksj=\(cbj,…,cej\)s\_\{j\}=\(c\_\{b\_\{j\}\},\\dots,c\_\{e\_\{j\}\}\)covers a contiguous span of context tokens\. A within\-context retrieval module ranks these chunks according to their relevance toqqand selects a subsetE=\{sj1,sj2,…,sjK\}E=\\\{s\_\{j\_\{1\}\},s\_\{j\_\{2\}\},\\dots,s\_\{j\_\{K\}\}\\\}\. Retrieval\-only methods typically useEEto construct a shorter input for generation\. In contrast, our goal is not to replace the original context with the selected chunks\. Instead, we use the selected evidence chunks as a supervision signal for test\-time adaptation, while final answer generation remains conditioned on the original full contextcc\.
### 3\.2Query\-Only Test\-Time Adaptation
Test\-time training adapts a model independently for each test instance at inference time, using signals derived from the test input itself\. In the long\-context setting, full\-parameter adaptation is expensive because each gradient update may change the key and value representations of the entire context, requiring repeated computation over the full input\.
Query\-only test\-time training provides a lightweight alternative\. Instead of updating all model parameters, it updates only query\-side parameters in self\-attention while keeping the rest of the model frozen\. LetΘQ=\{WQ\(1\),WQ\(2\),…,WQ\(L\)\}\\Theta\_\{Q\}=\\\{W\_\{Q\}^\{\(1\)\},W\_\{Q\}^\{\(2\)\},\\dots,W\_\{Q\}^\{\(L\)\}\\\}denote the query projection parameters across theLLtransformer layers\. Given the long contextcc, the model constructs key\-value representations\{K\(ℓ\),V\(ℓ\)\}ℓ=1L\\\{K^\{\(\\ell\)\},V^\{\(\\ell\)\}\\\}\_\{\\ell=1\}^\{L\}, which remain fixed during adaptation\. Updating onlyΘQ\\Theta\_\{Q\}changes how the model forms queries over these fixed key\-value representations, thereby modifying how it accesses information in the context without recomputing the full context after every gradient step\. Standard query\-only test\-time training usually relies on generic self\-supervised objectives\. For example, it may sample a spans=\(ct,ct\+1,…,ct\+m\)s=\(c\_\{t\},c\_\{t\+1\},\\dots,c\_\{t\+m\}\)from the context and optimize a next\-token prediction loss:
ℒspan\(ΘQ;s\)=−∑i=tt\+m−1logpθ,ΘQ\(ci\+1∣c≤i\)\.\\mathcal\{L\}\_\{\\mathrm\{span\}\}\(\\Theta\_\{Q\};s\)=\-\\sum\_\{i=t\}^\{t\+m\-1\}\\log p\_\{\\theta,\\Theta\_\{Q\}\}\(c\_\{i\+1\}\\mid c\_\{\\leq i\}\)\.This objective can adapt the model to the current input, but it does not explicitly indicate which parts of the context are useful for answering the current question\. As a result, query\-only adaptation can modify context\-access behavior, but the adaptation signal remains largely question\-agnostic\.
### 3\.3From Evidence Selection to Adaptation Supervision
The above discussion suggests a gap between within\-context retrieval and query\-only test\-time adaptation\. Within\-context retrieval can identify candidate evidence chunks for the current question, but retrieval\-only methods usually use these chunks to modify the input rather than the model\. Query\-only test\-time training can adapt how the model accesses the context, but its generic span\-based objectives do not directly specify which context positions are question\-relevant\.
Our method connects these two components by using retrieved evidence chunks as supervision for query\-side test\-time adaptation\. LetEEdenote the selected evidence chunks, and letΩ\(E\)⊆\{1,2,…,T\}\\Omega\(E\)\\subseteq\\\{1,2,\\dots,T\\\}denote the indices of context tokens covered by these chunks\. Instead of replacing the original context withEE, we useΩ\(E\)\\Omega\(E\)to guide adaptation toward evidence\-bearing positions\. The detailed construction of the soft attention target and the corresponding adaptation objective are introduced in the next section\.
## 4EASE\-TTT
### 4\.1Method Overview
We proposeEASE\-TTT, an evidence\-selective variant of query\-only test\-time training for long\-context question answering\. Unlike prior qTTT methods, which adapt query\-side parameters using generic self\-supervised losses over randomly sampled spans, our method identifies question\-relevant evidence and uses it to guide test\-time attention adaptation\. Given a contextccand a questionqq, EASE\-TTT segments the context into candidate spans and ranks them by their question\-conditioned utility\. The top\-KKspans are selected as evidence chunks and used to define a soft target attention distribution over context positions\. During test\-time adaptation, EASE\-TTT updates only query\-side adaptation parameters according to this evidence\-aligned attention target\. Final prediction is still performed on the original full context, so the selected chunks guide attention without truncating the input\.
Algorithm 1EASE\-TTT with Evidence Selection and Soft Attention Supervision1:Base model
fθf\_\{\\theta\}, context
cc, question
qq, update steps
NN, top\-
KK, attention layer
ℓ\\ell, mass
α\\alpha, learning rate
η\\eta
2:Insert trainable LoRA adapters into query projections; freeze all other parameters
3:Segment the context into candidate spans
𝒮\\mathcal\{S\}
4:foreach span
s∈𝒮s\\in\\mathcal\{S\}do
5:Compute question\-conditioned utility score
r\(s\)r\(s\)
6:endfor
7:
E←TopK\(𝒮,r,K\)E\\leftarrow\\mathrm\{TopK\}\(\\mathcal\{S\},r,K\)⊳\\trianglerightselected evidence chunks
8:
Ω\(E\)←\{context token positions covered byE\}\\Omega\(E\)\\leftarrow\\\{\\text\{context token positions covered by \}E\\\}
9:Construct soft target distribution
π\\piover context positions using
Ω\(E\)\\Omega\(E\)and
α\\alpha
10:for
t=1t=1to
NNdo
11:Obtain attention distribution
aaover context positions at layer
ℓ\\ell
12:
ℒ←DKL\(π∥a\)\\mathcal\{L\}\\leftarrow D\_\{\\mathrm\{KL\}\}\(\\pi\\,\\\|\\,a\)
13:Update query\-side LoRA parameters with learning rate
η\\eta
14:endfor
15:Generate the final answer using the full context
### 4\.2Within\-Context Evidence Selection
A central challenge in long\-context reasoning is that useful evidence is often buried among large amounts of irrelevant content\. To obtain a more targeted adaptation signal, we first identify candidate evidence chunks from the full context\.
Given the context token sequencec=\(c1,…,cT\)c=\(c\_\{1\},\\dots,c\_\{T\}\), we segment it into spans using token\-level negative log\-likelihood \(NLL\) spikes\. Specifically, we run a forward pass overccand compute the NLL of each context token\. After smoothing the resulting NLL curve, we detect boundary candidates using a threshold of the formμ\+κσ\\mu\+\\kappa\\sigma, whereμ\\muandσ\\sigmaare the mean and standard deviation of the smoothed curve, andκ\\kappais a spike factor\. Together with a minimum chunk\-length constraintmminm\_\{\\min\}, this yields a set of candidate spans𝒮=\{s1,…,sM\}\\mathcal\{S\}=\\\{s\_\{1\},\\dots,s\_\{M\}\\\}\.
We then score each span by how much it helps the model condition on the question\. For a candidate spanss, we define its question\-conditioned utility as
r\(s\)=ℒNTP\(\[BOS,q\]\)−ℒNTP\(\[s,BOS,q\]\),r\(s\)=\\mathcal\{L\}\_\{\\text\{NTP\}\}\(\[\{\\rm BOS\},q\]\)\-\\mathcal\{L\}\_\{\\text\{NTP\}\}\(\[s,\{\\rm BOS\},q\]\),whereℒNTP\(⋅\)\\mathcal\{L\}\_\{\\text\{NTP\}\}\(\\cdot\)denotes the next\-token prediction loss on the question tokens\. Intuitively, if prependingssreduces the question modeling loss, thensslikely contains evidence relevant to answeringqq\.
We rank all spans byr\(s\)r\(s\)and retain the top\-KKspans:
E=TopK\(𝒮,r,K\),E=\\operatorname\{TopK\}\(\\mathcal\{S\},r,K\),whereEEdenotes the selected evidence chunks\. These chunks are not used to replace the full context at inference time; instead, they provide a focused supervision signal for the subsequent adaptation stage\.
### 4\.3Soft\-Target Attention Alignment
Existing qTTT methods typically optimize generic self\-supervised objectives such as next\-token prediction over sampled spans\. While lightweight, such objectives only indirectly encourage the model to allocate attention toward question\-relevant evidence\. To make the adaptation target more explicit, we supervise attention directly using the selected evidence chunks\.
Letq=\(q1,…,qR\)q=\(q\_\{1\},\\dots,q\_\{R\}\)denote the tokenized question\. At each test\-time adaptation step, we prefill the model on the sequence\[c;q1:R−1\]\[c;q\_\{1:R\-1\}\]and decode the final question tokenqRq\_\{R\}\. From a chosen attention layerℓ\\ell, we extract the attention distribution over context positions, average across heads, and normalize it into a probability distributiona∈ℝTa\\in\\mathbb\{R\}^\{T\}\.
LetΩ\(E\)\\Omega\(E\)be the set of context token positions covered by the selected evidence chunksEE\. We define a soft target attention distributionπ\\piover context positions by assigning most of the probability mass toΩ\(E\)\\Omega\(E\):
πi=\{α/\|Ω\(E\)\|,i∈Ω\(E\),\(1−α\)/\(T−\|Ω\(E\)\|\),i∉Ω\(E\),\\pi\_\{i\}=\\begin\{cases\}\\alpha/\|\\Omega\(E\)\|,&i\\in\\Omega\(E\),\\\\\[4\.0pt\] \(1\-\\alpha\)/\(T\-\|\\Omega\(E\)\|\),&i\\notin\\Omega\(E\),\\end\{cases\}whereα∈\(0,1\)\\alpha\\in\(0,1\)controls how strongly attention is biased toward the selected evidence\.
We then optimize the Kullback–Leibler divergence
ℒattn=DKL\(π∥a\),\\mathcal\{L\}\_\{\\text\{attn\}\}=D\_\{\\mathrm\{KL\}\}\(\\pi\\\|a\),which explicitly encourages the model to reallocate attention toward evidence\-bearing context positions while still preserving a small amount of mass on the rest of the context\. Compared with hard masking, this soft target is more stable and avoids forcing the model to ignore potentially useful non\-selected tokens entirely\.
## 5Experiments
Table 1:Main results on six LongBench QA tasks: MuSiQue, HotpotQA, 2WikiMultihopQA, QASPER, NarrativeQA, and MultiFieldQA\-en, across Qwen3\-0\.6B, Qwen3\-1\.7B, and Llama\-3\.2\-1B\. RAG denotes Within\-Context Retrieval\-Augmented Generation\.### 5\.1Setup
Evaluation Datasets\.We evaluate our method on six English long\-context question answering tasks from LongBench\(Baiet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib21)\): MuSiQue, HotpotQA, 2WikiMultihopQA, QASPER, NarrativeQA, and MultiFieldQA\-en\. These tasks cover multi\-hop question answering, single\-document question answering, narrative understanding, and long\-context information extraction\. They require models to locate, aggregate, and reason over relevant evidence in extended input contexts\. We report the official task\-level evaluation scores and compute the macro\-average across the six datasets\.
LLMs and Baselines\.We conduct experiments on three small decoder\-only language models: Qwen3\-0\.6B, Qwen3\-1\.7B\(Yanget al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib47)\), and Llama\-3\.2\-1B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib46)\)\. We compare EASE\-TTT with four baselines\.Full\-contextdirectly generates the answer from the full input context, without retrieval or test\-time parameter updates\.Within\-Context Retrieval\-Augmented Generation \(Within\-Context RAG\)retrieves the top\-ranked chunks from the same input context using the question as the retrieval query, concatenates the retrieved chunks as a shortened context, and generates the answer without accessing any external corpus or updating model parameters\.In\-Context Retrieval \(ICR\)retrieves relevant segments from the given long input and uses the retrieved segments, together with the corresponding prompting strategy, to answer the question\(Agrawalet al\.,[2024](https://arxiv.org/html/2606.06906#bib.bib45)\)\.Query\-Only Test\-Time Training \(qTTT\)performs query\-only test\-time training by updating query\-side parameters using a generic self\-supervised next\-token prediction objective on sampled context spans\(Bansalet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib12)\)\.EASE\-TTTupdates only query\-side adaptation parameters, but replaces generic span\-based supervision with evidence\-guided soft attention supervision constructed from question\-relevant chunks selected within the input context\. Unlike retrieval\-only baselines, EASE\-TTT uses selected chunks only to guide test\-time adaptation, while final answer generation is performed over the original full context\.
Implementation Details\.For all methods, we truncate the input context to at most 32,768 tokens and the question to at most 1,024 tokens\. The maximum answer length is set to 128 tokens, and we use deterministic decoding\. For EASE\-TTT, we insert LoRA adapters into the query projection modules while keeping the base model frozen\. Unless otherwise specified, we use LoRA rank 8, a scaling factor of 16, and a dropout rate of 0\.05\. Test\-time adaptation is performed for 15 update steps with AdamW, using a learning rate of1×10−41\\times 10^\{\-4\}and weight decay of 0\.01\. We use 512 tokens as the target chunk size, with a minimum chunk size of 128 tokens, a maximum chunk size of 1,024 tokens, and an overlap of 64 tokens\. We then rank candidate chunks by the utility score in Section[4](https://arxiv.org/html/2606.06906#S4)and select the top 4 chunks for evidence\-guided adaptation\. By default, we use layerℓ=14\\ell=14for attention alignment\. The soft attention target uses a mass parameter ofα=0\.6\\alpha=0\.6\.
### 5\.2Main Results
Table[1](https://arxiv.org/html/2606.06906#S5.T1)reports the main results on six LongBench QA tasks\. Overall, EASE\-TTT achieves the best average performance on the Qwen3 models, improving over full\-context inference, retrieval\-only baselines, and qTTT\. On Qwen3\-0\.6B, EASE\-TTT obtains an average score of 23\.6, outperforming full\-context inference by 4\.1 points and qTTT by 1\.2 points\. On Qwen3\-1\.7B, EASE\-TTT achieves an average score of 30\.6, improving over Full\-context by 5\.6 points, Within\-Context RAG by 5\.3 points, ICR by 3\.0 points, and qTTT by 1\.9 points\.
These results support our hypothesis that long\-context QA depends not only on context availability, but also on reliable evidence access\. Full\-context inference is consistently weaker than adaptation\-based methods, indicating that simply providing the full input is insufficient\. Retrieval\-only methods improve some tasks, but their gains are inconsistent\. For example, ICR improves MuSiQue and 2WikiMultihopQA on Qwen3\-1\.7B, but underperforms full\-context inference on QASPER and NarrativeQA, suggesting that shortened retrieved contexts may also discard useful surrounding information\.
Compared with qTTT, EASE\-TTT improves the macro\-average scores on all three models, although the size of the gain varies across model family\. This suggests that evidence\-localized supervision provides a more targeted adaptation signal than generic span\-based self\-supervision, while preserving full\-context generation\. The gains are more visible on several tasks that require locating or integrating evidence across long inputs, such as 2WikiMultihopQA, QASPER, and MultiFieldQA\-en\. These gains show the benefit of anchoring test\-time updates to question\-relevant evidence\.
### 5\.3Efficiency Analysis
Table[2](https://arxiv.org/html/2606.06906#A2.T2)compares qTTT and EASE\-TTT on three profiled LongBench tasks using Qwen3\-1\.7B\. We focus on qTTT because it is the closest adaptation\-based baseline: both methods perform query\-side test\-time adaptation, but use different supervision signals\. EASE\-TTT improves the average score from 38\.0 to 40\.1, while increasing the average per\-example runtime from 6\.7s to 9\.1s\. This corresponds to a 2\.1\-point score improvement with an additional 2\.4s per example\.
The additional cost mainly comes from evidence selection and attention\-map supervision\. Unlike qTTT, which optimizes a standard next\-token prediction loss on sampled spans, EASE\-TTT first identifies question\-relevant evidence chunks and constructs a soft target over full\-context positions\. During adaptation, it also extracts and aligns the selected\-layer attention distribution with this target, which introduces extra computation beyond the generic span\-based objective\. Peak GPU memory remains in a comparable range across the profiled runs\. Overall, EASE\-TTT trades moderate additional latency for better accuracy over qTTT\.
### 5\.4Ablation Study
Loss Objective\.Figure[3](https://arxiv.org/html/2606.06906#S5.F3)evaluates the effect of the adaptation objective\. Chunk NTP adapts the model on selected evidence chunks using a standard next\-token prediction loss, while Attn\. KL directly aligns the model’s attention distribution with the selected evidence positions\. Attn\. KL consistently outperforms Chunk NTP on all three tasks, improving HotpotQA from 30\.5 to 36\.6, QASPER from 37\.0 to 39\.2, and MultiFieldQA from 43\.7 to 44\.6\. This comparison shows that the benefit of EASE\-TTT does not come simply from exposing the model to selected evidence during test\-time training\. If the selected chunks are used only as ordinary next\-token prediction data, the adaptation objective remains weakly connected to the final evidence\-access problem\. In contrast, Attn\. KL converts the selected chunks into an explicit supervision signal over full\-context positions\. This better matches the goal of EASE\-TTT: improving how the model attends to evidence while still generating from the original full context\.
Figure 3:Objective ablation on Qwen3\-1\.7B\. Attn\. KL outperforms Chunk NTP, showing the benefit of using selected evidence as explicit attention supervision\.Effect of Attention Layer\.
Figure 4:Effect of attention layer on EASE\-TTT using Qwen3\-1\.7B\. The results compare different attention layers while keeping all other hyperparameters fixed\.Figure[4](https://arxiv.org/html/2606.06906#S5.F4)studies how the choice of attention supervision layer affects performance\. This choice is not merely an implementation detail, because recent layer\-wise analyses suggest that different LLM layers play different functional roles\. Lower layers are more involved in gathering information from previous tokens, while upper layers increasingly consolidate the gathered information internally\(Artzy and Schwartz,[2024](https://arxiv.org/html/2606.06906#bib.bib67)\)\. In addition, intermediate layers can encode stronger task\-relevant representations than final layers for downstream tasks\(Skeanet al\.,[2025](https://arxiv.org/html/2606.06906#bib.bib69)\)\.
Our results are consistent with this view\. Very early layers are less effective, likely because their attention patterns are still dominated by low\-level context gathering rather than question\-specific evidence use\. The final layer is also not necessarily optimal, since it may be more closely tied to consolidated representations and final prediction\. Intermediate layers provide a better trade\-off: they are sufficiently contextualized to reflect question\-relevant evidence, while still leaving room for the alignment signal to influence subsequent computation\. This explains why EASE\-TTT benefits more from supervising intermediate attention layers than from supervising the earliest or final layers\.
## 6Conclusion
We studied long\-context question answering for smaller language models, where answer\-bearing evidence may already be present in the input but not reliably accessed by the model\. We proposed EASE\-TTT, a within\-context retrieval\-augmented test\-time training framework that localizes evidence chunks and converts them into soft attention supervision for query\-side adaptation\. Rather than replacing the full context with retrieved chunks, EASE\-TTT uses localized evidence to guide lightweight test\-time updates while generating the final answer from the original full context\. Experiments on six LongBench QA tasks show that EASE\-TTT improves over full\-context inference, retrieval\-only baselines, and qTTT\. Ablation results further show that explicit attention alignment is more effective than next\-token prediction on selected chunks, suggesting that localized evidence is most useful when it guides how the model attends to the full context rather than only exposing relevant content\. These findings highlight evidence\-aware test\-time adaptation as a promising direction for smaller long\-context models\.
## Limitations
This work has several limitations\. First, our experiments focus on long\-context question answering tasks, where answer\-relevant information is usually expected to appear in the input context\. Although this setting directly matches our research question, further evaluation is needed to understand how EASE\-TTT generalizes to other types of tasks, such as mathematical reasoning, symbolic reasoning, and open\-ended generation\.
Second, our study mainly evaluates relatively small language models\. Since larger models may already have stronger long\-context utilization ability, the effect of evidence\-guided test\-time adaptation may vary across model scales\. Future work can examine how the proposed approach behaves on larger models and different model family\.
## References
- A\. Agarwal, A\. Sengupta, and T\. Chakraborty \(2025\)First finish search: efficient test\-time scaling in large language models\.arXiv preprint arXiv:2505\.18149\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- D\. Agrawal, S\. Gao, and M\. Gajek \(2024\)Can’t remember details in long documents? you need some r&r\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 12692–12704\.Cited by:[§5\.1](https://arxiv.org/html/2606.06906#S5.SS1.p2.1)\.
- E\. Akyürek, M\. Damani, A\. Zweiger, L\. Qiu, H\. Guo, J\. Pari, Y\. Kim, and J\. Andreas \(2024\)The surprising effectiveness of test\-time training for few\-shot learning\.arXiv preprint arXiv:2411\.07279\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- S\. An, Z\. Ma, Z\. Lin, N\. Zheng, J\. Lou, and W\. Chen \(2024\)Make your llm fully utilize the context\.Advances in Neural Information Processing Systems37,pp\. 62160–62188\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- A\. B\. Artzy and R\. Schwartz \(2024\)Attend first, consolidate later: on the importance of attention in different llm layers\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,pp\. 177–184\.Cited by:[§5\.4](https://arxiv.org/html/2606.06906#S5.SS4.p3.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§5\.1](https://arxiv.org/html/2606.06906#S5.SS1.p1.1)\.
- R\. Bansal, A\. Zhang, R\. Tiwari, L\. Madaan, S\. S\. Duvvuri, D\. Khatri, D\. Brandfonbrener, D\. Alvarez\-Melis, P\. Bhargava, M\. S\. Kale,et al\.\(2025\)Let’s \(not\) just put things in context: test\-time training for long\-context llms\.arXiv preprint arXiv:2512\.13898\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1),[§2](https://arxiv.org/html/2606.06906#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.06906#S5.SS1.p2.1)\.
- Y\. Chen, S\. Qian, H\. Tang, X\. Lai, Z\. Liu, S\. Han, and J\. Jia \(2024\)Longlora: efficient fine\-tuning of long\-context large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 8220–8238\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- N\. Chirkova, T\. Formal, V\. Nikoulina, and S\. Clinchant \(2025\)Provence: efficient and robust context pruning for retrieval\-augmented generation\.arXiv preprint arXiv:2501\.16214\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- Y\. Ding, L\. L\. Zhang, C\. Zhang, Y\. Xu, N\. Shang, J\. Xu, F\. Yang, and M\. Yang \(2024\)Longrope: extending llm context window beyond 2 million tokens\.arXiv preprint arXiv:2402\.13753\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- G\. Feng, S\. Luo, K\. Hua, G\. Zhang, D\. He, W\. Huang, and T\. Cai \(2026\)In\-place test\-time training\.arXiv preprint arXiv:2604\.06169\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- Y\. Gao, Y\. Xiong, W\. Wu, B\. Li, Y\. Zhong, and H\. Wang \(2026\)U\-niah: unified rag and llm evaluation for long context needle\-in\-a\-haystack\.ACM Transactions on Information Systems44\(3\),pp\. 1–30\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1](https://arxiv.org/html/2606.06906#S5.SS1.p2.1)\.
- M\. Günther, I\. Mohr, D\. J\. Williams, B\. Wang, and H\. Xiao \(2024\)Late chunking: contextual chunk embeddings using long\-context embedding models\.arXiv preprint arXiv:2409\.04701\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- M\. Hardt and Y\. Sun \(2024\)Test\-time training on nearest neighbors for large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54625–54640\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- J\. Hu, Z\. Zhang, G\. Chen, X\. Wen, C\. Shuai, W\. Luo, B\. Xiao, Y\. Li, and M\. Tan \(2025\)Test\-time learning for large language models\.arXiv preprint arXiv:2505\.20633\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- J\. Hübotter, S\. Bongni, I\. Hakimi, and A\. Krause \(2025\)Efficiently learning at test\-time: active fine\-tuning of llms\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 74978–75035\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- S\. Jeong, J\. Baek, S\. Cho, S\. Hwang, and J\. C\. Park \(2023\)Test\-time self\-adaptive small language models for question answering\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 15459–15469\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu \(2023\)Llmlingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 13358–13376\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- H\. Jiang, Q\. Wu, X\. Luo, D\. Li, C\. Lin, Y\. Yang, and L\. Qiu \(2024\)Longllmlingua: accelerating and enhancing llms in long context scenarios via prompt compression\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1658–1677\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024\)A human\-inspired reading agent with gist memory of very long contexts\.arXiv preprint arXiv:2402\.09727\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- T\. Lee, C\. Yoon, K\. Jang, D\. Lee, M\. Song, H\. Kim, and J\. Kang \(2025\)Ethic: evaluating large language models on long\-context tasks with high information coverage\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5497–5512\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- H\. Li, P\. Verga, P\. Sen, B\. Yang, V\. Viswanathan, P\. Lewis, T\. Watanabe, and Y\. Su \(2024a\)ALR2: a retrieve\-then\-reason framework for long\-context question answering\.arXiv preprint arXiv:2410\.03227\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- T\. Li, G\. Zhang, Q\. D\. Do, X\. Yue, and W\. Chen \(2024b\)Long\-context llms struggle with long in\-context learning\.arXiv preprint arXiv:2404\.02060\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- Y\. Li, M\. R\. Lyu, and L\. Wang \(2025\)Learning to reason from feedback at test\-time\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5241–5253\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- Y\. Li, B\. Dong, F\. Guerin, and C\. Lin \(2023\)Compressing context to enhance inference efficiency of large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 6342–6353\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- B\. Liskavets, M\. Ushakov, S\. Roy, M\. Klibanov, A\. Etemad, and S\. K\. Luke \(2025\)Prompt compression with context\-aware sentence encoding for fast and improved llm inference\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24595–24604\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- K\. Luo, Z\. Liu, P\. Zhang, H\. Qian, J\. Zhao, and K\. Liu \(2025\)Does rag really perform bad for long\-context processing?\.arXiv preprint arXiv:2502\.11444\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- A\. Modarressi, H\. Deilamsalehy, F\. Dernoncourt, T\. Bui, R\. A\. Rossi, S\. Yoon, and H\. Schütze \(2025\)Nolima: long\-context evaluation beyond literal matching\.arXiv preprint arXiv:2502\.05167\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- D\. Muhtar, Y\. Shen, Y\. Yang, X\. Liu, Y\. Lu, J\. Liu, Y\. Zhan, H\. Sun, W\. Deng, F\. Sun,et al\.\(2024\)Streamadapter: efficient test time adaptation from contextual streams\.arXiv preprint arXiv:2411\.09289\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- I\. Nair, S\. Somasundaram, A\. Saxena, and K\. Goswami \(2023\)Drilling down into the discourse structure with llms for long document question answering\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 14593–14606\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- Z\. Pan, Q\. Wu, H\. Jiang, M\. Xia, X\. Luo, J\. Zhang, Q\. Lin, V\. Rühle, Y\. Yang, C\. Lin,et al\.\(2024\)Llmlingua\-2: data distillation for efficient and faithful task\-agnostic prompt compression\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 963–981\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- H\. Qian, Z\. Liu, K\. Mao, Y\. Zhou, and Z\. Dou \(2024\)Grounding language model with chunking\-free in\-context retrieval\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1298–1311\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- Y\. Qiu, V\. R\. Embar, Y\. Zhang, N\. Jaitly, S\. B\. Cohen, and B\. Han \(2025\)Eliciting in\-context retrieval and reasoning for long\-context large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 3176–3192\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- J\. Saad\-Falcon, D\. Y\. Fu, S\. Arora, N\. Guha, and C\. Ré \(2024\)Benchmarking and building long\-context retrieval models with loco and m2\-bert\.arXiv preprint arXiv:2402\.07440\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- P\. Sarthi, S\. Abdullah, A\. Tuli, S\. Khanna, A\. Goldie, and C\. Manning \(2024\)Raptor: recursive abstractive processing for tree\-organized retrieval\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 32628–32649\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- B\. Sheng, J\. Yao, M\. Zhang, and G\. He \(2025\)Dynamic chunking and selection for reading comprehension of ultra\-long context in large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 31857–31876\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.arXiv preprint arXiv:2502\.02013\.Cited by:[§5\.4](https://arxiv.org/html/2606.06906#S5.SS4.p3.1)\.
- Y\. Su, Y\. Ji, J\. Li, H\. Ye, and M\. Zhang \(2023\)Beware of model collapse\! fast and stable test\-time adaptation for robust question answering\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12998–13011\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- X\. Sun, Z\. Chen, Q\. Liu, S\. Wu, B\. Song, W\. Wang, Z\. Wang, and L\. Wang \(2026\)Predict the retrieval\! test time adaptation for retrieval augmented generation\.arXiv preprint arXiv:2601\.11443\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- Y\. Sun, X\. Wang, Z\. Liu, J\. Miller, A\. Efros, and M\. Hardt \(2020\)Test\-time training with self\-supervision for generalization under distribution shifts\.InInternational conference on machine learning,pp\. 9229–9248\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- C\. Taguchi, S\. Maekawa, and N\. Bhutani \(2025\)Efficient context selection for long\-context qa: no tuning, no iteration, just adaptive\-k\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20116–20141\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- G\. Team, P\. Georgiev, V\. I\. Lei, R\. Burnell, L\. Bai, A\. Gulati, G\. Tanzer, D\. Vincent, Z\. Pan, S\. Wang,et al\.\(2024\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.arXiv preprint arXiv:2403\.05530\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p1.1)\.
- R\. Tian, Y\. Li, Y\. Fu, S\. Deng, Q\. Luo, C\. Qian, S\. Wang, X\. Cong, Z\. Zhang, Y\. Wu,et al\.\(2025\)Distance between relevant information pieces causes bias in long\-context llms\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 521–533\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1),[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- D\. Wang, E\. Shelhamer, S\. Liu, B\. Olshausen, and T\. Darrell \(2020\)Tent: fully test\-time adaptation by entropy minimization\.arXiv preprint arXiv:2006\.10726\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
- M\. Wang, L\. Chen, F\. Cheng, S\. Liao, X\. Zhang, B\. Wu, H\. Yu, N\. Xu, L\. Zhang, R\. Luo,et al\.\(2024\)Leave no document behind: benchmarking long\-context llms with extended multi\-doc qa\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 5627–5646\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- Z\. Wang, J\. Araki, Z\. Jiang, M\. R\. Parvez, and G\. Neubig \(2023\)Learning to filter context for retrieval\-augmented generation\.arXiv preprint arXiv:2311\.08377\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p2.1)\.
- F\. Xu, W\. Shi, and E\. Choi \(2023\)Recomp: improving retrieval\-augmented lms with compression and selective augmentation\.arXiv preprint arXiv:2310\.04408\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.06906#S5.SS1.p2.1)\.
- C\. Yoon, T\. Lee, H\. Hwang, M\. Jeong, and J\. Kang \(2024\)Compact: compressing retrieved documents actively for question answering\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 21424–21439\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- Q\. Zhang, Y\. Bian, X\. Kong, P\. Zhao, and C\. Zhang \(2024\)COME: test\-time adaption by conservatively minimizing entropy\.arXiv preprint arXiv:2410\.10894\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1),[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- Q\. Zhang, F\. Lyu, Z\. Sun, L\. Wang, W\. Zhang, W\. Hua, H\. Wu, Z\. Guo, Y\. Wang, N\. Muennighoff,et al\.\(2025\)A survey on test\-time scaling in large language models: what, how, where, and how well?\.arXiv preprint arXiv:2503\.24235\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p2.1)\.
- Q\. Zhao, R\. Wang, Y\. Cen, D\. Zha, S\. Tan, Y\. Dong, and J\. Tang \(2024\)Longrag: a dual\-perspective retrieval\-augmented generation paradigm for long\-context question answering\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 22600–22632\.Cited by:[§2](https://arxiv.org/html/2606.06906#S2.p1.1)\.
- Y\. Zhu, R\. Li, D\. Wang, D\. Haehn, and X\. Liang \(2025\)Focus directions make your language models pay more attention to relevant contexts\.arXiv preprint arXiv:2503\.23306\.Cited by:[§1](https://arxiv.org/html/2606.06906#S1.p3.1)\.
## Appendix APrompt Templates
Instruction: Context\-Based QuestionAnsweringAnswer the question based on the given context\.Context:\{context\}Question:\{question\}Please provide the final answer only\.
We use the following prompt template for all context\-based question answering experiments\. The placeholder\{context\}denotes the full input context provided by the benchmark, and\{question\}denotes the corresponding question\. To ensure consistent evaluation, we instruct the model to output only the final answer without additional explanations or intermediate reasoning\.
## Appendix BBaseline Implementation Details
### B\.1Full\-Context Inference
We use full\-context inference as the base\-model baseline\. This baseline directly feeds the benchmark\-provided context and question into the pretrained model and generates the answer without retrieval, prompt compression, or test\-time parameter updates\. For a fair comparison, we use the same model checkpoints, tokenizer, prompt template, context truncation, question truncation, maximum answer length, and decoding strategy as EASE\-TTT\. Specifically, the input context is truncated to at most 32,768 tokens, the question is truncated to at most 1,024 tokens, and the maximum answer length is set to 128 tokens\. We use deterministic decoding for all evaluated models\. No LoRA adapters are inserted, and all model parameters remain unchanged during inference\.
### B\.2Within\-Context RAG
We implement Within\-Context RAG as a retrieval\-only baseline that uses the same input context as the original long\-context QA instance and does not access any external corpus\. For a fair comparison with EASE\-TTT, we use the same context preprocessing, tokenizer, truncation limits, prompt template, and decoding settings as our method\. The input context is first truncated to at most 32,768 tokens, and the question is truncated to at most 1,024 tokens\. The maximum answer length is set to 128 tokens, and deterministic decoding is used\.
For retrieval, we segment the truncated context into fixed\-length chunks of 512 tokens\. We then use BM25 to rank these chunks with the question as the retrieval query and select the top 4 chunks as the retrieved context\. The selected chunks are concatenated in their original document order and passed to the base model for answer generation\. Within\-Context RAG does not access any external documents, does not insert LoRA adapters, and does not perform test\-time parameter updates\.
### B\.3ICR
We implement R&R following the original paper and use the official open\-source implementation released by the authors\. R&R combines reprompting and in\-context retrieval \(ICR\) to improve long\-context question answering performance\. Following the original setup, documents are divided into page\-level segments, and the model first performs retrieval by identifying the top\-kkmost relevant pages for the given question before conducting a second QA step on the abbreviated context\. Following the default configuration in the original work, we retrieve the top\-k=5k=5pages during the ICR stage\. During reprompting, reminder instruction blocks are periodically inserted throughout the long context to mitigate the lost\-in\-the\-middle effect by reducing the distance between relevant evidence and task instructions\. Specifically, reminder prompts are inserted approximately everyr=10kr=10\\text\{k\}tokens following the implementation described in the original paper\. The retrieval stage uses the same two\-stage retrieval\-and\-answering pipeline as the original implementation, where the first LLM call retrieves relevant page IDs and the second LLM call performs QA on the abbreviated context constructed from the retrieved pages\. Following the original implementation, we use the official prompt templates, retrieval formatting, and hyperparameter settings provided by the authors for all experiments\.
### B\.4QTTT
We implement qTTT following the original paper and use the official open\-source implementation released by the authors\. Following the original setup, qTTT performs lightweight test\-time adaptation on the query projection modules using LoRA adapters rather than updating the full model parameters\. During inference, the key and value projections remain frozen, allowing the model to reuse the precomputed KV cache without recomputing full\-context representations\. Following the default configuration in the original work, qTTT performsNqTTT=32N\_\{\\text\{qTTT\}\}=32gradient update steps during inference using randomly sampled spans of lengthk=128k=128tokens, with a learning rate of1×10−51\\times 10^\{\-5\}\. Test\-time optimization is applied only to the query\-side attention parameters while all remaining model weights stay frozen\. The adaptation objective follows the standard next\-token prediction loss computed over sampled context spans, using the optimization procedure and default hyperparameter settings provided in the original implementation\. Following the motivation of qTTT, this adaptation strategy is designed to mitigate attention score dilution in long\-context reasoning by improving retrieval of relevant context tokens during inference while preserving efficient KV\-cache reuse\.
Table 2:Efficiency comparison on three profiled LongBench tasks using Qwen3\-1\.7B\. Time is measured in seconds and memory is measured in GB\.
### B\.5Evidence Selection
Table 3:Effect of evidence source on Qwen3\-1\.7B\. Scores are reported on three LongBench QA tasks\.Table[3](https://arxiv.org/html/2606.06906#A2.T3)examines the source of evidence used to construct the attention target\. Utility\-based selection slightly but consistently improves over BM25 across all three tasks\. This suggests that BM25 can retrieve useful lexical matches, while the proposed utility score provides a more task\-aligned signal for selecting evidence chunks\. Since the utility score measures how much a chunk improves question modeling, it is better aligned with the downstream adaptation objective than purely lexical retrieval\. The consistent gains support our use of utility\-based evidence selection for evidence\-guided test\-time training\.Similar Articles
A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models
This paper introduces a four-condition diagnostic protocol to separate no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization in long-context and retrieval-augmented language models, tested on five open-weight models across multiple datasets.
AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering
Proposes AMATA, a multi-agent trajectory alignment framework for knowledge-intensive question answering that introduces intra-trajectory preference learning and inter-agent dependency learning to improve factual grounding and interpretability, outperforming baselines on five benchmarks.
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.
ACC: Compiling Agent Trajectories for Long-Context Training
ACC converts multi-turn agent trajectories into long-context QA pairs to train LLMs on long-range reasoning without additional annotation, achieving significant gains on MRCR and GraphWalks benchmarks while preserving general capabilities.
MemTrain: Self-Supervised Context Memory Training
MemTrain proposes a self-supervised training framework that uses masked reconstruction and intermediate memory recall proxy tasks on Wikipedia corpora to enhance LLM agents' context memory, achieving up to 17.67 point gains on downstream memory-intensive QA benchmarks.