Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL Papers

Summary

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.

arXiv:2605.06919v1 Announce Type: new Abstract: Large language models have demonstrated impressive retrieval-augmented capabilities. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information. It is a limitation with real consequences in high-stakes domains like medicine and finance. We evaluate eight LLMs on their context-certainty obedience, measuring how well they adjust responses to match expressed context certainty. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification. This approach reduces obedience errors by 25% on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability. Our contributions include a principled evaluation metric, empirical insights into LLMs' uncertainty handling, and a portable strategy to improve context-certainty obedience across diverse LLMs.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:40 AM

# Can LLMs Take Retrieved Information with a Grain of Salt?
Source: [https://arxiv.org/html/2605.06919](https://arxiv.org/html/2605.06919)
###### Abstract

Large language models have demonstrated impressive retrieval\-augmented capabilities\. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information\. It is a limitation with real consequences in high\-stakes domains like medicine and finance\. We evaluate eight LLMs on theircontext\-certainty obedience, measuring how well they adjust responses to match expressed context certainty\. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts\. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification\. This approach reduces obedience errors by25%25\\%on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability\. Our contributions include a principled evaluation metric, empirical insights into LLMs’ uncertainty handling, and a portable strategy to improve context\-certainty obedience across diverse LLMs\.

$\\star$$\\star$footnotetext:Correspondence to: Behzad Shayegh <[behzad\.shayegh@rbc\.com](https://arxiv.org/html/2605.06919v1/mailto:[email protected])\>## 1Introduction

Large language models \(LLMs\) have become indispensable tools across various domains, revolutionizing natural language processing tasks\. One of their critical applications is retrieval\-augmented question answering\(RAQA; Salemi and Zamani,[2025](https://arxiv.org/html/2605.06919#bib.bib17)\), where LLMs generate responses based on retrieved documents\.

While significant progress has been made in this area, a key aspect remains underexplored: how LLMs react to uncertainty in the retrieved information\. This gap is particularly consequential in real\-world applications where information often comes with inherent ambiguity or conflicting sources\. For example, in medical diagnosis, if an LLM retrieves conflicting research studies about a treatment’s efficacy, its ability to acknowledge uncertainty could prevent overconfident recommendations that risk patient safety\. Similarly, in financial reporting, when retrieved market data includes speculative projections or unverified rumors, an LLM’s capacity to consider uncertainty could mitigate misleading investment advice\. Since assessing the certainty of the retrieved information is well\-studiedCulotta and McCallum \([2004](https://arxiv.org/html/2605.06919#bib.bib26)\); Ngarteraet al\.\([2026](https://arxiv.org/html/2605.06919#bib.bib25)\), enabling LLMs to react to such uncertainties could be the next step towards robustifying LLM\-based systems\.

In this work, we investigate the capability of LLMs to adapt to expressed context certainty, a concept we termcontext\-certainty obedience\. We consider certainty scores as undistorted probabilities reflecting the validity of retrieved contexts\.111Undistorted probabilities are statistically rigorous estimates, uninfluenced by psychological tendencies, miscommunications, or exaggerations\. Outside the scope of this work, expressed probabilities may be distorted, particularly when provided by humans, due to cognitive biases such as underestimating high probabilities or the certainty effect\.Interpreting undistorted probabilities is critical when integrating LLMs into larger systems, where such probabilities can be supplied\. This ensures reliable decision\-making in scenarios requiring precise uncertainty consideration\.

In practice, certainty scores can be derived from various sources: calibrated retriever confidence scores\(Cohenet al\.,[2021](https://arxiv.org/html/2605.06919#bib.bib27)\), multi\-document agreement\(Biswas and Uzuner,[2026](https://arxiv.org/html/2605.06919#bib.bib28)\), source authority signals\(Leeet al\.,[2026](https://arxiv.org/html/2605.06919#bib.bib29)\), or uncertainty quantification modules\(Perez\-Beltrachini and Lapata,[2025](https://arxiv.org/html/2605.06919#bib.bib30)\)\. Our work assumes these signals are available and addresses the complementary question: can LLMs interpret and act on them appropriately?

We begin by introducing a formal definition of context\-certainty obedience, grounded in the marginalization of conditional probability distributions\. Under this definition, an LLM is evaluated by comparing its output distribution against a linear combination of its prior distribution \(context\-independent\) and the context\-based distribution \(strict context adherence\), weighted by the certainty factor\. Models are expected to disregard0%0\\%\-certain contexts and strictly adhere to100%100\\%\-certain contexts\. This capability is critical for maintaining accuracy and reliability, as it implies ignoring noisy contextual information and overriding outdated or incorrect knowledge with verified information\.

The above formulation brings us close to studies on autonomous conflict resolution\(Jinet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib24); Huanget al\.,[2025](https://arxiv.org/html/2605.06919#bib.bib20); Zhanget al\.,[2025](https://arxiv.org/html/2605.06919#bib.bib21); Daiet al\.,[2026](https://arxiv.org/html/2605.06919#bib.bib22); Biet al\.,[2026](https://arxiv.org/html/2605.06919#bib.bib23)\), which develop models that internally resolve conflicts between parametric and retrieved knowledge\. While related, our work investigates a fundamentally different problem: Given an external certainty score for retrieved knowledge, would LLMs adhere to that score? Specifically, we want the models to follow the external signal rather than internally estimate the context’s reliability, which may contradict the expressed certainty\. While conflict resolution studies are evaluated by measuring accuracy or other QA success metrics, our study is independent of the ground truth answer and measures “context\-certainty obedience” without aiming to improve accuracy\.

Subsequently, we apply this evaluation framework to assess how various LLMs, including small and large variants of Llama, Qwen, and Gemma, exhibit context\-certainty obedience\. Our analyses highlight systematic limitations across LLMs: 1\. Models almost always fail to recall their prior responses after encountering contexts, particularly harming performance when contexts are highly uncertain\. In such cases, the ideal behavior, i\.e\., ignoring the context and reverting to prior knowledge, remains unattainable\. 2\. LLMs misinterpret expressed probabilities, distorting their responses to uncertainty\. This mismatch highlights a critical gap in their probabilistic reasoning, undermining reliability in certainty\-sensitive scenarios\. 3\. Models overtrust long, complex contextual information even when certainty is low\. This tendency is especially critical given the common use of LLMs in applications requiring reference to lengthy, intricate information sources\.

To address these limitations, we propose an interaction strategy comprising three steps: 1\. We remind the model with its pre\-context response to reinforce prior knowledge recall; 2\. We recalibrate certainty expressions to align with the model’s probabilistic understanding; and 3\. We simplify the context to reduce complexity\-driven confusion\. We evaluate our approach using eight LLMs with different sizes and families, on the ClashEval benchmark\(Wuet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib3)\)\. Our interaction strategy reduces context\-certainty obedience errors from an average of0\.52→0\.390\.52\\to 0\.39\(25%25\\%\), demonstrating its effectiveness without requiring model retraining\.

In summary, our key contributions include: 1\. formalize context\-certainty obedience with a grounded evaluation metric; 2\. highlight the current LLMs’ limitations in handling context uncertainty; and 3\. propose a portable enhancement strategy to improve context\-certainty obedience for any LLM via interaction design\.

## 2Evaluation Framework

### 2\.1Notation and Definitions

We consider a retrieval\-augmented question answering task where the retrieved context is assigned an undistorted certainty score and conveys a definite answer to the question\. In this work, we denote the answer to the question conveyed by the retrieved context asaaand the certainty of the retrieved context bycc\. In practice,ccwould be estimated and supplied by a dedicated certainty estimator\(Cohenet al\.,[2021](https://arxiv.org/html/2605.06919#bib.bib27); Leeet al\.,[2026](https://arxiv.org/html/2605.06919#bib.bib29); Perez\-Beltrachini and Lapata,[2025](https://arxiv.org/html/2605.06919#bib.bib30)\)\. Additionally, we define a random variableK∼Bernoulli​\(c\)K\\sim\\text\{Bernoulli\}\(c\), where:

- •K=1K=1indicates the retrieved context is valid \(i\.e\., aligned with ground truth\),
- •K=0K=0indicates the retrieved context is pure noise\.

Critically, “pure noise” here denotes contexts that are not systematically incorrect but rather statistically independent of the truth, meaning they may coincidentally align with it\. Note thatKKis a diagnostic mathematical construct in our work, not an ontological claim about context validity\.

We denote byπ​\(X\)=P​\(X\|K=0\)\\pi\(X\)=P\(X\|K=0\)the LLM’s output distribution over possible responsesXXwithout having any contextual information \(prior distribution\), and byδa​\(X\)=P​\(X\|K=1\)\\delta\_\{a\}\(X\)=P\(X\|K=1\)the LLM’s output distribution over possible responsesXXgiven certain knowledge of the context conveying answeraa\. Here,δa\\delta\_\{a\}is a degenerate distribution with mass concentrated at pointaa, reflecting the certainty of the context and the definiteness of the answeraa\.

### 2\.2Ideal Context\-Certainty\-Obedient Behavior

We derive the ideal behavior of a context\-certainty obedient model from the principle of marginalizing conditional probability distributions\. The output distribution of an ideal model over possible responsesXX, given acc\-certain context conveying answeraa, is

Pidl​\(X;c\)=P​\(K=0;c\)​P​\(X\|K=0\)\+P​\(K=1;c\)​P​\(X\|K=1\)=\(1−c\)⋅π​\(X\)\+c⋅δa​\(X\)\\displaystyle P\_\{\\text\{idl\}\}\(X;c\)=P\(K\{=\}0;c\)P\(X\|K\{=\}0\)\+P\(K\{=\}1;c\)P\(X\|K\{=\}1\)=\(1\-c\)\{\\cdot\}\\pi\(X\)\+c\{\\cdot\}\\delta\_\{a\}\(X\)\(1\)
Take\-away: The model’s prior behavior should be \(partially\) overwritten by the context, and the extent of this overwriting should align with the expressed certainty\.

### 2\.3Context\-Certainty Obedience Error

We are interested in studying the capability of a given LLM to obey expressed context uncertainty, which we regard as the closeness of the LLM’s output distributionPPto the ideal output distributionPidlP\_\{\\text\{idl\}\}described in §[2\.2](https://arxiv.org/html/2605.06919#S2.SS2)\. Formally, we define thecontext\-certainty obedience erroras follows:

ϵo​b​e​y​\(P\)\\displaystyle\\epsilon\_\{obey\}\(P\)=∫01dTV⁡\(P​\(⋅;c\),Pidl​\(⋅;c\)\)​𝑑c\\displaystyle=\\int\_\{0\}^\{1\}\\;\\operatorname\{d\_\{TV\}\}\\Big\(\\;P\(\\cdot;c\),\\;P\_\{\\text\{idl\}\}\(\\cdot;c\)\\;\\Big\)\\;dc\(2\)dTV\\operatorname\{d\_\{TV\}\}, known as total variation distance, is half of the L1distance between the probability functions:

dTV⁡\(P​\(⋅;c\),Pidl​\(⋅;c\)\)=12​‖P​\(⋅;c\)−Pidl​\(⋅;c\)‖1=12​∑x\|P​\(x;c\)−Pidl​\(x;c\)\|\\displaystyle\\operatorname\{d\_\{TV\}\}\\Big\(P\(\\cdot;c\),P\_\{\\text\{idl\}\}\(\\cdot;c\)\\Big\)=\\frac\{1\}\{2\}\|\|P\(\\cdot;c\)\-P\_\{\\text\{idl\}\}\(\\cdot;c\)\|\|\_\{1\}=\\frac\{1\}\{2\}\\sum\_\{x\}\|P\(x;c\)\-P\_\{\\text\{idl\}\}\(x;c\)\|\(3\)Eqn\. \([2](https://arxiv.org/html/2605.06919#S2.E2)\) is essentially the area under thedTV\\operatorname\{d\_\{TV\}\}curve\. Note thatϵo​b​e​y\\epsilon\_\{obey\}is independent of the context certaintycc, indicating that it accounts for the model’s ability to adjust its response to any given certainty score, regardless of whether the context is contradictory or consistent with the model’s prior knowledge, or whether it is close or distant from the ground truth\. That is, in this study we assess the model’s obedience to expressed certainty scores, not its accuracy\.

#### Approximation\.

Direct computation ofπ​\(x\)\\pi\(x\)andP​\(x;c\)P\(x;c\), required in Eqn\. \([3](https://arxiv.org/html/2605.06919#S2.E3)\), over all possible responsesxxis intractable\. Instead, we leverage autoregressive generation to approximate distributions\. Specifically, we generate the context answeraatoken\-by\-token, recording the model’s predicted probability for every vocabulary token at each generation step\. We then construct a proxy distribution by taking the product of the stepwise probabilities according to the chain rule\. This distribution is over a set of output prefixes, providing a lower bound ondTV\\operatorname\{d\_\{TV\}\}, as it evaluates alignments based on partial prefixes\. However, it critically captures the model’s commitment to generating the context answeraa, which is of paramount interest for our context\-certainty obedience analysis\.

Question: …?Context: ……Context certainty: \.\.%Get the prior responseSimplify the contextRecalibrateQuestion: …?Prior reminder: …Simple context: …Recalibrated certainty: \.\.%RAQAFigure 1:Illustration of our interaction strategy\.

## 3Our Interaction Strategy

To enhance context\-certainty obedience in LLMs, we propose an interaction strategy\. This methodology addresses three key limitations observed in §[5](https://arxiv.org/html/2605.06919#S5), including their struggles with recalling prior knowledge, misinterpretation of probabilities, and over\-reliance on complex contexts\. Our method addresses these challenges through three independent, parallelizable steps \(see Figure[1](https://arxiv.org/html/2605.06919#S2.F1)\):

![Refer to caption](https://arxiv.org/html/2605.06919v1/x1.png)Figure 2:Recalibration map for Gemma \(v3\.0, 27B\) on the Locations dataset with a prior reminder in the prompt\. The target certainty represents the true certainty of the context, while the expressed certainty is the value inserted in the prompt\.#### Reminding LLMs of their prior response\.

We first elicit the model’s response to the prompt without contextual information, preserving its prior response\. This prior response will be fed back to the model as a reminder\.

#### Recalibrating expressed certainty\.

We find for a given LLM, their recalibration mapping as

Cal⁡\(c\)=argminc0​dTV⁡\(P​\(⋅;c0\),Pidl​\(⋅;c\)\)\\displaystyle\\operatorname\{Cal\}\(c\)=\\underset\{c\_\{0\}\}\{\\operatorname\{argmin\}\}\\;\\;\\operatorname\{d\_\{TV\}\}\\Big\(P\(\\cdot;c\_\{0\}\),P\_\{\\text\{idl\}\}\(\\cdot;c\)\\Big\)\(4\)An example of such recalibration mapping is illustrated in Figure[2](https://arxiv.org/html/2605.06919#S3.F2)\. During inference, we apply the precomputed recalibration mapping to adjust expressed certainty levels to align with the LLM’s probabilistic interpretation\.

In practice, this recalibration step is computationally efficient at inference as it requires only a simple mapping\. It demands a one\-time empirical study of the specific LLM’s certainty\-response behavior during setup\. Since the same mapping applies to all queries, the initial cost is quickly amortized\.

We empirically demonstrate the generalizability of the one\-time recalibration mapping through held\-out evaluation: for each category, we compute the mapping using data from all other categories, excluding the target category from the fitting process \(§[5\.3](https://arxiv.org/html/2605.06919#S5.SS3)\)\. This approach ensures the method’s gains properly transfer to unseen QA categories\. Appendix[D\.2](https://arxiv.org/html/2605.06919#A4.SS2)further shows that the performance drop from this domain shift is negligible, confirming that the mappings capture systematic model behavior rather than category\-specific noise\.

#### Simplifying the context\.

To mitigate challenges posed by complex contexts, we first prompt the model to extract a raw answer from the provided context, irrespective to its correctness or associated certainty\. This retrieved output is subsequently reformatted into a standardized, simplified template\. This step directly addresses confusion stemming from verbose or multi\-layered contexts\.

#### Synthesis stage\.

After collecting the outcomes of the above three steps, we form a new prompt for the model that includes a prior\-response reminder, a simplified context, and a recalibrated certainty score for the context \(Figure[1](https://arxiv.org/html/2605.06919#S2.F1)\)\.

In practice, with precise implementation, this entire pipeline can run efficiently, similar to that of a single forward pass\. Due to space limitations, we refer the reader to Appendix[B](https://arxiv.org/html/2605.06919#A2)for more details\.

## 4Experimental Setup

#### LLMs in Our Analysis\.

We evaluate eight open\-weight LLMs spanning three architectural families and a wide range of parameter sizes \(1B to 72B\)\. These models are selected to ensure diversity in scale, training objectives, and quantization strategies\. The specific models included are:

- •LLaMA:v3\.3\-70B Instruct \(INT4 quantized via AWQ\(Linet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib35)\)\) and v3\.2\-3B Instruct\(Llama Team,[2024](https://arxiv.org/html/2605.06919#bib.bib34)\)\.
- •Qwen:v2\.5\-72B Instruct \(quantized via AWQ\) and v3\.0\-4B base\(Qwen Team,[2025a](https://arxiv.org/html/2605.06919#bib.bib32),[b](https://arxiv.org/html/2605.06919#bib.bib33)\)\.
- •Gemma v3:Instruct variants in 27B, 12B, 3B, and 1B\(Gemma Team,[2025](https://arxiv.org/html/2605.06919#bib.bib31)\)

To isolate the inherent token\-probability behavior, we explicitly disable reasoning modes during inference\. This choice avoids conflating our analysis with emergent reasoning capabilities, which introduce additional variables beyond the scope of this study\.

#### Dataset\.

Since our formulation of context\-certainty obedience \(§[2\.3](https://arxiv.org/html/2605.06919#S2.SS3)\) and its associated error metric are independent of ground\-truth certainty scores or responses, we can utilize any \(unlabeled\) RAQA dataset for experimentation\. We adopt the ClashEval dataset\(Wuet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib3)\), which is specifically designed to assess the ability of models to handle conflicting or uncertain information in retrieved contexts by providing with both correct and incorrect contexts for different QA categories\. To isolate context\-certainty obedience errors from retrieval failures, we exclude samples where any model fails to retrieve the answer from the context\. This filtering is necessary to obtain clean measurements of the obedience phenomenon\. Further details about the dataset are provided in Appendix[C](https://arxiv.org/html/2605.06919#A3)\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x3.png)

Figure 3:Baseline results\.Left:output similarity to the answer in context;Middle:output similarity to the model’s prior distribution;Right:per\-certainty total variation distance \(lower is better\);Table:context\-certainty obedience error \(lower is better\)\. Dashed lines indicate ideal behavior\. Since different models assign different prior probabilities to the context’s answer, i\.e\.,π​\(a\)\\pi\(a\), their ideal behaviors are unique, shown by different colors\. Because these ideal behaviors are similar, we report their average in subsequent experiments\. Models are split into two plots \(top and bottom\) for clarity\.
#### Metrics\.

Our evaluation metric isϵobey\\epsilon\_\{\\text\{obey\}\}\(see §[2\.3](https://arxiv.org/html/2605.06919#S2.SS3)\)\. To analyze model behavior, we report three complementary diagnostic curves, each shown as a function of the input certainty scorecc, computed as means across all samples:

- •Similarity to context=1−dTV⁡\(P​\(⋅;c\),δa​\(⋅\)\)=1\-\\operatorname\{d\_\{TV\}\}\\Big\(P\(\\cdot;c\),\\delta\_\{a\}\(\\cdot\)\\Big\)
- •Similarity to prior=1−dTV⁡\(P​\(⋅;c\),π​\(⋅\)\)=1\-\\operatorname\{d\_\{TV\}\}\\Big\(P\(\\cdot;c\),\\pi\(\\cdot\)\\Big\)
- •Deviation from ideal222ϵobey\\epsilon\_\{\\text\{obey\}\}is equivalent to the area under this curve \(AUC\)\.=dTV⁡\(P​\(⋅;c\),Pidl​\(⋅;c\)\)=\\operatorname\{d\_\{TV\}\}\\Big\(P\(\\cdot;c\),P\_\{\\text\{idl\}\}\(\\cdot;c\)\\Big\)

We evaluate every sample across a fixed sweep of certainty scores \(0%,20%,40%,60%,80%,100%0\\%,20\\%,40\\%,60\\%,80\\%,100\\%\), running six inferences per sample to generate the obedience curves and computeϵo​b​e​y\\epsilon\_\{obey\}\. We emphasize that our sole evaluation metric isϵo​b​e​y\\epsilon\_\{obey\}and the curves are included only for diagnostic purposes\.

#### Prompt Templates\.

All prompt templates used in our experiments are provided in Appendix[H](https://arxiv.org/html/2605.06919#A8)\.

## 5Experiments and Analysis

This section evaluates the performance of LLMs in terms of context\-certainty obedience and incrementally validates our interaction strategy\. Each step builds on the previous one, demonstrating cumulative performance gains\. Appendix[D\.6](https://arxiv.org/html/2605.06919#A4.SS6)further dissects the contribution of individual components of our interaction strategy\.

### 5\.1Baseline Analysis

Figure[3](https://arxiv.org/html/2605.06919#S4.F3)compares LLMs’ performance when provided with context\-certainty scores and guidelines\. While larger models exhibit slightly better context\-certainty obedience, all models show weaknesses, motivating deeper analysis\.

### 5\.2Reminding LLMs of Their Prior Response

![Refer to caption](https://arxiv.org/html/2605.06919v1/x4.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x5.png)

Figure 4:Enhancement by adding a prior reminder\. Gray lines show baseline performance without the reminder\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.The diagnostic curve \(Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\(middle\)\) reveals that LLMs struggle to recover their prior distributions after observing a context,333This is aligned withGenget al\.\([2025](https://arxiv.org/html/2605.06919#bib.bib4)\)showing that providing contexts changes the model’s beliefs and state\.even when instructed to ignore that contexts \(i\.e\.,0%0\\%certainty\)\.

To address this, we propose reminding models of their prior responses \(collected without contextual input\) during inference, hypothesizing that explicit self\-reference mitigates post\-context deviations \(see §[3](https://arxiv.org/html/2605.06919#S3)\)\. Figure[4](https://arxiv.org/html/2605.06919#S5.F4)confirms this significantly improves context\-certainty obedience, especially for low\-certainty contexts\. As exemplified in Figure[4](https://arxiv.org/html/2605.06919#S5.F4)\(table\), Gemma \(v3\.0, 27B\)’s error reduced from0\.48→0\.380\.48\\to~0\.38and Qwen \(v3\.0, 4B\)’s from0\.51→0\.480\.51\\to~0\.48\.

Take\-away: Reminding LLMs of their prior responses enables them to ignore unreliable contexts\.

#### Follow\-up:

Does improvement stem from reminding the model of itsownprior response, or from consideringanyalternative? We test this in Appendix[D\.1](https://arxiv.org/html/2605.06919#A4.SS1)by replacing the prior response with a third\-party alternative in the reminder prompt\. Results show that self\-prior reminders outperform third\-party alternatives: they effectively anchor the model to its original output distribution, leading to better uncertain\-context ignorance\.

### 5\.3Recalibrating Expressed Certainty

![Refer to caption](https://arxiv.org/html/2605.06919v1/x6.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x7.png)

Figure 5:Enhancement by recalibration\. Gray lines show baseline performance with prior reminder, without the recalibration\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.While prior reminders effectively enable models to ignore uncertain contexts, our diagnostic curve \(Figure[4](https://arxiv.org/html/2605.06919#S5.F4)\(left\)\) reveals persistent context overreliance \(being above the ideal line\)\.

Addressing this issue, we recalibrate expressed certainty scores \(see §[3](https://arxiv.org/html/2605.06919#S3)\)\. We employ a recalibration mapping with a20%20\\%\-certainty resolution\. To ensure generalizability, we compute a distinct recalibration mapping for each QA category in our dataset using data from all other categories\. In Appendix[D\.2](https://arxiv.org/html/2605.06919#A4.SS2), we show there is minimal performance loss under this domain shift, confirming the mappings reflect generalizable patterns rather than category\-specific noise\.

Results in Figure[5](https://arxiv.org/html/2605.06919#S5.F5)show significant gains by our approach across LLMs, even with the low\-resolution mapping\. For instance, Figure[5](https://arxiv.org/html/2605.06919#S5.F5)\(table\) shows that recalibration reduces error rates for Gemma \(v3\.0, 27B\) and Qwen \(v3\.0, 4B\) by0\.38→0\.300\.38\\to~0\.30and0\.48→0\.460\.48\\to~0\.46, respectively\.

Take\-away: The LLMs’ understanding of expressed context certainty is distorted\. This distortion can be robustly countered by recalibration of the expressed certainty\.

### 5\.4Simplifying the Context

While recalibration largely addresses the over\-trusting issue, the diagnostic curve \(Figure[5](https://arxiv.org/html/2605.06919#S5.F5)\(left\)\) reveals two persistent challenges among smaller LLMs: First, they continue to overtrust uncertain contexts\. Second, they underperform in100%100\\%\-certain scenarios, failing to adhere to context\-derived answers\. These issues suggest smaller models struggle with processing complex contexts\.444In pilot studies, we investigated whether the order of prompt components \(e\.g\., prior reminder, certainty score, context\) influences the context\-complexity challenge\. Results indicated that the observed challenges persisted across tested prompt designs, suggesting the difficulty is inherent rather than order\-dependent\.

To address the above challenges, we propose to add a context\-simplification step: First, we prompt each model to extract the answer from the context, agnostic to its accuracy or certainty\.555This step is error\-free, as all LLMs successfully retrieve answers in the set; see Appendix[C](https://arxiv.org/html/2605.06919#A3)for retrieval rates\.The extracted answer is then embedded into a streamlined template to create a simplified context\.

Context\-simplification results are reported in Figure[6](https://arxiv.org/html/2605.06919#S5.F6)\. The left panel shows that context\-simplification enhances context\-certainty obedience across most LLMs through two mechanisms: First, it enables smaller models to better adhere to context\-derived answers in high\-certainty scenarios, as shown by curves approaching the top at100%100\\%certainty\. Second, it further mitigates context overreliance, lowering curves across certainty levels\. Figure[6](https://arxiv.org/html/2605.06919#S5.F6)\(table\) demonstrates the overall error reduction; e\.g\.,0\.30→0\.280\.30\\to~0\.28for Gemma \(v3\.0, 27B\) and0\.46→0\.350\.46\\to~0\.35for Qwen \(v3\.0, 4B\)\.

Take\-away: Reducing context complexity improves context\-certainty obedience\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x8.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x9.png)

Figure 6:Enhancement by simplifying the context\. Gray lines show baseline performance with prior reminder and recalibration, without the simplification\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.#### Follow\-up:

Do LLMs inherently favor longer, more detailed contexts? In Appendix[D\.3](https://arxiv.org/html/2605.06919#A4.SS3), we show that LLMs tend to rely on full\-length contexts more than on summaries, and on summaries more than on extremely short, simplified versions\. This observation supports the role of context elaboration in context reliance\.

#### Follow\-up:

Given that elaborating contexts biases models toward them, does elaborating prior responses similarly bias models toward priors? Can this counteract context overreliance? In Appendix[D\.4](https://arxiv.org/html/2605.06919#A4.SS4), we show that elaborating on the prior reminder increases LLM reliance on priors on average, but creates sample\-level determinism that strongly favors either the prior or the context, ultimately harming obedience\.

#### Follow\-up:

Can the observations in §[5\.4](https://arxiv.org/html/2605.06919#S5.SS4)be artifacts ofself\-conditioning? There is evidence that the improvements indeed stem from better arbitration rather than self\-conditioning: Many models show reduced context\-reliance when expressed certainty is low, even when presented with a self\-simplified context\. This observation contradicts self\-conditioning effects, according to which we expect consistently high reliance on the simplified context regardless of the certainty score\. To isolate this effect, we conducted additional experiments using verified contextual answers instead of model\-extracted ones in Appendix[D\.5](https://arxiv.org/html/2605.06919#A4.SS5)\.

Table 1:Enhancement by our full interaction strategy\. Gray lines show baseline performance without enhancements\. The plot shows per\-certainty total variation distance and the table presents the context\-certainty obedience error\.

### 5\.5Synthesis: Full Interaction Strategy

Integrating prior reminders, certainty recalibration, and context simplification into a unified pipeline yields significant gains \(Table[1](https://arxiv.org/html/2605.06919#S5.T1)\), specially for the larger models\. On average, our method reduces the context\-certainty obedience errors by0\.130\.13\(25%25\\%\), showing its practical utility\.

### 5\.6Additional Analysis

We provide supplementary analyses in the appendix:[D\.6](https://arxiv.org/html/2605.06919#A4.SS6)\. Ablation study investigating the contribution of individual and combined interaction components;[D\.2](https://arxiv.org/html/2605.06919#A4.SS2)\. Impact of domain shift on our recalibration method;[D\.7](https://arxiv.org/html/2605.06919#A4.SS7)\. Impact of model self\-confidence on context\-certainty obedience; and[D\.8](https://arxiv.org/html/2605.06919#A4.SS8)\. Performance breakdown by the correctness of contexts\.

## 6Related Work

Our work addresses a critical gap in retrieval\-augmented question answering \(RAQA\): how well LLMs adapt their responses to expressed certainty of retrieved contexts\. This intersects with or is complementary to several established research areas\.

#### Conflict Resolution in RAG\.

Prior work investigates how LLMs resolve conflicts between parametric knowledge and retrieved information\.Wuet al\.\([2024](https://arxiv.org/html/2605.06919#bib.bib3)\)benchmark this problem by varying context distance from ground truth, reporting context preference and prior preference rates\. We adopt their dataset to examine how LLMs respond to expressed context certainty across varying context–ground\-truth distances\. Additionally, we extend their analysis of self\-confidence and context\-correctness effects in Appendices[D\.7](https://arxiv.org/html/2605.06919#A4.SS7)and[D\.8](https://arxiv.org/html/2605.06919#A4.SS8)\.

Several approaches estimate source reliability to guide conflict resolution\.Huanget al\.\([2025](https://arxiv.org/html/2605.06919#bib.bib20)\)determine which source to prioritize by estimating certainty in both prior and context\.Zhanget al\.\([2025](https://arxiv.org/html/2605.06919#bib.bib21)\)model fact\-level discrepancies between parametric and retrieved knowledge, then applies a self\-thinking process\.Daiet al\.\([2026](https://arxiv.org/html/2605.06919#bib.bib22)\)provide a dataset with four trustworthiness categories: context\-only, prior\-only, both, or neither, and trains a soft bias allocator to weight sources accordingly\. Other work builds on contrastive decoding\(Liet al\.,[2023](https://arxiv.org/html/2605.06919#bib.bib19)\), used to isolate context\-oriented from prior output distributions\(Shiet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib18)\)\. For example,Biet al\.\([2026](https://arxiv.org/html/2605.06919#bib.bib23)\)leverage entropy scores of the context\-oriented and prior distributions to weight each source\.Jinet al\.\([2024](https://arxiv.org/html/2605.06919#bib.bib24)\)combine contrastive decoding with fact\-aware instruction tuning\. The latter finds that stronger retrieval\-augmented LMs rely more on faulty internal memory even when correct evidence is provided\.

Our work differs fundamentally from these approaches: while prior methods focus on*estimating*or*inferring*context certainty under the assumption such signals are unavailable, we examine scenarios where context certainty is*explicitly provided*\. Specifically, we investigate whether LLMs appropriately respect expressed certainty signals when they conflict with the model’s own internal assessment of context reliability\. This difference in problem formulation means that existing conflict\-resolution methods cannot serve as direct baselines for our evaluation, as they address the orthogonal problem of uncertainty estimation rather than uncertainty adherence\.

Certainty calibration in LLMs\.A complementary line of research examines the calibration of LLMs’ confidence in their own outputs, investigating whether models’ expressed confidence aligns with actual correctness\(Wenet al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib16); Liuet al\.,[2025](https://arxiv.org/html/2605.06919#bib.bib14); Genget al\.,[2024](https://arxiv.org/html/2605.06919#bib.bib15)\)\. Our work complements this research direction by investigating the inverse problem: rather than calibrating LLMs’*output*confidence to match correctness, we examine whether LLMs properly interpret and respond to*input*confidence signals\.

#### Probabilistic interpretation by LLMs\.

Prior work has identified that LLMs frequently struggle with probabilistic reasoning and violate basic probability rules\(Freedman and Toni,[2025](https://arxiv.org/html/2605.06919#bib.bib5); Guet al\.,[2025](https://arxiv.org/html/2605.06919#bib.bib6)\)\. These studies highlight that while LLMs can handle probability concepts abstractly, they often fail to generate outputs consistent with specified distributions\. Concurrent to our work,Pournematet al\.\([2025](https://arxiv.org/html/2605.06919#bib.bib7)\)provide the first comprehensive evaluation of LLMs’ probabilistic reasoning over discrete probability distributions, revealing sensitivity to notation and performance degradation with increasing context length\. Our work extends these findings by examining how LLMs respond to expressed context certainty rather than abstract probabilistic tasks, revealing systematic misinterpretations of uncertainty signals\.

#### In\-context learning and interaction design\.

Recent advances in in\-context learning\(Xieet al\.,[2021](https://arxiv.org/html/2605.06919#bib.bib10); Minet al\.,[2022](https://arxiv.org/html/2605.06919#bib.bib8); Daiet al\.,[2023](https://arxiv.org/html/2605.06919#bib.bib9)\)and prompting strategies like Chain\-of\-Thoughts\(Weiet al\.,[2022](https://arxiv.org/html/2605.06919#bib.bib12)\)and Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06919#bib.bib13)\)demonstrate that interaction design can significantly enhance LLM capabilities without retraining\. Our proposed interaction strategy builds on this insight, showing that carefully designed interactions can improve context\-certainty obedience by25%25\\%across diverse models\.

## 7Conclusion

This work addresses a critical gap in retrieval\-augmented question answering systems: the ability of LLMs to adapt their responses to the certainty of retrieved information\. We formalize context\-certainty obedience as a principled evaluation framework, grounded in probabilistic marginalization, and demonstrate systematic limitations in how LLMs handle uncertainty across scales and architectures\. Our findings reveal three key challenges: 1\. LLMs struggle to recall their prior knowledge after encountering contexts, 2\. they misinterpret expressed probabilities, and 3\. they overtrust elaborated contexts even when marked as uncertain\.

To mitigate these issues, we propose an interaction strategy that combines prior reminders, certainty recalibration, and context simplification\. Evaluated on the ClashEval dataset, this approach reduces context\-certainty obedience errors from0\.460\.46–0\.700\.70\(0\.520\.52avg\.\) to0\.280\.28–0\.570\.57\(0\.390\.39avg\.\) across eight diverse LLMs, without modifying model weights\. These results underscore the potential of interaction design to enhance LLM reliability in real\-world RAQA systems\.

We discusslimitationsandfuture workin Appendix[E](https://arxiv.org/html/2605.06919#A5)\.

## References

- B\. Bi, S\. Liu, Y\. Wang, Y\. Xu, J\. Fang, L\. Mei, and X\. Cheng \(2026\)Parameters vs\. context: fine\-grained control of knowledge reliance in language models\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p6.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- M\. B\. Biswas and O\. Uzuner \(2026\)Contradiction to consensus: dual perspective, multi source retrieval based claim verification with source level disagreement using llm\.arXiv preprint arXiv:2602\.18693\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p4.1)\.
- H\. Cao, L\. Wei, M\. Chen, W\. Zhou, and S\. Hu \(2023\)Are large language models good fact checkers: a preliminary study\.arXiv preprint arXiv:2311\.17355\.Cited by:[§D\.8](https://arxiv.org/html/2605.06919#A4.SS8.p2.1)\.
- D\. Cohen, B\. Mitra, O\. Lesota, N\. Rekabsaz, and C\. Eickhoff \(2021\)Not all relevance scores are equal: efficient uncertainty and calibration modeling for deep retrieval models\.InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval,pp\. 654–664\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.06919#S2.SS1.p1.4)\.
- A\. Culotta and A\. McCallum \(2004\)Confidence estimation for information extraction\.InProceedings of HLT\-NAACL 2004: Short Papers,pp\. 109–112\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p2.1)\.
- D\. Dai, Y\. Sun, L\. Dong, Y\. Hao, S\. Ma, Z\. Sui, and F\. Wei \(2023\)Why can gpt learn in\-context? language models secretly perform gradient descent as meta\-optimizers\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 4005–4019\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px3.p1.1)\.
- X\. Dai, H\. Hu, Y\. Hua, J\. Li, Y\. Chen, R\. Jin, N\. Hu, and G\. Qi \(2026\)After retrieval, before generation: enhancing the trustworthiness of large language models in retrieval\-augmented generation\.arXiv preprint arXiv:2505\.17118\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p6.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- G\. Freedman and F\. Toni \(2025\)Exploring the potential for large language models to demonstrate rational probabilistic beliefs\.arXiv preprint arXiv:2504\.13644\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px2.p1.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[3rd item](https://arxiv.org/html/2605.06919#S4.I1.i3.p1.1)\.
- J\. Geng, F\. Cai, Y\. Wang, H\. Koeppl, P\. Nakov, and I\. Gurevych \(2024\)A survey of confidence estimation and calibration in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6577–6595\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p4.1)\.
- J\. Geng, H\. Chen, R\. Liu, M\. H\. Ribeiro, R\. Willer, G\. Neubig, and T\. L\. Griffiths \(2025\)Accumulating context changes the beliefs of language models\.arXiv preprint arXiv:2511\.01805\.Cited by:[footnote 3](https://arxiv.org/html/2605.06919#footnote3)\.
- J\. Gu, L\. Pang, H\. Shen, and X\. Cheng \(2025\)Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 5375–5390\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Huang, S\. Chen, H\. Cai, and B\. Dhingra \(2025\)To trust or not to trust? enhancing large language models’ situated faithfulness to external contexts\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p6.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- Z\. Jin, P\. Cao, Y\. Chen, K\. Liu, X\. Jiang, J\. Xu, Q\. Li, and J\. Zhao \(2024\)Tug\-of\-war between knowledge: exploring and resolving knowledge conflicts in retrieval\-augmented language models\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING,pp\. 16867–16878\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p6.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- N\. Lee, B\. Z\. Li, S\. Wang, W\. Yih, H\. Ma, and M\. Khabsa \(2020\)Language models as fact checkers?\.InProceedings of the Third Workshop on Fact Extraction and VERification \(FEVER\),pp\. 36–41\.Cited by:[§D\.8](https://arxiv.org/html/2605.06919#A4.SS8.p2.1)\.
- S\. Lee, J\. Back, D\. Jeon, S\. Kwon, M\. Kim, I\. Kang, and J\. Lee \(2026\)From relevance to authority: authority\-aware generative retrieval in web search engines\.InThe 64th Annual Meeting of the Association for Computational Linguistics–Industry Track,Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.06919#S2.SS1.p1.4)\.
- X\. L\. Li, A\. Holtzman, D\. Fried, P\. Liang, J\. Eisner, T\. Hashimoto, L\. Zettlemoyer, and M\. Lewis \(2023\)Contrastive decoding: open\-ended text generation as optimization\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12286–12312\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of machine learning and systems6,pp\. 87–100\.Cited by:[1st item](https://arxiv.org/html/2605.06919#S4.I1.i1.p1.1)\.
- X\. Liu, T\. Chen, L\. Da, C\. Chen, Z\. Lin, and H\. Wei \(2025\)Uncertainty quantification and confidence calibration in large language models: a survey\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6107–6117\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p4.1)\.
- Llama Team \(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[1st item](https://arxiv.org/html/2605.06919#S4.I1.i1.p1.1)\.
- S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer \(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.arXiv preprint arXiv:2202\.12837\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px3.p1.1)\.
- L\. Ngartera, S\. Nadarajah, and R\. Koina \(2026\)Bayesian rag: uncertainty\-aware retrieval for reliable financial question answering\.Frontiers in Artificial IntelligenceVolume 8 \- 2025\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p2.1)\.
- L\. Perez\-Beltrachini and M\. Lapata \(2025\)Uncertainty quantification in retrieval augmented question answering\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.06919#S2.SS1.p1.4)\.
- M\. Pournemat, K\. Rezaei, G\. Sriramanan, A\. Zarei, J\. Fu, Y\. Wang, H\. Eghbalzadeh, and S\. Feizi \(2025\)Reasoning under uncertainty: exploring probabilistic reasoning capabilities of llms\.arXiv preprint arXiv:2509\.10739\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px2.p1.1)\.
- Qwen Team \(2025a\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[2nd item](https://arxiv.org/html/2605.06919#S4.I1.i2.p1.1)\.
- Qwen Team \(2025b\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[2nd item](https://arxiv.org/html/2605.06919#S4.I1.i2.p1.1)\.
- A\. Salemi and H\. Zamani \(2025\)LaMP\-qa: a benchmark for personalized long\-form question answering\.arXiv preprint arXiv:2506\.00137\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p1.1)\.
- W\. Shi, X\. Han, M\. Lewis, Y\. Tsvetkov, L\. Zettlemoyer, and W\. Yih \(2024\)Trusting your evidence: hallucinate less with context\-aware decoding\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 783–791\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px3.p1.1)\.
- B\. Wen, C\. Xu, R\. Wolfe, L\. L\. Wang, B\. Howe,et al\.\(2024\)Mitigating overconfidence in large language models: a behavioral lens on confidence estimation and calibration\.InNeurIPS 2024 Workshop on Behavioral Machine Learning,Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p4.1)\.
- K\. Wu, E\. Wu, and J\. Zou \(2024\)Clasheval: quantifying the tug\-of\-war between an llm’s internal prior and external evidence\.Advances in neural information processing systems37,pp\. 33402–33422\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p8.2),[§4](https://arxiv.org/html/2605.06919#S4.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p1.1)\.
- S\. M\. Xie, A\. Raghunathan, P\. Liang, and T\. Ma \(2021\)An explanation of in\-context learning as implicit bayesian inference\.arXiv preprint arXiv:2111\.02080\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px3.p1.1)\.
- Q\. Zhang, Z\. Xiang, Y\. Xiao, L\. Wang, J\. Li, X\. Wang, and J\. Su \(2025\)FaithfulRAG: fact\-level conflict modeling for context\-faithful retrieval\-augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 21863–21882\.Cited by:[§1](https://arxiv.org/html/2605.06919#S1.p6.1),[§6](https://arxiv.org/html/2605.06919#S6.SS0.SSS0.Px1.p2.1)\.

Table 2:Success rate \(%\) of retrieving the answer in the context, regardless of its correctness\.![Refer to caption](https://arxiv.org/html/2605.06919v1/x10.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x11.png)

Figure 7:Enhancement by our full interaction strategy on unfiltered data\. Gray lines show baseline performance without enhancements\. See Figure[3](https://arxiv.org/html/2605.06919#S4.F3)for the layout description\.## Appendix ALicenses

For our experiments, we used ClashEval dataset, which is under MIT License\.666[https://choosealicense\.com/licenses/mit/](https://choosealicense.com/licenses/mit/)The following represents the list of LLMs in our experiments, along with their licenses:

- •
- •
- •

## Appendix BEfficient Implementation of Interaction Strategy

Naively implementing our interaction strategy requires roughly three forward passes per query:

1. 1\.LLM\(question\)→\\rightarrowprior answer
2. 2\.LLM\(question, long context\)→\\rightarrowsimplified context
3. 3\.LLM\(question, prior answer, simplified context, certainty score\)→\\rightarrowoutput

However, with precise implementation and reuse of KV caches, the entire pipeline can execute in essentially a single forward pass\. First, all three steps can leverage the same KV cache for the question\. Second, since the model is autoregressive, Step 3 is the continued generation following Step 1\. Moreover, the original context is processed only in Step 2, and the certainty score is processed only in the last one\. Therefore, compared with the baseline LLM\(question, long context, certainty score\), there is a minimal overhead due to generating the prior answer, and extracting the answer in the context for constructing the simplified context\.

## Appendix CDataset Details

In this work, we analyze QA entries from five distinct categories within the ClashEval dataset: Drug Dosage, News, Wikipedia Dates, Names, and Locations\.121212We exclude entries from the Sports Records category, as their time\-based answers exhibit inconsistent presentation formatting, confounding our analysis\.We conduct our evaluation using the concatenated dataset of these categories, including9,3779,377samples in total\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x12.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x13.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x14.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x15.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x16.png)

Figure 8:Top to bottom: Gemma \(v3, 12B\), Llama \(v3\.3, 70B\), Llama \(v3\.2, 3B\), Qwen \(v2\.5, 72B\), and Qwen \(v3, 4B\); queried with no reminder, with prior reminder, and with ground\-truth reminder\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.
![Refer to caption](https://arxiv.org/html/2605.06919v1/x17.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x18.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x19.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x20.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x21.png)

Figure 9:Top to bottom: Gemma \(v3, 12B\), Llama \(v3\.3, 70B\), Llama \(v3\.2, 3B\), Qwen \(v2\.5, 72B\), and Qwen \(v3, 4B\); queried with original, with simplified \(answer only\), and with summarized \(100\-words\) contexts\. Prior reminders are included and certainties are not recalibrated\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.

In §[5](https://arxiv.org/html/2605.06919#S5), we isolate context\-certainty obedience errors from retrieval failures by filtering the ClashEval dataset to retain only samples where all models successfully retrieved the answer from the context, reducing the number of samples to1,2801,280\. This setup is influenced by the retrieval rate of the acquired LLMs, shown in Table[2](https://arxiv.org/html/2605.06919#A0.T2)\. Gemma \(v3\.0, 1B\) exhibits significantly lower retrieval rate compared with other models \(35%35\\%vs\.6969–82%82\\%\), substantially constraining the filtered evaluation set\. Excluding this model would allow retention of3,9753,975samples\. We verified through controlled experiments that our findings remain robust to this filtering\. Therefore, we include Gemma \(v3\.0, 1B\) in our main experiments as a representative of extremely small LLMs, broadening our analysis scope\.

#### Results on Unfiltered Data\.

For completeness, we also evaluate our interaction strategy on unfiltered data \(Figure[7](https://arxiv.org/html/2605.06919#A0.F7)\), wherethe results and conclusions remain consistent\. However, retrieval errors compound the overall performance degradation, particularly for smaller models\. This underscores that while our method effectively addresses context\-certainty obedience, retrieval quality remains a critical bottleneck in real\-world RAQA systems\. The robustness of our findings across both filtered and unfiltered settings validates the generalizability of the proposed interaction strategy\.

## Appendix DAdditional Analysis

In the following analysis, we use a subset of LLMs for each experiment due to limited time and resources\. We make sure that each analysis is backed by adequate experimental evidence to support our conclusions\.

### D\.1Prior Reminder vs\. Alternative Answer

In this part, we answer one specific question: Does the improvement in Figure[4](https://arxiv.org/html/2605.06919#S5.F4)stem from reminding the model of the prior, or is it a general effect of considering any alternative response?

To test this, we replace prior responses with ground\-truth answers to the questions—an extreme case where the alternative response is maximally plausible, as it aligns perfectly with both the question logic and the context\. We report results for five LLMs in Figure[8](https://arxiv.org/html/2605.06919#A3.F8), demonstrating that while this improves performance, it underperforms prior\-reminding, confirming the uniqueness of self\-consistency cues\. The self\-prior’s superiority over the ground\-truth answer is expected as the mechanism relies on the model’s internal belief state rather than objective plausibility\. Supplying the prior response anchors the model to its initial distribution when context certainty is low, a state that Figure[3](https://arxiv.org/html/2605.06919#S4.F3)shows is otherwise forgotten\. This directly reduces obedience error, which our framework measures by comparing the model against an interpolation between “the model’s own prior distribution” and the context\-dedicated output\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x22.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x23.png)

Figure 10:Recalibration mapping trained on on in\-category data \(colored\) vs\. trained on out\-of\-category data \(gray\)\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.Take\-away: When context certainty is low, self\-prior reminders anchor the model to its initial distribution, enhancing context\-certainty obedience and uncertain\-context ignorance, and outperforming third\-party alternative responses\.

### D\.2Recalibration Mapping under Domain Shift

Figure[10](https://arxiv.org/html/2605.06919#A4.F10)presents results from the experiment in §[5\.3](https://arxiv.org/html/2605.06919#S5.SS3)under two conditions: \(1\) recalibration mappings trained on in\-category data \(Held out×\\times\), and \(2\) mappings trained on out\-of\-category data \(Held out✓\\checkmark\)\. The latter represents a domain\-shift scenario\. Performance differences between these conditions are negligible, suggesting the recalibration mappings generalize well across categories\. This finding indicates that computing the mapping once would yield consistent obedience improvements across different domains\.

### D\.3Do LLMs Favor Complex Contexts?

In this section, we investigate whether LLMs inherently favor longer, more detailed contexts\. We test this by prompting the LLM to summarize contexts before certainty\-oriented RAQA\. To isolate the effect of context elaboration, we disable recalibration in this experiment\. If elaboration amplifies believability, models given summarized contexts are expected to show reduced context reliance compared to those with original contexts, yet increased reliance compared to those with simplified contexts\. We show the results for five LLMs in Figure[9](https://arxiv.org/html/2605.06919#A3.F9), confirming this intermediate behavior and supporting the role of context elaboration in context reliance\.

Take\-away: Longer, more\-elaborated contexts increase believability, driving context overreliance\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x24.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x25.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x26.png)

Figure 11:Top to bottom: Gemma \(v3, 12B\), Llama \(v3\.3, 70B\), and Qwen \(v3, 4B\); queried with no prior reminder, with short prior reminder \(answer only\), and with 100\-word self\-explained prior reminder\. Analysis restricted to samples where the model conveys identical responses regardless of explanation provision, yielding385385,515515, and7474samples respectively\. Certainty scores are not recalibrated\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.
### D\.4Does Elaborating Prior Reminders Reduce Context Overreliance?

Following our observation in the previous section, we investigate whether elaborating prior reminders similarly biases models toward priors and whether this can counteract context overreliance\. To test this, we augment prior reminders with self\-explained justifications\. We disable recalibration to isolate how explanations affect model determinism, avoiding confounds from certainty adjustment\. For a controlled comparison, we restrict this analysis to cases where models produce the same response with and without explanation\. Results for three LLMs131313Llama \(v3\.2, 3B\) is specially excluded from this experiment because there were only 18 cases where models produced identical responses with and without explanation, making controlled results unreliable\. This is unsurprising given the model’s limited capacity\.in Figure[11](https://arxiv.org/html/2605.06919#A4.F11)reveal two key findings: \(1\) explained priors reduce context reliance \(lowering the line in the left panel\) and increase prior reliance \(raising the line in the middle panel\), yet \(2\) they may harm context\-certainty obedience\.

This results present an apparent contradiction: In the left and middle plots of Figure[11](https://arxiv.org/html/2605.06919#A4.F11), models with explained\-prior reminders align more closely with the ideal behavior across certainty levels\. However, the right plot and theϵo​b​e​y\\epsilon\_\{obey\}indicate worse performance, particularly at60%60\\%and80%80\\%certainty for Gemma and Llama, and20%20\\%and40%40\\%for Qwen\. This discrepancy arises from differing measurements\. The left two plots use signed averaging, which cancels out opposing errors, creating an illusion of balance\. In contrast, the right plot andϵo​b​e​y\\epsilon\_\{obey\}rely on absolute averaging, exposing the model’s true determinism\. That means explained\-prior reminders make the model more deterministic per sample, strongly favoring either the prior or the context, which harms obedience despite improving aggregate balance\. Note that our sole evaluation metric isϵo​b​e​y\\epsilon\_\{obey\}, which uses absolute averaging\. The signed\-averaged measures serve only as diagnostic visualizations, revealing directional biases, and should not be treated as evaluation metrics\.

Take\-aways: Explaining a response makes it more believable to LLM on average; Self\-explanation may make LLM more deterministic\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x27.png)\(a\)Llama \(v3\.2, 3B\)
![Refer to caption](https://arxiv.org/html/2605.06919v1/x28.png)\(b\)Qwen \(v3\.0, 4B\)

Figure 12:Queried with original context, LLM\-simplified context, and gold\-standard simplified context\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.
### D\.5Self\-Conditioning Effect

To isolate the effect of context simplification from that of conditioning LLMs on their own generated text, we conduct an ablation experiment\. We replace LLM\-simplified contexts with gold\-standard simplified contexts from our datasets, bypassing the LLM’s text generation step entirely\. Results in Figure[12](https://arxiv.org/html/2605.06919#A4.F12)confirm that simplifying contexts reduces context reliance regardless of the simplification method, especially for uncertain contexts\. This is while a self\-conditioning effect is also observable, more pronounced in high\-certainty cases: context reliance is higher with LLM\-simplified contexts than with gold\-standard simplified contexts\.

Take\-away: The context\-simplification effect is independent from the self\-conditioning effect\. Both effects occur simultaneously, and their combined impact improves context\-certainty obedience\.

### D\.6Ablation Study

Table 3:Ablation study showing the impact of prior reminders, context simplification, and certainty recalibration on model performance, in terms of context\-certainty obedience \(ϵobey\\epsilon\_\{\\text\{obey\}\}\), across different LLMs\.This section evaluates the contribution of each component in our interaction strategy\. Table[3](https://arxiv.org/html/2605.06919#A4.T3)presents context\-certainty obedience errors \(ϵobey\\epsilon\_\{\\text\{obey\}\}\) across eight LLMs under different combinations of enhancements\.

Prior reminders alone reduce average error from0\.52→0\.460\.52\\to 0\.46\(12%12\\%improvement\), demonstrating their critical role in recovering prior knowledge\. We exclude configurations applying recalibration or context simplification without prior reminders, as preliminary experiments revealed that prior reminders are critical for enabling the other components to function effectively\. As including these configurations necessitates expensive complementary experiments while their failure is predictable, we skip them\.

Notably, recalibration and simplification provide somehow orthogonal gains \(combined gain is close to the sum of individuals\), indicating that they address distinct failure modes rather than being overlapping solutions\.

Integrating all three components yields the strongest performance, on average representing a0\.130\.13\(25%25\\%\) error reduction from baselines across eight LLMs\. The full strategy is particularly effective for larger models: Gemma \(v3\.0, 27B\) exhibits0\.48→0\.280\.48\\to 0\.28\(42%42\\%\); Qwen \(v2\.5, 72B\) exhibits0\.46→0\.310\.46\\to 0\.31\(33%33\\%\); and Llama \(v3\.3, 70B\) exhibits0\.48→0\.290\.48\\to 0\.29\(40%40\\%\)\.

### D\.7Self\-Confidence Effect

![Refer to caption](https://arxiv.org/html/2605.06919v1/x29.png)Figure 13:Self\-confidence/context\-certainty heatmap for Gemma \(v3\.0, 12B\)’s total variation distance, without \(left\) and with \(right\) our enhancements\.We investigate how models’ self\-confidence—defined as the maximum probability in the prior distribution assigned to an answer, i\.e\.,maxa^⁡π​\(a^\)\\max\_\{\\hat\{a\}\}\\pi\(\\hat\{a\}\)—interacts with context certainty to influence context\-certainty obedience error\. Results are provided in Figure[13](https://arxiv.org/html/2605.06919#A4.F13)\.

When prompted without our interaction strategy, there is no clear relationship between the model errors and self\-confidence\. However, a linear diagonal pattern emerges when our interaction strategy is employed\. Errors concentrate along the diagonal where context certainty and self\-confidence are comparable\. This reveals a critical insight: the model struggles most when it must arbitrate between conflicting signals of similar strength\.

This self\-confidence effect demonstrates that our enhancement strategy successfully enables models to leverage their prior knowledge calibration\. By reminding models of their prior responses and recalibrating certainty expressions, we help them recognize when their inherent confidence should override uncertain contexts—and vice versa\. The emergent linear relationship validates that the model now uses confidence as a meaningful signal for uncertainty quantification, rather than treating all cases uniformly\.

Take\-away: Our enhancement strategy teaches models to use self\-confidence as a reliable signal for arbitrating between prior knowledge and contextual information, transforming random errors into predictable failures concentrated where signals conflict\.

### D\.8Context\-Correctness Effect

We decompose our results by context correctness to verify that performance improvements hold across both correct and incorrect contexts\. The latter case is particularly interesting, as it represents scenarios where we want to overwrite the LLM’s training knowledge; for example, due to outdated information or domain\-specific corrections\.

In Figure[14](https://arxiv.org/html/2605.06919#A4.F14), we see all eight models trust correct contexts more than incorrect ones, a reflection of their fact\-checking behavior\(Leeet al\.,[2020](https://arxiv.org/html/2605.06919#bib.bib1); Caoet al\.,[2023](https://arxiv.org/html/2605.06919#bib.bib2)\)\. Importantly, our interaction strategy improves performance across both scenarios similarly\. This robustness demonstrates that our method enhances context\-certainty calibration independent of ground\-truth alignment, enabling models to appropriately weight information based on expressed certainty rather than implicit correctness judgments\.

Take\-away: Our interaction strategy improves context\-certainty obedience regardless of whether contexts are correct or incorrect\.

![Refer to caption](https://arxiv.org/html/2605.06919v1/x30.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x31.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x32.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x33.png)

\(a\)Top to bottom: Gemma \(v3, 27B\), Gemma \(v3, 12B\), Gemma \(v3, 4B\), Gemma \(v3, 1B\)\.
![Refer to caption](https://arxiv.org/html/2605.06919v1/x34.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x35.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x36.png)

![Refer to caption](https://arxiv.org/html/2605.06919v1/x37.png)

\(b\)Top to bottom: Llama \(v3\.3, 70B\), Llama \(v3\.2, 3B\), Qwen \(v2\.5, 72B\), and Qwen \(v3, 4B\)\.

Figure 14:Queried without \(gray\) and with \(colored\) our enhancements, on the samples with correct \(solid squares\) and wrong \(empty squares\) contextual information\. Here, we look at all samples \(unfiltered\) to keep the correct/wrong split fair: Both subsets include the same question set\. Layout follows Figure[3](https://arxiv.org/html/2605.06919#S4.F3)\.

## Appendix ELimitations and Future Work

This work represents a foundational step in formalizing and evaluating context\-certainty obedience in retrieval\-augmented generation\. To enable rigorous theoretical derivations and reveal systematic limitations in current LLMs, we necessarily scoped our framework to short\-answer retrieval settings where context conveys a definite answer\. This deliberate narrowness provides the mathematical clarity required to establish baseline obedience behavior and identify where models fail to respect expressed context certainty\. Extensions to more realistic scenarios, including partially correct contexts, long\-form generation, and distorted human\-expressed confidence signals, represent important directions for future work\. However, incorporating these extensions was not feasible within the current scope, as the paper has already significantly expanded into the appendix\.

A more immediate limitation concerns our methodology for estimatingϵobey\\epsilon\_\{\\text\{obey\}\}: it relies on explicit access to output probability distributions and does not extend naturally to reasoning\-heavy models, where non\-deterministic reasoning chains complicate probabilistic modeling\. It also prevents us from extending our analysis to API\-based LLMs where output probability distributions are not available\. Exploring approaches to estimate obedience without direct logit access, and extending evaluation to state\-of\-the\-art scale models, remain important directions for future work\.

Moreover, computational constraints limited our experiments by the model sizes\. We evaluated large models using INT4 quantization via AWQ \(single A100 GPU with 80GB memory\)\. However, our core findings generalize to unquantized models across a range of scales \(Gemma 1B–27B\), demonstrating that observed obedience failures are fundamental characteristics rather than quantization artifacts\.

Finally, we notice the absence of direct baselines for our work from prior methods\. While prior work on knowledge conflict resolution exists, these methods autonomously generate their own confidence assessments and were not designed to leverage externally provided context\-certainty signals\. Adapting such approaches to accept external confidence scores would require fundamental architectural changes, a contribution substantial enough to warrant independent study\. Given that our paper has already expanded significantly into the appendix, we prioritize establishing the problem formulation, framework, and baseline obedience behavior over extending existing methods\. Developing practical adaptations of prior work to our setting remains valuable future work\.

## Appendix FCompute Resources Required

All experiments were run on a single Nvidia A100 \(80GB\) GPU\. For each of the88LLMs tested, we evaluated performance across77different certainty scores with short\-length generation\. On average, an LLM required approximately9090minutes of compute time per certainty score on ClashEval, totaling roughly630630minutes \(10\.510\.5hours\) per model\. That said, actual compute time depends on factors including batch size, model size, and CPU availability \(as CPUs were shared across concurrent jobs\), which may vary across experimental configurations\.

## Appendix GBroader Impact

This work addresses context\-certainty obedience in large language models, which carries societal implications\. On the positive side, improving LLMs’ ability to appropriately respond to uncertainty could substantially enhance safety in high\-stakes domains such as healthcare, finance, and legal services\. By enabling models to calibrate responses to context certainty, our work could reduce risks of misleading medical advice based on speculative research, overconfident financial recommendations from uncertain data, or definitive legal claims from ambiguous precedents\. However, negative impacts exist\. For example, the mechanisms could be exploited via certainty score manipulation to propagate misinformation\. To mitigate this risk, we recommend developing monitoring mechanisms to detect certainty score manipulation and maintaining human oversight in high\-stakes applications\.

## Appendix HPrompts

Prompt templates for all experimental conditions are detailed in TablesLABEL:tab:promptsandLABEL:tab:userprompts, encompassing prior collection, context extraction, summarization, and main RAQA tasks with and without enhancements\.

Table 4:System prompts organized by purpose and question category\.PurposeCategoryContentCollect model’s prior response without contextDrug DosageYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question about drug dosing\. Only respond with a number\. Your output should JUST be numerical and nothing else\. Your answer should be in units of mg\. DO NOT include units in your answer\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\. You MUST respond with a numerical answer and do not refuse to respond\. If you are unsure of the answer, just provide your most reasonable guess\. If the answer has a range of correct values, select the single number that is most likely\.Example Input Format:Question: What is the correct dosage of acetaminophen for infants in mg/kg/dose?Example Response:10NewsYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question about recent news that has a numerical answer\. Only respond with a number\. Your output should JUST be numerical and nothing else\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\. If you are unsure of the answer, just provide your most reasonable guess\.Example Input Format:Question: How many points did the Cleveland Cavaliers score on March 12, 2024?Example Response:104Wiki DatesYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question on the year in which an event occurred\. Your output should JUST be the year of the event in the format YYYY \(eg\. 1975, 1512\) and nothing else\. DO NOT respond with a full sentence\. If you are unsure of the answer, just provide your most reasonable guess\.Example Input Format:Question: Which year did John Brown’s raid on Harpers Ferry occur?Example Response:1859NamesYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question that requires to answer with the name of a person\. Respond with only the name, and no other words\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters, replace them with lowercase letters\. DO NOT respond with a full sentence\. If you are unsure of the answer, just provide your most reasonable guess\.Example Input Format:Question: Who was the commanding general of the Union Army during the American Civil War?Example Response:ulysses s\. grantLocationsYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question that requires to answer with the name of a city\. Respond with only the name of the city, and no other words\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters, replace them with lowercase letters\. DO NOT respond with a full sentence\. If you are unsure of the answer, just provide your most reasonable guess\.Example Input Format:Question: Which city is the most populous city in the U\.S\. state of California?Example Response:los angelesCollect model’s explained prior response without contextDrug DosageYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question about drug dosing\. Respond with a number in units of mg, along with your justification or supporting facts\. Write up to 100 words\. If you are unsure of the answer, just provide your most reasonable guess\. If the answer has a range of correct values, select the single number that is most likely\.NewsYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question about recent news that has a numerical answer\. Respond with a number, along with your justification or supporting facts\. Write up to 100 words\. If you are unsure of the answer, just provide your most reasonable guess\.Wiki DatesYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question on the year in which an event occurred\. Your output should be the year of the event in the format YYYY, along with your justification or supporting facts\. Write up to 100 words\. If you are unsure of the answer, just provide your most reasonable guess\.NamesYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question that requires to answer with the name of a person\. Respond with the name, along with your justification or supporting facts\. Write up to 100 words\. If you are unsure of the answer, just provide your most reasonable guess\.LocationsYou are a QA bot\. Given a question, answer it to the best of your ability\. You will be given a question that requires to answer with the name of a city\. Respond with the name of the city, along with your justification or supporting facts\. Write up to 100 words\. If you are unsure of the answer, just provide your most reasonable guess\.Extract answer from contextDrug DosageYour job is to retrieve the answer to a question from a provided context\. The answer to a question necessarily exists in the provided context\. You must find the answer in the context, even if it is wrong\. Do not answer the question yourself\. Your answer must be found in or realized by the context\. You must just find and echo the answer that exists in the context\. The question is about drug dosages\. Only respond with a number\. Your output should JUST be numerical and nothing else\. Your answer should be in units of mg\. DO NOT include units in your answer\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end, even if it is how the answer is presented in the context\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\. You MUST respond with a numerical answer and do not refuse to respond\. If the answer has a range of correct values, select the single number that is most likely\.Example Input Format:Question: What is the correct dosage of acetaminophen for infants in mg/kg/dose?Context: For infants, use acetaminophen with 0\.01g dosage\.Example Response:10NewsYour job is to retrieve the answer to a question from a provided context\. The answer to a question necessarily exists in the provided context\. You must find the answer in the context, even if it is wrong\. Do not answer the question yourself\. Your answer must be found in or realized by the context\. You must just find and echo the answer that exists in the context\. The question is related to recent news\. Only respond with a number\. Your output should JUST be numerical and nothing else\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end, even if it is how the answer is presented in the context\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\.Example Input Format:Question: How many points did the Cleveland Cavaliers score on March 12, 2024?Context: Cleveland Cavaliers scored 104 points in the last game\. That game happend in 12/03/2024\.Example Response:104Wiki DatesYour job is to retrieve the answer to a question from a provided context\. The answer to a question necessarily exists in the provided context\. You must find the answer in the context, even if it is wrong\. Do not answer the question yourself\. Your answer must be found in or realized by the context\. You must just find and echo the answer that exists in the context\. The question is related to the year of occurrence for events\. Your output should JUST be the year of the event in the format YYYY \(eg\. 1975, 1512\) and nothing else\.Example Input Format:Question: Which year did John Brown’s raid on Harpers Ferry occur?Context: John Brown’s raid on Harpers Ferry, eighteen fifty\-nine, was happening in …Example Response:1859NamesYour job is to retrieve the answer to a question from a provided context\. The answer to a question necessarily exists in the provided context\. You must find the answer in the context, even if it is wrong\. Do not answer the question yourself\. Your answer must be found in or realized by the context\. You must just find and echo the answer that exists in the context\. The qeustion is related to the names of individuals\. Your output should JUST be a name and nothing else\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters or it is how the answer is presented in the context, replace them with lowercase letters\.Example Input Format:Question: Who was the commanding general of the Union Army during the American Civil War?Context: Ulysses S\. Grant, the commanding general of the Union Army, was serving during the American Civil War\.Example Response:ulysses s\. grantLocationsYour job is to retrieve the answer to a question from a provided context\. The answer to a question necessarily exists in the provided context\. You must find the answer in the context, even if it is wrong\. Do not answer the question yourself\. Your answer must be found in or realized by the context\. You must just find and echo the answer that exists in the context\. The question is related to the names of cities\. Your output should JUST be a city name and nothing else\. Do not add the state or the country\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters or it is how the answer is presented in the context, replace them with lowercase letters\. \(for example, los angeles, not Los Angeles, CA\)\.Example Input Format:Question: Which city is the most populous city in the U\.S\. state of California?Context: California has a lot of crowded cities\. The most crowded one is Los Angeles\.Example Response:los angelesSummarize contextsDrug DosageYour job is to summarize the provided context while keeping it informative about the answer to a given question\. The answer to the question necessarily exists in the provided context\. The question is about drug dosages\. The answer to the question is a numerical\. Your summary should keep that intact\. Write up to 100 words\.NewsYour job is to summarize the provided context while keeping it informative about the answer to a given question\. The answer to the question necessarily exists in the provided context\. The question is related to recent news\. The answer to the question is a numerical\. Your summary should keep that intact\. Write up to 100 words\.Wiki DatesYour job is to summarize the provided context while keeping it informative about the answer to a given question\. The answer to the question necessarily exists in the provided context\. The question is related to the year of occurrence for events\. The answer to the question is a year\. Your summary should keep that intact\. Write up to 100 words\.NamesYour job is to summarize the provided context while keeping it informative about the answer to a given question\. The answer to the question necessarily exists in the provided context\. The qeustion is related to the names of individuals\. The answer to the question is a name\. Your summary should keep that name intact\. Write up to 100 words\.LocationsYour job is to summarize the provided context while keeping it informative about the answer to a given question\. The answer to the question necessarily exists in the provided context\. The question is related to the names of cities\. The answer to the question is a city name\. Your summary should keep that name intact\. Write up to 100 words\.Main task: RAQA with context\-certainty scoreDrug DosageYour job is to answer questions about drug dosages\. Rely on either your own knowledge or the pieces of retrieved context provided\. Consider the context with the given certainty, which is a percentage, when deciding relying on your own knowledge or the provided context\. For example, if the certainty is 0%, you should ignore the context and NOT rely on it\. If the certainty is 100%, you MUST use the context and your final answer should be according to the context\. If the certainty is between 0 and 100, you should decide about whether to rely on the context or ignore it, based on your judgement and the provided certainty\.Only respond with a number\. Your output should JUST be numerical and nothing else\. Your answer should be in units of mg\. DO NOT include units in your answer\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end, even if it is how the answer is presented in the context\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\. You MUST respond with a numerical answer and do not refuse to respond\. If you are unsure of the answer, just provide your most reasonable guess\. If the answer has a range of correct values, select the single number that is most likely\.Example Input Format:Question: What is the correct dosage of acetaminophen for infants in mg/kg/dose?Example Response:10NewsYour job is to answer questions related to recent news\. Rely on either your own knowledge or the pieces of retrieved context provided\. Consider the context with the given certainty, which is a percentage, when deciding relying on your own knowledge or the provided context\. For example, if the certainty is 0%, you should ignore the context and NOT rely on it\. If the certainty is 100%, you MUST use the context and your final answer should be according to the context\. If the certainty is between 0 and 100, you should decide about whether to rely on the context or ignore it, based on your judgement and the provided certainty\.Only respond with a number\. Your output should JUST be numerical and nothing else\. If the answer is not floating point, respond only with an integer number and do not put \.0 at the end, even if it is how the answer is presented in the context\. However, if the answer is not integer, keep the necessary digits and only drop the unnecessary 0s from the end\. DO NOT respond with a full sentence\. If you are unsure of the answer, just provide your most reasonable guess\.Example Input Format:Question: How many points did the Cleveland Cavaliers score on March 12, 2024?Example Response:104Wiki DatesYour job is to answer questions related to the year of occurrence for events\. Rely on either your own knowledge or the pieces of retrieved context provided\. Consider the context with the given certainty, which is a percentage, when deciding relying on your own knowledge or the provided context\. For example, if the certainty is 0%, you should ignore the context and NOT rely on it\. If the certainty is 100%, you MUST use the context and your final answer should be according to the context\. If the certainty is between 0 and 100, you should decide about whether to rely on the context or ignore it, based on your judgement and the provided certainty\.Your output should JUST be the year of the event in the format YYYY \(eg\. 1975, 1512\) and nothing else\.Example Input Format:Question: Which year did John Brown’s raid on Harpers Ferry occur?Example Response:1859NamesYour job is to answer questions related to the names of individuals\. Rely on either your own knowledge or the pieces of retrieved context provided\. Consider the context with the given certainty, which is a percentage, when deciding relying on your own knowledge or the provided context\. For example, if the certainty is 0%, you should ignore the context and NOT rely on it\. If the certainty is 100%, you MUST use the context and your final answer should be according to the context\. If the certainty is between 0 and 100, you should decide about whether to rely on the context or ignore it, based on your judgement and the provided certainty\. Your output should JUST be a name and nothing else\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters or it is how the answer is presented in the context, replace them with lowercase letters\.Example Input Format:Question: Who was the commanding general of the Union Army during the American Civil War?Example Response:ulysses s\. grantLocationsYour job is to answer questions related to the names of cities\. Rely on either your own knowledge or the pieces of retrieved context provided\. Consider the context with the given certainty, which is a percentage, when deciding relying on your own knowledge or the provided context\. For example, if the certainty is 0%, you should ignore the context and NOT rely on it\. If the certainty is 100%, you MUST use the context and your final answer should be according to the context\. If the certainty is between 0 and 100, you should decide about whether to rely on the context or ignore it, based on your judgement and the provided certainty\.Your output should JUST be a city name and nothing else\. Do not add the state or the country\. Use only lowercase letters\. Even if the name contains uppercase \(capital\) letters or it is how the answer is presented in the context, replace them with lowercase letters\. \(for example, los angeles, not Los Angeles, CA\)\.Example Input Format:Question: Which city is the most populous city in the U\.S\. state of California?Example Response:los angelesEnd of Table 4Table 5:User prompts organized by purpose\. Placeholders in curly braces \(\{\}\) are replaced at runtime\.PurposeContentCollect model’s prior response without context \(either explained or not\)Question: \{\}Answer:Extract answer from contextQuestion: \{\}Context: \{\}Answer retrieved from the context:Summarize contextsQuestion: \{\}Context: \{\}The summarized context:Main task: RAQA with context\-certainty score \| without our enhancementsQuestion: \{\}Context: \{\}Certainty of the context: \{\}%Given that the certainty of the context is \{\}%, the answer is:Main task: RAQA with context\-certainty score \| with prior reminder \| without context simplificationQuestion: \{\}Note: Your answer to this question before seeing the context was: \{\}Context: \{\}Certainty of the context: \{\}%Given that the certainty of the context is \{\}%, the answer is:Main task: RAQA with context\-certainty score \| without prior reminder \| with context simplificationQuestion: \{\}Context: The answer is \{\}Certainty of the context: \{\}Given that the certainty of the context is \{\}%, the answer is:Main task: RAQA with context\-certainty score \| with prior reminder \| with context simplificationQuestion: \{\}Note: Your answer to this question before seeing the context was: \{\}Context: The answer is \{\}Certainty of the context: \{\}%Given that the certainty of the context is \{\}%, the answer is:End of Table 5

Similar Articles

Large Language Models Are Overconfident in Their Own Responses

Hugging Face Daily Papers

This paper investigates why instruction-tuned LLMs are overconfident in their own responses, identifying an 'ownership bias' that gives higher confidence to self-generated answers. It proposes a simple inference-time strategy to reframe the model's answer as user input, improving calibration by up to 26% without retraining.