Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
Summary
This paper proposes a three-regime framework to resolve empirical contradictions in how LLMs handle conflict between training knowledge and new documents, validated across five major models. It distinguishes between parametric strength and uniqueness and demonstrates how task framing and evidence coherence significantly impact model behavior.
View Cached Full Text
Cached at: 05/13/26, 06:14 AM
# A Predictive Framework and Empirical Validation
Source: [https://arxiv.org/html/2605.11574](https://arxiv.org/html/2605.11574)
## Three Regimes of Context\-Parametric Conflict: A Predictive Framework and Empirical Validation
###### Abstract
The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers\(Longpre et al\.,[2021](https://arxiv.org/html/2605.11574#bib.bib12)\), while others find models readily defer to the document\(Xie et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib20)\)\. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them\. We propose athree\-regime framework: Regime 1 \(single\-source updating, dominant predictor: evidence coherence\), Regime 2 \(competitive integration, dominant predictor: parametric certainty\), and Regime 3 \(task\-appropriate selection, dominant predictor: task knowledge requirement\)\. We formalise a distinction between parametric*strength*\(exposure frequency\) and parametric*uniqueness*\(encoding consistency\), showing empirically that these are orthogonal dimensions \(r=−0\.002r=\-0\.002,p=\.97p=\.97\) with strength as the operative predictor in stable factual domains\. We validate the framework acrossClaude Sonnet 4\.6, GPT\-5\.5, Gemini 2\.5 Flash, Llama 4 Maverick, and DeepSeek V3using 9,970 API calls in three experimental phases: \(1\) a Regime 1 vs\. Regime 2 certainty gradient study on 450 PopQA items \(8,970 calls\), where GEE logistic regression confirms the predicted gradient for all five models \(β=−0\.38\\beta=\-0\.38to−0\.50\-0\.50, allp≤\.013p\\leq\.013, BH\-FDR corrected\); \(2\) a Wikidata edit\-frequency analysis showing strength and uniqueness are independent predictors; and \(3\) a Regime 3 ablation \(1,000 calls\) showing task framing alone flips context\-following from near\-100% \(CK condition\) to 6–71% \(PK condition\), with all five models significant \(p<\.001p<\.001\)\. The certainty gradient in Regime 2 is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction\. We additionally identify a distractor plausibility confound in PopQA\-based studies and demonstrate that manipulation design is a first\-order variable in conflict research\.
## 1 Introduction
Consider two peer\-reviewed studies, both asking whether a language model follows a document that contradicts its training knowledge\.
Longpre et al\. \([2021](https://arxiv.org/html/2605.11574#bib.bib12)\)replace correct answer entities in passages with incorrect ones and find models over\-rely on their training knowledge, ignoring the document approximately 47% of the time across seven models\.Xie et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib20)\)use fluent, well\-written counter\-evidence passages and find models update their answers approximately 96% of the time in single\-source settings\.
The 49\-percentage\-point gap is not noise\. The difference is evidence coherence: entity substitution creates passages that remain internally consistent with the original entity, producing a plausibility signal that models can exploit; fluent generation does not\. When both studies are interpreted as the same processing situation but with different evidence quality, the contradiction dissolves\.
#### What this paper shows
We argue that the conflict literature has produced contradictions because researchers have unknowingly compared findings across three qualitatively distinct processing situations\. We call these three situations*regimes*, and we preview them in Table[1](https://arxiv.org/html/2605.11574#S1.T1)so the rest of the paper reads clearly\.
Table 1:The three regimes at a glance\. Each row describes a distinct processing situation\. The dominant predictor of context\-trust differs across regimes, which is why findings from different regimes cannot be directly compared\. Full development in §[3](https://arxiv.org/html/2605.11574#S3)\.The same pattern of false contradictions repeats across the literature\.Augenstein et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib1)\)find that when only a conflicting document is shown \(Regime 1\), models follow it; but when the same model is first shown supportive evidence and then a contradicting document \(Regime 2\), it stubbornly resists\.Chen et al\. \([2022](https://arxiv.org/html/2605.11574#bib.bib3)\)show that when many supporting passages are retrieved, models stop relying on training knowledge almost entirely: high evidence volume collapses Regime 2 back into Regime 1\.Sun et al\. \([2025a](https://arxiv.org/html/2605.11574#bib.bib16)\)demonstrate that the same conflict produces opposite correct behaviours depending on the task: for a context\-only task the model should follow the document; for a parametric\-only task it should resist\. This is Regime 3, a moderator that can reverse both Regime 1 and Regime 2 predictions\. Each finding is internally valid but not comparable to the others, because they study different regimes\.
Prior unification attempts are incomplete\.Xu et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib21)\)provide a descriptive taxonomy but no predictive theory of which variables dominate per setting\.Sun et al\. \([2025a](https://arxiv.org/html/2605.11574#bib.bib16)\)show task type explains substantial variance but do not account for evidence quality or parametric activation mode\. No prior work simultaneously resolves all four major contradictions\.
#### Contributions
\(1\) Athree\-regime framework\(Table[1](https://arxiv.org/html/2605.11574#S1.T1), §[3](https://arxiv.org/html/2605.11574#S3)\) resolving four contradictions\. \(2\) A formal distinction betweenparametric strengthandparametric uniqueness, with empirical evidence that they are orthogonal \(r=−0\.002r=\-0\.002\) \(§[3\.5](https://arxiv.org/html/2605.11574#S3.SS5)\)\. \(3\) Empirical validation of Regime 2 via GEE logistic regression across five models \(§[4](https://arxiv.org/html/2605.11574#S4)\)\. \(4\) ARegime 3 ablationdirectly validating task\-appropriate selection with 1,000 new API calls \(§[4\.3](https://arxiv.org/html/2605.11574#S4.SS3)\)\. \(5\) Identification of adistractor plausibility confound\(§[4\.4](https://arxiv.org/html/2605.11574#S4.SS4)\)\.
## 2 Background and Related Work
### 2\.1 Conflict taxonomy
Xu et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib21)\)identify three categories: context\-memory conflict \(retrieved context contradicts training knowledge\), inter\-context conflict \(retrieved passages contradict each other\), and intra\-memory conflict \(training knowledge is internally inconsistent\)\. This paper focuses on context\-memory conflict\. Inter\-context conflict reduces to Regime 2 \(§[3\.3](https://arxiv.org/html/2605.11574#S3.SS3)\); intra\-memory conflict is a driver of parametric uniqueness \(§[3\.5](https://arxiv.org/html/2605.11574#S3.SS5)\)\.
### 2\.2 Four contradictions
#### Contradiction 1 \(Longpre vs\. Xie\)
Longpre et al\. \([2021](https://arxiv.org/html/2605.11574#bib.bib12)\): 47% parametric hold with entity substitution\.Xie et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib20)\): 96% context\-following with fluent counter\-memory\. Both are Regime 1; evidence quality explains the gap\.
#### Contradiction 2 \(single\- vs\. multi\-source\)
Augenstein et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib1)\): high context\-following in single\-source \(Regime 1\) but confirmation bias under parametric co\-activation \(Regime 2\)\. This is the regime boundary\.
#### Contradiction 3 \(volume of evidence\)
Chen et al\. \([2022](https://arxiv.org/html/2605.11574#bib.bib3)\): below 3\.6% memorisation with 50–100 passages vs\. Longpre’s single passage\. High volume collapses Regime 2 into Regime 1\.
#### Contradiction 4 \(task type reversal\)
Sun et al\. \([2025a](https://arxiv.org/html/2605.11574#bib.bib16)\): GPT\-5\.2 \(the model in their study, distinct from GPT\-5\.5 in ours\) drops from 89\.2% accuracy on PK tasks to 33\.7% under high\-plausibility conflict, a 55\.5pp loss driven by task\-context mismatch\.
### 2\.3 Prior framework attempts
Xu et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib21)\)provide a comprehensive taxonomy but no quantitative predictions per setting\.Zhang et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib22)\)propose an evaluation\-methodology taxonomy \(parametric\-only, offline RAG, online RAG\) complementary to ours\.Marjanović et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib14)\)identify that fact dynamicity predicts context resistance better than subject popularity; we test this directly and find a more nuanced picture \(§[3\.5](https://arxiv.org/html/2605.11574#S3.SS5)\)\.
### 2\.4 Additional related work
Several recent works address related aspects of parametric\-contextual arbitration\.Wang et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib18)\)propose a stage\-wise diagnosis framework \(memory, instruction, reasoning\) that complements our regime taxonomy by specifying where within the generation pipeline conflict resolution fails\.Li et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib10)\)survey memory operations in LLMs, providing a complementary taxonomy of parametric vs\. contextual representations at the architectural level\.Chen et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib4)\)study how training\-data properties \(repetition, inconsistency, corpus skew\) shape arbitration preferences, showing that models prefer parametric knowledge for high\-confidence facts and in\-context knowledge for novel items, consistent with our Regime 2 certainty gradient\.Wu et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib19)\)train explicit multi\-policy arbitration via reinforcement learning, demonstrating that the regime distinction we describe can be operationalised at training time, not just diagnosed post hoc\.Shi et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib15)\)show that decoding\-level constraints can mask or reveal parametric knowledge independently of prompting, cautioning that evaluation procedures themselves can shift observed PK/CK balance, a concern that reinforces our finding that manipulation design is a first\-order variable\.
## 3 A Three\-Regime Framework
### 3\.1 The suppression baseline
Before any conflict\-specific mechanism engages, the mere presence of context suppresses parametric knowledge, even when context is irrelevant\. Consider a model that correctly answers “Who directedThe Dark Knight?” in a closed\-book setting\. Prepend an unrelated passage about medieval poetry and ask again\.Cheng et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib5)\)document that the trained answer is disregarded in over 60% of such cases across six LLMs and three knowledge domains, regardless of relevance\.
The mechanistic basis is twofold\(Sun et al\.,[2025b](https://arxiv.org/html/2605.11574#bib.bib17)\): \(1\) Copying Head failure, where attention heads fail to retain attended context during generation; and \(2\) Knowledge FFN over\-injection, where feed\-forward modules overwrite context with parametric knowledge\. Both are causally validated\(Sun et al\.,[2025b](https://arxiv.org/html/2605.11574#bib.bib17)\)\.
No regime produces zero suppression\. Our three regimes predict which additional variables modulate the degree of suppression above this floor\.
### 3\.2 Regime 1: Single\-source updating
In Regime 1, a single context source is presented without explicit activation of the model’s competing trained knowledge\.
#### Dominant predictor: evidence quality
Incoherent entity substitution signals implausibility\(Longpre et al\.,[2021](https://arxiv.org/html/2605.11574#bib.bib12)\); fluent counter\-memory does not, yielding 96% context\-following\(Xie et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib20)\)\. Assertive framing increases context\-following in controlled settings\(Du et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib7)\), though with near\-zero correlation to utilisation in real\-world documents\(Hagström et al\.,[2025](https://arxiv.org/html/2605.11574#bib.bib8)\)\.
Four of five models show flat context\-follow profiles across certainty tiers in Regime 1 \(0\.967–1\.00 at low, 0\.940–0\.987 at high; Table[6](https://arxiv.org/html/2605.11574#S4.T6)\), confirming that parametric certainty is not the dominant predictor\. Llama 4 Maverick shows a gradient \(0\.947 low, 0\.720 high\) discussed in §[4\.4](https://arxiv.org/html/2605.11574#S4.SS4)\.
### 3\.3 Regime 2: Competitive integration
In Regime 2, both context and the model’s trained knowledge are explicitly present and competing\. This occurs under multi\-source designs\(Xie et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib20)\), prior\-commitment prompting, or conversational designs where the model commits to its trained answer before facing the conflict\.
#### Dominant predictor: parametric certainty
When parametric memory is co\-activated, the key variable is how strongly the model’s training data encoded the fact\.Marjanović et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib14)\)show that object temporality correlates with context resistance atr=−0\.27r=\-0\.27\(p<10−40p<10^\{\-40\}\), substantially stronger than popularity \(r=−0\.10r=\-0\.10\) or semantic entropy \(r=0\.003r=0\.003, n\.s\.\)\.
Our GEE logistic regression confirms this: each unit increase in log\-popularity reduces context\-following odds by 31–40% \(per\-modelβ=−0\.38\\beta=\-0\.38to−0\.50\-0\.50; Table[4](https://arxiv.org/html/2605.11574#S4.T4)\)\. The gradient is strongest for Llama \(context\-follow drops from 0\.846 at low certainty to 0\.538 at high, a 30\.8pp gradient\) and present in all five models including Gemini \(β=−0\.50\\beta=\-0\.50,p=\.013p=\.013\)\.
#### Mechanistic substrate
Jin et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib9)\)identify attention heads that function preferentially as memory heads \(routing parametric recall\) versus context heads \(routing external evidence\), providing a mechanistic basis for the competitive dynamics\. This is a dominant\-function, not exclusive\-function, distinction\.
#### Inter\-context conflict as Regime 2
When multiple passages conflict, models prefer the passage aligning with parametric memory\(Xie et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib20); Chen et al\.,[2022](https://arxiv.org/html/2605.11574#bib.bib3)\)\. This is structurally Regime 2: parametric memory is implicitly co\-activated\.
### 3\.4 Regime 3: Task\-appropriate selection
In Regime 3, the task itself determines which source should win\. This regime is a first\-order moderator that can reverse Regime 1 and Regime 2 predictions\.
Sun et al\. \([2025a](https://arxiv.org/html/2605.11574#bib.bib16)\)identify four task types: Knowledge\-Free \(KF\), Contextual Knowledge \(CK\), Parametric Knowledge \(PK\), and Parametric\-Contextual \(PCK\)\. Consider the conflict “The director of Supercock is David Nutter” against the model’s trained belief \(Gus Trikonis\): under CK framing \(“Based on this document…”\), context should win; under PK framing \(“Based on your own knowledge…”\), the trained answer should win\. Same conflict, opposite correct behaviour\.
We validate Regime 3 directly in §[4\.3](https://arxiv.org/html/2605.11574#S4.SS3)\.
### 3\.5 Parametric strength vs\. uniqueness
Prior work has conflated two distinct properties of parametric memory\.
*Parametric strength*is exposure frequency: how often the model encountered the fact during training, operationalised as Wikipedia monthly page views \(s\_pop;Mallen et al\.,[2023](https://arxiv.org/html/2605.11574#bib.bib13)\)\.
*Parametric uniqueness*is encoding consistency: whether the fact has a single canonical representation or multiple competing ones, operationalised as inverse Wikidata edit frequency\(Marjanović et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib14)\)\.
Marjanović et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib14)\)report that uniqueness outpredicts strength \(β=−0\.08\\beta=\-0\.08vs\.−0\.05\-0\.05\) on their DynamicQA dataset, which includes temporal and disputable facts\. We tested this directly on our PopQA sample by querying Wikidata revision histories for all 450 subject entities and found a strikingly different pattern\.
First, strength and uniqueness are*completely uncorrelated*in our sample \(r=−0\.002r=\-0\.002,p=\.97p=\.97,n=450n=450\)\. Mean edit counts are nearly identical across certainty tiers \(low: 68\.2, medium: 76\.6, high: 69\.9\)\. The two constructs are genuinely orthogonal dimensions, not proxies for the same underlying variable\.
Second, when both are entered as predictors in a joint GEE model \(Table[5](https://arxiv.org/html/2605.11574#S4.T5)\), strength is the sole significant predictor \(pooledβ=−0\.92\\beta=\-0\.92,p<\.001p<\.001\) while uniqueness has no independent predictive power \(β=\+0\.02\\beta=\+0\.02,p=\.893p=\.893\)\.
The divergence fromMarjanović et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib14)\)likely reflects domain differences\. DynamicQA includes temporal facts \(“Who is the current Prime Minister?”\) where edit frequency tracks genuine knowledge instability\. PopQA consists primarily of stable creative\-work properties \(director, genre, author\) where edit frequency reflects Wikipedia editorial activity rather than factual change\. In this domain, how often a model saw the fact matters more than how often the fact was revised\.
We therefore revise the framework: Regime 2 resistance is driven by*parametric certainty*, which decomposes into strength and uniqueness as independent dimensions\. Which dimension dominates depends on the knowledge domain\. For stable factual domains, strength is primary; for dynamic domains, uniqueness may be primary\.
Table 2:Four contradictions in the context\-parametric conflict literature and how the three\-regime framework resolves each\. Each row identifies a pair of studies that appear to contradict each other, describes the apparent conflict, and explains why the contradiction dissolves once regime membership is recognised\.
### 4\.1 Experimental design
#### Dataset
450 items from PopQA\(Mallen et al\.,[2023](https://arxiv.org/html/2605.11574#bib.bib13)\), stratified into three non\-overlapping certainty tiers by Wikipedia monthly page views \(s\_pop\): low \(n=150n=150, s\_pop=27=27–373373\), medium \(n=150n=150, s\_pop=379=379–2,8642\{,\}864\), high \(n=150n=150, s\_pop=2,893=2\{,\}893–492,490492\{,\}490\)\.
#### Distractor construction
A distractor entity is sampled from the same Wikidata property type with fixed seed \(42\) and converted to an assertive context sentence\. Six items witho\_pop/s\_pop<0\.01o\\\_pop/s\\\_pop<0\.01are excluded from v2 analyses\.
#### Models and API access
Claude Sonnet 4\.6 \(Anthropic API\), GPT\-5\.5 \(OpenAI API\), Gemini 2\.5 Flash \(Google AI Studio API\), Llama 4 Maverick \(Meta; 17B active parameters, 128\-expert MoE; DeepInfra FP8 endpoint\), DeepSeek V3 \(DeepSeek AI; Together AI API\)\. All five are instruction\-tuned; we distinguish*closed\-weight commercial*\(Claude, GPT\-5\.5, Gemini\) from*open\-weight*\(Llama, DeepSeek\) where the contrast matters\. All API costs were borne by the author \(approximately USD 95 for 8,970 calls plus USD 15 for the 1,000\-call Regime 3 ablation\)\.
#### Experimental passes \(8,970 calls\)
*Baseline*\(2,250\): bare question\.*Regime 1*\(2,250\): assertive context prepended\.*Regime 2 v1*\(2,250\): single\-turn prior\-commitment assertion\.*Regime 2 v2*\(2,220\): multi\-turn conversation where the model commits in Turn 1 and faces conflict in Turn 2 with its own words in the assistant role\. All calls use temperature 0, no system prompt, and stateless sessions: conversation history is reset between items except in v2, where Turn 1 is explicitly constructed from the baseline response\. Exact prompt templates for all conditions are provided in Table[3](https://arxiv.org/html/2605.11574#S4.T3)\.
#### Statistical methods
Primary analysis: generalised estimating equations \(GEE;Liang and Zeger,[1986](https://arxiv.org/html/2605.11574#bib.bib11)\) with binomial family, exchangeable correlation, item\-level clustering, and log\(s\_pop \+ 1\) as the continuous predictor\. Robustness: multinomialχ2\\chi^\{2\}with three outcome categories \(context/parametric/neither\), sensitivity analyses collapsing “neither” into parametric\-hold \(conservative\) and context\-follow \(liberal\), and Benjamini\-Hochberg FDR correction\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2605.11574#bib.bib2)\)across all per\-model tests\. Cramér’sVV\(Cramér,[1946](https://arxiv.org/html/2605.11574#bib.bib6)\)reported for discretised comparisons \(conventional thresholds:<0\.10<0\.10weak,0\.100\.10–0\.300\.30moderate,\>0\.30\>0\.30strong\)\.
#### Scoring
\+1\+1\(followed context\),0\(held parametric\),−1\-1\(neither\)\. Case\-insensitive substring matching, minimum 6\-character tokens\.
Table 3:Prompt templates used in all experimental conditions\.\[Q\]= the PopQA factual question \(e\.g\., “Who directed Supercock?”\);\[CTX\]= the assertive distractor context sentence \(e\.g\., “The director of Supercock is David Nutter\.”\);\[A1\]= the model’s own closed\-book baseline response from Pass 1\. All API calls use temperature 0, no system prompt, and stateless sessions \(conversation reset between items\)\. R2\-v2 is a two\-turn conversation: the model’s Pass 1 answer is placed in the assistant role before Turn 2 introduces the conflict\.
### 4\.2 Results
Table 4:How strongly does fact popularity predict whether a model follows a contradicting document? This table reports a GEE logistic regression for each model in Regime 2 multi\-turn \(v2\)\. The predictor is log\(Wikipedia page views \+ 1\); the outcome is whether the model followed the distractor \(1\) or held its trained answer \(0\)\. “Neither” responses excluded\. A negativeβ\\betameans more popular facts resist context more\. All five models show significant negative effects; all survive Benjamini\-Hochberg FDR correction atα=\.05\\alpha=\.05\. Exchangeable correlation structure, clustered by item\.Table 5:Are fact popularity \(strength\) and Wikidata edit frequency \(uniqueness\) independent predictors of context resistance? This table enters both into a single GEE model pooled across all five models \(with model fixed effects\)\. Strength \(β=−0\.918\\beta=\-0\.918,p<\.001p<\.001\) is the sole significant predictor\. Uniqueness \(β=\+0\.020\\beta=\+0\.020,p=\.893p=\.893\) adds no predictive power\. The two are uncorrelated in our 450\-item sample \(r=−0\.002r=\-0\.002,p=\.97p=\.97\), confirming they measure genuinely different properties of parametric memory\.Table 6:How often does each model follow a contradicting document when context is presented alone \(Regime 1, no prior commitment\)? Each cell shows the proportion of items where the model adopted the distractor answer\.n=150n=150per cell\. Nearly all cells are blue \(≥\.95\\geq\.95\): models follow context at ceiling regardless of how well they know the fact\. Llama is the exception, resisting context for well\-known facts \(high tier: \.720\)\. Shading:≥\\geq\.95,\.85–\.94,\.70–\.84,\.55–\.69,<<\.55\.Table 7:How often does each model follow a contradicting document when it has already committed to its own answer in a previous turn \(Regime 2, multi\-turn\)? Each cell shows the proportion of items where the model switched to the distractor\. The rightward colour shift from blue to pink shows the certainty gradient: well\-known facts \(high tier\) resist context more\. Gradient \(pp\) = low minus high tier rate\.VV= Cramér’s V effect size\.n=144n=144–150150per cell\.Table 8:Context follow rate when the model is told “you previously stated \[answer\]” in a single turn, then shown a contradicting document \(Regime 2, single\-turn\)\. Open\-weight models \(Llama, DeepSeek\) already show a certainty gradient here\. Closed\-weight models \(Claude, GPT\-5\.5, Gemini\) remain near ceiling, showing that a single\-turn assertion does not create genuine parametric competition for these models\.n=150n=150per cell\.#### Finding 1: Regime matters more than model identity
Show a model a contradicting document by itself \(Regime 1\) and most models accept it almost unconditionally: Claude follows the distractor 99\.1% of the time, GPT\-5\.5 follows 98\.4%, Gemini 97\.6%\. Now get the model to first commit to its own answer and then show the same document \(Regime 2 multi\-turn\), and behaviour changes substantially: Claude drops to 85\.1%, GPT\-5\.5 to 84\.0%\. Same items, same models, same distractors\. Only the regime changed\.
#### Finding 2: Famous facts resist more
Think of a fact every model should know: “Who directedThe Dark Knight?” \(Christopher Nolan, extremely high s\_pop\)\. Now think of an obscure fact: “Who directedSupercock?” \(Gus Trikonis, low s\_pop\)\. In Regime 2 multi\-turn, when the model has already committed to its trained answer and then sees a contradicting document, the obscure fact gets overridden 84\.6% of the time for Llama\. The famous fact? Only 53\.8%\. That is a 31\-point swing driven purely by how well the model knew the answer\. The GEE regression \(Table[4](https://arxiv.org/html/2605.11574#S4.T4)\) confirms this as a continuous effect: each unit increase in log\-popularity reduces context\-following odds by 31–40%\.
#### Finding 3: How you create the conflict matters
Telling a closed\-weight model “you previously saidXX” in a single turn is not enough to create genuine parametric competition\. Having the model literally sayXXin a previous conversation turn creates dramatically different behaviour\. In single\-turn Regime 2, GPT\-5\.5 follows context 98\.2% of the time with no certainty effect\. In multi\-turn Regime 2, it drops to 84\.0% with a clear gradient \(94\.6% low, 71\.0% high\)\. Identical items, identical model; only the conversation structure changed \(Table[9](https://arxiv.org/html/2605.11574#S4.T9)\)\.
#### Finding 4: Gemini is an outlier
Gemini follows context at near\-ceiling under all Regime 2 conditions \(V=\.102V=\.102, 1\.5pp gradient\)\. This might look like Gemini*cannot*resist context\. But our Regime 3 ablation \(§[4\.3](https://arxiv.org/html/2605.11574#S4.SS3)\) shows otherwise: under PK framing, Gemini drops to 11\.8% context\-following\. Gemini*can*resist; it simply defaults to deference unless explicitly instructed otherwise\. \(A methodological note: Gemini’s tiered gradient is minimal,V=\.102V=\.102, but its continuous GEE coefficient is significant,β=−0\.50\\beta=\-0\.50,p=\.013p=\.013\. The discrepancy arises because the continuous predictor retains more statistical power than the three\-level discretisation; under multinomial treatment with three outcome categories, Gemini’s effect becomes marginal atp=\.094p=\.094\. We therefore treat Gemini’s Regime 2 gradient as suggestive but not robust\.\)
Table 9:Baseline parametric accuracy \(how often each model answers correctly with no document at all\) and overall Regime 2 context follow rate under single\-turn \(v1\) vs\. multi\-turn \(v2\) manipulation\.Δ\\Delta= v2 minus v1: negative values mean the multi\-turn design reduced context\-following \(revealing parametric resistance that was hidden in the single\-turn design\)\.n=450n=450per model\.
### 4\.3 Regime 3 ablation
To directly validate Regime 3, we ran a new experiment: 100 PopQA items \(stratified: 34 low, 33 medium, 33 high certainty\), each tested under two task framings across all five models \(1,000 API calls, seed 123\)\.
*CK condition*\(context should win\): “Based ONLY on the following document, answer the question\. Do not use any prior knowledge\. Only use what the document says\. Document: \[distractor\]\. Question: \[Q\]\.”
*PK condition*\(parametric should win\): “Answer the following question based on your own knowledge\. A document is shown below, but it may contain errors\. Use your own knowledge, not the document\. Document: \[distractor\]\. Question: \[Q\]\.”
Both conditions receive the same distractor sentence\. The only change is the instruction framing\.
Table 10:What happens when you tell the model which source to trust? Same 100 items, same five models, same distractor document\. Only the instruction changes\. CK \(“Based ONLY on this document”\) = context should win\. PK \(“Based on your own knowledge”\) = trained answer should win\. CK rate = proportion following the distractor under CK framing\. PK rate = proportion following the distractor under PK framing\.Δ\\Delta= CK minus PK\. All five models show a massive swing \(p<\.001p<\.001\), confirming task framing as the single most powerful variable in the study\.#### Finding 5: Task framing is the single most powerful variable
Under CK framing \(“use the document”\), every model follows the distractor essentially 100% of the time, for every fact, at every certainty level\. Under PK framing \(“use your own knowledge”\), the same models on the same items drop to 6–71% context\-following \(Table[10](https://arxiv.org/html/2605.11574#S4.T10)\)\. Nothing else in this study produces a swing this large\. Evidence quality, parametric certainty, manipulation strength: all produce effects in the 1–31 percentage point range\. Task framing produces a 29–94 percentage point swing\. Regime 3 is not just a moderator; it is the dominant variable in the entire framework\.
To put this concretely: ask Claude “Based on this document, who directed Supercock?” and it answers “David Nutter” \(the distractor\) 100% of the time, even for high\-certainty items\. Ask the same model “Based on your own knowledge, who directed Supercock?” with the same document visible, and it answers “Gus Trikonis” \(its trained answer\) 92\.4% of the time\. Same model\. Same item\. Same distractor in full view\. Only the instruction changed\.
#### Finding 6: The certainty gradient composes with task framing
Under PK framing, the certainty gradient from Regime 2 reappears: models resist context more for famous facts than obscure ones\. Claude’s PK context\-follow rate drops from 11\.8% \(low certainty\) to 3\.0% \(medium\) to 0\.0% \(high\)\. For Llama: 55\.9% \(low\) to 39\.4% \(medium\) to 27\.3% \(high\)\. The regime predictions compose: Regime 3 determines the direction \(follow or resist\), while parametric certainty modulates the magnitude within PK\-type tasks\. Under CK framing, no certainty gradient exists because the ceiling effect eliminates all variance\.
Table 11:Does the certainty gradient persist under PK task framing? This table shows the proportion of items where each model followed the distractor under PK instructions \(“use your own knowledge”\), broken down by certainty tier\. CK rates omitted because all are≥\.91\\geq\.91\(ceiling\)\. Under PK framing, every model resists context more for well\-known facts \(high tier\) than obscure ones \(low tier\)\.Δ\\Delta= low minus high: the size of the certainty gradient within PK framing\.
#### Finding 7: The closed\-weight vs\. open\-weight split reappears
Under PK framing, closed\-weight models \(Claude, GPT\-5\.5, Gemini\) drop to 6–12% context\-following: they are very good at following the “use your own knowledge” instruction\. Open\-weight models \(Llama, DeepSeek\) drop to only 51–71%: they partially follow the instruction but are much less compliant\. The same architectural split that appeared in the v1\-vs\-v2 manipulation contrast reappears in the CK\-vs\-PK task contrast\. This suggests that whatever instruction tuning makes closed\-weight models defer to context by default also makes them better at following instructions to*resist*context when told\.
### 4\.4 Secondary analyses
#### Inverted Dunning\-Kruger
We predicted that items the model answered incorrectly in the baseline would resist context most strongly \(overconfident wrong answers\)\. The reverse is true: items where Llama was*correct*in the baseline are the ones it most stubbornly defends\. In Regime 2 v1, wrong\-baseline items follow context at 0\.830 \(n=229n=229\) while correct\-baseline items follow at only 0\.566 \(n=221n=221\), a 26\.4pp difference \(p<\.001p<\.001\)\. This makes sense under the framework: correctly\-answered items are correct precisely because the fact is well\-encoded, and well\-encoded facts resist context \(Table[13](https://arxiv.org/html/2605.11574#S4.T13)\)\.
#### Distractor plausibility
Llama is the only model showing a certainty gradient in Regime 1, where the framework predicts a flat profile\. Manual inspection of Llama’s 60 Regime 1 resistance items reveals two overlapping causes\. The majority are items where Llama’s baseline was correct, reflecting genuine parametric resistance strong enough to compete with context even without explicit co\-activation\. A minority involve temporally implausible distractors \(e\.g\., Shamshi\-Adad I, an Assyrian king from c\. 1800 BCE, as the father of Rand Paul; John Farrow, who died in 1954, as director ofBreaking Bad, which aired 2008–2013\)\. Conservatively excluding all 60 items, the Regime 2 v2 gradient attenuates but remains significant \(V=\.156V=\.156,χ2=11\.80\\chi^\{2\}=11\.80,p=\.003p=\.003\)\.
#### DeepSeek hedging
DeepSeek produces “neither” responses 25\.5% of the time in R2\-v2 \(other models: 0\.7–5\.9%\)\. This is uniformly distributed across certainty tiers \(χ2=2\.03\\chi^\{2\}=2\.03,p=\.363p=\.363\) and does not confound the gradient\. The certainty gradient survives multinomial analysis with three outcome categories \(χ2\(4\)=35\.24\\chi^\{2\}\(4\)=35\.24,p<\.001p<\.001,V=\.199V=\.199\) and both sensitivity treatments \(neither→\\toparametric:V=\.172V=\.172,p=\.001p=\.001; neither→\\tocontext:V=\.281V=\.281,p<\.001p<\.001\)\.
Table 12:Does the Regime 2 certainty gradient survive when “neither” responses \(hedging, refusal\) are handled differently? Four columns show Cramér’s V under four treatments\.Original: neither excluded\.Multinomial: neither treated as a third outcome category\.Neither as parametric: hedging counted as resisting context\.Neither as context: hedging counted as following context\. Four of five models are robust across all four treatments\. Gemini’s gradient becomes non\-significant under the multinomial and conservative treatments\.Table 13:Do models defend wrong answers more stubbornly than correct ones? This table splits Regime 2 v1 items by whether the model’s closed\-book baseline was wrong or correct, then reports the context\-follow rate for each group\. If wrong answers were overconfident, they should resist context more \(positiveΔ\\Delta\)\. The reverse is true: correct\-baseline items resist more, because they are correct precisely because the fact is well\-encoded\. Only Llama shows a statistically significant gap\.
## 5 Discussion
### 5\.1 Manipulation design as a first\-order variable
Under single\-turn Regime 2, Claude and GPT\-5\.5 follow context at 0\.942 and 0\.982 overall with no visible certainty gradient\. Under multi\-turn Regime 2, these drop to 0\.851 and 0\.840, revealing significant gradients \(bothV\>\.24V\>\.24,p<\.001p<\.001\)\. Items, models, and distractors are identical; only conversation structure changed\.
Closed\-weight commercial models appear to treat a single\-turn assertion as user\-provided context rather than as genuine prior commitment\. In the multi\-turn design, the model’s own words in the assistant role create genuine conflict between its prior statement and the user’s new document\. Researchers using single\-turn “you previously stated” prompts may be studying Regime 1 rather than Regime 2, with direct implications for reproducibility\.
### 5\.2 The Gemini exception, resolved
Gemini showed a flat profile under all Regime 2 conditions \(V=\.102V=\.102\), which we initially interpreted as either context saturation or manipulation insufficiency\. The Regime 3 ablation resolves this: under PK framing, Gemini drops to 11\.8% context\-following\. Gemini can resist context when told to\. Its flat Regime 2 profile reflects a default instruction\-following strategy \(defer to context\) rather than an inability to engage parametric memory\. This is an important distinction for RAG system design: Gemini’s context saturation is a policy choice, not an architectural constraint\.
### 5\.3 Safety implications
Near\-uniform context\-following makes models correctable but vulnerable to fluent misinformation\. GPT\-5\.5 follows context at 94\.6% even for high\-certainty items in R2\-v2: a fluent assertive document claiming “The director ofInterstellaris Brian Singer” would almost certainly override its knowledge of Christopher Nolan’s filmography\.
The Regime 3 results suggest a practical mitigation: task framing can activate parametric resistance\. System prompts that instruct models to “verify claims against your own knowledge when the source is unverified” may partially protect against context\-injection attacks, though the effectiveness of such prompts in adversarial settings remains untested\.
### 5\.4 Limitations
#### PopQA property skew
The sample over\-represents creative\-work properties \(director, genre, screenwriter\)\. Findings may not generalise to geographic, biographical, or scientific domains\.
#### Synthetic\-to\-real gap
Hagström et al\. \([2025](https://arxiv.org/html/2605.11574#bib.bib8)\)show insights from synthetic conflict datasets do not reliably transfer to real\-world RAG\. Our results characterise controlled conditions, not production systems\.
#### Scoring sensitivity
The 6\-character minimum token match may produce false negatives for short surnames\. Future work should use canonicalisation against knowledge\-base aliases\.
#### Multi\-turn design scope
Our multi\-turn manipulation uses only the model’s own prior answer as commitment\. Alternative commitment structures \(user\-voiced prior, hedged commitment, repeated reinforcement\) remain untested\.
#### Domain contingency of strength vs\. uniqueness
Our finding that strength dominates over uniqueness is specific to PopQA’s stable factual domain\. The relationship may reverse for temporal or disputable facts, asMarjanović et al\. \([2024](https://arxiv.org/html/2605.11574#bib.bib14)\)suggest\. Extension to at least one dynamic\-domain dataset \(e\.g\., FreshQA or DynamicQA\) would provide a direct test of this domain\-contingency hypothesis and is a priority for future work\.
#### Entity\-level vs\. fact\-level uniqueness
Our Wikidata edit\-frequency measure counts total entity revisions, not revisions to the specific property\-value pair tested \(e\.g\., total edits to the “Rand Paul” entity rather than edits specifically to the “father” claim\)\. This entity\-level operationalisation may dilute signal for properties that are stable even when the broader entity page is frequently edited\. Fact\-level revision histories, available via the Wikidata statement API, would provide a cleaner measure\.
#### Covariates not controlled
Our primary GEE regressions use only log\-popularity as the predictor\. Baseline correctness, property type, and distractor popularity \(o\_popo\\\_pop\) are potential covariates that could confound or mediate the certainty gradient\. Future analyses should include these as fixed effects to rule out alternative explanations\.
#### Regime 1 evidence quality not directly tested
The evidence\-quality claim for Regime 1 is argued through synthesis of prior studies\(Longpre et al\.,[2021](https://arxiv.org/html/2605.11574#bib.bib12); Xie et al\.,[2024](https://arxiv.org/html/2605.11574#bib.bib20)\)\. Our own Regime 1 experiment uses only assertive synthetic sentences and does not directly manipulate coherence or plausibility\. A direct internal test \(e\.g\., fluent vs\. entity\-substituted contexts on the same items\) would strengthen this component of the framework\.
#### Data availability
Code, item lists, distractor mappings, prompt templates, and raw response logs for all 9,970 API calls will be released upon publication to support exact replication\.
## 6 Conclusion
The context\-parametric conflict literature has produced contradictory findings because researchers have compared results across three qualitatively distinct experimental structures without recognising the distinction\.
We proposed a three\-regime framework with different dominant predictors per regime \(evidence coherence in Regime 1, parametric certainty in Regime 2, task knowledge requirement in Regime 3\) and validated all three empirically\. The Regime 2 certainty gradient is confirmed by GEE logistic regression across five frontier models \(β=−0\.38\\beta=\-0\.38to−0\.50\-0\.50, allp≤\.013p\\leq\.013, BH\-FDR corrected\), survives multinomial outcome modeling and sensitivity analyses, and is robust to the distractor plausibility confound\. The Regime 3 ablation produces the study’s single strongest result: task framing alone flips context\-following from near\-100% \(CK\) to 6–71% \(PK\), with certainty modulating resistance within PK framing\.
We showed that parametric strength and uniqueness are orthogonal dimensions \(r=−0\.002r=\-0\.002\) with domain\-contingent predictive power: strength dominates for stable factual domains, while uniqueness may dominate for dynamic domains\. We identified manipulation design as a first\-order variable that can reveal or obscure conflict behaviour, and demonstrated that the closed\-weight vs\. open\-weight split in conflict behaviour extends from Regime 2 manipulation sensitivity to Regime 3 task compliance\.
The practical implication: regime classification should precede comparison of conflict findings\. Predictors do not generalise across regime boundaries\.
## References
- Augenstein et al\. \(2024\)Isabelle Augenstein et al\.Scaling instruction\-finetuned language models via knowledge conflict benchmarking\.*arXiv preprint arXiv:2305\.13300*, 2024\.
- Benjamini and Hochberg \(1995\)Yoav Benjamini and Yosef Hochberg\.Controlling the false discovery rate: A practical and powerful approach to multiple testing\.*Journal of the Royal Statistical Society: Series B*, 57\(1\):289–300, 1995\.
- Chen et al\. \(2022\)Hung\-Ting Chen, Michael J\.Q\. Zhang, and Eunsol Choi\.Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence\.In*Proceedings of EMNLP*, 2022\.
- Chen et al\. \(2025\)Yiran Chen et al\.How training data shapes parametric vs\. in\-context knowledge arbitration\.*arXiv preprint arXiv:2510\.02370*, 2025\.
- Cheng et al\. \(2024\)Shailesh Cheng et al\.Interplay of parametric and contextual knowledge: A study of parametric knowledge utilisation in LLMs\.*arXiv preprint arXiv:2410\.08414*, 2024\.
- Cramér \(1946\)Harald Cramér\.*Mathematical Methods of Statistics*\.Princeton University Press, 1946\.
- Du et al\. \(2024\)Yanda Du, Zhijing Zhao, Bernhard Schölkopf, et al\.Context versus prior knowledge in language models\.In*Proceedings of ACL*, 2024\.
- Hagström et al\. \(2025\)Lovisa Hagström et al\.Reality check on RAG: Do we need to worry about context utilisation?In*Proceedings of NAACL*, 2025\.
- Jin et al\. \(2024\)Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Qiuxia Li, and Jun Zhao\.Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models\.In*Proceedings of ACL Findings*, 2024\.
- Li et al\. \(2025\)Zhen Li et al\.Memory operations in large language models: A survey\.*arXiv preprint arXiv:2505\.00675*, 2025\.
- Liang and Zeger \(1986\)Kung\-Yee Liang and Scott L\. Zeger\.Longitudinal data analysis using generalized linear models\.*Biometrika*, 73\(1\):13–22, 1986\.
- Longpre et al\. \(2021\)Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, Sameer Singh, Hannaneh Hajishirzi, Eunsol Choi, and Ramakanth Pasunuru\.Entity\-based knowledge conflicts in question answering\.In*Proceedings of EMNLP*, 2021\.
- Mallen et al\. \(2023\)Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\.When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories\.In*Proceedings of ACL*, 2023\.
- Marjanović et al\. \(2024\)Sara Vera Marjanović, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina Lioma, and Isabelle Augenstein\.DynamicQA: Tracing internal knowledge conflicts in language models\.In*Proceedings of EMNLP*, 2024\.
- Shi et al\. \(2025\)Weijia Shi et al\.Retrieval\-constrained decoding for faithful generation\.*arXiv preprint arXiv:2509\.23417*, 2025\.
- Sun et al\. \(2025a\)Kaiser Sun, Fan Bai, and Mark Dredze\.Task matters: Knowledge requirements shape LLM responses to context–memory conflict\.In*Proceedings of ACL*, 2025a\.
- Sun et al\. \(2025b\)Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li\.ReDeEP: Detecting hallucination in retrieval\-augmented generation via mechanistic interpretability\.In*Proceedings of ICLR*, 2025b\.
- Wang et al\. \(2025\)Yifei Wang et al\.PRISM: Stage\-wise diagnosis of hallucination in retrieval\-augmented generation\.*arXiv preprint arXiv:2604\.16909*, 2025\.
- Wu et al\. \(2025\)Hao Wu et al\.Knowledgeable\-R1: Multi\-policy reinforcement learning for parametric\-contextual knowledge balance\.*arXiv preprint arXiv:2506\.05154*, 2025\.
- Xie et al\. \(2024\)Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su\.Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts\.In*Proceedings of ICLR*, 2024\.
- Xu et al\. \(2024\)Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu\.Knowledge conflicts for LLMs: A survey\.*arXiv preprint arXiv:2403\.08319*, 2024\.
- Zhang et al\. \(2025\)Haokun Zhang et al\.Taming knowledge conflicts in language models\.*arXiv preprint arXiv:2503\.10996*, 2025\.Similar Articles
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
RoleConflictBench is a novel benchmark containing over 13,000 scenarios across 65 roles designed to evaluate how well LLMs handle contextual sensitivity in role conflict situations where multiple social expectations clash. Analysis of 10 LLMs reveals that models predominantly rely on learned role preferences rather than dynamic contextual cues when making decisions.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.
Context Is Not Control, a source-boundary eval for LLMs
A paper introducing 'Context Is Not Control', an evaluation benchmark for assessing source-boundary failures in LLMs' use of controlled text-mediated evidence. Includes replication packages for open-weight and frontier API models.
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
This paper introduces Context-Driven Decomposition (CDD), a probe to diagnose when RAG systems comply with retrieved context despite conflicting parametric knowledge, and releases the Epi-Scale benchmark for systematic study across model families.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
A controlled study of compound LLM agent design in an adversarial POMDP (CybORG CAGE-2), systematically varying context, reasoning, and hierarchy across five model families. Key findings: programmatic state abstraction yields large returns per token, hierarchy without deliberation tools achieves best absolute performance, and context engineering is more cost-effective than deeper reasoning.