Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models
Summary
This paper investigates how discourse-role labels (e.g., 'Reference:', 'Instruction:', 'Example:') used to wrap context in RAG systems significantly affect how much language models adopt misleading information, with shifts of 56–84 percentage points observed across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct. The authors argue that wrapper labels should be treated as presentation-time variables and reported/controlled in context-utilization benchmarks.
View Cached Full Text
Cached at: 06/05/26, 02:12 AM
# Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models
Source: [https://arxiv.org/html/2606.04109](https://arxiv.org/html/2606.04109)
###### Abstract
Context\-augmented language model systems often wrap supplied content with labels such asReference:,Evidence:,Instruction:,Note:, orExample:, but the effect of these labels on reader\-model behavior remains underexplored\. We introduce a paired fixed\-content probe over 500 MMLU\-Pro items: each item receives the same misleading answer\-bearing assertion under different discourse\-role labels, and adoption is measured by whether the model outputs the injected wrong option\. Across GPT\-5\.5, DeepSeek V4 Pro, Llama\-3\-8B\-Instruct, and Qwen2\.5\-7B\-Instruct, Misleading Adoption Rate shifts by 56–84 percentage points\. Binding or source\-like labels such asInstruction:andReference:produce high adoption, whereasExample:consistently suppresses it\. Paired tests, bootstrap intervals, final\-instruction ablations, and Qwen final\-step log\-probability probes support a label\-conditioned candidate preference\. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage\-shaped external context preserves smaller label gaps, short\-answer evaluation rules out option\-letter copying, and nested\-label conflicts suggest that illustrative framing can delimit adoption scope\. A 200\-case single\-author manual audit confirms that the short\-answer contrasts are stable under conservative adjudication\. The resulting claim is bounded but practical: context\-utilization and reader\-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context\.
###### keywords:
Large language models , Context utilization , Retrieval\-augmented generation , Presentation\-time formatting , Discourse\-role labels , Evaluation methodology
††journal:Information Processing & Management\\affiliation
\[inst1\]organization=Chengdu University of Information Technology, addressline=Chengdu University of Information Technology, No\. 366, Section 5, Yinhe Road, Shuangliu District, city=Chengdu, postcode=610225, state=Sichuan, country=China
\{highlights\}
Discourse\-role labels shift misleading adoption by 56–84pp on MMLU\-Pro\.
An aligned no\-label/instruction/example subset supports cross\-model replication\.
Passage\-wrapper probes show the effect persists in passage\-shaped context\.
Log\-probability and nested\-label probes indicate preference and scope effects\.
Wrapper labels should be reported in context\-utilization benchmarks\.
## 1Introduction
A retrieval\-reader pipeline does not only decide*what*information to pass to a language model\. It also decides*how*that information is presented\. Retrieved passages, tool outputs, memory snippets, demonstrations, and prompt\-template blocks are commonly wrapped with short labels such asReference:,Evidence:,Instruction:,Note:, orExample:\. In many systems these labels are treated as cosmetic organization for human readers\. This paper asks whether they are also functional presentation\-time variables for machine readers\.
The question matters for information processing and management because context\-augmented systems are increasingly evaluated by how faithfully and selectively a model uses supplied information\. If identical answer\-bearing content is adopted when labeled as a reference but suppressed when labeled as an example, then a benchmark may be measuring not only context content but also the role assigned to that content\. Figure[1](https://arxiv.org/html/2606.04109#S1.F1)locates the studied variable in a retrieval\-reader pipeline: after information is retrieved, generated, or recalled, but before the reader model converts it into an answer\.
Retrieval / tools / memorysupplied external informationWrapper layerReference:/Evidence:Instruction:/Example:Reader modelanswer decisionpresentation\-time variable studied here
Figure 1:Location of discourse\-role labels in a context\-augmented retrieval\-reader pipeline\. The paper studies the wrapper layer while holding the supplied answer\-bearing content fixed\.Previous studies have shown that large\-language\-model behavior can be sensitive to prompt wording, presentation\-time formatting, punctuation, underspecification, demonstrations, and retrieved context\(Sclaret al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib1); He and others,[2024](https://arxiv.org/html/2606.04109#bib.bib5); Seleznyovet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib7); Liuet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib12),[2025](https://arxiv.org/html/2606.04109#bib.bib17); Zhanget al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib19)\)\. We study a narrower and more diagnostic problem: when the answer\-bearing content is fixed, does the discourse\-role label attached to that content determine whether the model adopts it? To connect this controlled question to context\-augmented settings, the paper treats passage\-shaped wrapper probes as a main reader\-side evidence layer rather than as a cosmetic appendix check\.
We call this phenomenon*role\-conditioned reader adoption of supplied content*\. The experimental design holds the question, answer choices, injected wrong option, wrong\-option text, prompt position, and final answer instruction fixed; only the local discourse\-role label changes\. The outcome is not aggregate accuracy drift across broad prompt variants, but paired within\-item adoption of the same controlled misleading assertion\. This design turns a wrong answer into a measurement device: if the model outputs the injected wrong option, it has adopted the supplied claim under that label\.
Our central contribution is an evaluation design that treats wrapper labels as controlled variables\. First, we introduce a paired protocol that isolates discourse\-role labels while holding the supplied content fixed\. This makes wrapper choice observable rather than implicit in context\-use evaluation\. Second, we report cross\-system replication across four reader models, including a fully aligned no\-label/instruction/example subset, with misleading adoption shifting by 56–84 percentage points in the audited MMLU\-Pro setting\. Third, we provide RAG\-relevant reader\-side evidence through passage\-shaped external context, while explicitly separating this probe from end\-to\-end retrieval evaluation\. Fourth, we provide decoding\-level evidence from final\-instruction ablations and final\-step log probabilities\. Fifth, we characterize task\-affordance and output\-format boundaries using GSM8K, mixed\-language prompts, label taxonomy probes, template variants, nested\-label conflicts, short\-answer output, and a single\-author manual audit of short\-answer judgments\. Together, these results support a bounded methodological claim: context\-utilization benchmarks should report and control the wrapper labels surrounding supplied or retrieved content, because those labels can change measured reliance on external information\.
## 2Related work
### 2\.1Prompt sensitivity and presentation effects
A large body of prompt\-sensitivity work has already made the broad point that form matters: wording, formatting, punctuation, underspecification, scoring artifacts, and prompt variants can all change model behavior\(Sclaret al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib1); Chatterjee and others,[2024](https://arxiv.org/html/2606.04109#bib.bib2); Zhuo and others,[2024](https://arxiv.org/html/2606.04109#bib.bib3); Luet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib4); He and others,[2024](https://arxiv.org/html/2606.04109#bib.bib5); Razaviet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib6); Seleznyovet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib7); Huaet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib8); Pecheret al\.,[2026](https://arxiv.org/html/2606.04109#bib.bib9); Liu and Chu,[2026](https://arxiv.org/html/2606.04109#bib.bib10)\)\. We use that literature as motivation rather than as the comparison target\. The narrower question here is whether a local role label changes adoption when the assertion text, answer options, wrong option, prompt position, and final answer instruction are held fixed\.
### 2\.2In\-context demonstrations
Examples are a natural source of ambiguity for this study\. In\-context\-learning work has shown that demonstrations, their order, and their presentation affect model behavior\(Wanget al\.,[2024a](https://arxiv.org/html/2606.04109#bib.bib22); Peng and others,[2024](https://arxiv.org/html/2606.04109#bib.bib23); Zhang and others,[2024](https://arxiv.org/html/2606.04109#bib.bib24); Su and others,[2024](https://arxiv.org/html/2606.04109#bib.bib25); Qin and others,[2024](https://arxiv.org/html/2606.04109#bib.bib26); Agarwal and others,[2024](https://arxiv.org/html/2606.04109#bib.bib27); Bertsch and others,[2025](https://arxiv.org/html/2606.04109#bib.bib28)\)\. Our use ofExample:is different: the supplied sentence is not a worked demonstration selected for imitation\. It is the same counterfactual answer\-bearing assertion used in the other conditions, with only the wrapper label changed\. This lets us test the role assigned to the content rather than the quality of an example set\.
### 2\.3RAG faithfulness, context conflict, and source attribution
The closest application setting is retrieval\-augmented reading\. Prior work has documented that models may underuse sufficient context, ignore retrieved evidence, show position effects, or mix parametric and retrieved knowledge under conflict\(Liuet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib12); Wu and others,[2024](https://arxiv.org/html/2606.04109#bib.bib13); Qiet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib14); Eset al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib15); Shen and others,[2024](https://arxiv.org/html/2606.04109#bib.bib16); Liuet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib17); Hagström and others,[2025](https://arxiv.org/html/2606.04109#bib.bib18); Zhanget al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib19); Ming and others,[2025](https://arxiv.org/html/2606.04109#bib.bib20); Linet al\.,[2026](https://arxiv.org/html/2606.04109#bib.bib21)\)\. Source\-attribution and evidence\-use studies usually ask whether the answer is supported by the right material\. We ask a smaller reader\-side question: if the material is fixed, can the label around it change whether it is adopted? The passage\-wrapper probe is included for that reason, and is not reported as an end\-to\-end retriever–reader benchmark\.
### 2\.4External\-context security
There is also a security\-adjacent reading of the result\. Prompt injection, indirect prompt injection, instruction/data separation, and external\-context security work study how untrusted text can affect model\-integrated systems\(OWASP Foundation,[2025](https://arxiv.org/html/2606.04109#bib.bib31); Russinovich,[2024](https://arxiv.org/html/2606.04109#bib.bib32); Microsoft Agent Framework Team,[2026](https://arxiv.org/html/2606.04109#bib.bib33); Hineset al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib36); Chenet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib37); Zverevet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib38); Zhanet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib39); Debenedettiet al\.,[2024](https://arxiv.org/html/2606.04109#bib.bib40); Yiet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib41); Zouet al\.,[2025](https://arxiv.org/html/2606.04109#bib.bib42); Al Masoudet al\.,[2026](https://arxiv.org/html/2606.04109#bib.bib34); Khodayariet al\.,[2026](https://arxiv.org/html/2606.04109#bib.bib35)\)\. We do not evaluate attacks or propose a defense\. The relevance is more limited: ordinary context labels can change adoption of a fixed external assertion, so label choice should be treated as part of context\-presentation design\.
Work areaVaries content?Fixed assertion?Wrapper role?Paired adoption?Reporting guidance?Prompt sensitivityOftenUsually noBroad formattingUsually noLimitedIn\-context demonstrationsYNExample presentationUsually noDemo\-focusedRAG faithfulness/conflictYSometimesRarely isolatedTask dependentEvidence\-focusedSource attributionYUsually noSource markersUsually noCitation\-focusedThis workNYYYWrapper labelsTable 1:Positioning relative to adjacent literatures\. The contribution is not that prompts matter in general, but that a local discourse\-role wrapper changes paired adoption of the same fixed misleading assertion and yields a concrete reporting recommendation for context\-utilization benchmarks\.
## 3Framework and methodology
A*contextual assertion*is a statement prepended to a task input that contains an answer or claim relevant to the current question\. In the main experiments, the assertion contains a multiple\-choice option and its option text, for exampleReference: The answer is \(B\)\. <option text\>orExample: The answer is \(B\)\. <option text\>\. A*discourse\-role label*is the short prefix assigning a role to that assertion\. Some labels are binding or evidential, some are suggestive or illustrative, and the nonce labelZorple:preserves label syntax without carrying an interpretable discourse role\.
For each itemii, a wrong optionwiw\_\{i\}is selected and reused across label conditions\. If the model outputy^i,ℓ\\hat\{y\}\_\{i,\\ell\}under labelℓ\\ellequalswiw\_\{i\}, the supplied assertion is counted as adopted\. The primary metric is Misleading Adoption Rate \(MAR\):
MAR\(ℓ\)=1n∑i𝟏\[y^i,ℓ=wi\]\.\\mathrm\{MAR\}\(\\ell\)=\\frac\{1\}\{n\}\\sum\_\{i\}\\mathbf\{1\}\[\\hat\{y\}\_\{i,\\ell\}=w\_\{i\}\]\.\(1\)
MAR should not be read as ordinary task error\. It is a targeted adoption measure: under a fixed counterfactual conflict, the model either follows or resists the supplied misleading claim\. Intuitively, labels such asInstruction:andReference:make the assertion feel closer to the current answer decision, whileExample:frames the same sentence as illustrative rather than operational\.
The main task is MMLU\-Pro\-style multiple\-choice question answering\(Wanget al\.,[2024b](https://arxiv.org/html/2606.04109#bib.bib11)\)\. Its ten\-option format makes adoption of a specific wrong answer directly observable and reduces ambiguity in the primary adoption metric\. For each sampled item, one incorrect option is selected using a fixed random seed and reused across all label conditions, so label comparisons are paired within item rather than confounded by wrong\-answer plausibility\. GSM8K is used as a boundary task because arithmetic word problems require independent derivation and make direct answer reuse less appropriate\. Two GPT\-5\.5 reader\-setting probes use the same 500 paired MMLU\-Pro items: a passage\-wrapper probe embeds the misleading answer text in paragraph\-like external context, and a short\-answer probe requests a textual answer instead of an option letter\. The aligned cross\-model subset uses the shared no\-label,Instruction:, andExample:conditions; broader label inventories are treated as model\-specific extensions rather than directly comparable label sets\.
The model set is chosen to separate a detailed primary run from replication and diagnostic probes\. GPT\-5\.5 is used for the cleanest six\-label run, final\-instruction ablations, the mixed\-language rerun, and the reader\-setting probes\. Qwen2\.5\-7B\-Instruct supports the open\-weight final\-step log\-probability analysis\. DeepSeek V4 Pro and Llama\-3\-8B\-Instruct add structural replication with different model families and prompt implementations\. API experiments use deterministic decoding where available; local open\-weight experiments use greedy or temperature\-zero generation\. Since sample indices and wrong options are paired across conditions, primary comparisons use exact McNemar tests and paired bootstrap confidence intervals\. Accuracy, none rate, and other\-output rate are reported where available so that adoption is not conflated with generic answer failure\.
## 4Results
### 4\.1Discourse\-role labels create a large adoption gradient
Figure[2](https://arxiv.org/html/2606.04109#S4.F2)visualizes the main GPT\-5\.5 result\. With 500 MMLU\-Pro examples per condition,Instruction:reaches 95\.6% MAR andReference:reaches 80\.2%, whileExample:falls to 11\.4%\. The largest within\-model contrast,Instruction:vsExample:, is 84\.2 percentage points\. Paired tests confirm the within\-item nature of the effect:Instruction:vsExample:has 421 vs 0 discordant pairs \(p=3\.69×10−127p=3\.69\\times 10^\{\-127\}\), andReference:vsExample:has 345 vs 1 \(p=4\.84×10−102p=4\.84\\times 10^\{\-102\}\)\.
Figure 2:GPT\-5\.5 fixed\-content label probe over 500 paired MMLU\-Pro items per condition\. Panel \(a\) groups items by identical six\-label adoption patterns; the row label reports the pattern index and the annotation reports the number of items in that pattern\. Panel \(b\) shows MAR with Wilson 95% confidence intervals and annotates the largest paired contrast betweenInstruction:andExample:\.No\-labelandEvidence:have nearly identical marginal MAR \(72\.4% vs\. 71\.6%\) but adopt non\-identical item subsets, which is visible in the grouped paired patterns\.Table[2](https://arxiv.org/html/2606.04109#S4.T2)summarizes the same GPT\-5\.5 finding together with cross\-model structural replication\. Because the full label inventories differ slightly across model scripts, Table[3](https://arxiv.org/html/2606.04109#S4.T3)reports the completely aligned subset shared by the four main reader models: no label,Instruction:, andExample:\. This aligned view is more conservative than comparing each model’s strongest label against its weakest label\.
Model / settingHigh\-adoption roleBare or neutralExampleSpreadGPT\-5\.5Instruction:95\.6%Reference:80\.2%no label 72\.4%Evidence:71\.6%11\.4%84\.2ppDeepSeek V4 ProInstruction:74\.8%Note:73\.6%no\-format 44\.6%5\.4%69\.4ppQwen2\.5\-7B\-InstructHint:89\.0%Instruction:81\.2%no\-format 61\.0%7\.8%81\.2ppLlama\-3\-8B\-InstructZorple:88\.4%Note:88\.0%no\-format 73\.4%32\.0%56\.4ppTable 2:Main adoption gradient and four\-model structural replication\. Values are MAR percentages\. Exact labels differ across scripts, so cross\-model comparison emphasizes role families and spread\.Modelnn/cond\.No labelInstruction:Example:Aligned contrastGPT\-5\.550072\.4%95\.6%11\.4%Inst\.–Ex \+84\.2pp; No\-label–Ex \+61\.0ppDeepSeek V4 Pro50044\.6%74\.8%5\.4%Inst\.–Ex \+69\.4pp; No\-label–Ex \+39\.2ppQwen2\.5\-7B\-Instruct50061\.0%81\.2%7\.8%Inst\.–Ex \+73\.4pp; No\-label–Ex \+53\.2ppLlama\-3\-8B\-Instruct50073\.4%85\.4%32\.0%Inst\.–Ex \+53\.4pp; No\-label–Ex \+41\.4ppTable 3:Completely aligned cross\-model core\-label subset\. Values are all\-trial MAR percentages on the pure\-English fixed\-wrong\-assertion setting\. The table uses only labels shared by all four main models, so it avoids comparing model\-specific label inventories\.The cross\-model picture is deliberately read at the role\-family level\.Example:is the lowest\-adoption condition in every evaluated model, and the top\-bottom spread ranges from 56\.4pp to 84\.2pp\. The aligned subset gives the same boundary without relying on model\-specific high labels:Instruction:exceedsExample:by 53\.4–84\.2pp, and even no\-label prompts exceedExample:by 39\.2–61\.0pp\. At the same time, the full rankings are not identical; the mean pairwise Spearman correlation over full rankings is moderate \(ρ=0\.59\\rho=0\.59\)\. We therefore do not treat labels as having one universal scalar authority\. What replicates is coarser: binding or source\-like roles tend to admit adoption, while illustrative roles suppress it\.
### 4\.2Labels interact with global instructions and pre\-generation preferences
The decoding\-level evidence is summarized in Table[4](https://arxiv.org/html/2606.04109#S4.T4)\. First, GPT\-5\.5 final\-instruction ablations show that global context\-use instructions amplify adoption but do not erase the local label boundary\. Under a reference\-based final instruction,Reference:reaches 99\.6% MAR, yetExample:remains at 26\.8%\. A fixed\-effect logistic analysis over 9,000 records finds significant label, final\-instruction, and label\-by\-instruction terms \(labelp<10−300p<10^\{\-300\}; final instructionp=5\.28×10−243p=5\.28\\times 10^\{\-243\}; interactionp=4\.90×10−19p=4\.90\\times 10^\{\-19\}\)\.
Second, Qwen2\.5\-7B\-Instruct final\-step log\-probability probes show label\-conditioned differences in candidate\-answer preference before generation\. The wrong\-correct gap is9\.1499\.149underReference:,9\.1969\.196underInstruction:, and−5\.697\-5\.697underExample:\. The pairedReference:–Example:gap is 14\.846 log\-probability points \(95% CI \[14\.257, 15\.444\]\); theInstruction:–Example:gap is 14\.893 points \(95% CI \[14\.301, 15\.485\]\)\. Thus, before generation, the same wrong option receives higher relative probability under binding labels but not under the illustrative label\.
ProbeMeasureBinding/source\-likeExample:ContrastGPT\-5\.5, neutral final instructionMARInstruction:96\.2%11\.6%\+84\.6ppGPT\-5\.5, reference\-based final instructionMARReference:99\.6%26\.8%\+72\.8ppQwen2\.5 final\-step probabilityWrong\-pref\. rateInstruction:89\.6%28\.4%\+61\.2ppQwen2\.5 final\-step probabilitylogp\(w\)−logp\(c\)\\log p\(w\)\-\\log p\(c\)Instruction:9\.196\-5\.697\+14\.893Table 4:Pre\-generation preference evidence\. Binding/source\-like labels remain aboveExample:under stronger final instructions, and Qwen2\.5\-7B\-Instruct assigns higher relative probability to the misleading candidate before generation\.This goes beyond a final\-output formatting effect\. Before generation, the same supplied assertion already has a different standing relative to the candidate answer depending on the local discourse role\.
### 4\.3Nested\-label conflicts reveal scope\-sensitive role assignment
The preceding experiments vary a single wrapper at a time\. A natural follow\-up is whether nested, conflicting wrappers combine as simple authority cues\. We therefore run a nested\-label conflict probe on the same 500\-item fixed\-wrong\-assertion setting for GPT\-5\.5 and DeepSeek V4 Pro\. The probe includes two single\-label baselines,Reference:andExample:, and four nested wrappers:Reference⊃\\supsetExample,Example⊃\\supsetReference,Instruction⊃\\supsetExample, andExample⊃\\supsetInstruction\. We analyze this probe internally within its own six\-condition run, because its single\-label baseline differs numerically from the main six\-label run\. The result should therefore be read as evidence about nested\-label conflict structure, not as a replacement for the main MAR estimates\.
Table[5](https://arxiv.org/html/2606.04109#S4.T5)shows that nested labels do not behave as simple authority stacking\. In both models, pureReference:produces high MAR, while pureExample:produces low MAR\. However, placingReferenceorInstructionoutside an innerExampleframe does not restore high adoption\. On GPT\-5\.5,Reference⊃\\supsetExamplereaches 26\.2% MAR, far belowReference:alone at 82\.6%;Instruction⊃\\supsetExamplefalls to 3\.0%\. On DeepSeek V4 Pro, the same nested conditions reach 9\.8% and 3\.8%, respectively, again far belowReference:alone at 64\.4%\. Conversely,Example⊃\\supsetReferenceremains far belowReference:alone in both models\. This pattern suggests that illustrative framing can act as a scope delimiter: when answer\-bearing content is explicitly embedded as an example, it is less likely to be treated as operational evidence even when an authority\-like label appears elsewhere in the wrapper\.
ConditionGPT\-5\.5DeepSeek V4 ProMARAcc\.NoneOtherMARAcc\.NoneOtherReference:82\.617\.20\.00\.264\.412\.422\.80\.4Example:11\.080\.60\.08\.47\.264\.623\.25\.0Reference⊃\\supsetExample26\.267\.40\.06\.49\.861\.822\.65\.8Example⊃\\supsetReference28\.265\.60\.06\.214\.456\.024\.65\.0Instruction⊃\\supsetExample3\.086\.00\.011\.03\.868\.221\.66\.4Example⊃\\supsetInstruction5\.884\.20\.010\.05\.063\.825\.06\.2Table 5:Nested\-label conflict probe over 500 paired items per condition\. Values are percentages\. Nested illustrative framing keeps MAR far below the pureReference:baseline in both models, but the response distributions differ: GPT\-5\.5 converts most non\-adoption cases into correct answers or other explicit answers, whereas DeepSeek V4 Pro maintains a stable non\-answer component\.The two models also differ in how low adoption is realized\. GPT\-5\.5 has no non\-answer outputs in this probe, so lower MAR mostly corresponds to correct\-answer recovery or other explicit non\-adoption outputs\. DeepSeek V4 Pro maintains a stable 21–25% non\-answer rate across conditions, with a smaller but persistent other\-output component\. Thus, similar aggregate reductions in adoption can hide different response\-distribution patterns\. We do not use this probe to claim a general decision mechanism\. Instead, it shows that MAR should be reported together with accuracy, none rate, and other\-output rate, because adoption behavior and final response type need not vary in the same way across models\.
### 4\.4The effect is bounded by task affordance
Table[6](https://arxiv.org/html/2606.04109#S4.T6)summarizes boundary and robustness checks\. GSM8K sharply attenuates adoption: when solving requires arithmetic derivation, adoption falls near zero even with misleading previous\-solution prompts\. This supports the task\-affordance interpretation\. Role\-conditioned adoption is strongest when the supplied assertion can be directly reused as an answer\.
The mixed\-language rerun narrows the claim without overturning it\. Using English labels with Chinese assertion content on the same 500 sample indices, the coarse boundary remains:Instruction:reaches 93\.8%,Reference:70\.8%, andExample:13\.0%\. However, pure\-English prompts remain significantly higher for some labels, so the result supports robustness of the coarse boundary rather than full multilingual invariance\.
ProbeSettingHigh roleLow roleInterpretationTask boundaryMMLU\-Pro direct assertion, GPT\-5\.595\.6%11\.4%Direct answer reuse is availableTask boundaryGSM8K previous\-solution, GPT\-5\.50\.0%–Independent derivation resists adoptionTask boundaryGSM8K Chinese answer hint, DeepSeek Chat7\.5%–Stronger cue yields weak residual adoptionMixed languageEnglish labels, Chinese assertion, GPT\-5\.593\.8%13\.0%Coarse boundary persistsTemplate variantsQwen2\.5 six assertion templates23\.5%10\.2%Pattern survives weaker cuesTable 6:Boundary and robustness checks\. High role reports the strongest binding/source\-like condition; low role reportsExample:when applicable\. Values are MAR percentages\.Label\-taxonomy experiments further show that the effect is not driven by a single word pair such asReference:versusExample:\. Instruction\-like, source/evidence, neutral\-context, and suggestive labels cluster above illustrative labels\. Template\-robustness experiments show that the pattern survives alternatives toThe answer is \(X\), although direct answer cues remain strongest\. Candidate\-only Qwen logit\-lens probes provide a consistency check: final\-layer wrong\-correct gaps follow the same structure and correlate highly with final\-step log\-probability gaps \(r≈0\.97r\\approx 0\.97–0\.990\.99\), although we do not infer neural circuits from this probe\.
### 4\.5Reader\-setting probes: passage wrappers and short answers
The original probe uses a direct assertion such asThe answer is \(B\)and a multiple\-choice output format\. Table[7](https://arxiv.org/html/2606.04109#S4.T7)tests whether the effect survives two reader\-setting changes\. The passage\-wrapper probe is the main bridge to context\-augmented reading: it embeds the misleading answer text inside short paragraph\-like external context and varies only the wrapper\. It does not implement retrieval, indexing, reranking, corpus construction, or a deployed retriever–generator pipeline; it is a synthetic external\-context probe\. Even under this weaker, less direct cue,Reference:andInstruction:remain aboveExample:by about 16pp\.
The short\-answer probe removes option\-letter output as the target\. Candidate answer texts are shown without A/B/C labels, and ambiguous textual answers are judged with a two\-stage procedure\. The label effect not only persists but sharpens:Reference:reaches 76\.0%,Instruction:52\.6%, andExample:5\.2%\. Explicit option\-letter outputs occur in only 0\.4% of responses\. To audit the judgment step, we manually labeled 200 model\-judged short\-answer cases, stratified as 50 per condition, using a pre\-specified four\-label rubric \(ADOPT\_WRONG, CORRECT, OTHER, AMBIGUOUS\)\. The automatic and manual labels agree on 87\.5% of audited cases, with Cohen’sκ=0\.765\\kappa=0\.765\. Agreement is balanced across conditions \(86–90%\)\. The largest disagreement source is automatic OTHER versus manual CORRECT \(11 cases\), while directADOPT\_WRONG/CORRECT cross\-confusions account for only five cases\. A conservative adjudication changes every condition\-level MAR by at most 0\.6pp: theReference:–Example:contrast remains 70\.8pp, and theInstruction:–Example:contrast changes from 47\.4pp to 46\.8pp\. Thus, the short\-answer result is not an option\-letter\-copying artifact and is stable under manual adjudication\.
ProbeNo labelReference:Instruction:Example:Primary contrastDirect assertion72\.480\.295\.611\.4Inst\.–Ex \+84\.2ppPassage\-shaped context29\.839\.439\.223\.2Ref\.–Ex \+16\.2pp; Inst\.–Ex \+16\.0ppShort\-answer output15\.276\.052\.65\.2Ref\.–Ex \+70\.8pp; Inst\.–Ex \+47\.4ppTable 7:Reader\-setting probes\. Values are MAR percentages over 500 paired GPT\-5\.5 items per condition\. Passage\-shaped context is a synthetic external\-context reader probe, not a deployed RAG pipeline\. The short\-answer probe has a 0\.4% explicit option\-letter output rate\.Audit itemResultManual audit sample200 short\-answer cases; 50 per conditionAutomatic–manual agreement87\.5% \(175/200\); Cohen’sκ=0\.765\\kappa=0\.765Per\-condition agreement86–90%, balanced across conditionsMain disagreement sourceautomatic OTHER vs\. manual CORRECT: 11 casesDirect label conflictADOPT\_WRONGvs\. CORRECT cross\-confusions: 5 casesConservative MAR shiftall condition\-level shifts≤\\leq0\.6ppKey contrasts after adjudicationRef\.–Ex 70\.8pp; Inst\.–Ex 46\.8ppTable 8:Manual audit of short\-answer judgments\. The audit was stratified by condition and conducted using a pre\-specified four\-label rubric overADOPT\_WRONG, CORRECT, OTHER, and AMBIGUOUS\. Conservative adjudication leaves the main short\-answer contrasts essentially unchanged\.#### 4\.5\.1Wrapper labels and output format are independent adoption channels
A useful implication emerges from the short\-answer probe: wrapper labels and output format are partially independent adoption channels\. Removing option\-letter output sharply reduces no\-label adoption from 72\.4% to 15\.2%, butReference:remains high at 76\.0%\. Thus, the binding label can preserve adoption even when the direct letter\-copying channel is largely removed\.
## 5Methodological recommendation for context\-utilization benchmarks
The practical recommendation is modest: wrapper text should be treated as part of the benchmark specification\. A benchmark that labels a passage asReference:may measure a different reader behavior from one that labels the same passage asExample:,Note:, or no wrapper\. This matters when comparing model reliance on provided context, robustness to conflicting information, or faithfulness to retrieved evidence\.
Three low\-cost changes would make this variable visible\.
1. 1\.Report wrapper labels and delimiters\.Benchmark descriptions should include the exact text surrounding supplied context, not only the retrieved passage or answer format\.
2. 2\.Add content\-fixed paired variants\.When evaluating whether a reader model uses or ignores external information, studies should include conditions where the supplied content is identical and only the wrapper changes\.
3. 3\.Separate passage\-shaped probes from end\-to\-end retrieval benchmarks\.A prompt containing passage\-like context can diagnose reader behavior, but it should not be reported as a full retriever–reader evaluation unless retrieval, indexing, reranking, and corpus construction are actually implemented\.
These changes are intentionally lightweight\. They do not require every evaluation to become a full end\-to\-end RAG benchmark; they make the reader\-side presentation choice visible enough that wrapper effects are not mistaken for model\-level context\-utilization differences\.
## 6Discussion
### 6\.1Design implications for context\-augmented systems
For prompt and system designers, the immediate lesson is not to treatReference:,Instruction:,Evidence:, andExample:as interchangeable decoration\. A non\-binding context block may become more adoptable when framed as a reference or instruction; an illustrative block should be marked as such\. The numeric pattern will not necessarily transfer unchanged to every retrieval\-augmented system, but the design variable itself should be logged and evaluated rather than left implicit\.
### 6\.2Safety implications
Although this is not a security paper, the results have safety implications\. Authority\-like labels can increase adoption of misleading content, while illustrative or non\-binding labels can reduce adoption in controlled settings\. This suggests that discourse\-role framing may be useful as one component in broader context\-risk mitigation, but it is not a standalone defense against prompt injection, RAG poisoning, or instruction/data confusion\.
## 7Limitations and reproducibility
The claim is intentionally bounded by design choices that isolate the reader\-side label effect\. First, the evidence is behavioral and decoding\-level; we do not identify internal circuits, perform activation interventions, or establish a causal neural mechanism\. Second, the stable pattern is block\-level rather than a universal fine\-grained ranking: absolute MAR values and within\-block rankings vary across model families, language settings, and final instructions\. For this reason, we include both a structural replication table and a completely aligned no\-label/Instruction:/Example:cross\-model subset\.
Third, the main paired probe uses a 500\-item MMLU\-Pro subset because it allows precise wrong\-answer adoption measurement\. We deliberately trade benchmark breadth for within\-item statistical control\. The broader evidence spans several distributional axes already available in the study: GSM8K, Chinese assertions with English wrappers, nested\-label conflicts, passage\-shaped external context, short\-answer output, and four reader\-model families\. We do not claim generalization to all QA datasets, option formats, or item construction protocols\. Fourth, the passage\-wrapper probe is not deployed RAG because it intentionally omits retrieval, indexing, reranking, source diversity, and corpus construction\. Its smaller contrasts should therefore be read as controlled reader\-side evidence that wrapper effects can persist in passage\-shaped contexts, not as a claim about effect sizes in deployed retrieval systems\.
Fifth, the nested\-label conflict probe should be interpreted as a within\-run structural probe rather than as a replacement for the main six\-label MAR estimates, because its single\-label baselines differ numerically from the main run\. Finally, the short\-answer probe requires judging ambiguous textual answers\. We report judge\-method counts, letter\-output rates, a 200\-case single\-author manual audit, and a conservative adjudication rerun\. The audit supports the reliability of the short\-answer result, but it remains a single\-annotator manual audit rather than a multi\-annotator annotation study\. We therefore use the short\-answer probe to validate the output\-format boundary and the non\-letter\-copying explanation, while keeping the primary claim anchored in the paired multiple\-choice evidence\.
The experiments use fixed random seeds and paired sample indices wherever possible\. Formal raw records include sample index, original index, condition, correct option, injected wrong option, prompt, raw response, parsed prediction, and adoption indicators\. The reader\-setting probes also record passage text, raw short answer, judge method, matched option, and letter\-output flags\. A complete replication package should include prompt templates, wrapper text, sampling scripts, model names, decoding parameters, parser code, judge prompts for short\-answer ambiguity, aggregate result tables, and scripts for McNemar tests and paired bootstrap confidence intervals\. For the short\-answer audit, the structured artifact should additionally include the anonymized answer text, automatic judge label, manual audit label, adjudicated label, and a disagreement flag for every audited case\. If raw model responses cannot be released in full, a useful supplement should still provide de\-identified examples, annotation rules, and enough structured records to reproduce the reported tables\.
## 8Ethical considerations
This work studies how models adopt misleading context\. Such findings could be misused to craft more persuasive misleading prompts\. We therefore frame the experiments as controlled behavioral analysis rather than attack optimization, avoid releasing a toolkit for maximizing attack success, and do not claim that specific labels constitute universal attacks\. The constructive implication is that system designers should report and control how retrieved, generated, or user\-supplied context is framed\.
## 9Conclusion
The experiments point to a simple conclusion: supplied context is not defined only by its content, but also by the role assigned to it in the prompt\. With answer\-bearing content held fixed, binding and source\-like labels promote adoption, while illustrative labels such asExample:suppress it\. This pattern recurs across four model families in the audited setting, survives stronger final instructions, appears in Qwen final\-step candidate preferences, weakens on tasks that require independent reasoning, persists in passage\-shaped context, and remains visible under manually audited short\-answer evaluation\. Nested\-label conflicts add a further boundary case by showing scope\-sensitive behavior rather than simple authority stacking\. The claim is bounded, but the reporting implication is direct: context\-utilization research and reader\-side RAG evaluation should document and control the discourse\-role labels surrounding supplied or retrieved content\.
## References
- R\. Agarwalet al\.\(2024\)Many\-shot in\-context learning\.arXiv preprint\.Note:arXiv:2404\.11018External Links:[Link](https://arxiv.org/abs/2404.11018)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- A\. Al Masoud, M\. Arazzi, and A\. Nocera \(2026\)SD\-RAG: a prompt\-injection\-resilient framework for selective disclosure in retrieval\-augmented generation\.arXiv preprint\.Note:arXiv:2601\.11199External Links:[Link](https://arxiv.org/abs/2601.11199)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- A\. Bertschet al\.\(2025\)In\-context learning with long\-context models: an in\-depth exploration\.arXiv preprint\.Note:arXiv:2405\.00200External Links:[Link](https://arxiv.org/abs/2405.00200)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- A\. Chatterjeeet al\.\(2024\)POSIX: a prompt sensitivity index for large language models\.InFindings of the Association for Computational Linguistics: EMNLP,External Links:[Link](https://aclanthology.org/2024.findings-emnlp.852/)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- S\. Chen, J\. Piet, C\. Sitawarin, and D\. Wagner \(2025\)StruQ: defending against prompt injection with structured queries\.InUSENIX Security Symposium,Note:arXiv:2402\.06363External Links:[Link](https://arxiv.org/abs/2402.06363)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- E\. Debenedetti, J\. Zhang, M\. Balunovic, L\. Beurer\-Kellner, M\. Fischer, and F\. Tramer \(2024\)AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents\.arXiv preprint\.Note:arXiv:2406\.13352External Links:[Link](https://arxiv.org/abs/2406.13352)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- S\. Es, J\. James, L\. Espinosa\-Anke, and S\. Schockaert \(2024\)RAGAs: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations,External Links:[Link](https://aclanthology.org/2024.eacl-demo.16/)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- L\. Hagströmet al\.\(2025\)A reality check on context utilisation for retrieval\-augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2025.acl-long.968/)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- J\. Heet al\.\(2024\)Does prompt formatting have any impact on LLM performance?\.arXiv preprint\.Note:arXiv:2411\.10541External Links:[Link](https://arxiv.org/abs/2411.10541)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- K\. Hines, G\. Lopez, M\. Hall, F\. Zarfati, Y\. Zunger, and E\. Kiciman \(2024\)Defending against indirect prompt injection attacks with spotlighting\.arXiv preprint\.Note:arXiv:2403\.14720External Links:[Link](https://arxiv.org/abs/2403.14720)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- A\. Hua, K\. Tang, C\. Gu, J\. Gu, E\. Wong, and Y\. Qin \(2025\)Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Note:arXiv:2509\.01790External Links:[Link](https://aclanthology.org/2025.emnlp-main.1006/)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- S\. Khodayari, X\. Zhang, B\. Acharya, and G\. Pellegrino \(2026\)Indirect prompt injection in the wild: an empirical study of prevalence, techniques, and objectives\.arXiv preprint\.Note:arXiv:2604\.27202External Links:[Link](https://arxiv.org/abs/2604.27202)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- C\. Lin, Y\. Wen, D\. Su, H\. Tan, F\. Sun, M\. Chen, C\. Bao, and Z\. Lv \(2026\)Resisting contextual interference in RAG via parametric\-knowledge reinforcement\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6Qc6sO1jh9)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- A\. Liu, O\. Press, N\. A\. Smith, and H\. Hajishirzi \(2025\)Sufficient context: a new lens on retrieval augmented generation systems\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2411.06037)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics\.External Links:[Link](https://aclanthology.org/2024.tacl-1.9/)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- Y\. Liu and C\. Chu \(2026\)Understanding the prompt sensitivity\.arXiv preprint\.Note:arXiv:2604\.18389External Links:[Link](https://arxiv.org/abs/2604.18389)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- S\. Lu, H\. Schuff, and I\. Gurevych \(2024\)How are prompts different in terms of sensitivity?\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2024.naacl-long.325/)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- Microsoft Agent Framework Team \(2026\)Stop prompt injection from hijacking your agent: new security capabilities now released within agent framework\.Note:Microsoft Agent Framework BlogExternal Links:[Link](https://devblogs.microsoft.com/agent-framework/fides/)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- Y\. Minget al\.\(2025\)FaithEval: can your language model stay faithful to context, even if “the moon is made of marshmallows”?\.InInternational Conference on Learning Representations,Note:arXiv:2410\.03727External Links:[Link](https://arxiv.org/abs/2410.03727)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- OWASP Foundation \(2025\)OWASP top 10 for large language model applications\.Note:Technical reportExternal Links:[Link](https://owasp.org/www-project-top-10-for-large-language-model-applications/)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- B\. Pecher, M\. Spiegel, R\. Belanec, and J\. Cegin \(2026\)Revisiting prompt sensitivity in large language models for text classification: the role of prompt underspecification\.arXiv preprint\.Note:arXiv:2602\.04297External Links:[Link](https://arxiv.org/abs/2602.04297)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- K\. Penget al\.\(2024\)Revisiting demonstration selection strategies in in\-context learning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2024.acl-long.492/)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- J\. Qi, G\. Sarti, R\. Fernandez, and A\. Bisazza \(2024\)Model internals\-based answer attribution for trustworthy retrieval\-augmented generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2024.emnlp-main.347/)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- C\. Qinet al\.\(2024\)In\-context learning with iterative demonstration selection\.InFindings of the Association for Computational Linguistics: EMNLP,External Links:[Link](https://aclanthology.org/2024.findings-emnlp.438/)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- A\. Razavi, M\. Soltangheis, N\. Arabzadeh, S\. Salamat, M\. Zihayat, and E\. Bagheri \(2025\)Benchmarking prompt sensitivity in large language models\.InEuropean Conference on Information Retrieval,Note:arXiv:2502\.06065External Links:[Link](https://arxiv.org/abs/2502.06065)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- M\. Russinovich \(2024\)How microsoft discovers and mitigates evolving attacks against AI guardrails\.Note:Microsoft Security BlogExternal Links:[Link](https://www.microsoft.com/en-us/security/blog/2024/04/11/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails/)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2024\)Quantifying language models’ sensitivity to spurious features in prompt design, or: how i learned to start worrying about prompt formatting\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=RIu5lyNXjT)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- M\. Seleznyov, M\. Chaichuk, G\. Ershov, A\. Panchenko, E\. Tutubalina, and O\. Somov \(2025\)When punctuation matters: a large\-scale comparison of prompt robustness methods for LLMs\.InFindings of the Association for Computational Linguistics: EMNLP,Note:arXiv:2508\.11383External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1109/)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- X\. Shenet al\.\(2024\)Assessing “implicit” retrieval robustness of large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2024.emnlp-main.507/)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- Y\. Suet al\.\(2024\)Demonstration augmentation for zero\-shot in\-context learning\.InFindings of the Association for Computational Linguistics: ACL,External Links:[Link](https://aclanthology.org/2024.findings-acl.846/)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2024a\)Learning to retrieve in\-context examples for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2024.eacl-long.105/)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024b\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2406.01574)Cited by:[§3](https://arxiv.org/html/2606.04109#S3.p5.1)\.
- D\. Wuet al\.\(2024\)Synchronous faithfulness monitoring for trustworthy retrieval\-augmented generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2024.emnlp-main.527/)Cited by:[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- J\. Yi, Y\. Xie, B\. Zhu, E\. Kiciman, G\. Sun, X\. Xie, and F\. Wu \(2025\)Benchmarking and defending against indirect prompt injection attacks on large language models\.InACM SIGKDD Conference on Knowledge Discovery and Data Mining,Note:arXiv:2312\.14197External Links:[Link](https://arxiv.org/abs/2312.14197)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- Q\. Zhan, Z\. Liang, Z\. Ying, and D\. Kang \(2024\)InjecAgent: benchmarking indirect prompt injections in tool\-integrated large language model agents\.InFindings of the Association for Computational Linguistics: ACL,External Links:[Link](https://aclanthology.org/2024.findings-acl.624/)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- M\. Zhanget al\.\(2024\)The impact of demonstrations on multilingual in\-context learning: a multidimensional analysis\.InFindings of the Association for Computational Linguistics: ACL,External Links:[Link](https://aclanthology.org/2024.findings-acl.438/)Cited by:[§2\.2](https://arxiv.org/html/2606.04109#S2.SS2.p1.1)\.
- Q\. Zhang, Z\. Xiang, Y\. Xiao, L\. Wang, J\. Li, X\. Wang, and J\. Su \(2025\)FaithfulRAG: fact\-level conflict modeling for context\-faithful retrieval\-augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2025.acl-long.1062/)Cited by:[§1](https://arxiv.org/html/2606.04109#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.04109#S2.SS3.p1.1)\.
- J\. Zhuoet al\.\(2024\)ProSA: assessing and understanding the prompt sensitivity of LLMs\.InFindings of the Association for Computational Linguistics: EMNLP,External Links:[Link](https://aclanthology.org/2024.findings-emnlp.108/)Cited by:[§2\.1](https://arxiv.org/html/2606.04109#S2.SS1.p1.1)\.
- W\. Zou, R\. Geng, B\. Wang, and J\. Jia \(2025\)PoisonedRAG: knowledge corruption attacks to retrieval\-augmented generation of large language models\.InUSENIX Security Symposium,Note:arXiv:2402\.07867External Links:[Link](https://arxiv.org/abs/2402.07867)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.
- E\. Zverev, S\. Abdelnabi, S\. Tabesh, M\. Fritz, and C\. H\. Lampert \(2025\)Can LLMs separate instructions from data? and what do we even mean by that?\.InInternational Conference on Learning Representations,Note:arXiv:2403\.06833External Links:[Link](https://arxiv.org/abs/2403.06833)Cited by:[§2\.4](https://arxiv.org/html/2606.04109#S2.SS4.p1.1)\.Similar Articles
LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values
This paper investigates whether large language models have stable preferences across different deployment contexts, finding that context can cause larger variations than prompt perturbations, suggesting that measured preferences are context-conditioned rather than fixed properties.
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.
Evaluating Large Language Models in a Complex Hidden Role Game
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.
Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions
This paper investigates how contextual framing affects LLM responses in mental health interactions, finding systematic behavioral variation and demonstrating that internal representations encode framing information throughout transformer layers.
Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.