Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
Summary
Introduces MedMisBench to measure LLMs' ability to maintain correct medical reasoning under misleading context. Shows that accuracy drops sharply from 71.1% to 38.0% under adversarial conditions, with potential harm flagged by clinical panel.
View Cached Full Text
Cached at: 06/15/26, 09:03 AM
Paper page - Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
Source: https://huggingface.co/papers/2606.12291 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Large language models demonstrate reduced medical reasoning accuracy when exposed to misleading context, highlighting a critical gap in current evaluation methods that fails to assess epistemic resilience under adversarial conditions.
Large language models (LLMs) now reach expert-level scores onmedical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: whenmisleading contextis injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial contextepistemic resilience, and introduceMedMisBenchto measure it.MedMisBenchcontains 10,932 medical question items and 48,889misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focusedmisleading context, with 51.5%attack success. The most damaging injections are formal, rule-like fabrications:authority-framed falsehoodsreach 69.5%attack successandexception-poisoning claimsreach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases.MedMisBenchexposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment undermisleading context.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.12291
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12291 in a model README.md to link it from this page.
Datasets citing this paper1
#### HongjianZhou/MedMisBench Viewer• Updatedabout 4 hours ago • 10.9k • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12291 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
This paper investigates how large language models maintain correct beliefs under adversarial pressure in clinical settings, proposing R-FT fine-tuning to improve epistemic resilience while balancing corrigibility, and demonstrating significant robustness gains on medical benchmarks.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Auditing LLM Benchmarks with Item Response Theory
This paper introduces an Item Response Theory-based method to detect mislabeled examples in LLM benchmarks at 95% precision, tracing errors to labeling heuristics and annotation issues.
Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement
This paper introduces 'second-order bias', the bias LLMs exhibit when judging biased content, and proposes a reasoning task grounded in epistemic entitlement to evaluate it. Experiments show that the task evades safety guardrails and reveals systematic demographic biases in LLM judges.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
This study evaluates how interactive dialogue with an LLM (via the MedSyn system) improves diagnostic accuracy for physicians in emergency care settings, showing significant gains for residents on difficult cases.