Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Summary
This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.
View Cached Full Text
Cached at: 06/01/26, 09:26 AM
# Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs Source: [https://arxiv.org/abs/2605.30646](https://arxiv.org/abs/2605.30646) [View PDF](https://arxiv.org/pdf/2605.30646) > Abstract:Large Language Models \(LLMs\) are increasingly used in clinical applications\. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation\. This sensitivity poses risks in safety\-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions\. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding\-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity\. To address this limitation, we propose a semantic verification framework based on Natural Language Inference \(NLI\) to filter meaning\-preserving prompt variations, which are further refined using an LLM\-as\-a\-judge and audited by a clinical expert\. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity \(MVS\), confidence variation \(\\Delta C\), and Worst\-Case Instability \(WCI\)\. We evaluate 16 open\-source general\-purpose \(GP\) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets\. Our results demonstrate that robustness differences between domain\-specific \(DS\) models are mixed and highly model\-dependent, i\.e\., domain specialization does not consistently improve or reduce robustness to meaning\-preserving prompt reformulations\. Several DS models rank among the most robust \(when compared with GP counterparts\), and strong GP baselines remain competitive as well\. ## Submission history From: Mahdi Alkaeed Khalaf \[[view email](https://arxiv.org/show-email/b6c2c15c/2605.30646)\] **\[v1\]**Thu, 28 May 2026 23:03:43 UTC \(2,457 KB\)
Similar Articles
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Researchers from PNNL and Washington University introduce a systematic framework to test how five LLMs detect subtle semantic changes in documents, revealing positional bias, context coherence effects, and model-specific scoring fingerprints.
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
This paper presents a neuro-symbolic verification architecture for LLM outputs in high-stakes domains, combining formal symbolic methods with neural semantic analysis. Evaluated on a medical device damage assessment system, it achieves over 83% hallucination detection for structured entities and 30% reduction in report creation time.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
On the Persistent Effects of Lexicality in Large Language Mod
This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.
When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates
This paper introduces SemCog Bench, a curated benchmark of 1,858 Arabic-Hebrew word pairs with sentence-level annotations, to evaluate LLMs' ability to distinguish true cognates from false friends and loanwords. Results show high accuracy on true cognates but sharp drops on false friends, highlighting a key limitation in cross-lingual semantic reasoning.