Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

arXiv cs.CL Papers

Summary

This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.

arXiv:2605.30646v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\Delta C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:26 AM

# Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Source: [https://arxiv.org/abs/2605.30646](https://arxiv.org/abs/2605.30646)
[View PDF](https://arxiv.org/pdf/2605.30646)

> Abstract:Large Language Models \(LLMs\) are increasingly used in clinical applications\. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation\. This sensitivity poses risks in safety\-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions\. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding\-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity\. To address this limitation, we propose a semantic verification framework based on Natural Language Inference \(NLI\) to filter meaning\-preserving prompt variations, which are further refined using an LLM\-as\-a\-judge and audited by a clinical expert\. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity \(MVS\), confidence variation \(\\Delta C\), and Worst\-Case Instability \(WCI\)\. We evaluate 16 open\-source general\-purpose \(GP\) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets\. Our results demonstrate that robustness differences between domain\-specific \(DS\) models are mixed and highly model\-dependent, i\.e\., domain specialization does not consistently improve or reduce robustness to meaning\-preserving prompt reformulations\. Several DS models rank among the most robust \(when compared with GP counterparts\), and strong GP baselines remain competitive as well\.

## Submission history

From: Mahdi Alkaeed Khalaf \[[view email](https://arxiv.org/show-email/b6c2c15c/2605.30646)\] **\[v1\]**Thu, 28 May 2026 23:03:43 UTC \(2,457 KB\)

Similar Articles

On the Persistent Effects of Lexicality in Large Language Mod

arXiv cs.CL

This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

arXiv cs.CL

This paper introduces SemCog Bench, a curated benchmark of 1,858 Arabic-Hebrew word pairs with sentence-level annotations, to evaluate LLMs' ability to distinguish true cognates from false friends and loanwords. Results show high accuracy on true cognates but sharp drops on false friends, highlighting a key limitation in cross-lingual semantic reasoning.