Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

arXiv cs.CL 06/01/26, 04:00 AM Papers

Summary

This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.

arXiv:2605.30646v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\Delta C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

Original Article

View Cached Full Text

Cached at: 06/01/26, 09:26 AM

# Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Source: [https://arxiv.org/abs/2605.30646](https://arxiv.org/abs/2605.30646)
[View PDF](https://arxiv.org/pdf/2605.30646)

> Abstract:Large Language Models \(LLMs\) are increasingly used in clinical applications\. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation\. This sensitivity poses risks in safety\-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions\. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding\-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity\. To address this limitation, we propose a semantic verification framework based on Natural Language Inference \(NLI\) to filter meaning\-preserving prompt variations, which are further refined using an LLM\-as\-a\-judge and audited by a clinical expert\. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity \(MVS\), confidence variation \(\\Delta C\), and Worst\-Case Instability \(WCI\)\. We evaluate 16 open\-source general\-purpose \(GP\) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets\. Our results demonstrate that robustness differences between domain\-specific \(DS\) models are mixed and highly model\-dependent, i\.e\., domain specialization does not consistently improve or reduce robustness to meaning\-preserving prompt reformulations\. Several DS models rank among the most robust \(when compared with GP counterparts\), and strong GP baselines remain competitive as well\.

## Submission history

From: Mahdi Alkaeed Khalaf \[[view email](https://arxiv.org/show-email/b6c2c15c/2605.30646)\] **\[v1\]**Thu, 28 May 2026 23:03:43 UTC \(2,457 KB\)

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Similar Articles

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

On the Persistent Effects of Lexicality in Large Language Mod

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

Submit Feedback

Similar Articles

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

On the Persistent Effects of Lexicality in Large Language Mod

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates