Tag
This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.