Tag
This paper presents a deployment-oriented stress-testing framework to evaluate how well large language models identify side effects of breast cancer radiation treatments. The study highlights limitations in LLM reliability, such as sensitivity to minor documentation changes and under-recall of rare side effects, suggesting that grounding outputs in clinician-curated lists improves robustness.