Tag
This study investigates whether instruction-tuned LLMs (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, Phi-3-mini) can reliably classify Correct Information Units in aphasic discourse transcripts. Few-shot prompting yields competitive F1 scores (0.776–0.817) for three models, but performance varies by severity and human agreement remains insufficient for fully autonomous use.
This paper proposes ReportQA, a QA-based framework for evaluating radiology reports that uses LLMs to answer clinically relevant questions, demonstrating better alignment with radiologist judgments than existing metrics.
This paper presents a computational audit of representational bias in ClinicalBERT, finding that demographic associations are amplified by the model itself rather than inherited from training data.
This paper presents a fully local, two-stage LLM pipeline using MedGemma-27B for filling Case Report Forms from clinical notes, achieving a macro-F1 of 0.55 on the English test track and securing second place among local open-source submissions.
EDEN is a large-scale corpus of anonymized clinical notes from Italian emergency departments, with a subset manually annotated for structured information extraction. It aims to support LLM development for medical applications in Italian.
This paper demonstrates that supervised fine-tuning with synthetic rationale data consistently harms prediction performance for Alzheimer's disease detection compared to label-only fine-tuning, across many configurations and model families. The degradation persists despite high-quality rationales and is attributed to a conflict between narrative plausibility and discriminative optimization.
This paper presents an iterative imbalance-aware fine-tuning approach using Qwen3-8B with QLoRA for psychological defense mechanism classification, achieving a macro F1 of 0.3917 and ranking 4th out of 21 teams in the PsyDefDetect 2026 shared task.
Introduces SafeRx-Agent, a knowledge-grounded multi-agent framework for safe and explainable medication recommendation that generates fine-grained ATC code predictions while controlling drug interactions and contraindications, evaluated on MIMIC-III and MIMIC-IV datasets.
This paper investigates the risk of sensitive information inference from exported LLM representations in clinical summarization, showing that reducing leakage from one vector artifact does not guarantee privacy in others. It introduces SurfaceLoRA, a fine-tuning method that reduces race recovery from targeted vectors while preserving utility.
This paper introduces EPPC-OASIS, an ontology-aware adaptation method for extracting structured communication behaviors from secure patient-provider messages. The approach combines Wasserstein alignment during fine-tuning with inference refinement procedures, achieving modest improvements over baselines on a de-identified corpus.
MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.
This paper explores using few-shot prompted LLMs for actionable triage categorization of online patient inquiries into self-care, schedule-visit, urgent-clinician-review, or emergency-referral. The best model (Claude Haiku 4.5 with 12-shot prompting) achieves macro-F1 of 0.475, surpassing supervised baselines, but the authors conclude that LLMs can support triage prioritization and selective human review, not autonomous deployment.
This paper introduces ClinicalBench and the EpiKG system, evaluating assertion-aware retrieval for clinical question answering on MIMIC-IV data across multiple LLMs. It demonstrates that handling negation and temporality in retrieval significantly improves performance over standard baselines.
This paper presents a deployment-oriented stress-testing framework to evaluate how well large language models identify side effects of breast cancer radiation treatments. The study highlights limitations in LLM reliability, such as sensitivity to minor documentation changes and under-recall of rare side effects, suggesting that grounding outputs in clinician-curated lists improves robustness.
RADS uses reinforcement learning to pick the most informative samples for few-shot fine-tuning, boosting transfer-learning accuracy on low-resource, highly imbalanced clinical datasets.
FD-NL2SQL is a feedback-driven natural language to SQL system for clinical oncology databases that improves with use through clinician edits and logic-based SQL augmentation. The system decomposes natural language questions into predicates, retrieves expert-verified exemplars, and synthesizes executable SQL with continuous learning capabilities.