Tag
MedCUA-Bench is a new benchmark for evaluating computer-use agents on clinical software tasks, covering 18 scenarios across 10 medical domains with safety dimensions. Results show that current agents perform poorly, especially on real OpenEMR, highlighting a significant gap in reliability.
AMNESIA is the first large-scale open-source benchmark for medical unlearning, comprising 70,560 QA pairs from 8,820 patient notes across 11 diseases, designed to evaluate forgetting of both factual and reasoning knowledge in LLMs.
This paper investigates the role of inductive bias in time-series pretraining for clinical data, proposing PathoFM, an encoder-centric transformer pretrained on multivariate gait windows. The study compares different pretraining objectives and finds that dynamics-centric mixtures yield the most balanced transfer across classification and regression tasks.
This paper investigates how large language models maintain correct beliefs under adversarial pressure in clinical settings, proposing R-FT fine-tuning to improve epistemic resilience while balancing corrigibility, and demonstrating significant robustness gains on medical benchmarks.
AnchorDiff proposes a topology-aware masked diffusion framework for radiology report generation, integrating RadGraph-derived clinical anchors and confidence-based rewriting to achieve state-of-the-art results on MIMIC-CXR and MIMIC-RG4 benchmarks.