Tag
This paper explores transfer learning for mapping FHIR questionnaire items to LOINC codes using retrieval methods, comparing six approaches on a small evaluation set.
This paper introduces the task of extracting applicability conditions for therapeutic drug-disease relations from biomedical literature, creates a manually annotated dataset of triples, and proposes a LoRA-enhanced method that outperforms existing baselines.
AAbAAC is a manually annotated corpus of 115 PubMed abstracts for autoimmunity information extraction, focusing on entities like autoimmune diseases and autoantibodies. The study demonstrates improved NER performance after fine-tuning on this corpus.
Introduces BioDivergence, a benchmark and evaluation framework for detecting context-conditioned contradictions in biomedical abstracts, featuring a six-class conflict taxonomy and a silver dataset of 11,865 claim pairs.
A large-scale study across 5 models (7B–72B), 10 biomedical QA datasets, 4 retrieval methods, and 4 corpora finds that RAG yields only small and inconsistent gains (1–2 points) over no-retrieval baselines in biomedical question answering. The study concludes that the main bottleneck is not retrieval quality but models' limited ability to effectively use retrieved evidence.
This paper presents a robust evaluation framework and training strategies for biomedical publication type and study design classification, using knowledge-guided perturbations to mitigate reliance on spurious features.
This paper compares two strategies for injecting structured biomedical knowledge from the UMLS Metathesaurus into language models: continual pretraining (embedding knowledge into model parameters) and GraphRAG (querying a knowledge graph at inference time). Results show improvements on biomedical QA benchmarks, with GraphRAG on LLaMA 3-8B yielding over 3 and 5 accuracy points on PubMedQA and BioASQ respectively without any retraining.
DeepER-Med introduces an agentic AI framework for evidence-based medical research with explicit evidence appraisal criteria and a new benchmark dataset (DeepER-MedQA) of 100 expert-curated medical questions, demonstrating superior performance over production platforms with clinical validation on real-world cases.
MedConclusion introduces a large-scale benchmark of 5.7 million PubMed structured abstracts for evaluating LLMs on biomedical conclusion generation from structured scientific evidence. The study finds that conclusion writing is behaviorally distinct from summarization and that current automatic metrics cluster strong models closely together.