Tag
This paper compares two strategies for injecting structured biomedical knowledge from the UMLS Metathesaurus into language models: continual pretraining (embedding knowledge into model parameters) and GraphRAG (querying a knowledge graph at inference time). Results show improvements on biomedical QA benchmarks, with GraphRAG on LLaMA 3-8B yielding over 3 and 5 accuracy points on PubMedQA and BioASQ respectively without any retraining.
DeepER-Med introduces an agentic AI framework for evidence-based medical research with explicit evidence appraisal criteria and a new benchmark dataset (DeepER-MedQA) of 100 expert-curated medical questions, demonstrating superior performance over production platforms with clinical validation on real-world cases.
MedConclusion introduces a large-scale benchmark of 5.7 million PubMed structured abstracts for evaluating LLMs on biomedical conclusion generation from structured scientific evidence. The study finds that conclusion writing is behaviorally distinct from summarization and that current automatic metrics cluster strong models closely together.