DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

arXiv cs.AI Papers

Summary

DeepER-Med introduces an agentic AI framework for evidence-based medical research with explicit evidence appraisal criteria and a new benchmark dataset (DeepER-MedQA) of 100 expert-curated medical questions, demonstrating superior performance over production platforms with clinical validation on real-world cases.

arXiv:2604.15456v1 Announce Type: new Abstract: Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:33 AM

# DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
Source: https://arxiv.org/abs/2604.15456
Authors: Zhizheng Wang (https://arxiv.org/search/cs?searchtype=author&query=Wang,+Z), Chih-Hsuan Wei (https://arxiv.org/search/cs?searchtype=author&query=Wei,+C), Joey Chan (https://arxiv.org/search/cs?searchtype=author&query=Chan,+J), Robert Leaman (https://arxiv.org/search/cs?searchtype=author&query=Leaman,+R), Chi-Ping Day (https://arxiv.org/search/cs?searchtype=author&query=Day,+C), Chuan Wu (https://arxiv.org/search/cs?searchtype=author&query=Wu,+C), Mark A Knepper (https://arxiv.org/search/cs?searchtype=author&query=Knepper,+M+A), Antolin Serrano Farias (https://arxiv.org/search/cs?searchtype=author&query=Farias,+A+S), Jordina Rincon-Torroella (https://arxiv.org/search/cs?searchtype=author&query=Rincon-Torroella,+J), Hasan Slika (https://arxiv.org/search/cs?searchtype=author&query=Slika,+H), Betty Tyler (https://arxiv.org/search/cs?searchtype=author&query=Tyler,+B), Ryan Huu-Tuan Nguyen (https://arxiv.org/search/cs?searchtype=author&query=Nguyen,+R+H), Asmita Indurkar (https://arxiv.org/search/cs?searchtype=author&query=Indurkar,+A), Mélanie Hébert (https://arxiv.org/search/cs?searchtype=author&query=H%C3%A9bert,+M), Shubo Tian (https://arxiv.org/search/cs?searchtype=author&query=Tian,+S), Lauren He (https://arxiv.org/search/cs?searchtype=author&query=He,+L), Noor Naffakh (https://arxiv.org/search/cs?searchtype=author&query=Naffakh,+N), Aseem Aseem (https://arxiv.org/search/cs?searchtype=author&query=Aseem,+A), Nicholas Wan (https://arxiv.org/search/cs?searchtype=author&query=Wan,+N), Emily Y Chew (https://arxiv.org/search/cs?searchtype=author&query=Chew,+E+Y), Tiarnan D L Keenan (https://arxiv.org/search/cs?searchtype=author&query=Keenan,+T+D+L), Zhiyong Lu (https://arxiv.org/search/cs?searchtype=author&query=Lu,+Z)

View PDF (https://arxiv.org/pdf/2604.15456)

> Abstract: Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

## Submission history

From: Zhizheng Wang [view email (https://arxiv.org/show-email/798f7fe7/2604.15456)] **[v1]** Thu, 16 Apr 2026 18:17:24 UTC (4,720 KB)

Similar Articles

Mind DeepResearch Technical Report

Hugging Face Daily Papers

MindDR is a multi-agent deep research framework using a three-agent architecture (Planning, DeepSearch, Report) and a four-stage training pipeline, achieving competitive performance with ~30B-parameter models on multiple benchmarks. Developed by Li Auto and deployed as an online product, it also introduces MindDR Bench, a 500-query Chinese benchmark for evaluating deep research capabilities.

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Hugging Face Daily Papers

DR³-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation with a realistic web environment simulation and comprehensive evaluation framework measuring information recall, factual accuracy, citation coverage, instruction following, and depth quality.

Enabling a new model for healthcare with AI co-clinician

Google DeepMind Blog

Google DeepMind announces an AI co-clinician research initiative aimed at improving healthcare delivery through 'triadic care,' where AI agents assist patients under physician supervision. The system demonstrated high accuracy and zero critical errors in a study of primary care queries, outperforming existing evidence synthesis tools.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.