ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Hugging Face Daily Papers 05/19/26, 12:00 AM Papers

clinical-reasoning multimodal evidence-seeking agentic-framework llm medical-ai

Summary

ClinSeekAgent is an automated agentic framework that enables large language models to actively acquire and synthesize multimodal clinical evidence from raw data sources, improving decision-making accuracy in both text-only and multimodal tasks. It introduces the ClinSeek-Bench benchmark and a distilled model ClinSeek-35B-A3B that achieves strong performance on agentic clinical reasoning.

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

Original Article

View Cached Full Text

Cached at: 05/22/26, 06:23 AM

Paper page - ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Source: https://huggingface.co/papers/2605.20176

Abstract

Large language models(LLMs) andagentic systemshave shown promise forclinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, anautomated agentic frameworkfor dynamicmultimodal evidence seekingthat shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by queryingmedical knowledge bases, navigating raw EHRs, and invokingmedical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence intogrounded clinical decisions. ClinSeekAgent serves both as aninference-time agentfor frontier LLMs and as atraining-time pipelinefor distilling high-qualityagent trajectoriesintocompact open-source models. To validate its inference-time effectiveness, we constructClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existingAgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.20176

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### UCSC-VLAA/ClinSeek-35B-A3B Text Generation• 35B• Updated1 day ago • 44

Datasets citing this paper1

#### UCSC-VLAA/ClinSeek-Bench Viewer• Updated1 day ago • 2.79k • 26

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.20176 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Paper page - ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM

@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…

I built a compiler that rewrites Python into a model-facing representation

@KanikaBK: Google just dropped an AI bomb! A BILLION DOLLARS Game is on. Gemma 4 12 B runs on your laptop. 16 GB of RAM, that is a…

Submit Feedback

Similar Articles

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Qwen3.5-9B outperforms gemma-4-12b-it on 5 of 8 benchmarks despite having a smaller footprint, with gemma only slightly better at coding.

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM

@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…

I built a compiler that rewrites Python into a model-facing representation

@KanikaBK: Google just dropped an AI bomb! A BILLION DOLLARS Game is on. Gemma 4 12 B runs on your laptop. 16 GB of RAM, that is a…