ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
Summary
ClinSeekAgent is an automated agentic framework that enables large language models to actively acquire and synthesize multimodal clinical evidence from raw data sources, improving decision-making accuracy in both text-only and multimodal tasks. It introduces the ClinSeek-Bench benchmark and a distilled model ClinSeek-35B-A3B that achieves strong performance on agentic clinical reasoning.
View Cached Full Text
Cached at: 05/22/26, 06:23 AM
Paper page - ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
Source: https://huggingface.co/papers/2605.20176
Abstract
ClinSeekAgent is an automated agentic framework that enables large language models to actively acquire and synthesize multimodal clinical evidence from raw data sources, improving decision-making accuracy in both text-only and multimodal tasks.
Large language models(LLMs) andagentic systemshave shown promise forclinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, anautomated agentic frameworkfor dynamicmultimodal evidence seekingthat shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by queryingmedical knowledge bases, navigating raw EHRs, and invokingmedical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence intogrounded clinical decisions. ClinSeekAgent serves both as aninference-time agentfor frontier LLMs and as atraining-time pipelinefor distilling high-qualityagent trajectoriesintocompact open-source models. To validate its inference-time effectiveness, we constructClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existingAgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.20176
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### UCSC-VLAA/ClinSeek-35B-A3B Text Generation• 35B• Updated1 day ago • 44
Datasets citing this paper1
#### UCSC-VLAA/ClinSeek-Bench Viewer• Updated1 day ago • 2.79k • 26
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.20176 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Qwen3.5-9B outperforms gemma-4-12b-it on 5 of 8 benchmarks despite having a smaller footprint, with gemma only slightly better at coding.
Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM
Google releases Gemma 4 12B, a compact AI model optimized for local laptop use with only 16GB of RAM, featuring multi-token prediction and streamlined multimodal capabilities for text, audio, and images.
@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…
Google DeepMind researcher announces the release of Gemma 4 12B, a dense encoder-free model that processes text, image, and audio inputs, continuing work on unifying models across modalities.
I built a compiler that rewrites Python into a model-facing representation
Vulpine is a compiler that transforms human-readable Python code into a compressed macro representation optimized for LLMs, reducing token count by 13.8% on average while enabling exact structural reconstruction.
@KanikaBK: Google just dropped an AI bomb! A BILLION DOLLARS Game is on. Gemma 4 12 B runs on your laptop. 16 GB of RAM, that is a…
Google released Gemma 4 12B, an open-source multimodal AI model under Apache 2.0 that runs locally on laptops with 16GB RAM, targeting enterprise edge deployment.