PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

arXiv cs.AI Papers

Summary

PathoSage introduces a three-stage framework for pathology multimodal reasoning that separates knowledge retrieval, evidence collection, and evidence adjudication to reduce hallucinations and handle conflicting evidence, featuring a training-free Beta-Bernoulli experience system for modeling tool reliability.

arXiv:2606.07549v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:52 AM

# PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow
Source: [https://arxiv.org/html/2606.07549](https://arxiv.org/html/2606.07549)
Chengyang Zhang1,2Wenchuan Zhang2∗Bo Li3Mengran Li4Bob Zhang3 Yuhao Yi1,2Hong Bu2,†Jiancheng Lv1,† 1College of Computer Science, Sichuan University 2Department of Pathology and Institute of Clinical Pathology, West China Hospital, Sichuan University 3Department of Computer and Information Science, University of Macau 4School of Intelligent Systems Engineering, Sun Yat\-sen University yuhaoyi@scu\.edu\.cn

###### Abstract

Recent advances in Multimodal Large Language Models \(MLLMs\) and agent workflows have shown strong promise for computational pathology, yet reliable patch\-level reasoning remains challenging\. End\-to\-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination\. We propose PathoSage, a three\-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch\-level pathology multimodal reasoning\. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias\. We further introduce a training\-free Beta\-Bernoulli experience system with continuous credit assignment to model long\-term tool reliability and construct similarity\-weighted priors for future tool use\. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines\. Our results highlight explicit evidence adjudication and reliability\-aware tool modeling as key ingredients for robust pathology agents\.

## 1Introduction

In recent years, Multimodal Large Language Models \(MLLMs\) in computational pathology have rapidly advanced from early image\-text representation learning to complex multi\-step reasoning[plip](https://arxiv.org/html/2606.07549#bib.bib52);[quilt1m](https://arxiv.org/html/2606.07549#bib.bib53);[slideseek](https://arxiv.org/html/2606.07549#bib.bib25);[vlsa](https://arxiv.org/html/2606.07549#bib.bib55)\. Consequently, pathology AI is evolving from a monolithic paradigm toward agentic systems that actively invoke external tools, retrieve domain knowledge, and organize analysis workflows\. Moving beyond direct answer generation, these pathology agents increasingly emulate expert behavior by utilizing structured mechanisms to acquire and organize evidence[pathchat](https://arxiv.org/html/2606.07549#bib.bib2);[wsi\-llava](https://arxiv.org/html/2606.07549#bib.bib12);[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24);[cpathagent](https://arxiv.org/html/2606.07549#bib.bib29);[pathology\-cot](https://arxiv.org/html/2606.07549#bib.bib26)\. Recent studies have advanced multimodal reasoning across both fine\-grained morphological recognition and whole\-slide image \(WSI\) cross\-region analysis[titan](https://arxiv.org/html/2606.07549#bib.bib15);[chief](https://arxiv.org/html/2606.07549#bib.bib58)\. Tool augmentation has also emerged; for example, PathAsst integrated a specialized backbone with visual sub\-models and literature retrieval[pathasst](https://arxiv.org/html/2606.07549#bib.bib23)\. Building on this, recent studies highlight that reliable reasoning requires structured workflows for observation selection, tool invocation, and progressive evidence accumulation, rather than relying solely on stronger visual representations[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24);[cpathagent](https://arxiv.org/html/2606.07549#bib.bib29);[pathology\-cot](https://arxiv.org/html/2606.07549#bib.bib26)\. Taken together, these developments suggest that pathology AI is gradually shifting from isolated multimodal understanding toward more structured systems that must organize, compare, and utilize evidence across multiple sources\.

Despite these advances, existing methods typically merge tool outputs, retrieved information, and model reasoning into a single shared context[Rajendran2025FoundationMI](https://arxiv.org/html/2606.07549#bib.bib56);[peng2025aligning](https://arxiv.org/html/2606.07549#bib.bib57)\. This design is fragile when sources provide*heterogeneous or conflicting evidence*, such as disagreeing classifiers, hallucinated VQA findings, or misaligned retrieved knowledge[chen2026landscape](https://arxiv.org/html/2606.07549#bib.bib60)\. Without explicit evidence adjudication, early biases and context contamination accumulate, reducing reliability and interpretability\. The core challenge, therefore, is not merely how to add more tools or more knowledge, but how to separate, assess, and reconcile heterogeneous evidence before producing a final answer\. This limitation mirrors broader challenges recognized in literature on reasoning agents, tool use, and retrieval\-augmented generation[react](https://arxiv.org/html/2606.07549#bib.bib36);[toolformer](https://arxiv.org/html/2606.07549#bib.bib38)\. This issue is particularly critical for*patch\-level pathology reasoning*\. As compact and interpretable units of morphological evidence[zhang2025attention](https://arxiv.org/html/2606.07549#bib.bib65);[shui2026nunext](https://arxiv.org/html/2606.07549#bib.bib68), local patches serve as a natural foundation for clinical judgments, educational assistance, and interactive analysis[conch](https://arxiv.org/html/2606.07549#bib.bib51);[musk](https://arxiv.org/html/2606.07549#bib.bib54);[homie](https://arxiv.org/html/2606.07549#bib.bib59);[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24);[octomed](https://arxiv.org/html/2606.07549#bib.bib61);[pulsemind](https://arxiv.org/html/2606.07549#bib.bib62);[cx\-mind](https://arxiv.org/html/2606.07549#bib.bib63);[wu2025bridging](https://arxiv.org/html/2606.07549#bib.bib64);[jeddi2026does](https://arxiv.org/html/2606.07549#bib.bib67);[anatomy\-r1](https://arxiv.org/html/2606.07549#bib.bib69)\. While the patch\-level setting provides an ideal testbed for studying tool\-model interactions, there remains no unified framework for structurally collecting, reconciling, and modeling heterogeneous evidence\. It offers a relatively controlled setting in which the central difficulty is not large\-scale navigation itself, but how multi\-source evidence should be organized, compared, and adjudicated\. The fundamental question of*how multi\-source evidence should be adjudicated*remains underexplored[rag](https://arxiv.org/html/2606.07549#bib.bib39);[reflexion](https://arxiv.org/html/2606.07549#bib.bib44);[zhang2026multimodal](https://arxiv.org/html/2606.07549#bib.bib37)\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x1.png)Figure 1:Comparison of the \(a\) "black box" VLM approach and \(b\) our proposed PathoSage for evidence\-based pathology analysis\. \(c\) is the performance of PathoSage on the PathMMU test set\.To address this, we propose PathoSage, a three\-stage framework for patch\-level multimodal reasoning that explicitly decomposes the process\. First, in the knowledge retrieval stage, the system retrieves and assesses task\-relevant external knowledge based on the patch and query[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24);[pathasst](https://arxiv.org/html/2606.07549#bib.bib23)\. Next, the evidence collection stage invokes pathology\-specific tools to gather local visual evidence, deferring the final answer generation[react](https://arxiv.org/html/2606.07549#bib.bib36)\. Finally, the Structured Evidence Deliberation stage independently evaluates tool outputs, performs conflict analysis, and generates a final judgment in a fresh context to minimize historical contamination\. Thus, PathoSage shifts the paradigm from merely*using*tools to explicitly*adjudicating*their evidence\. Furthermore, we introduce a Beta\-Bernoulli experience system to dynamically model tool reliability across similar patches[agrawal2012analysis](https://arxiv.org/html/2606.07549#bib.bib66)\. Rather than assuming static trustworthiness, PathoSage continuously updates posterior estimates based on tool performance and task relevance, bridging single\-instance reasoning with long\-term adaptation for more targeted future tool use[toolmem](https://arxiv.org/html/2606.07549#bib.bib40);[xskill](https://arxiv.org/html/2606.07549#bib.bib41)\. Ultimately, by formalizing evidence organization and adjudication, this work establishes a robust foundation for both practical patch\-level applications and larger\-scale pathology agent systems \(Fig\.[1](https://arxiv.org/html/2606.07549#S1.F1)\)\.

#### Our main contributions are as follows\.

1. 1\.We proposePathoSage, a three\-stage agent framework for patch\-level pathology multimodal reasoning that explicitly decouples knowledge retrieval, evidence collection, evidence adjudication, and final answer generation\.
2. 2\.We introduceStructured Evidence Deliberation \(SED\)and aBeta\-Bernoulli experience systemfor heterogeneous evidence assessment, inter\-tool conflict analysis, weighted reasoning, and long\-term reliability\-aware tool utilization\.
3. 3\.We build a tool\-augmented system for patch\-level reasoning, validating it across multiple benchmarks to demonstrate the value of explicit evidence adjudication and experience\-based reliability modeling\.

## 2Related Works

### 2\.1Pathology Multimodal Large Language Models

In recent years, pathology multimodal large language models have advanced rapidly, with the research focus expanding from early image\-text representation learning to pathology question answering, description generation, interpretability, and more complex multi\-step reasoning\. A natural trend in this evolution is that some studies primarily center on local pathology images, emphasizing fine\-grained morphological recognition, local semantic understanding, and patch\-level question answering, while others further extend to whole\-slide images, modeling cross\-region context, multi\-scale tissue organization, and slide\-level semantic generation\. Representative works along these directions include PathAsst[pathasst](https://arxiv.org/html/2606.07549#bib.bib23), Quilt\-LLaVA[quilt\-llava](https://arxiv.org/html/2606.07549#bib.bib1), PathChat[pathchat](https://arxiv.org/html/2606.07549#bib.bib2), PA\-LLaVA[pa\-llava](https://arxiv.org/html/2606.07549#bib.bib3), PathGen\-LLaVA[pathgen16m](https://arxiv.org/html/2606.07549#bib.bib4), Patho\-R1[patho\-r1](https://arxiv.org/html/2606.07549#bib.bib5), SmartPath\-R1[smartpath\-r1](https://arxiv.org/html/2606.07549#bib.bib6), and TeamPath[teampath](https://arxiv.org/html/2606.07549#bib.bib7), as well as WSICaption[wsicaption](https://arxiv.org/html/2606.07549#bib.bib8), WSI\-VQA[wsi\-vqa](https://arxiv.org/html/2606.07549#bib.bib9), HistGen[histgen](https://arxiv.org/html/2606.07549#bib.bib10), SlideChat[slidechat](https://arxiv.org/html/2606.07549#bib.bib11), WSI\-LLaVA[wsi\-llava](https://arxiv.org/html/2606.07549#bib.bib12), PathAlign[pathalign](https://arxiv.org/html/2606.07549#bib.bib13), ALPaCA[alpaca](https://arxiv.org/html/2606.07549#bib.bib14), TITAN[titan](https://arxiv.org/html/2606.07549#bib.bib15), PathReasoner\-R1[pathreasoner\-r1](https://arxiv.org/html/2606.07549#bib.bib16), CPath\-Omni[cpath\-omni](https://arxiv.org/html/2606.07549#bib.bib17), PolyPath[polypath](https://arxiv.org/html/2606.07549#bib.bib18), HistoGPT[histogpt](https://arxiv.org/html/2606.07549#bib.bib19), PRISM2[prism2](https://arxiv.org/html/2606.07549#bib.bib20), Hepato\-LLaVA[hepato\-llava](https://arxiv.org/html/2606.07549#bib.bib22)and PathFound[pathfound](https://arxiv.org/html/2606.07549#bib.bib21)\. Overall, existing pathology MLLMs have demonstrated that pathology understanding cannot rely on a single scale or modality alone, but instead requires connecting local morphological cues with higher\-level histopathological semantics\.

### 2\.2Tool\-Augmented Reasoning and Pathology Agents

As pathology multimodal systems continue to evolve, an increasing number of studies have shifted the focus from simply enabling models to answer questions toward enabling systems to actively organize the reasoning process\. This trend is typically reflected in the introduction of agentic capabilities such as tool invocation, knowledge retrieval, region navigation, multi\-step observation, and decision trajectory modeling\. Unlike traditional pathology MLLMs, which mainly emphasize end\-to\-end generation, pathology agents place greater emphasis on whether the system can more closely mimic the workflow of pathologists by actively selecting regions of interest, invoking auxiliary modules, and progressively accumulating evidence before reaching a conclusion\. Recent pathology agent research has already expanded to multiple directions, including knowledge\-augmented reasoning, whole\-slide navigation, clinical decision support, prognostic analysis, and biomarker discovery\. Representative systems include Patho\-AgenticRAG[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24), SlideSeek[slideseek](https://arxiv.org/html/2606.07549#bib.bib25), Pathology\-CoT[pathology\-cot](https://arxiv.org/html/2606.07549#bib.bib26), PathFinder[pathfinder](https://arxiv.org/html/2606.07549#bib.bib27), PathAgent[pathagent](https://arxiv.org/html/2606.07549#bib.bib28), CPathAgent[cpathagent](https://arxiv.org/html/2606.07549#bib.bib29), SurvAgent[survagent](https://arxiv.org/html/2606.07549#bib.bib30), WSI\-agent[wsi\-agent](https://arxiv.org/html/2606.07549#bib.bib31), TissueLab[tissuelab](https://arxiv.org/html/2606.07549#bib.bib32), MMNavAgent[mmnavagent](https://arxiv.org/html/2606.07549#bib.bib33), as well as related agent frameworks for oncology decision\-making and biomarker discovery[ferber2025development](https://arxiv.org/html/2606.07549#bib.bib34);[sage](https://arxiv.org/html/2606.07549#bib.bib35)\. Overall, these studies suggest that pathology AI is evolving toward active systems that integrate tool use, knowledge access, and evidence accumulation\. This shift also highlights a deeper challenge: how heterogeneous evidence should be organized and used for reliable reasoning\.

### 2\.3Multi\-Source Evidence Integration, Conflict Handling, and Reliability Modeling

Although tool augmentation, retrieval\-augmented generation, and agentic reasoning have substantially expanded the capability boundary of multimodal systems, most existing methods still combine tool outputs, retrieved knowledge, and model reasoning within a shared interaction trajectory, leaving the final decision to be made over a single accumulated context[react](https://arxiv.org/html/2606.07549#bib.bib36);[toolformer](https://arxiv.org/html/2606.07549#bib.bib38);[rag](https://arxiv.org/html/2606.07549#bib.bib39);[pathasst](https://arxiv.org/html/2606.07549#bib.bib23);[cpathagent](https://arxiv.org/html/2606.07549#bib.bib29);[pathology\-cot](https://arxiv.org/html/2606.07549#bib.bib26)\. While this design is effective for improving overall capability, it raises an important and still underexplored challenge: when different tools provide heterogeneous, partially relevant, or even conflicting evidence, how should a system explicitly separate*evidence collection*from*evidence adjudication*? This issue is particularly critical in pathology, where classifiers may disagree, VQA modules may hallucinate morphological findings, and retrieved knowledge may only partially align with the image under analysis\. Recent studies have begun to move beyond one\-shot tool use toward memory\- and experience\-aware agents[memos](https://arxiv.org/html/2606.07549#bib.bib42);[memverse](https://arxiv.org/html/2606.07549#bib.bib43)\. ToolMem shows that agents can improve tool selection by summarizing the strengths and weaknesses of tools from prior interactions and retrieving such capability memory at inference time[toolmem](https://arxiv.org/html/2606.07549#bib.bib40)\. XSkill further highlights the importance of continual learning from both*experiences*and*skills*, using visually grounded summarization, cross\-rollout critique, and retrieval\-based adaptation to improve multimodal agents without parameter updates[xskill](https://arxiv.org/html/2606.07549#bib.bib41)\. However, these methods mainly focus on improving tool selection and continual adaptation, rather than explicitly modeling how heterogeneous tool outputs should be independently assessed, reconciled under conflict, and translated into reliability\-aware final decisions in pathology reasoning\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x2.png)Figure 2:Overview of the PathoSage framework, which performs pathology multimodal reasoning through a three\-stage agentic system with an optional experience system\.

## 3Methods

### 3\.1Overview

Given a pathology imageIIand a clinical queryQQ, PathoSage uses a set of specialist tools𝒯=\{t1,…,tm\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{m\}\\\}to produce an answer𝒜\\mathcal\{A\}\. The central problem is the adjudication of multi\-source evidence: individual tools may return complementary yet contradictory conclusions, and naively presenting all outputs to a single VLM call induces anchoring bias[echterhoff2024cognitive](https://arxiv.org/html/2606.07549#bib.bib48), which degrades fusion quality\. To address this, PathoSage decouples the analysis into three phases with strict information isolation \(Fig\.[2](https://arxiv.org/html/2606.07549#S2.F2)\): a RAG\-based knowledge retrieval stage provides domain\-grounded tool planning; a ReAct loop collects tool evidence; and a Structured Evidence Deliberation \(SED\) procedure independently assesses, algorithmically weighs, and synthesizes the collected evidence\. A Bayesian experience system optionally tracks per\-tool reliability and distills cross\-rollout strategy knowledge to progressively refine the adjudication process without parameter updates\. We detail each component in the following sections\.

### 3\.2Knowledge\-augmented Tool Planning

To ground tool selection in domain knowledge rather than VLM intuition, PathoSage incorporates a retrieval\-augmented knowledge module before tool execution begins, as illustrated in Fig\.[3](https://arxiv.org/html/2606.07549#S3.F3)\.

Pathology Knowledge Base\.We adopt the pathology textbook corpus from Patho\-AgenticRAG[patho\-agenticrag](https://arxiv.org/html/2606.07549#bib.bib24), comprising over 200,000 curated pages from approximately 600 authoritative textbooks\. Each page is embedded as an image–text pair using ColQwen2[fayssecolpali](https://arxiv.org/html/2606.07549#bib.bib45)into a shared vector space and indexed via HNSW[malkov2018efficient](https://arxiv.org/html/2606.07549#bib.bib47)in Milvus[wang2021milvus](https://arxiv.org/html/2606.07549#bib.bib46), yielding a database𝒟\\mathcal\{D\}of over 150 million vectors that supports efficient joint text–image retrieval\.

Retrieval and Page Understanding\.Given the input query and image, we construct one or more keyword\-based retrieval requests and retrieve the top\-20 textbook pages per request from𝒟\\mathcal\{D\}\. Each top\-ranked page is then processed by the VRAG agentℛ\\mathcal\{R\}from Patho\-AgenticRAG for multi\-turn document understanding\. The agent iteratively issues sub\-queries, localizes relevant regions, and produces a final summary, autonomously terminating within at most three turns\. This yields a structured knowledge summary𝒦=ℛ​\(𝒟,Q,I\)\\mathcal\{K\}=\\mathcal\{R\}\(\\mathcal\{D\},Q,I\)grounded in textbook evidence\.

Tool Guidance Generation\.The host VLM𝒱\\mathcal\{V\}receives𝒦\\mathcal\{K\}and generates a concise tool selection plan𝒢=𝒱​\(𝒦,𝒯,Q,I\)\\mathcal\{G\}=\\mathcal\{V\}\(\\mathcal\{K\},\\mathcal\{T\},Q,I\), specifying which analysis dimensions \(e\.g\., cellular morphology, tissue classification\) are most relevant to the query\.𝒢\\mathcal\{G\}is injected into the Phase A system prompt, and𝒦\\mathcal\{K\}is later provided to the SED independent assessment in Phase B as a reference for cross\-checking tool conclusions\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x3.png)Figure 3:The RAG retriever pipeline\. Per\-candidate queries retrieve textbook pages from a Milvus database, which are read by the RAG agent and summarized into tool guidance\.
### 3\.3Phase A: ReAct Evidence Collection

While pathology VLMs have achieved impressive performance, they remain susceptible to hallucinating morphological features that are not present in the image\. Moreover, these models generate holistic textual descriptions but cannot perform quantitative analyses, such as cell counting and tissue\-type distribution\. In contrast, specialist models provide these capabilities, yet their outputs are heterogeneous in both format and scope\.

To harness these complementary strengths, PathoSage includes a series of specialist tools spanning three analysis categories\. Specifically, cell segmentation provides per\-cell type counts and spatial distribution statistics\. Patch classification maps image patches to tissue\-type labels\. Pathology\-specialized VLM provides free\-form diagnostic reasoning\. This multi\-dimensional analysis paradigm mirrors clinical pathology practice, where pathologists integrate morphological examination, quantitative biomarker scoring, and pattern\-based judgment to reach a diagnosis[pathchat](https://arxiv.org/html/2606.07549#bib.bib2);[shaktah2025application](https://arxiv.org/html/2606.07549#bib.bib49)\.

During Phase A, the host VLM𝒱\\mathcal\{V\}iteratively selects and invokes tools from𝒯\\mathcal\{T\}via function calling, guided by the tool plan𝒢\\mathcal\{G\}\. Each toolt∈𝒯t\\in\\mathcal\{T\}is paired with a structured operation description specifying its input requirements and output format, so that𝒱\\mathcal\{V\}can match the analytical needs identified in𝒢\\mathcal\{G\}to the appropriate tool capabilities\. At each iterationii,𝒱\\mathcal\{V\}selects a tooltit\_\{i\}and obtains its output, forming a reasoning step𝒫i\\mathcal\{P\}\_\{i\}:

Pi=⟨ti,oi⟩=⟨ti,ti​\(I,Q\)⟩,ti∈𝒯,P\_\{i\}=\\langle t\_\{i\},o\_\{i\}\\rangle=\\langle t\_\{i\},t\_\{i\}\(I,Q\)\\rangle,\\ t\_\{i\}\\in\\mathcal\{T\},\(1\)whereoio\_\{i\}represents the tool’s structured result\. Crucially,𝒱\\mathcal\{V\}is instructed to only collect evidence during this phase, which does not produce a final answer\. This design reflects a deliberate separation of evidence gathering from evidence adjudication\. By deferring judgment to Phase B’s isolated contexts, we prevent the ReAct conversation history from anchoring the model’s reasoning on early evidence\. Phase A terminates when𝒱\\mathcal\{V\}invokes a dedicated termination function or reaches a maximum iteration count, yielding the collected evidence setℰ=\{𝒫1,𝒫2,…,𝒫n\}\\mathcal\{E\}=\\\{\\mathcal\{P\}\_\{1\},\\mathcal\{P\}\_\{2\},\\dots,\\mathcal\{P\}\_\{n\}\\\}that is forwarded to the structured deliberation procedure in Phase B\.

### 3\.4Phase B: Structured Evidence Deliberation

When two or more tools are invoked, their outputs may conflict\. For instance, a segmentation model may indicate predominantly inflammatory cells while a VQA model describes the tissue as neoplastic\. To prevent the anchoring bias that arises from single\-pass LLM fusion, we introduce Structured Evidence Deliberation \(SED\)\.

Step 1: VLM Assessment\.The collected evidenceℰ\\mathcal\{E\}is first grouped by tool category\. For each category, an independent VLM call containing only the original imageII, queryQQ, the category’s tool outputs, and the RAG knowledge𝒦\\mathcal\{K\}produces two semantic judgments:ai∈\{agree,uncertain,disagree\}a\_\{i\}\\in\\\{\\text\{agree\},\\;\\text\{uncertain\},\\;\\text\{disagree\}\\\}andri∈\{high,medium,low\}r\_\{i\}\\in\\\{\\text\{high\},\\;\\text\{medium\},\\;\\text\{low\}\\\}, whereaia\_\{i\}indicates whether the VLM concurs with the conclusion of tooltit\_\{i\}given its own visual understanding and textbook reference, andrir\_\{i\}captures how relevant that conclusion is to the query\. The assessor produces labels only, so that subsequent weighting is determined algorithmically rather than by VLM self\-estimation[xiongcan](https://arxiv.org/html/2606.07549#bib.bib50)\.

Step 2: Tool Conflict Analysis\.Given the assessments from step 1 and the tool reliability posteriors from the experience system, we compute a three\-dimensional effective weight for each evidence source:

wi=ϕ​\(ri\)⋅ψ​\(ai\)⋅θi,w\_\{i\}=\\phi\(r\_\{i\}\)\\cdot\\psi\(a\_\{i\}\)\\cdot\\theta\_\{i\},\(2\)whereϕ​\(⋅\)\\phi\(\\cdot\)andψ​\(⋅\)\\psi\(\\cdot\)are predefined numerical mappings \(e\.g\.,ϕ​\(high\)=1\.0\\phi\(\\text\{high\}\)=1\.0,ψ​\(agree\)=1\.0\\psi\(\\text\{agree\}\)=1\.0\), andθi∈\[0,1\]\\theta\_\{i\}\\in\[0,1\]is the historical reliability prior:

θi=αiαi\+βi,\\theta\_\{i\}=\\frac\{\\alpha\_\{i\}\}\{\\alpha\_\{i\}\+\\beta\_\{i\}\},\(3\)which is initialized to 0\.5 under the uninformative priorBeta​\(1,1\)\\mathrm\{Beta\}\(1,1\)when no experience is available\. The analyzer detects inter\-category conflicts and produces a structured report𝒞\\mathcal\{C\}that ranks all evidence bywkw\_\{k\}and highlights disagreements\.

Step 3: Final Reasoning\.A final VLM call synthesizes the answer in yet another fresh context, receiving the original imageII, queryQQ, the per\-tool assessments, and the conflict report𝒞\\mathcal\{C\}:

𝒜=𝒱​\(I,Q,ℰ,𝒞\)\.\\mathcal\{A\}=\\mathcal\{V\}\\bigl\(I,\\;Q,\\;\\mathcal\{E\},\\;\\mathcal\{C\}\\bigr\)\.\(4\)By isolating each step in a fresh LLM context, SED ensures that the final reasoning is informed by structured, pre\-adjudicated evidence rather than raw, sequentially accumulated tool outputs\.

### 3\.5Experience System

PathoSage supports a training\-free experience system that progressively refines evidence adjudication across tasks\. The system operates in two complementary layers: tracking per\-tool reliability via Bayesian posterior updates, and distilling task\-level strategy knowledge\.

Layer 1: Tool Reliability Tracking\.We model each tool’s reliability as a Beta\-Bernoulli conjugate pair\. For each tooltk∈𝒯t\_\{k\}\\in\\mathcal\{T\}, we maintain parameters\(αk,βk\)\(\\alpha\_\{k\},\\beta\_\{k\}\)\. After each task with ground\-truth feedback, we perform a continuous credit assignment that leverages the semantic assessment from SED Step 1\. Concretely, letR∈\{0,1\}R\\in\\\{0,1\\\}denote whether the final answer is correct, and letsk=ψ​\(ak\)s\_\{k\}=\\psi\(a\_\{k\}\)andvk=ϕ​\(rk\)v\_\{k\}=\\phi\(r\_\{k\}\)be the mapped assessment and relevance scores\. The update rule is:

\(αk,βk\)←\{\(αk\+sk⋅vk,βk\)R=1\(αk,βk\+\(1−sk\)⋅vk\),R=0\(\\alpha\_\{k\},\\beta\_\{k\}\)\\leftarrow\\begin\{cases\}\(\\alpha\_\{k\}\+s\_\{k\}\\cdot v\_\{k\},\\beta\_\{k\}\)&R=1\\\\ \(\\alpha\_\{k\},\\beta\_\{k\}\+\(1\-s\_\{k\}\)\\cdot v\_\{k\}\),&R=0\\end\{cases\}\(5\)A tool judged as “agree” with “high” relevance receives full credit when the task succeeds, while a tool judged “disagree” contributes minimally to eitherα\\alphaorβ\\beta, reflecting appropriate uncertainty about its role in the outcome\. To generalize across visually similar inputs, we retrieve the top\-KKnearest Beta records by image embedding similarity and aggregate them via distance\-weighted averaging to form the prior for a new query\. The posterior meanθk=αk/\(αk\+βk\)\\theta\_\{k\}=\\alpha\_\{k\}/\(\\alpha\_\{k\}\+\\beta\_\{k\}\)then feeds into the effective weight computation in SED Step 2\.

Layer 2: Cross Rollout Strategy Distillation\.While Layer 1 tracks statistical reliability, Layer 2 extracts symbolic strategy knowledge\. During an exploration, we executeNNrollouts per query\. Successful and failed rollouts are then compared to extract the semantic advantage, a natural\-language description of what strategies led to success versus failure\. Formally, given a group of rollouts\{\(yj,Rj\)\}j=1N\\\{\(y\_\{j\},R\_\{j\}\)\\\}\_\{j=1\}^\{N\}, whereyjy\_\{j\}is the trajectory andRjR\_\{j\}is the binary outcome, we use the host VLM to introspect on the group and produce strategy updates via add, delete, or modify operations on a persistent strategy bank\. The distilled strategies are injected into the Phase A system prompt, serving as a learned token prior that guides the VLM’s behavior toward more effective tool calling\.

## 4Experiments

Table 1:Quantitative comparison of models on the PathMMU test set\. The best result in each subset for general and pathology MLLMs isin\-bold, and the second\-best result isunderlined\. Subscriptgreen numbersindicate absolute performance gains relative to the host model \(GPT\-5\.4\)\.Table 2:Quantitative comparison of models on Quilt\-VQA, Path\-VQA, MedXpert, and OmniMed\. The best result in each subset for general and pathology MLLMs isin\-bold, and the second\-best result isunderlined\.We conduct comprehensive patch understanding evaluations across five diverse pathology Visual Question Answering \(VQA\) datasets, including PathMMU[sun2024pathmmu](https://arxiv.org/html/2606.07549#bib.bib73), Path\-VQA[he2020pathvqa](https://arxiv.org/html/2606.07549#bib.bib70), Quilt\-VQA[quilt1m](https://arxiv.org/html/2606.07549#bib.bib53), MedXpertQA[zuo2025medxpertqa](https://arxiv.org/html/2606.07549#bib.bib71), and OmniMedVQA[hu2024omnimedvqa](https://arxiv.org/html/2606.07549#bib.bib72)\. Detailed information of datasets and implementation are listed in Appendix\.

### 4\.1Quantitative Comparison

PathoSage significantly enhances the reasoning capability of its host model\.As presented in Table[1](https://arxiv.org/html/2606.07549#S4.T1), PathoSage \(using GPT\-5\.4 as the host\) achieves an impressive 79\.6% on the PathMMU\-test\-tiny split and 76\.8% on the Test\-All split\. Notably, it substantially outperforms its own underlying reasoning engine, GPT\-5\.4, yielding an absolute improvement of 6\.7% on the Tiny split and 6\.0% on the All split\. Furthermore, PathoSage remains highly competitive against Gemini\-3\-Pro, the strongest general MLLM evaluated, even surpassing it on the Test\-Tiny overall score \(79\.6% vs\. 78\.6%\) and dominating in specific subsets such as EduContext \(82\.8% vs\. 80\.8%\) and PathCLS \(71\.2% vs\. 67\.8%\)\. This superiority extends to the diverse diagnostic benchmarks in Table[2](https://arxiv.org/html/2606.07549#S4.T2), where PathoSage achieves 81\.4% on Quilt\-VQA and 88\.3% on OmniMed\. The performance advantage stems from the fundamental difference in reasoning paradigms\. General MLLMs, despite their massive parameter counts, rely on single\-pass generation\. This "black\-box" approach makes them susceptible to morphological hallucinations and anchoring biases when faced with complex, high\-resolution pathology images\. In contrast, PathoSage explicitly decouples evidence collection from adjudication\. By integrating visual tools directly into the reasoning process and utilizing SED to weigh evidence, PathoSage grounds its diagnosis in verified features, thereby overcoming the inherent limitations of single MLLMs\.

PathoSage establishes a new state\-of\-the\-art paradigm for domain\-specific reasoning\.Compared to Patho\-AgenticRAG, which also employs retrieval and reasoning mechanisms, PathoSage achieves a significant lead \(\+6\.7% on PathMMU\-test\-tiny\)\. For Yes/No questions requiring definitive morphological judgments, it achieves 81\.4% on Quilt\-VQA and 83\.2% on Path\-VQA, markedly outperforming previous pathology MLMMs\. The results highlight the efficiency and robustness of our collaborative agentic design\. Most existing pathology MLLMs require extensive domain\-specific datasets and computationally expensive fine\-tuning pipelines to acquire medical reasoning capabilities\. In contrast, PathoSage is a training\-free framework that achieves superior performance simply by orchestrating and collaborating existing specialized models\.

### 4\.2Qualitative Analysis

Figure[4](https://arxiv.org/html/2606.07549#S4.F4)illustrates a representative VQA example from the PathMMU test set\. This case highlights the vulnerability of MLLMs to visual deception\. The baseline models, including the highly advanced Gemini\-3 Pro and GPT\-5\.4, all fail by hallucinating "perinuclear halos" \(Option A\), a feature typically associated with viral infections \(e\.g\., HPV\) that is absent in this normal tissue section\. Qwen3\-VL\-32B also misinterprets the visual context\. In contrast, PathoSage correctly identifies the "high nucleus\-to\-cytoplasm ratio" \(Option D\)\. It achieves this not through a single\-pass guess, but by orchestrating specialized tools: the classification tools confirm a normal squamous epithelium context, and the segmentation tool verifies high nuclear density\. By grounding its reasoning in these verified tool outputs, PathoSage successfully avoids the hallucination trap\. Additional qualitative examples, including detailed case studies on how SED explicitly resolves complex inter\-tool conflicts by downweighting erroneous VQA suggestions, are provided in the Appendix\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x4.png)Figure 4:A representative example where PathoSage correctly identifies the answer, while three baseline MLLMs \(Gemini\-3\-Pro, GPT\-5\.4, and Qwen2\.5\-32B\) all fail on the same question\.
### 4\.3Ablation Study

Table 3:Ablation on key components, including RAG, Tool\-box, SED, and Experience\. The study is conducted on PathMMU\-test\-tiny\. The best results arein\-bold\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x5.png)Figure 5:Ablation on the host VLM, including Qwen3\-VL\-32B, GPT\-5\.4, and Gemini\-3\-Pro\. The study is conducted on PathMMU\-test\-tiny\.Effectiveness of Key Components\.We ablate the core components of PathoSage on PathMMU\-test\-tiny\. As shown in Table[3](https://arxiv.org/html/2606.07549#S4.T3), the host model \(GPT\-5\.4\) achieves an overall accuracy of 72\.9%\. Introducing RAG alone improves performance to 74\.5% \(\+1\.6%\), while introducing the Tool\-box alone yields 75\.0% \(\+2\.1%\), indicating that both external knowledge and specialized visual tools provide valuable diagnostic signals\. However, simply combining them only reaches 75\.8%\. This marginal gain \(\+0\.8% over Tool\-box alone\) suggests that naively aggregating heterogeneous evidence into a single context limits the utilization of all available information\. When SED is applied to the Tool\-box \(without RAG\), performance jumps to 76\.4% \(\+1\.4% over pure Tool\-box\)\. When SED is applied to both RAG and Tool\-box, the accuracy increases to 77\.6% \(\+1\.8% over naive combination\)\. Finally, incorporating the experience system further pushes accuracy to 79\.6% \(\+2\.0%\), validating that modeling long\-term tool reliability and establishing distance\-weighted priors are crucial for resolving complex inter\-tool conflicts\.

Generalizability Across Host VLMs\.To verify that the performance gains of PathoSage are not restricted to a specific model, we evaluate our framework using three host VLMs\. As illustrated in Figure[5](https://arxiv.org/html/2606.07549#S4.F5), PathoSage consistently enhances the accuracy across all tested hosts\. Specifically, PathoSage improves the overall accuracy of Qwen3\-VL\-32B from 65\.4% to 72\.0% \(\+6\.6%\), GPT\-5\.4 from 72\.9% to 79\.6% \(\+6\.7%\), and Gemini\-3\-Pro from 78\.6% to 80\.9% \(\+2\.3%\)\. Notably, when equipped with Gemini\-3\-Pro, PathoSage achieves the highest performance across almost all subsets\. Furthermore, when using the open\-source Qwen3\-VL\-32B, PathoSage \(72\.0%\) performs competitively with the single GPT\-5\.4 model \(72\.9%\)\. These consistent improvements confirm that our decoupled evidence adjudication and experience\-aware routing mechanisms provide a robust, model\-agnostic paradigm for advancing pathology AI\.

## 5Conclusion

Broader Impact\.This paper introduces PathoSage, an agentic framework designed to deliver reliable pathology diagnoses by explicitly decoupling evidence collection from final adjudication\. Addressing the limitations of VLMs and naive agentic workflows that suffer from context contamination, PathoSage incorporates SED to algorithmically weigh tool outputs and an experience system to model long\-term tool reliability\. By bridging the gap between opaque AI predictions and the rigorous, evidence\-based diagnostic processes of human pathologists, our method represents a significant step toward trustworthy and clinically translatable AI in computational pathology\.

Limitations\.The proposed framework’s performance inherently depends on the availability and accuracy of specialized visual tools, which remain limited for rare disease subtypes\. In addition, the final diagnostic synthesis still relies on the host VLM, which is prone to inherent inconsistencies and hallucinations if all supporting tools provide erroneous evidence\. Addressing these limitations through end\-to\-end tool optimization and more granular error taxonomies will be essential to further improve the clinical impact of computer\-aided diagnosis\.

## Acknowledgments and Disclosure of Funding

This work was supported in part by the Natural Science Foundation of Sichuan Province under Grant 2026NSFSC1491, the National Natural Science Fundation of China \(Grant No\. 62303338, No\. 62427820\), the Fundamental Research Funds for the Central Universities under grant Sichuan University YJ202285, the Sichuan Science and Technology Program under Grant 2025ZDZX0125, the Science Fund for Creative Research Groups of Sichuan Province Natural Science Foundation under Grant 2024NSFTD0035\.

## References

- \[1\]Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro\.Quilt\-llava: Visual instruction tuning by extracting localized narratives from open\-source histopathology videos\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024\.
- \[2\]Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al\.A multimodal generative ai copilot for human pathology\.Nature, 634\(8033\):466–473, 2024\.
- \[3\]Dawei Dai, Yuanhui Zhang, Long Xu, Qianlan Yang, Xiaojing Shen, Shuyin Xia, and Guoyin Wang\.Pa\-llava: A large language\-vision assistant for human pathology image understanding\.InProceedings of the International Conference on Bioinformatics and Biomedicine, pages 3138–3143, 2024\.
- \[4\]Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Jingxiong Li, Xuan Gong, XINHENG LYU, Tao Lin, and Lin Yang\.Pathgen\-1\.6m: 1\.6 million pathology image\-text pairs generation through multi\-agent collaboration\.InProceedings of the International Conference on Learning Representations, 2025\.
- \[5\]Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu\.Patho\-r1: A multimodal reinforcement learning\-based pathology expert reasoner\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28418–28426, 2026\.
- \[6\]Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, et al\.A versatile pathology co\-pilot via reasoning enhanced multimodal large language model\.arXiv preprint arXiv:2507\.17303, 2025\.
- \[7\]Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Heli Qi, Rui Yang, Simeng Han, Tinglin Huang, Fang Wu, et al\.Teampath: Building multimodal pathology experts with reasoning ai copilots\.arXiv preprint arXiv:2511\.17652, 2025\.
- \[8\]Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang\.Wsicaption: Multiple instance generation of pathology reports for gigapixel whole\-slide images\.InProceedings of the International Conference on Medical Image Computing and Computer\-Assisted Intervention, pages 546–556\. Springer, 2024\.
- \[9\]Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, and Lin Yang\.Wsi\-vqa: Interpreting whole slide images by generative visual question answering\.InProceedings of the European Conference on Computer Vision, pages 401–417\. Springer, 2024\.
- \[10\]Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen\.Histgen: Histopathology report generation via local\-global feature encoding and cross\-modal context interaction\.InProceedings of the International Conference on Medical Image Computing and Computer\-Assisted Intervention, pages 189–199\. Springer, 2024\.
- \[11\]Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, and Junjun He\.Slidechat: A large vision\-language assistant for whole\-slide pathology image understanding\.InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5134–5143, 2025\.
- \[12\]Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, et al\.Wsi\-llava: A multimodal large language model for whole slide image\.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22718–22727, 2025\.
- \[13\]Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S Corrado, et al\.Pathalign: A vision\-language model for whole slide images in histopathology\.arXiv preprint arXiv:2406\.19578, 2024\.
- \[14\]Zeyu Gao, Kai He, Weiheng Su, Ines P Machado, William McGough, Mercedes Jimenez\-Linan, Brian Rous, Chunbao Wang, Chengzu Li, Xiaobo Pang, et al\.Alpaca: Adapting llama for pathology context analysis to enable slide\-level question answering\.medRxiv, pages 2025–04, 2025\.
- \[15\]Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al\.A multimodal whole\-slide foundation model for pathology\.Nature Medicine, pages 1–13, 2025\.
- \[16\]Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, and Yongbing Zhang\.Pathreasoner\-r1: Instilling structured reasoning into pathology vision\-language model via knowledge\-guided policy optimization\.arXiv preprint arXiv:2601\.21617, 2026\.
- \[17\]Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, and Lin Yang\.Cpath\-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10360–10371, 2025\.
- \[18\]Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al\.Polypath: Adapting a large multimodal model for multi\-slide pathology report generation\.Modern Pathology, page 100886, 2025\.
- \[19\]Manuel Tran, Paul Schmidle, Ruifeng Ray Guo, Sophia J Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H Murphree, Heather D Hardway, Marina D’Amato, et al\.Generating dermatopathology reports from gigapixel whole slide images with histogpt\.Nature Communications, 16\(1\):4886, 2025\.
- \[20\]Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H Bernhard, Ran A Godrich, Juan A Retamero, et al\.Prism2: Unlocking multi\-modal general pathology ai with clinical dialogue\.arXiv preprint arXiv:2506\.13063, 2025\.
- \[21\]Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, and Xiaofan Zhang\.Pathfound: An agentic multimodal model activating evidence\-seeking pathological diagnosis\.arXiv preprint arXiv:2512\.23545, 2025\.
- \[22\]Yuxuan Yang, Zhonghao Yan, Yi Zhang, Bo Yun, Muxi Diao, Guowei Zhao, Kongming Liang, Wenbin Li, and Zhanyu Ma\.Hepato\-llava: An expert mllm with sparse topo\-pack attention for hepatocellular pathology analysis on whole slide images\.arXiv preprint arXiv:2602\.19424, 2026\.
- \[23\]Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang\.Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5034–5042, 2024\.
- \[24\]Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu\.Patho\-agenticrag: towards multimodal agentic retrieval\-augmented generation for pathology vlms via reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29921–29929, 2026\.
- \[25\]Chengkuan Chen, Luca L Weishaupt, Drew FK Williamson, Richard J Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y Lu, et al\.Evidence\-based diagnostic reasoning with multi\-agent copilot for human pathology\.arXiv preprint arXiv:2506\.20964, 2025\.
- \[26\]Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, and Zhi Huang\.Pathology\-cot: Learning visual chain\-of\-thought agent from expert whole slide image diagnosis behavior\.arXiv preprint arXiv:2510\.04587, 2025\.
- \[27\]Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro\.Pathfinder: A multi\-modal multi\-agent system for medical diagnostic decision\-making applied to histopathology\.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23431–23441, 2025\.
- \[28\]Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Yongbing Zhang\.Pathagent: Toward interpretable analysis of whole\-slide pathology images via large language model\-based agentic reasoning\.arXiv preprint arXiv:2511\.17052, 2025\.
- \[29\]Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, and Lin Yang\.Cpathagent: An agent\-based foundation model for interpretable high\-resolution pathology image analysis mimicking pathologists’ diagnostic logic\.arXiv preprint arXiv:2505\.20510, 2025\.
- \[30\]Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo, Sen Yang, Xiaohan Xing, and Linlin Shen\.Survagent: Hierarchical cot\-enhanced case banking and dichotomy\-based multi\-agent system for multimodal survival prediction\.arXiv preprint arXiv:2511\.16635, 2025\.
- \[31\]Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, and Linlin Shen\.Wsi\-agents: A collaborative multi\-agent system for multi\-modal whole slide image analysis\.arXiv preprint arXiv:2507\.14680, 2025\.
- \[32\]Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, et al\.A co\-evolving agentic ai system for medical imaging analysis\.arXiv preprint arXiv:2509\.20279, 2025\.
- \[33\]Zhengyang Xu, Han Li, Jingsong Liu, Linrui Xie, Xun Ma, Xin You, Shihui Zu, Ayako Ito, Xinyu Hao, Hongming Xu, et al\.Mmnavagent: Multi\-magnification wsi navigation agent for clinically consistent whole\-slide analysis\.arXiv preprint arXiv:2603\.02079, 2026\.
- \[34\]Dyke Ferber, Omar SM El Nahhas, Georg Wölflein, Isabella C Wiest, Jan Clusmann, Marie\-Elisabeth Leßmann, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jäger, et al\.Development and validation of an autonomous artificial intelligence agent for clinical decision\-making in oncology\.Nature Cancer, 6\(8\):1337–1349, 2025\.
- \[35\]Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Tanvir Hasan, Zenghan Wang, Suman Ghosh, Sandeep Manandhar, Shikhar Shiromani, Twisha Shah, Naoto Tokuyama, et al\.Sage: Agentic framework for interpretable and clinically translatable computational pathology biomarker discovery\.arXiv preprint arXiv:2602\.00953, 2026\.
- \[36\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.InProceedings of the International Conference on Learning Representations, 2022\.
- \[37\]Andrew Zhang, Tong Ding, Sophia J Wagner, Caiwei Tian, Ming Y Lu, Rowland Pettit, Joshua E Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, et al\.A multimodal and temporal foundation model for virtual patient representations at healthcare system scale\.arXiv preprint arXiv:2604\.18570, 2026\.
- \[38\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.Advances in Neural Information Processing Systems, 36:68539–68551, 2023\.
- \[39\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, et al\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in Neural Information Processing Systems, 33:9459–9474, 2020\.
- \[40\]Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, and Zora Zhiruo Wang\.Toolmem: Enhancing multimodal agents with learnable tool capability memory\.arXiv preprint arXiv:2510\.06664, 2025\.
- \[41\]Guanyu Jiang, Zhaochen Su, Xiaoye Qu, et al\.Xskill: Continual learning from experience and skills in multimodal agents\.arXiv preprint arXiv:2603\.12056, 2026\.
- \[42\]Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al\.Memos: A memory os for ai system\.arXiv preprint arXiv:2507\.03724, 2025\.
- \[43\]Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al\.Memverse: Multimodal memory for lifelong learning agents\.arXiv preprint arXiv:2512\.03627, 2025\.
- \[44\]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems, 36:8634–8652, 2023\.
- \[45\]Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo\.Colpali: Efficient document retrieval with vision language models\.InProceedings of the International Conference on Learning Representations, 2025\.
- \[46\]Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al\.Milvus: A purpose\-built vector data management system\.InProceedings of the International Conference on Management of Data, pages 2614–2627, 2021\.
- \[47\]Yu A Malkov and Dmitry A Yashunin\.Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs\.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42\(4\):824–836, 2018\.
- \[48\]Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He\.Cognitive bias in decision\-making with llms\.InFindings of the Association for Computational Linguistics, pages 12640–12653, 2024\.
- \[49\]Lawrence A Shaktah, Zunamys I Carrero, Katherine Jane Hewitt, Marco Gustav, Matthew Cecchini, Sebastian Foersch, Sabina Berezowska, and Jakob Nikolas Kather\.Application of artificial intelligence and digital tools in cancer pathology\.The Lancet Digital Health, 7\(10\), 2025\.
- \[50\]Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi\.Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.InProceedings of the International Conference on Learning Representations, 2024\.
- \[51\]Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al\.A visual\-language foundation model for computational pathology\.Nature Medicine, 30\(3\):863–874, 2024\.
- \[52\]Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou\.A visual–language foundation model for pathology image analysis using medical twitter\.Nature Medicine, 29\(9\):2307–2316, 2023\.
- \[53\]Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro\.Quilt\-1m: One million image\-text pairs for histopathology\.Advances in Neural Information Processing Systems, 36:37995–38017, 2023\.
- \[54\]Jinxi Xiang, Xiyue Wang, Xiaoming Zhang, Yinghua Xi, Feyisope Eweje, Yijiang Chen, Yuchen Li, Colin Bergstrom, Matthew Gopaulchan, Ted Kim, et al\.A vision–language foundation model for precision oncology\.Nature, 638\(8051\):769–778, 2025\.
- \[55\]Pei Liu, Luping Ji, Jiaxiang Gou, Bo Fu, and Mao Ye\.Interpretable vision\-language survival analysis with ordinal inductive bias for computational pathology\.InProceedings of the International Conference on Learning Representations, 2025\.
- \[56\]Praveenbalaji Rajendran, Mojtaba Safari, Wenfeng He, Mingzhe Hu, Shansong Wang, Jun Zhou, and Xiaofeng Yang\.Foundation models in medical image analysis: A systematic review and meta\-analysis\.ArXiv, abs/2510\.16973, 2025\.
- \[57\]Qi Peng, Jiatong Li, Sirui Huang, Yiyang Jiang, Kaisong Gong, Ronger Ding, Shijie Ye, Changmeng Zheng, Xiao\-Yong Wei, and Qing Li\.Aligning clinical needs and ai capabilities: a survey on llms for medical reasoning\.Authorea Preprints, 2025\.
- \[58\]Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, et al\.A pathology foundation model for cancer diagnosis and prognosis prediction\.Nature, 634\(8035\):970–978, 2024\.
- \[59\]Qifeng Zhou, Wenliang Zhong, Thao M Dang, Hehuan Ma, Saiyang Na, Yuzhi Guo, and Junzhou Huang\.Homie: Histopathology omni\-modal embedding for pathology composed retrieval\.arXiv preprint arXiv:2502\.07221, 2025\.
- \[60\]Jingyun Chen, Fengchun Liu, Songhan Jiang, and Linghan Cai\.The landscape of computational pathology agents from static analysis to autonomous diagnostic workflows\.Authorea Preprints, 2026\.
- \[61\]Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, and Hoifung Poon\.Octomed: Data recipes for state\-of\-the\-art multimodal medical reasoning\.arXiv preprint arXiv:2511\.23269, 2025\.
- \[62\]Jiao Xu, Junwei Liu, Jiangwei Lao, Qi Zhu, Yunpeng Zhao, Congyun Jin, Shinan Liu, Zhihong Lu, Lihe Zhang, Xin Chen, et al\.Pulsemind: A multi\-modal medical model for real\-world clinical diagnosis\.arXiv preprint arXiv:2601\.07344, 2026\.
- \[63\]Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, et al\.Cx\-mind: a pioneering multimodal large language model for interleaved reasoning in chest x\-ray via curriculum\-guided reinforcement learning\.Information Fusion, page 104027, 2025\.
- \[64\]Ruiqi Wu, Yuang Yao, Tengfei Ma, Chenran Zhang, Na Su, Tao Zhou, Geng Chen, Wen Fan, and Yi Zhou\.Bridging the gap in ophthalmic ai: Mm\-retinal\-reason dataset and ophthareason model toward dynamic multimodal reasoning\.arXiv preprint arXiv:2508\.16129, 2025\.
- \[65\]Wenchuan Zhang, Shuwan Zhang, Jiadi You, Fengling Li, Xiaoyan Wu, Xunxi Lu, Qingjie Lv, Juan Huang, Yuhao Yi, and Hong Bu\.Attention\-based multimodal fusion transformer for predicting the efficacy of neoadjuvant therapy in breast cancer: a cross\-institutional retrospective study\.Breast Cancer Research, 2025\.
- \[66\]Shipra Agrawal and Navin Goyal\.Analysis of thompson sampling for the multi\-armed bandit problem\.InProceedings of the Conference on Learning Theory, pages 39–1\. JMLR Workshop and Conference Proceedings, 2012\.
- \[67\]Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, and Babak Taati\.When does rl help medical vlms? disentangling vision, sft, and rl gains\.arXiv preprint arXiv:2603\.01301, 2026\.
- \[68\]Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, and Cheng Tan\.Nunext: Reframing nucleus detection as next\-point detection\.arXiv preprint arXiv:2603\.07098, 2026\.
- \[69\]Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo, and Zhen Lei\.Anatomy\-r1: Enhancing anatomy reasoning in multimodal large language models via anatomical similarity curriculum and group diversity augmentation\.arXiv preprint arXiv:2512\.19512, 2025\.
- \[70\]Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie\.Pathvqa: 30000\+ questions for medical visual question answering\.arXiv preprint arXiv:2003\.10286, 2020\.
- \[71\]Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou\.Medxpertqa: Benchmarking expert\-level medical reasoning and understanding\.arXiv preprint arXiv:2501\.18362, 2025\.
- \[72\]Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo\.Omnimedvqa: A new large\-scale comprehensive evaluation benchmark for medical lvlm\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024\.
- \[73\]Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, et al\.Pathmmu: A massive multimodal expert\-level benchmark for understanding and reasoning in pathology\.InProceedings of the European Conference on Computer Vision, pages 56–73\. Springer, 2024\.
- \[74\]OpenAI\.Introducing\-gpt\-5\-4\.2026\.
- \[75\]Google\.Gemini 3 pro \- model card\.2025\.
- \[76\]Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot\.Hover\-net: Simultaneous segmentation and classification of nuclei in multi\-tissue histology images\.Medical Image Analysis, 58:101563, 2019\.
- \[77\]Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, and Jens Kleesiek\.Cellvit\+\+: Energy\-efficient and adaptive cell segmentation and classification using foundation models\.Computer Methods and Programs in Biomedicine, page 109206, 2026\.

## Appendix AAdditional Experiments and Discussion

### A\.1Experience Accumulation on PathMMU\-val

Table A\.1:Quantitative comparison of models on the PathMMU val set\. The best result in each subset for general and pathology MLLMs isin\-bold, and the second\-best result isunderlined\.To ensure a rigorous evaluation and prevent data leakage during the testing phase, PathoSage’s Beta\-Bernoulli experience system is exclusively accumulated on the validation set of PathMMU \(PathMMU\-val\)\. This initial exploration phase serves as the foundation for modeling long\-term tool reliability and extracting task\-level strategies\. Table[A\.1](https://arxiv.org/html/2606.07549#A1.T1)presents the quantitative performance of various general and pathology\-specific MLLMs on the PathMMU val set, alongside the performance of PathoSage during this critical accumulation phase\.

During the experience accumulation process, PathoSage operates inexploration mode\. We evaluate its performance under two distinct settings:

- •Pass@1:The agent performs a single reasoning trajectory without utilizing any historical experience priors \(θk=0\.5\\theta\_\{k\}=0\.5\)\. This serves as the baseline performance of our SED mechanism acting alone\.
- •Pass@4:The agent executesN=4N=4independent rollouts per query\. The system aggregates the results from these multiple trajectories to extract successful strategies and update the Beta\-Bernoulli reliability parameters\.

As shown in Table[A\.1](https://arxiv.org/html/2606.07549#A1.T1), PathoSage \(Pass@1\), utilizing GPT\-5\.4 as the host model, achieves a strong overall accuracy of 74\.0% on the validation set\. This represents a substantial absolute improvement of\+8\.6%over the bare GPT\-5\.4 \(65\.4%\) and performs competitively with the strongest general\-purpose baseline, Gemini\-3\-Pro \(74\.5%\)\. This significant margin confirms that explicitly decoupling evidence collection from adjudication via SED provides a highly robust reasoning foundation, even before any historical experience is accumulated\.

Crucially, when operating in exploration mode \(Pass@4\), PathoSage’s performance surges to an impressive overall accuracy of80\.1%, establishing a new state\-of\-the\-art across nearly all sub\-categories \(e\.g\., 86\.3% on PubMed and 78\.8% on EduContent\)\. The multiple rollouts allow the system to explore diverse tool combinations and reasoning paths, successfully navigating complex cases where a single\-pass generation might fail\.

The successful trajectories identified during this Pass@4 exploration are subsequently harvested to populate the experience database\. This process yields high\-quality pseudo\-labels for updating the Beta\-Bernoulli reliability parameters \(αk,βk\\alpha\_\{k\},\\beta\_\{k\}\) and extracts valuable strategies for tool utilization\. The accumulated experience from this validation phase is then frozen and utilized to establish priors during the final evaluation on the PathMMU test set\.

### A\.2Analysis of Experience Database

![Refer to caption](https://arxiv.org/html/2606.07549v1/x6.png)Figure A\.1:Beta posterior meanθ\\thetafor each tool per category and number of labeled samples\.To understand what PathoSage learns during the initial exploration phase, we conduct an in\-depth analysis of the accumulated experience database on the PathMMU val set\. We examine both the statistical tool reliability tracked by Layer 1 \(Beta\-Bernoulli updates\) and the symbolic strategy knowledge distilled by Layer 2 \(Cross\-Rollout Critique\)\.

Layer 1: Tool Reliability Profiling\.Figure[A\.1](https://arxiv.org/html/2606.07549#A1.F1)illustrates the posterior meanθk=αk/\(αk\+βk\)\\theta\_\{k\}=\\alpha\_\{k\}/\(\\alpha\_\{k\}\+\\beta\_\{k\}\)for each specialized tool across different PathMMU sub\-categories\. This metric represents the system’s learned trust in a specific tool for a given domain\. Several key observations emerge:

- •Domain\-Specific Competence:Tool reliability is highly heterogeneous\. For instance, the VQA tool \(Patho\-R1\) maintains consistently high reliability \(θ≈0\.65−0\.70\\theta\\approx 0\.65\-0\.70\) across all categories, reflecting its strong general visual reasoning capabilities\. Conversely, zero\-shot classifiers \(Patho\-CLIP and CONCH\) exhibit moderate reliability \(θ≈0\.47−0\.60\\theta\\approx 0\.47\-0\.60\), indicating that while they provide useful signals, their raw outputs should not unconditionally override other evidence without deliberation\.
- •Sensitivity to Model Architectures:Within the segmentation category, CellViT consistently achieves higher reliability scores \(θ≈0\.60−0\.67\\theta\\approx 0\.60\-0\.67\) compared to HoVerNet \(θ≈0\.52−0\.60\\theta\\approx 0\.52\-0\.60\)\. This suggests that the experience system successfully captures the underlying performance differences between tool architectures, naturally learning to prioritize the more robust CellViT model when resolving conflicts in downstream tasks\.

These learnedθ\\thetavalues validate the necessity of our distance\-weighted priors: rather than treating all tools with static, uniform trust, PathoSage dynamically calibrates its reliance based on the specific tissue microenvironment and historical tool performance\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x7.png)Figure A\.2:The percentage of tool categories marked as “critical” under each dataset category\.Layer 2: Cross\-Rollout Critique and Tool Criticality\.Beyond statistical reliability, Layer 2 of the experience system distills task\-level strategies by comparing successful and failed rollouts\. During the exploration phase \(4 rollouts per query\), the system identifiescritical tools, which are defined as tool categories that consistently appear in successful trajectories and play a pivotal role in the reasoning chain, but are absent or misused in failed ones\.

Figure[A\.2](https://arxiv.org/html/2606.07549#A1.F2)presents the proportion of samples within each sub\-category where a specific tool type was flagged as "critical"\. The distribution reveals a clear hierarchical strategy learned by the system:

- •VQA as the Primary Anchor:The VQA module is identified as critical in nearly 100% of cases across all subsets\. This indicates that the experience database recognizes VQA as the "workhorse" for PathMMU tasks, essential for interpreting complex, open\-ended morphological queries\.
- •Classification as a Secondary Validator:Patch classification is deemed critical in 61\.0% to 92\.0% of cases, depending on the subset\. It serves as a crucial auxiliary signal, particularly in subsets like PathCLS \(92\.0%\) and Atlas \(82\.0%\), where tissue\-level identification is often required to ground the VQA’s narrative\.
- •Segmentation for Specialized Contexts:Segmentation tools are marked as critical in a minority of cases \(30\.0% to 68\.0%\)\. Notably, its criticality peaks in the PathCLS subset \(68\.0%\), which predominantly consists of single H&E patches where fine\-grained cellular composition \(e\.g\., nuclear density, cell counting\) is decisive for the final diagnosis\.

This analysis demonstrates that PathoSage does not merely learn to call all available tools blindly\. Instead, it successfully distills a nuanced, pathology\-aware strategy: anchoring on VQA for general reasoning, cross\-validating with classifiers for tissue context, and selectively invoking segmentation for fine\-grained cellular tasks\.

### A\.3Examples of How PathoSage Handles Tool Evidence

While introducing multiple tools provides richer context, it inevitably leads to conflicting evidence and invites uncritical agreement\. PathoSage’s SED mechanism is designed to handle both regimes explicitly\.

Figure[A\.3](https://arxiv.org/html/2606.07549#A1.F3)demonstrates how PathoSage’s SED mechanism explicitly adjudicates conflicts where a non\-VQA evidence stream is at odds with the higher\-relevance signal\. In the first scenario \(top, MedXpert\), the VQA tool incorrectly suggested "UBC" for a bone lesion\. A naive agentic system would likely suffer from anchoring bias and adopt this suggestion\. However, PathoSage’s SED independently evaluated the VQA output against the clinical vignette and histological features, explicitly judging the VQA interpretation as inconsistent\. By downweighting this erroneous evidence, the system correctly concluded the diagnosis was Chondroblastoma \(Option A\)\. In the second scenario \(bottom, PathMMU\-test\) regarding nuclear characteristics, the patch classification tool produced an inconsistent and low\-relevance output, while the segmentation tool was compatible but less specific\. SED correctly identified that the VQA assessment provided the highest\-relevance evidence for this specific morphological query\. By algorithmically prioritizing the VQA output and disregarding the irrelevant classifier, PathoSage accurately selected Option D\.

Figure[A\.4](https://arxiv.org/html/2606.07549#A1.F4)extends this analysis to the harder case in which the VQA tool itself supplies the wrong interpretation\. In the first example \(top, Quilt\-VQA\), the VQA tool described the field as resembling normal fat, which would have led to an incorrect "Yes" answer\. SED detected that the VQA reasoning was self\-contradictory, and instead trusted the higher\-weighted classifier evidence identifying the dominant component as myxoid stromal material, yielding the correct answer of "No\." In the second example \(bottom, PathMMU\-test\), the VQA tool selected Option B \(tumor\-associated stroma\) for an H&E lung patch, but the highest\-confidence classifier called the tissue normal, and SED noted that the VQA narrative lacked decisive markers of malignant epithelial nests or desmoplastic stroma\. SED therefore overrode the VQA suggestion and committed to Option C \(normal tissue\)\. These cases show that PathoSage’s deliberation process is not biased toward any single tool\.

Figure[A\.5](https://arxiv.org/html/2606.07549#A1.F5)illustrates the complementary regime in which the invoked tools converge on consistent evidence\. In the first example \(top, Quilt\-VQA\), the VQA tool described stratum\-spinosum\-like polygonal keratinocytes with intercellular bridges, and both classifiers independently labelled the tissue as squamous epithelium, jointly supporting the answer "Yes\." In the second example \(bottom, PathMMU\-test\), the VQA tool, the patch classifier, and the cell segmentation tool all reported elongated spindle cells arranged in sweeping fascicles, unambiguously matching Option B \(interweaving bundle/fascicular pattern\)\. Under such concordant evidence, SED routes the deliberation through a streamlined consolidation path, allowing PathoSage to commit to the correct answer with high confidence and minimal additional reasoning\. Together, Figures[A\.3](https://arxiv.org/html/2606.07549#A1.F3)–[A\.5](https://arxiv.org/html/2606.07549#A1.F5)demonstrate that PathoSage’s SED mechanism delivers a consistent treatment of evidence across the full spectrum from open conflict to full consensus\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x8.png)Figure A\.3:Two examples illustrating how PathoSage resolves tool conflicts\. One indicates that an incorrect VQA suggestion is overridden; the other indicates that irrelevant classification outputs are downweighted in favor of high\-relevance VQA evidence\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x9.png)Figure A\.4:Two additional examples illustrating how PathoSage resolves tool conflicts\. In both cases, SED detects inconsistencies in the VQA output and prioritizes high\-confidence classifier evidence to recover the correct answer\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x10.png)Figure A\.5:Two examples in which the independently invoked tools converge on consistent evidence\. PathoSage routes such concordant signals through a streamlined deliberation path and commits to the correct answer with high confidence\.
### A\.4Failure Case Analysis

Despite the robustness demonstrated above, PathoSage still exhibits two characteristic failure modes, illustrated in Figures[A\.6](https://arxiv.org/html/2606.07549#A1.F6)and[A\.7](https://arxiv.org/html/2606.07549#A1.F7)\.

Figure[A\.6](https://arxiv.org/html/2606.07549#A1.F6)shows cases of collective tool failure, in which all invoked tools converge on the same incorrect interpretation\. In the first example \(top, MedXpert\), all four tools concurred on CIN III, although the ground\-truth diagnosis was CIN II, reflecting a one\-grade overcall of dysplasia severity\. In the second example \(bottom, PathMMU\-test\), the patch classifier, the VQA tool, and the cell segmentation tool unanimously identified a "glandular" architectural pattern, whereas the ground\-truth label was papillary projections\. Because SED’s central premise is that disagreement between independent evidence streams flags potential errors, this premise breaks down when the streams share the same systematic bias; no amount of additional deliberation within the existing tool ensemble can recover the correct answer\.

Figure[A\.7](https://arxiv.org/html/2606.07549#A1.F7)illustrates a second failure mode in which a high\-relevance VQA narrative is itself misleading and dominates the deliberation\. In the first example \(top, MedXpert\), a clinical case of an S100\-positive spindle\-cell tumor with radicular symptoms, the VQA tool fixated on Antoni A\-type nuclear palisading and selected schwannoma \(option B\), whereas the ground truth was malignant peripheral nerve sheath tumor \(option D\)\. The disagreeing classifier outputs carried only medium relevance and were down\-weighted\. In the second example \(bottom, PathMMU\-test\), the VQA tool described "abundant eosinophilic collagenous stroma surrounding atypical cells" and selected option B \(dense, aligned collagen fibers\); however, the ground truth was option D \(loose fibrous background with scattered collagen fibers\)\. The patch classifiers were inconsistent and were assigned low relevance, so SED’s relevance\-weighted aggregation could not counterbalance the confident but incorrect VQA assertion\. These cases reveal that when the most diagnostic tool produces a self\-consistent but erroneous narrative, lower\-relevance corroborating tools may fail to generate sufficient counterweight to override it\.

It is worth noting that these residual failures do not stem from a flaw in the SED mechanism itself, but rather expose the limits of the underlying tool ensemble\. SED is, by construction, an aggregator of independent evidence: when the available evidence streams share the same systematic bias \(Figure[A\.6](https://arxiv.org/html/2606.07549#A1.F6)\), or when the most diagnostic tool produces a confident but erroneous narrative that no higher\-relevance counter\-evidence is available to challenge \(Figure[A\.7](https://arxiv.org/html/2606.07549#A1.F7)\), no purely aggregation\-level rule can recover the correct answer\. Redesigning SED to overrule a unanimous high\-relevance signal on the basis of weaker disagreeing evidence would reintroduce the very anchoring and over\-correction biases that SED was designed to suppress, and would degrade performance on the much larger set of cases where the consensus is in fact correct\. The principled remedy is therefore not to retune the deliberation logic, but to enrich the evidence pool itself\. We regard this as a natural direction for future extensions of PathoSage rather than a limitation of the current framework\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x11.png)Figure A\.6:Two failure cases in which all invoked tools converge on the same, but ultimately incorrect interpretation\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x12.png)Figure A\.7:Two failure cases in which a confident but flawed VQA narrative dominates the deliberation\.

## Appendix BImplementation Setup

### B\.1Detailed Dataset Descriptions

We conduct comprehensive patch understanding evaluations across five diverse pathology Visual Question Answering \(VQA\) datasets, encompassing both binary \(Yes/No\) and multiple\-choice reasoning tasks\. These datasets are selected to assess different dimensions of the agent’s capabilities, ranging from definitive morphological judgments to complex differential diagnoses\. The detailed statistics and configurations for each dataset are as follows:

PathMMU\.PathMMU\[[73](https://arxiv.org/html/2606.07549#bib.bib73)\]serves as our primary evaluation suite due to its extensive coverage of diverse tissue types, clinical scenarios, and expert\-validated annotations\. Following its official protocol, we report results on both the full test set and a representativetest\-tinysplit, allowing for a rigorous assessment of PathoSage’s ability to integrate heterogeneous tool evidence in high\-resolution patch analysis\.

- •Test\-Tiny Split:Comprises a total of 1,139 questions, distributed across five distinct sub\-categories: Atlas \(208\), EduContent \(255\), PathCLS \(177\), PubMed \(281\), and SocialPath \(218\)\.
- •Test\-All Split:Scales up to 8,454 questions, distributed as: Atlas \(799\), EduContent \(1,683\), PathCLS \(1,632\), PubMed \(2,787\), and SocialPath \(1,553\)\.

Multiple\-Choice Diagnostic Benchmarks\.For more complex diagnostic reasoning and differential analysis, we employ multiple\-choice questions sourced from MedXpertQA\[[71](https://arxiv.org/html/2606.07549#bib.bib71)\]and OmniMedVQA\[[72](https://arxiv.org/html/2606.07549#bib.bib72)\]\.

- •MedXpertQA:We filter the original dataset to extract 90 highly relevant pathology examples that require expert\-level medical knowledge\.
- •OmniMedVQA:We utilize the BRIGHT Challenge subset, which consists of 890 cases focusing on challenging diagnostic reasoning across medical specialties\.

Binary Morphological Benchmarks\.To assess the model’s capability in making definitive morphological judgments, we utilize Yes/No \(YorN\) questions selected from the test splits of Path\-VQA\[[70](https://arxiv.org/html/2606.07549#bib.bib70)\]and Quilt\-VQA\[[53](https://arxiv.org/html/2606.07549#bib.bib53)\]\. These tasks require precise identification of specific pathological features \(e\.g\., the presence of necrosis, specific cellular arrangements, or staining characteristics\)\. We collect closed\-ended questions from their respective test splits, resulting in 3,362 questions for Path\-VQA and 343 questions for Quilt\-VQA\.

### B\.2Implementation Details and Compute Resources

Throughout the PathoSage workflow, we employ GPT\-5\.4\[[74](https://arxiv.org/html/2606.07549#bib.bib74)\]as the host VLM for our main experiments\. Although Gemini\-3\-Pro\[[75](https://arxiv.org/html/2606.07549#bib.bib75)\]demonstrated marginally superior performance in preliminary tests, we selected GPT\-5\.4 to achieve an optimal trade\-off between diagnostic accuracy and inference efficiency during large\-scale evaluations\. For the RAG retriever, we adopt the exact RAG configuration and textbook database from Patho\-AgenticRAG\[[24](https://arxiv.org/html/2606.07549#bib.bib24)\]\. For the tool\-box, we implement 5 tools in total, including HoverNet\[[76](https://arxiv.org/html/2606.07549#bib.bib76)\]and CellViT\+\+\[[77](https://arxiv.org/html/2606.07549#bib.bib77)\]for cell segmentation, CONCH\[[51](https://arxiv.org/html/2606.07549#bib.bib51)\]and Patho\-CLIP\[[5](https://arxiv.org/html/2606.07549#bib.bib5)\]for patch classification, and Patho\-R1\[[5](https://arxiv.org/html/2606.07549#bib.bib5)\]for VQA\. Crucially, to ensure strict separation between exploration and evaluation, the experience system is exclusively accumulated on the PathMMU val set during an initial exploration phase\.

The PathoSage framework operates in a hybrid deployment environment\. The specialized tool models \(HoverNet, CellViT\+\+, CONCH, Patho\-CLIP, and Patho\-R1\) and the RAG retrieval system are deployed locally on a computing node equipped with 8×\\timesNVIDIA RTX 4090 GPUs\. When utilizing proprietary models such as GPT\-5\.4 or Gemini\-3\-Pro as the host VLM, we directly access their respective cloud APIs\. Conversely, for experiments evaluating the open\-weights Qwen3\-VL\-32B\-Instruct as the host VLM, we deploy the model locally using vLLM on an additional dedicated node equipped with 8×\\timesNVIDIA RTX 4090 GPUs, utilizing tensor parallelism \(TP=8\) to ensure efficient inference\.

### B\.3Prompts

Figures[B\.8](https://arxiv.org/html/2606.07549#A2.F8)–[B\.11](https://arxiv.org/html/2606.07549#A2.F11)present the four prompts that govern the key decision points of PathoSage\. Figure[B\.8](https://arxiv.org/html/2606.07549#A2.F8)illustrates the system prompt, which functions as the system message during the ReAct\-based evidence collection phase\. It defines the agent’s role, enumerates the available tools, and specifies the tool\-selection policy\. Figure[B\.9](https://arxiv.org/html/2606.07549#A2.F9)depicts the RAG assessment prompt, which is employed during the knowledge retrieval phase prior to tool invocation\. This prompt instructs the model to evaluate the relevance of retrieved textbook passages, summarize key pathological knowledge, and translate this information into actionable guidance for tool selection in the subsequent ReAct phase\. Figure[B\.10](https://arxiv.org/html/2606.07549#A2.F10)shows the independent assessment prompt used in Step 1 of the SED phase, which elicits a credibility label \(agree, disagree, or uncertain\) for each executed tool and a relevance label \(high, medium, or low\) for each tool category\. Figure[B\.11](https://arxiv.org/html/2606.07549#A2.F11)presents the final reasoning prompt, applied in the last step of the SED phase, which integrates per\-tool assessments and cross\-tool conflict reports to generate the final diagnostic output under explicit evidence\-weighting rules\.

![Refer to caption](https://arxiv.org/html/2606.07549v1/x13.png)Figure B\.8:System prompt for PathoSage\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x14.png)Figure B\.9:RAG assessment prompt for evaluating RAG retrieval results and generating recommendations for using tools\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x15.png)Figure B\.10:VLM assessment prompt for assigning independent scores to each tool’s output\.![Refer to caption](https://arxiv.org/html/2606.07549v1/x16.png)Figure B\.11:Final reasoning prompt for generating the output\.

Similar Articles

WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis

arXiv cs.CL

WiseMind is a knowledge-guided multi-agent framework that uses LLMs for psychiatric diagnosis by combining a "Reasonable Mind" agent for evidence-based logic with an "Emotional Mind" agent for empathetic communication, achieving 85.6% diagnostic accuracy on simulated and real patient interactions. The framework leverages DSM-5 structured knowledge graphs to reduce hallucinations and outperforms single-agent baselines by 15-54 percentage points while maintaining clinical soundness and psychological support.

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

arXiv cs.AI

This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.