Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

arXiv cs.AI 06/17/26, 04:00 AM Papers
agentic-ai healthcare llm multi-agent hallucination clinical-diagnosis safety
Summary
This paper proposes a multi-agent framework using deterministic orchestration and neuro-symbolic state tracking to mitigate premature diagnostic handoff and silent hallucinations in healthcare LLM applications.
arXiv:2606.18068v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (\sigma) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:41 AM
# Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
Source: [https://arxiv.org/html/2606.18068](https://arxiv.org/html/2606.18068)
###### Abstract

Recent advances in Large Language Models \(LLMs\) and multi\-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning\. However, open\-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient\. In this work, we propose a multi\-agent framework that addresses both issues by replacing “LLM\-as\-a\-judge” routing with deterministic orchestration constraints\. The framework incorporates two safety mechanisms\. First, a neuro\-symbolic state\-tracking gate enforces completeness of the OLDCARTS clinical protocol \(Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity\) by blocking diagnostic transitions until all required dimensions are collected\. Second, an epistemic uncertainty quantification \(UQ\) gate computes semantic entropy \(HH\) acrossK=5K=5independent diagnostic samples to identify and intercept divergent outputs before delivery\.

We evaluate the system using simulated patient agents powered by the llama\-3\.1\-70b\-instruct model on 150 test cases\. The full architecture achieves49\.3%49\.3\\%diagnostic precision, representing an absolute improvement of11\.311\.3percentage points over an unconstrained baseline\. Additionally, we observe a statistically significant negative correlation \(r=−0\.181,p<0\.05r=\-0\.181,\\ p<0\.05\) between OLDCARTS completeness \(σ\\sigma\) and semantic entropy \(HH\), suggesting that structured information gathering is associated with reduced diagnostic uncertainty\.

## IIntroduction

Clinical triage and diagnosis are high\-stakes processes where errors in history taking or diagnostic reasoning can lead to delayed treatment, patient harm, and loss of trust in AI\-assisted systems\[[8](https://arxiv.org/html/2606.18068#bib.bib2)\]\. With increasing demand on healthcare systems and persistent workforce shortages, there is growing interest in deploying AI\-based clinical decision support tools\[[13](https://arxiv.org/html/2606.18068#bib.bib1)\]\. Large Language Models \(LLMs\) have emerged as promising candidates due to their ability to encode medical knowledge, perform reasoning, and interact with patients through natural language\.

Recent work demonstrates the potential of LLMs in clinical applications\. Generalist models such as MedFound support diagnosis across multiple specialties\[[7](https://arxiv.org/html/2606.18068#bib.bib6)\], while adapted LLMs have shown strong performance in structured clinical summarization tasks\[[14](https://arxiv.org/html/2606.18068#bib.bib7)\]\. Specialized systems such as RadGPT further highlight the capability of LLMs to generate patient\-specific explanations\[[4](https://arxiv.org/html/2606.18068#bib.bib3)\]\. These advances suggest a pathway toward conversational agents capable of assisting in end\-to\-end clinical workflows\.

However, reliability remains a critical challenge\. Medical LLMs are prone to hallucination, producing incorrect diagnoses, fabricated medications, or unsafe recommendations\[[11](https://arxiv.org/html/2606.18068#bib.bib4)\]\. Such failures are not rare and have been observed across both general\-purpose and domain\-specific models\. A key limitation is that existing approaches optimize for fluent generation rather than epistemic correctness\.

![Refer to caption](https://arxiv.org/html/2606.18068v1/x1.png)Figure 1:Illustration of two critical failure modes of unconstrained LLM clinical agents: premature diagnostic handoff and silent hallucination\.Current mitigation strategies, including retrieval\-augmented generation, chain\-of\-thought prompting, and LLM\-as\-a\-judge frameworks, operate in a probabilistic manner\. While they can reduce error rates, they do not provide guarantees of correctness\. More importantly, they do not ensureintake completeness, i\.e\., whether all clinically required symptom dimensions are collected before diagnosis\. In clinical practice, structured history taking using the OLDCARTS protocol is essential, as incomplete information is a major source of diagnostic error\[[2](https://arxiv.org/html/2606.18068#bib.bib5)\]\. Existing multi\-agent systems do not enforce this requirement through a deterministic mechanism\. Figure[1](https://arxiv.org/html/2606.18068#S1.F1)provides a concrete illustration of the two failure modes that motivate this work\. The left panel depicts*premature diagnostic handoff*, where the agent proceeds to diagnosis after collecting only a partial symptom history, with just22of the88OLDCARTS dimensions observed\. The resulting diagnosis is therefore based on incomplete clinical context\. The right panel depicts*silent clinical hallucination*, where the agent produces a medication recommendation without any explicit uncertainty or safety screening, despite a potentially serious drug interaction\. Taken together, these scenarios show that reliability in clinical conversational systems depends not only on the quality of the generated diagnosis, but also on whether the system enforces structured intake and flags uncertain or potentially unsafe outputs before delivery\.

In this work, we address these limitations through a neuro\-symbolic multi\-agent framework where we replace purely prompt\-based control with deterministic orchestration at the system level\. Specifically, we introduce two complementary mechanisms\. First, a Neuro\-Symbolic OLDCARTS State Tracker \(M1\), which enforces completeness of symptom collection by blocking transition to diagnosis until all required fields are observed\. Second, a Semantic Entropy\-based Uncertainty Quantification \(UQ\) gate \(M2\), that computes normalized entropy across different independent diagnostic samples to identify divergent outputs before they are presented to the user\.

We evaluate the proposed framework on simulated clinical cases derived from MedQA\[[5](https://arxiv.org/html/2606.18068#bib.bib27)\], using LLM\-based patient agents realised using Agentic\-AI software system\. Results show that the proposed system improves diagnostic accuracy by11\.311\.3percentage points over an unconstrained baseline\. Additionally, we observe a statistically significant negative correlation between symptom completeness and diagnostic uncertainty, indicating that structured information collection contributes to more consistent model outputs\.

The main contributions of this paper are summarized as follows:

- •Neuro\-symbolic multi\-agent framework:We proposed a structured multi\-agent architecture, based on Agentic AI paradigm, for clinical triage and diagnosis that separateshistory taking,diagnostic reasoning, andsafety supervisioninto distinct role\-specialized components\. By organizing the workflow in this manner, the framework supports explicit control over the transition from symptom intake to diagnosis, rather than relying solely on unconstrained conversational generation\.
- •Deterministic OLDCARTS\-based intake verification mechanism:We introduce a neuro\-symbolic state\-tracking gate that explicitly monitors the completeness of symptom collection under the OLDCARTS clinical protocol\. This mechanism enforces a deterministic verification step before diagnosis is initiated, thereby reducing the risk of premature diagnostic handoff caused by incomplete history taking\.
- •Uncertainty\-aware diagnostic screening mechanism:We incorporate a semantic entropy\-based uncertainty quantification module that evaluates disagreement across multiple independently generated diagnostic samples\. This mechanism provides an additional screening layer for identifying potentially unreliable or divergent diagnostic outputs before they are presented to the user\.
- •We evaluate the proposed framework through an ablation study across multiple test settings and show that the our proposed architecture improves diagnostic accuracy \(\+11\.3\+11\.3percentage\-point accuracy gain\) over an unconstrained baseline\. We further analyze the relationship between structured symptom completeness and diagnostic uncertainty, showing that more complete intake is associated with lower entropy in downstream diagnosis generation\.

![Refer to caption](https://arxiv.org/html/2606.18068v1/x2.png)Figure 2:System architecture of the proposed Neuro\-Symbolic Multi\-Agent Triage framework\. The pipeline operates across three phases: \(1\) structured history taking enforced by the OLDCARTS state\-tracking gate \[M1\], \(2\) parallel stochastic diagnostic sampling with semantic entropy uncertainty quantification \[M2\], and \(3\) a recursive safety supervision loop with up to three revision attempts\.TABLE I:Comparison of Related Multi\-Agent and LLM\-Based Clinical Frameworks
## IIRelated Work

LLMs in Clinical Decision Support\.The use of LLMs in clinical decision support has expanded rapidly with the emergence of instruction\-tuned foundation models\. Early evidence of this potential came from studies showing that GPT\-4 could perform strongly on the United States Medical Licensing Examination \(USMLE\) without task\-specific fine\-tuning, highlighting its broad medical knowledge and reasoning capability\[[10](https://arxiv.org/html/2606.18068#bib.bib25)\]\. Subsequent domain\-adapted models, such as Med\-PaLM 2\[[12](https://arxiv.org/html/2606.18068#bib.bib30)\], further narrowed the gap with specialist\-level performance on medical question answering benchmarks\. More recently, generalist medical models including MedFound\[[7](https://arxiv.org/html/2606.18068#bib.bib6)\]have demonstrated the feasibility of supporting diagnosis across multiple clinical specialties within a unified model family\. In parallel, specialized applications such as RadGPT for radiology report explanation\[[4](https://arxiv.org/html/2606.18068#bib.bib3)\]and adapted LLMs for clinical dialogue summarization\[[14](https://arxiv.org/html/2606.18068#bib.bib7)\]have shown that these models can achieve strong performance on targeted clinical language tasks, including settings where they match or surpass human experts in structured information extraction\.

Hallucination and Safety in Medical AI\.Despite the growing capabilities of LLMs in healthcare, hallucination remains a major unresolved challenge\. Omar et al\.\[[11](https://arxiv.org/html/2606.18068#bib.bib4)\]show that LLMs can absorb and reproduce clinical inaccuracies embedded in realistic patient narratives, often treating false medical context as valid evidence\. To mitigate such failures, prior work has explored strategies such as retrieval\-augmented generation \(RAG\), chain\-of\-thought prompting, and self\-consistency decoding\. While these methods can reduce hallucination rates, they remain inherently probabilistic and do not provide deterministic safeguards\. Similarly, the “LLM\-as\-a\-judge” paradigm, in which a secondary LLM evaluates the output of a primary model\[[15](https://arxiv.org/html/2606.18068#bib.bib32)\], can improve output quality on average, but it does not eliminate failure modes in a guaranteed manner, since the judging model is itself susceptible to the same underlying errors\.

Multi\-Agent Systems for Healthcare\.Multi\-agent frameworks offer a structural mechanism for decomposing complex clinical workflows into role\-specialized components\. Conversational Health Agents\[[1](https://arxiv.org/html/2606.18068#bib.bib31)\]propose a personalized LLM\-powered agent framework that separates symptom elicitation from downstream planning\. The Microsoft’s AutoGen agentic AI framework\[[9](https://arxiv.org/html/2606.18068#bib.bib29)\]provides a general\-purpose multi\-agent conversation infrastructure that has been applied to medical question answering and summarization tasks\. However, existing multi\-agent healthcare systems typically rely on prompt\-based constraints for inter\-agent routing, leaving the completeness of information gathering to the model’s instruction\-following probability rather than verified by a formal symbolic gate\.

Uncertainty Quantification in LLM Outputs\.Uncertainty quantification \(UQ\) for LLMs has emerged as a distinct research area\. Token\-probability\-based measures suffer from the “linguistic paraphrase problem”: semantically equivalent outputs \(e\.g\., “STEMI” vs\. “ST\-elevation myocardial infarction”\) produce artificially high entropy\. Semantic Entropy \(SE\)\[[6](https://arxiv.org/html/2606.18068#bib.bib28)\]addresses this by clustering semantically equivalent outputs via bidirectional NLI entailment before computing normalized Shannon entropy\. This estimator has been validated on open\-domain QA tasks but has not previously been applied to the multi\-label, multi\-diagnosis setting of clinical differential diagnosis or integrated into a deterministic multi\-agent orchestration gate\.

Table[I](https://arxiv.org/html/2606.18068#S1.T1)compares the proposed framework with representative prior systems across six dimensions relevant to the failure modes discussed in Section[II](https://arxiv.org/html/2606.18068#S2)\. To the best of our knowledge, this is the first work to unify \(1\) deterministic symbolic verification of clinical intake completeness, \(2\) semantic entropy\-based uncertainty quantification for multi\-diagnosis outputs, and \(3\) recursive safety supervision within an open\-source multi\-agent architecture\.

## IIIBackground and Task formulation

This section formalizes the clinical diagnosis task considered in this work\. We first define the problem setting, and then describe the structured intake constraint, the uncertainty quantification mechanism, and the experimental design used for evaluation\.

### III\-AProblem Formulation

Let𝒞\\mathcal\{C\}denote the patient’s chief complaint, and letℋ=\{h1,h2,…,hn\}\\mathcal\{H\}=\\\{h\_\{1\},h\_\{2\},\\ldots,h\_\{n\}\\\}denote the set of symptom details collected during the triage conversation\. A clinical triage session is represented as a tuple\(𝒞,ℋ,𝒟,ℛ\)\(\\mathcal\{C\},\\mathcal\{H\},\\mathcal\{D\},\\mathcal\{R\}\), where𝒟\\mathcal\{D\}denotes the differential diagnosis generated by the diagnostic agent, andℛ∈\{0,1\}\\mathcal\{R\}\\in\\\{0,1\\\}denotes the binary correctness label assigned by a ClinicalJudge agent\.

The objective of the framework is to maximize the expected diagnostic correctness,𝔼\[ℛ\]\\mathbb\{E\}\[\\mathcal\{R\}\], while ensuring that diagnosis is performed only after sufficiently complete symptom collection\. To formalize this requirement, we define an OLDCARTS state vector

V=\{VOnset,VLocation,VDuration,…,VSeverity\},V=\\\{V\_\{\\text\{Onset\}\},V\_\{\\text\{Location\}\},V\_\{\\text\{Duration\}\},\\dots,V\_\{\\text\{Severity\}\}\\\},where eachVi∈\{0,1\}V\_\{i\}\\in\\\{0,1\\\}indicates whether the corresponding symptom dimension has been collected\. The completeness condition is enforced through the constraint

∑i=18Vi=8,\\sum\_\{i=1\}^\{8\}V\_\{i\}=8,which requires that all eight OLDCARTS fields be observed before the system is allowed to transition from triage to diagnosis\. In this setting, the problem is to design a triage\-and\-diagnosis workflow that jointly satisfies two requirements: \(i\) collect a complete and structured symptom history, and \(ii\) generate a diagnosis with high expected correctness\.

### III\-BOLDCARTS Clinical Protocol

The OLDCARTS mnemonic\[[2](https://arxiv.org/html/2606.18068#bib.bib5)\]specifies eight standard dimensions for structured symptom characterization in clinical history taking: Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity\. To encode this protocol, we associate each dimension with a Boolean variableVi∈\{0,1\}V\_\{i\}\\in\\\{0,1\\\}, initialized to0and set to11once the corresponding symptom attribute has been explicitly elicited and registered by the TriageNurse agent\.

We define the intake completeness score as

σ=∑i=18Vi,σ∈\{0,1,…,8\},\\sigma=\\sum\_\{i=1\}^\{8\}V\_\{i\},\\quad\\sigma\\in\\\{0,1,\\ldots,8\\\},\(1\)where larger values ofσ\\sigmaindicate more complete symptom collection\. A score ofσ=8\\sigma=8corresponds to full coverage of all OLDCARTS dimensions\. The M1 verification gate enforces the hard constraintσ=8\\sigma=8before the workflow is allowed to transition from triage to the diagnostic phase\. If a patient reports abdominal pain and the system collects only onset, location, and severity, then the corresponding state vector satisfiesσ=3\\sigma=3\. In this case, the M1 gate blocks diagnostic handoff and requires the remaining OLDCARTS dimensions to be gathered before diagnosis can proceed\.

### III\-CSemantic Entropy Uncertainty Quantification

Following Kuhn et al\.\[[6](https://arxiv.org/html/2606.18068#bib.bib28)\], we adapt the Semantic Entropy \(SE\) estimator to the differential diagnosis setting\. Let\{s1,s2,…,sK\}\\\{s\_\{1\},s\_\{2\},\\ldots,s\_\{K\}\\\}denoteKKindependent diagnostic samples generated for the same clinical case\. Each sample produces a set of candidate diagnoses, denoted byLi=\{d1\(i\),d2\(i\),…\}L\_\{i\}=\\\{d\_\{1\}^\{\(i\)\},d\_\{2\}^\{\(i\)\},\\ldots\\\}\.

Because diagnostically equivalent outputs may appear in different lexical forms, we group semantically equivalent labels using bidirectional natural language inference \(NLI\)\. Specifically, two labelsaaandbbare assigned to the same clustercjc\_\{j\}if and only if

P\(ENT∣a→b\)≥τNLI∧P\(ENT∣b→a\)≥τNLI,P\(\\text\{ENT\}\\mid a\\to b\)\\geq\\tau\_\{\\text\{NLI\}\}\\;\\wedge\\;P\(\\text\{ENT\}\\mid b\\to a\)\\geq\\tau\_\{\\text\{NLI\}\},\(2\)whereτNLI=0\.7\\tau\_\{\\text\{NLI\}\}=0\.7\. This clustering step reduces sensitivity to surface\-level variation in diagnosis wording\.

After clustering, we compute the normalized Shannon entropy over the cluster distribution:

H=−1log2⁡\|C\|∑j=1\|C\|pjlog2⁡pj,H∈\[0,1\],H=\-\\frac\{1\}\{\\log\_\{2\}\|C\|\}\\sum\_\{j=1\}^\{\|C\|\}p\_\{j\}\\log\_\{2\}p\_\{j\},\\quad H\\in\[0,1\],\(3\)wherepj=\|cj\|/np\_\{j\}=\|c\_\{j\}\|/n,\|C\|\|C\|is the number of semantic clusters, andnnis the total number of diagnosis labels aggregated across allKKsamples\. A value ofH=0H=0indicates complete agreement among the sampled outputs, whereas larger values ofHHindicate greater disagreement and hence higher epistemic uncertainty\.

We use a routing thresholdτ=0\.5\\tau=0\.5\. WhenH\>τH\>\\tau, the case is flagged to the MedicalSafetySupervisor together with an explicit uncertainty annotation, allowing uncertain diagnostic outputs to receive additional review before being presented to the user\.

Example\.Suppose the sampled outputs include labels such asmyocardial infarction,heart attack, andgastric reflux\. The first two may be grouped into the same semantic cluster, while the third forms a different cluster\. If the sampled diagnoses are distributed across multiple such clusters, the resulting entropy increases, indicating greater diagnostic disagreement\.

### III\-DAblation Study Design

We conduct an ablation study to isolate the effects of the two proposed components: the OLDCARTS verification gate \(M1\) and the semantic entropy\-based uncertainty quantification module \(M2\)\. The experimental conditions are defined by the Boolean setting ofUSE\_OLDCARTS\_GATEand the number of diagnostic samplesKK\.

- •Baseline \(B\):Gate==F,K=1K=1\. No symbolic intake verification is applied, and diagnosis is generated from a single deterministic sample\.
- •Ablation A \(A\):Gate==T,K=1K=1\. The symbolic intake gate is enabled, but diagnosis is still based on a single deterministic sample; thus, semantic entropy\-based uncertainty quantification is not used\.
- •Full Architecture \(FA\):Gate==T,K=5K=5\. Both the symbolic intake gate and the uncertainty quantification module are enabled, allowing diagnosis to benefit from structured intake verification as well as multi\-sample uncertainty estimation\.

All three conditions are evaluated onN∈\{50,100,150\}N\\in\\\{50,100,150\\\}test cases, withN=150N=150used as the primary reporting setting\. Diagnostic accuracy is determined by a ClinicalJudge agent, which assigns a binary score of11when the ground\-truth diagnosis matches any label in the pooled prediction set, and0otherwise\. For the single\-sample settings \(K=1K=1\), semantic entropy is undefined; in these cases,HHis set to0\.00\.0as a placeholder, and no uncertainty\-based routing is triggered\.

## IVDesign and Implementation

### IV\-AFramework Overview

Figure[2](https://arxiv.org/html/2606.18068#S1.F2)illustrates the proposed neuro\-symbolic multi\-agent framework for clinical triage and diagnosis\. The framework is organized as a three\-phase pipeline that combines structured symptom collection, uncertainty\-aware diagnostic reasoning, and recursive safety supervision within a unified multi\-agent workflow\.

At the top of the pipeline, theUser Interaction Layerprovides the entry point for patient input through a Gradio web interface\. The patient supplies symptoms in natural language, and the interaction proceeds through a multi\-turn conversation\. These exchanges are forwarded to theTriage Nurseagent, which is responsible for eliciting symptom details in a structured manner\.

The first phase of the framework focuses onstructured history taking\. During this phase, the Triage Nurse interacts with the patient and updates theshared conversation thread, which serves as the central memory for downstream agents\. A neuro\-symbolicOLDCARTS gatecontinuously verifies whether the required symptom dimensions:Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity; have been collected\. If the completeness condition is not satisfied, the workflow is routed back to the Triage Nurse for additional questioning\. Only when the OLDCARTS gate is satisfied is the case allowed to proceed to the diagnostic phase\. In this way, the framework prevents premature transition to diagnosis based on incomplete clinical intake\.

The second phase performsuncertainty\-aware diagnostic reasoning\. Once the conversation history is deemed complete, the shared conversation thread is provided to a set of parallelSenior Diagnosticianagents\. As shown in the figure,K=5K=5diagnostician instances are sampled at temperatureT=0\.7T=0\.7in order to generate diverse candidate diagnoses for the same clinical case\. Their outputs are then passed to anNLI clustering and entropy calculationmodule, which groups semantically equivalent diagnosis labels and computes the semantic entropy scoreHH\. This score quantifies the level of disagreement across the sampled diagnostic outputs\. Lower entropy indicates stronger agreement, whereas higher entropy reflects greater epistemic uncertainty\.

The third phase performsdeterministic diagnosis and safety review\. In parallel with the uncertainty estimation stage, a separateSenior Diagnosticianoperating at temperatureT=0\.0T=0\.0generates the final deterministic diagnostic report from the same shared conversation history\. Before finalization, this diagnostician can invoke thecheck\_drug\_interaction\(\)tool, which provides a static prototype mechanism for screening potentially unsafe medication combinations\. The resulting diagnosis is then sent to theMedical Safety Supervisor, which serves as the final review component in the framework\.

The decision logic of the safety supervisor depends on both the generated diagnosis and the uncertainty signal\. If the entropy score satisfiesH≤τH\\leq\\tau, the diagnosis can proceed to final review under normal conditions\. IfH\>τH\>\\tau, the case is treated as uncertain and is subjected to closer supervision\. The Medical Safety Supervisor either approves the diagnosis or rejects it and returns structured feedback\. In the rejection case, the framework enters arecursive review loop, shown in the figure by the feedback path from the supervisor back to the diagnostician, with a maximum of three revision attempts\. This loop enables iterative refinement before an approved diagnosis is returned through the interface\.

Overall, the framework combines symbolic verification and probabilistic reasoning in a coordinated multi\-agent design\. The OLDCARTS gate enforces completeness during intake, the semantic entropy module quantifies diagnostic uncertainty across multiple samples, and the safety supervisor provides an explicit review mechanism before patient\-facing output is approved\. This design allows the system to address both premature diagnostic handoff and silent hallucinations within a single end\-to\-end architecture\.

Phase 1: Structured History Taking \(M1\) Input:Chief complaint𝒞\\mathcal\{C\}, OLDCARTS fieldsℱ=\{F1,…,F8\}\\mathcal\{F\}=\\\{F\_\{1\},\\ldots,F\_\{8\}\\\}, max triage turnsTmaxT\_\{\\max\}Output:Structured historyℋ\\mathcal\{H\}1:InitialiseV←\{Fi:False\}i=18V\\leftarrow\\\{F\_\{i\}:\\texttt\{False\}\\\}\_\{i=1\}^\{8\},ℋ←\[𝒞\]\\mathcal\{H\}\\leftarrow\[\\mathcal\{C\}\],t←0t\\leftarrow 02:whilet<Tmaxt<T\_\{\\max\}do3:q←TriageNurse\(ℋ\)q\\leftarrow\\textsc\{TriageNurse\}\(\\mathcal\{H\}\)4:ifqqcontainsREADY\_FOR\_DIAGNOSISthen5:Appendqqtoℋ\\mathcal\{H\}6:ifσ=∑iV\[Fi\]=8\\sigma=\\sum\_\{i\}V\[F\_\{i\}\]=8then7:break8:else9:Append corrective message listing missing fields toℋ\\mathcal\{H\}10:endif11:else12:Extract field tagFiF\_\{i\}fromqq13:SetV\[Fi\]←TrueV\[F\_\{i\}\]\\leftarrow\\texttt\{True\}14:Appendqqtoℋ\\mathcal\{H\}15:r←Patient\(q\)r\\leftarrow\\textsc\{Patient\}\(q\)16:Appendrrtoℋ\\mathcal\{H\}17:endif18:t←t\+1t\\leftarrow t\+119:endwhile

Phase 2: Semantic Entropy UQ \(M2\) Input:Structured historyℋ\\mathcal\{H\}, SE samplesKK, NLI thresholdτNLI\\tau\_\{\\text\{NLI\}\}Output:Semantic entropy scoreHH1:fork=1k=1toKKin paralleldo2:Lk←Diagnose\(ℋ,temp=0\.7\)L\_\{k\}\\leftarrow\\textsc\{Diagnose\}\(\\mathcal\{H\},\\text\{temp\}=0\.7\)3:endfor4:𝒞clusters←∅\\mathcal\{C\}\_\{\\text\{clusters\}\}\\leftarrow\\emptyset5:for alllabelℓ\\ellin⋃kLk\\bigcup\_\{k\}L\_\{k\}do6:if∃c∈𝒞clusters\\exists\\,c\\in\\mathcal\{C\}\_\{\\text\{clusters\}\}such thatP\(ENT∣ℓ→c\)≥τNLIP\(\\text\{ENT\}\\mid\\ell\\to c\)\\geq\\tau\_\{\\text\{NLI\}\}andP\(ENT∣c→ℓ\)≥τNLIP\(\\text\{ENT\}\\mid c\\to\\ell\)\\geq\\tau\_\{\\text\{NLI\}\}then7:Assignℓ\\ellto clustercc8:else9:Create new cluster\{ℓ\}\\\{\\ell\\\}10:endif11:endfor12:H←−1log2⁡\|𝒞clusters\|∑jpjlog2⁡pjH\\leftarrow\-\\dfrac\{1\}\{\\log\_\{2\}\|\\mathcal\{C\}\_\{\\text\{clusters\}\}\|\}\\displaystyle\\sum\_\{j\}p\_\{j\}\\log\_\{2\}p\_\{j\}

Phase 3: Deterministic Diagnosis and Review Input:ℋ\\mathcal\{H\},HH, entropy thresholdτ\\tau, max retriesRmaxR\_\{\\max\}Output:Approved differential diagnosis𝒟∗\\mathcal\{D\}^\{\*\}1:Append SE consensus note\(H,τ,⋃kLk\)\(H,\\tau,\\bigcup\_\{k\}L\_\{k\}\)toℋ\\mathcal\{H\}2:forr=1r=1toRmaxR\_\{\\max\}do3:𝒟←Diagnose\(ℋ,temp=0\.0,drug\_tool\)\\mathcal\{D\}\\leftarrow\\textsc\{Diagnose\}\(\\mathcal\{H\},\\text\{temp\}=0\.0,\\text\{drug\\\_tool\}\)4:ifH\>τH\>\\tauthen5:Prepend uncertainty flag\(H,τ,⋃kLk\)\(H,\\tau,\\bigcup\_\{k\}L\_\{k\}\)to𝒟\\mathcal\{D\}6:endif7:decision←Supervisor\(𝒟\)\\textit\{decision\}\\leftarrow\\textsc\{Supervisor\}\(\\mathcal\{D\}\)8:ifdecision=APPROVED\\textit\{decision\}=\\texttt\{APPROVED\}then9:return𝒟∗←𝒟\\mathcal\{D\}^\{\*\}\\leftarrow\\mathcal\{D\}10:else11:Append rejection feedback toℋ\\mathcal\{H\}12:endif13:endfor14:return𝒟∗←∅\\mathcal\{D\}^\{\*\}\\leftarrow\\emptyset

Figure 3:Three\-phase algorithmic view of the proposed neuro\-symbolic multi\-agent clinical triage framework\. The workflow consists of structured history taking with OLDCARTS verification, semantic entropy\-based uncertainty quantification, and deterministic diagnosis with recursive safety review\.Figure[3](https://arxiv.org/html/2606.18068#S4.F3)formalises the complete three\-phase execution of the proposed framework\. The corresponding algorithmic workflow is organized into three phases\. Phase 1 captures the structured triage procedure with neuro\-symbolic OLDCARTS verification\. Phase 2 computes semantic entropy over multiple independently generated diagnostic samples to estimate uncertainty\. Phase 3 performs deterministic diagnosis generation followed by safety supervision and, if necessary, recursive revision before final approval\.

### IV\-BAgent Roles and Prompt Design

The framework employs three role\-specialized agents that correspond to the major stages of the clinical workflow: structured symptom elicitation, diagnostic reasoning, and safety supervision\. Rather than relying on a single conversational model to perform all functions, these roles are separated to enable clearer control over information flow and to support phase\-specific constraints at the orchestration layer\.

#### IV\-B1TriageNurse

TheTriageNurseagent is responsible for structured symptom elicitation during the intake phase\. Its prompt is designed to support focused, turn\-by\-turn questioning aligned with the OLDCARTS protocol\. To make the elicited information machine\-verifiable, each question is associated with a field tag indicating the symptom dimension being queried\. These tags are consumed by the orchestration layer to update the OLDCARTS state vector and to determine whether the intake process is complete\. The agent proceeds one question at a time and signals readiness for diagnostic handoff only after completing the symptom\-gathering stage\. This design ensures that the triage interaction remains both conversational and formally traceable\.

#### IV\-B2SeniorDiagnostician

TheSeniorDiagnosticianagent is responsible for generating the diagnostic assessment from the completed conversation history\. Its output is structured to support both clinical interpretability and downstream uncertainty analysis\. In particular, the agent produces a diagnosis\-oriented report that includes the differential diagnosis and associated clinical recommendations, together with a machine\-readable set of diagnosis labels used in semantic entropy computation\. The same role is instantiated under two settings within the framework: multiple stochastic instances are used for uncertainty estimation, while a separate deterministic instance is used to generate the final patient\-facing report\. The diagnostician is also connected to a lightweight drug\-interaction checking tool that serves as a prototype safeguard for medication\-related recommendations\.

#### IV\-B3MedicalSafetySupervisor

TheMedicalSafetySupervisoragent acts as the final review component before a diagnosis is approved\. It evaluates the diagnostic report for clinical safety and either accepts the report or returns structured feedback for revision\. When the uncertainty score exceeds the predefined threshold, the supervisor additionally receives an uncertainty summary containing the entropy value and the set of candidate diagnoses observed across the sampled outputs\. Importantly, this uncertainty signal is treated as contextual information rather than as an automatic rejection criterion\. The supervisor is intended to reject outputs only on the basis of clinical safety concerns, thereby separating uncertainty awareness from the final safety decision\.

### IV\-CM1: Neuro\-Symbolic OLDCARTS State Tracker

The M1 component implements a neuro\-symbolic verification mechanism for monitoring intake completeness during triage\. It maintains a Boolean state vector over the eight OLDCARTS dimensions, where each entry indicates whether the corresponding symptom attribute has been explicitly elicited during the conversation\. From this state vector, the framework computes the completeness scoreσ\\sigmadefined in Eq\.[1](https://arxiv.org/html/2606.18068#S3.E1), checks whether full coverage has been achieved, and identifies any symptom dimensions that remain uncollected\. To make the symbolic interface explicit, field registration is implemented through pattern\-based parsing of the TriageNurse output\. Specifically, the orchestration layer detects tags of the form\[FIELD:X\], whereXdenotes one of the OLDCARTS dimensions, using the following case\-insensitive regular expression:

```
_FIELD_TAG_RE = re.compile(
  r"\[FIELD\s*:\s*(Onset|Location|Duration|
    Character|Aggravating|Radiation|
    Timing|Severity)\]\s*",
  re.IGNORECASE)
```

This parser provides the symbolic grounding required for deterministic state updates: once a valid tag is detected, the corresponding OLDCARTS variable is marked as observed\. An important design choice is the use ofre\.searchrather thanre\.match\. In practice, LLM\-generated questions do not always begin directly with the field marker; they may first include a short natural\-language preamble, e\.g\., “The pain is sharp\.\[FIELD:Character\]How would you describe it?” An anchored match would fail in such cases, preventing the state tracker from registering the intended field\. By instead searching for the tag anywhere in the generated response, the framework preserves deterministic symbolic registration while remaining robust to minor variations in model output formatting\.

To support deterministic state updates, the TriageNurse associates each symptom\-oriented question with an explicit field tag indicating the OLDCARTS dimension being queried\. These tags are parsed by the orchestration layer and mapped to the corresponding entries in the state vector\. Importantly, the tag extraction procedure is designed to be robust to variations in model output formatting\. In practice, the field marker may appear either at the beginning of a question or after a short natural\-language lead\-in\. The state tracker therefore searches for the tag anywhere in the generated response, rather than assuming a fixed positional format\. This design choice makes the verification mechanism less sensitive to stylistic variation in LLM outputs while preserving deterministic field registration\.

The gate is applied whenever the triage agent signals readiness to proceed to diagnosis\. If the completeness conditionσ=8\\sigma=8is satisfied, the workflow is allowed to transition to the diagnostic stage\. Otherwise, the orchestration layer blocks the handoff and returns corrective feedback identifying the missing OLDCARTS dimensions\. This feedback is incorporated back into the conversation context so that the TriageNurse can continue collecting the required symptom details\. In the evaluation setting, a more directive form of corrective feedback is used to explicitly reinforce the required field\-tag format, which helps prevent repeated premature handoff attempts\.

Overall, the M1 state tracker converts structured intake verification into an explicit system\-level constraint rather than a prompt\-level preference\. As a result, the decision to proceed to diagnosis depends on the observed completeness of the conversation history, not on whether the language model chooses to follow the intended questioning protocol\.

### IV\-DM2: Semantic Entropy UQ Gate

The M2 component implements uncertainty estimation through multi\-sample diagnostic generation followed by semantic clustering\. Given a completed conversation history, the framework generatesK=5K=5independent diagnostic samples, each instantiated with a fresh agent and an independent model client at temperature0\.70\.7\. The key requirement in this stage is sampling independence: no state or conversational context is shared across these diagnostician instances beyond the common input history\. This design ensures that the sampled outputs can be treated as independent draws conditioned on the same clinical context\. In the deployed system, these samples are generated concurrently to reduce end\-to\-end latency, whereas in the evaluation pipeline they are executed sequentially due to external rate constraints\.

For each sample, the diagnosis labels used for uncertainty estimation are extracted separately from the free\-form reasoning text\. Concretely, the model output is structured such that the final line contains a machine\-readable list of diagnosis names, which is parsed independently from the preceding explanation\. This separation is important because entropy should reflect disagreement in the diagnosis space rather than superficial linguistic variation in the generated rationale\.

To account for synonymous or lexically varied diagnoses, the extracted labels are grouped using bidirectional natural language inference\. Specifically, the framework employs thecross\-encoder/nli\-deberta\-v3\-smallmodel\[[3](https://arxiv.org/html/2606.18068#bib.bib33)\]to determine whether two diagnosis expressions are semantically equivalent\. Since the NLI model is designed for sentence\-pair inference, each diagnosis label is first converted into a simple sentential form \(e\.g\., “The diagnosis is \{x\}\.”\) before comparison\. The resulting similarity module is implemented in a modular manner, allowing the semantic matching function to be replaced without changing the surrounding entropy computation pipeline\. After clustering, the normalized Shannon entropy over the cluster distribution is computed as described in Eq\.[3](https://arxiv.org/html/2606.18068#S3.E3), and the resulting score is used as the uncertainty signal for downstream review\.

### IV\-EConcurrency and Session Management

The framework is designed to support interactive clinical dialogue while preserving a clear separation between user interaction and backend orchestration\. The user interface is handled through Gradio, while the multi\-agent workflow executes asynchronously in a dedicated background thread with its own event loop\. This separation allows the system to maintain responsive user interaction during long\-running agent operations, including multi\-turn triage, multi\-sample uncertainty estimation, and recursive safety review\.

Communication between the interface layer and the orchestration layer is managed through asynchronous queues and event\-based synchronization primitives\. This design enables controlled exchange of user inputs and agent outputs without blocking the interface thread\. In addition, the framework maintains explicit bookkeeping over the shared conversation history to avoid repeated reprocessing of previously consumed messages\. Rather than passing the full dialogue context at each step, each agent receives only the portion of the shared history that has not yet been consumed in its own stage of the workflow\. This strategy reduces redundant context accumulation and improves efficiency during prolonged multi\-agent interactions\.

## VPerformance Evaluation

### V\-AEvaluation Methodology

We evaluate the proposed framework using a fully automated pipeline built on a clinical vignette dataset derived from the held\-out test split of the MedQA\-USMLE\-4\-options benchmark\[[5](https://arxiv.org/html/2606.18068#bib.bib27)\]\. To align the evaluation with the diagnostic objective of our system, we retain only cases corresponding to diagnostic reasoning tasks, specifically those whose question text contains phrases such as “most likely diagnosis” or “best next step\.” This filtering ensures that the ground\-truth output for each case is a named diagnosis compatible with label\-space evaluation\.

For each selected vignette, the chief complaint is extracted from the original MedQA case usingLlama\-3\.1\-8B\-Instantthrough a zero\-shot JSON\-formatted prompt\. This preprocessing step produces structuredchief\_complaintandhidden\_vignettefields that are stored in the evaluation set\. Importantly, this auxiliary model is used only for data preparation and does not participate in the triage or diagnosis stages of the proposed framework\.

During evaluation, each test case is instantiated as a simulated patient with access to the hidden vignette\. The interaction begins from the extracted chief complaint, after which theTriageNurseagent conducts the structured interview and the OLDCARTS state tracker records the completeness scoreσ\\sigmathroughout the triage process\. Once triage is completed, or once the maximum number of triage turns is reached \(MAX\_TRIAGE\_TURNS=15=15\), the diagnostic phase is initiated andKKdiagnostic samples are collected for uncertainty estimation\.

In the deployed system, theseK=5K=5samples are generated concurrently\. Diagnostic correctness is then determined by aClinicalJudgeagent, which assigns a binary scoreℛ∈\{0,1\}\\mathcal\{R\}\\in\\\{0,1\\\}depending on whether the ground\-truth diagnosis appears in the pooled prediction set\. To reduce ambiguity in automated scoring, the judge is instructed to return only the token1or0, although occasional verbose outputs containing a digit remain a minor limitation of this evaluation setup\.

All experiments are conducted using llama\-3\.1\-70b\-instruct served via NVIDIA NIM, which replaces the Groq\-based production endpoint to avoid token\-per\-minute constraints during large\-scale evaluation\. For each test case, we record case\-level outputs including the completeness score, the number of premature handoff attempts, the semantic entropy value, the ground\-truth diagnosis, the consensus diagnosis labels, and the final correctness indicator\. Statistical analysis is performed post hoc using Pearson correlation, Fisher z\-transformation, and 95% bootstrap confidence intervals computed fromB=2000B=2000resamples with random seed4242\.

The recursiveMedicalSafetySupervisorloop is part of the deployed framework but is excluded from the automated evaluation reported here\. This choice allows us to isolate the contribution of the M1 and M2 mechanisms without conflating their effects with the corrective behavior of the supervisor\. Accordingly, the reported results should be interpreted as a conservative estimate of the framework’s diagnostic performance in the absence of the final review stage\. Assessing the incremental contribution of the safety supervisor is left for future work involving clinician\-centered evaluation\.

### V\-BAblation Study Results

![Refer to caption](https://arxiv.org/html/2606.18068v1/x3.png)Figure 4:Ablation study acrossN∈\{50,100,150\}N\\in\\\{50,100,150\\\}test cases\. Panels \(a\)–\(c\): diagnostic accuracy \(solid bars\) and OLDCARTSσ\\sigma\-score \(hatched bars\) with 95% bootstrap CIs for Baseline \(B\), Ablation A \(A\), and Full Architecture \(FA\)\. Panels \(d\)–\(e\): accuracy andσ\\sigma\-score scaling withNN\(shaded = 95% bootstrap CI\)\.![Refer to caption](https://arxiv.org/html/2606.18068v1/x4.png)Figure 5:Linear regression analysis of the Full Architecture \(N=150N=150\), demonstrating a statistically significant negative correlation \(r=−0\.181r=\-0\.181,p=0\.0271p=0\.0271\) between OLDCARTS completeness \(σ\\sigma\-score\) and downstream epistemic uncertainty \(HH\)\. The OLS regression line with 95% confidence band confirms the predicted direction: more complete symptom histories produce lower semantic entropy acrossK=5K=5independent diagnostic samples\.Table[II](https://arxiv.org/html/2606.18068#S5.T2)presents the meanσ\\sigma\-score and diagnostic accuracy for each ablation condition acrossN∈\{50,100,150\}N\\in\\\{50,100,150\\\}test cases\. The corresponding scaling curves are shown in Fig\.[4](https://arxiv.org/html/2606.18068#S5.F4)\.

TABLE II:Ablation Study: Meanσ\\sigma\-score and Diagnostic Accuracy by Condition and Sample Size\. Green rows = Full Architecture\.Δ\\Deltavs B = percentage\-point difference vs Baseline\. MeanHHreported for Full Architecture only\.NConditionGateKMeanσ\\sigma\-scoreAccuracy \(%\)Δ\\Deltavs B \(pp\)50Baseline \(B\)F11\.74034\.0—Ablation A \(A\)T16\.54030\.0−4\.0\-4\.0Full Arch\. \(FA\)T56\.68056\.0\+22\.0\+22\.0MeanHH\[FA\]: 0\.9342 Pearsonrr\(σ⇔H\\sigma\\Leftrightarrow H\) \[FA\]:r=−0\.0084r=\-0\.0084,p=0\.954p=0\.95495% CI:\[−0\.286,0\.271\]\[\-0\.286,0\.271\]100Baseline \(B\)F12\.39036\.0—Ablation A \(A\)T16\.23032\.0−4\.0\-4\.0Full Arch\. \(FA\)T56\.44047\.0\+11\.0\+11\.0MeanHH\[FA\]: 0\.9273 Pearsonrr\(σ⇔H\\sigma\\Leftrightarrow H\) \[FA\]:r=−0\.2703r=\-0\.2703,p=0\.0065p=0\.006595% CI:\[−0\.443,−0\.078\]\[\-0\.443,\-0\.078\]150Baseline \(B\)F12\.40038\.0—Ablation A \(A\)T16\.40732\.7−5\.3\-5\.3Full Arch\. \(FA\)T56\.66749\.3\+11\.3\\mathbf\{\+11\.3\}MeanHH\[FA\]: 0\.9278 Pearsonrr\(σ⇔H\\sigma\\Leftrightarrow H\) \[FA\]:r=−0\.1805r=\-0\.1805,p=0\.0271p=0\.027195% CI:\[−0\.331,−0\.021\]\[\-0\.331,\-0\.021\]
### V\-COLDCARTS Completeness Analysis

A direct consequence of disabling the M1 gate is a substantial reduction in the completeness of symptom collection\. AtN=150N=150, the Baseline condition achieves a meanσ\\sigma\-score of only2\.400/82\.400/8, corresponding to 30\.0% completeness\. This indicates that, without explicit intake verification, the TriageNurse collects on average fewer than three of the eight OLDCARTS dimensions before proceeding to diagnosis\. These results provide quantitative evidence of the premature\-handoff failure mode discussed in Section[III](https://arxiv.org/html/2606.18068#S3), where unconstrained conversational behavior does not reliably ensure structured clinical intake\.

In contrast, the Full Architecture \(Gate==T\) attains a meanσ\\sigma\-score of6\.667/86\.667/8atN=150N=150, representing an absolute gain of4\.2674\.267points over the Baseline\. This corresponds to an 83\.3% increase in average intake completeness\. The remaining gap between6\.6676\.667and the maximum possible score of8\.08\.0is primarily attributable to cases in which the triage process reaches the predefined limit ofMAX\_TRIAGE\_TURNS = 15before all OLDCARTS dimensions are collected\. This suggests that the residual incompleteness arises from turn\-budget constraints rather than failure of the verification mechanism itself\.

### V\-DDiagnostic Accuracy Analysis

AtN=150N=150, the Baseline achieves a diagnostic accuracy of 38\.0%, whereas the Full Architecture reaches 49\.3%, yielding an absolute improvement of 11\.3 percentage points\. The largest gain is observed atN=50N=50, where the Full Architecture achieves 56\.0% accuracy compared with 34\.0% for the Baseline, corresponding to a 22\.0 percentage\-point improvement\. This pattern is consistent with the intuition that structured symptom collection is particularly beneficial when the available diagnostic context is otherwise limited\.

Ablation A \(Gate==T,K=1K=1\) underperforms the Baseline across all three evaluation settings: 30\.0% versus 34\.0% atN=50N=50, 32\.0% versus 36\.0% atN=100N=100, and 32\.7% versus 38\.0% atN=150N=150\. The reason is that activating the M1 gate increases the length of the triage history by preventing premature handoff, while the diagnosis stage in this setting still relies on a single deterministic sample without uncertainty\-aware filtering\. As a result, the additional context does not consistently translate into better predictions\. In contrast, the Full Architecture compensates for this effect through multi\-sample diagnostic generation and semantic clustering, which together recover and improve performance\. These results suggest that the M1 and M2 components are most effective when deployed jointly, rather than in isolation\.

### V\-EStatistical Analysis:σ\\sigma\-Score vs\. Semantic EntropyHH

Fig\.[5](https://arxiv.org/html/2606.18068#S5.F5)and the correlation statistics in Table[II](https://arxiv.org/html/2606.18068#S5.T2)address the central statistical hypothesis:that higher OLDCARTS completeness \(σ\\sigma\-score\) is associated with lower epistemic uncertainty \(HH\)\.A complete symptom profile more strongly constrains the differential diagnosis space, producing greater consensus acrossKKindependent samples and thus lower normalized Shannon entropy\.

TABLE III:Pearson Correlation Betweenσ\\sigma\-Score and Semantic EntropyHH\(Full Architecture Only\)\. Bold row = primary reported result\.Table[III](https://arxiv.org/html/2606.18068#S5.T3)summarizes the correlation analysis\. AtN=150N=150, the Pearson correlation between OLDCARTS completeness and semantic entropy isr=−0\.181r=\-0\.181\(p=0\.027p=0\.027, 95% CI\[−0\.331,−0\.021\]\[\-0\.331,\-0\.021\]\), which is statistically significant atα=0\.05\\alpha=0\.05\. The negative coefficient is consistent with the expected trend: cases with more complete symptom collection tend to exhibit lower semantic entropy across theK=5K=5diagnostic samples\.

The observed effect size \(\|r\|=0\.181\|r\|=0\.181\) is small in magnitude, which is reasonable in this setting\. While the M1 gate directly affects the completeness of the collected clinical history, it does not directly constrain the semantic content of the downstream diagnosis\. The relationship betweenσ\\sigmaandHHis therefore indirect and mediated by the quality of the diagnostic reasoning process, which naturally introduces additional variability\. AtN=50N=50, the correlation is not statistically significant \(r=−0\.0084r=\-0\.0084,p=0\.954p=0\.954\), likely reflecting limited statistical power at the smaller sample size\. A post\-hoc power analysis based on the effect observed atN=150N=150suggests that approximatelyN=120N=120cases are required to achieve 80% power atα=0\.05\\alpha=0\.05\(two\-tailed\), which is consistent with the emergence of statistical significance atN=100N=100andN=150N=150\.

### V\-FScaling Analysis

Fig\.[4](https://arxiv.org/html/2606.18068#S5.F4)\(e\) plots diagnostic accuracy and OLDCARTSσ\\sigma\-score as a function of sample size\. The accuracy of the Full Architecture is not monotonic with increasingNN: it reaches 56\.0% atN=50N=50and then remains close to 49% forN≥100N\\geq 100\. This pattern likely reflects variation in case difficulty at smaller sample sizes\. Moreover, the overlapping 95% bootstrap confidence intervals atN=100N=100andN=150N=150suggest that the difference between these two settings is not statistically significant\. In contrast, theσ\\sigma\-scores for the gate\-enabled conditions remain relatively stable across sample sizes \(approximately6\.26\.2–6\.76\.7\), indicating that the M1 mechanism enforces symptom completeness consistently\. The Baselineσ\\sigma\-score is similarly stable, though at a much lower level \(approximately1\.71\.7–2\.42\.4\), showing that without explicit verification the model consistently fails to collect complete symptom histories regardless of sample size\.

Taken together, these results support three main observations:

1. 1\.M1 alone does not improve diagnostic accuracy:The results of Ablation A indicate that enforcing structured symptom completeness, by itself, does not lead to improved diagnostic performance\. In fact, relative to the Baseline, accuracy decreases when the intake gate is enabled without accompanying uncertainty\-aware filtering, suggesting that longer and more complete histories alone are not sufficient to improve prediction quality\.
2. 2\.The benefits of M2 are realized most effectively in conjunction with M1:The accuracy gains of the Full Architecture arise from the combined use of structured intake verification and multi\-sample uncertainty estimation\. The M1 mechanism provides a more complete diagnostic context, while the M2 mechanism leverages this context to assess agreement across multiple samples and identify uncertain cases for additional scrutiny\.
3. 3\.Evidence for theσ\\sigma–HHrelationship becomes clearer at larger sample sizes:The correlation between symptom completeness and semantic entropy does not reach statistical significance atN=50N=50, but becomes significant for larger evaluation settings\. This pattern is consistent with the sample sizes typically required to detect small correlation effects with adequate statistical power\.

## VIConclusions and Future Work

This paper proposed a neuro\-symbolic multi\-agent framework, based on Agentic AI paradigm, for clinical triage and diagnosis aimed at addressing two important failure modes in LLM\-based healthcare agents: premature diagnostic handoff and silent hallucination\. The framework combines a deterministic OLDCARTS\-based intake verification mechanism with a semantic entropy\-based uncertainty quantification module to enforce structured symptom collection and identify uncertain diagnostic outputs\.

Experimental results show that the full architecture improves diagnostic accuracy by 11\.3 percentage points over an unconstrained baseline atN=150N=150\. We also observe a statistically significant negative correlation between symptom completeness and diagnostic uncertainty, suggesting that more complete structured intake is associated with more consistent diagnostic predictions\. Ablation results further indicate that the two components are most effective when used together\.

While the current evaluation is based on automated clinical vignette testing, the proposed framework provides a foundation for more reliable multi\-agent clinical decision support\. Future work will focus on improving triage efficiency, tuning uncertainty thresholds more systematically, incorporating clinician\-centered evaluation, and extending the framework to additional clinical modalities\.

## References

- \[1\]M\. Abbasianet al\.\(2023\)Conversational health agents: a personalized LLM\-powered agent framework\.External Links:2310\.02374Cited by:[TABLE I](https://arxiv.org/html/2606.18068#S1.T1.1.4.3.1),[§II](https://arxiv.org/html/2606.18068#S2.p3.1)\.
- \[2\]L\. Bickley and P\. G\. Szilagyi\(2012\)Bates’ guide to physical examination and history\-taking\.Lippincott Williams & Wilkins\.Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p4.2),[§III\-B](https://arxiv.org/html/2606.18068#S3.SS2.p1.3)\.
- \[3\]P\. Heet al\.\(2021\)DeBERTaV3: improving DeBERTa using ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing\.External Links:2111\.09543Cited by:[§IV\-D](https://arxiv.org/html/2606.18068#S4.SS4.p3.1)\.
- \[4\]S\. E\. Herwaldet al\.\(2025\)RadGPT: a system based on a large language model that generates sets of patient\-centered materials to explain radiology report information\.Journal of the American College of Radiology22\(9\),pp\. 1050–1059\.Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p2.1),[§II](https://arxiv.org/html/2606.18068#S2.p1.1)\.
- \[5\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.External Links:[Document](https://dx.doi.org/10.3390/app11146421)Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p6.1),[§V\-A](https://arxiv.org/html/2606.18068#S5.SS1.p1.1)\.
- \[6\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InProceedings of the International Conference on Learning Representations \(ICLR\),Note:arXiv:2302\.09664Cited by:[§II](https://arxiv.org/html/2606.18068#S2.p4.1),[§III\-C](https://arxiv.org/html/2606.18068#S3.SS3.p1.3)\.
- \[7\]X\. Liu, H\. Liu, G\. Yang, Z\. Jiang, S\. Cui, Z\. Zhang, H\. Wang, L\. Tao, Y\. Sun, Z\. Song, T\. Hong, J\. Yang, T\. Gao, J\. Zhang, X\. Li, J\. Zhang, Y\. Sang, Z\. Yang, K\. Xue, and G\. Wang\(2025\)A generalist medical language model for disease diagnosis assistance\.Nature Medicine31\(3\),pp\. 932–942\.External Links:[Document](https://dx.doi.org/10.1038/s41591-024-03416-6)Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p2.1),[§II](https://arxiv.org/html/2606.18068#S2.p1.1)\.
- \[8\]M\. Lu, B\. Ho, D\. Ren, and X\. Wang\(2024\)TriageAgent: towards better multi\-agents collaborations for large language model\-based clinical triage\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 5747–5764\.Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p1.1)\.
- \[9\]Microsoft Research\(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation framework\.Note:https://autogen\-ai\.github\.ioCited by:[§II](https://arxiv.org/html/2606.18068#S2.p3.1)\.
- \[10\]H\. Nori, N\. King, S\. M\. McKinney, D\. Carignan, and E\. Horvitz\(2023\)Capabilities of GPT\-4 on medical challenge problems\.arXiv preprint arXiv:2303\.13375\.Cited by:[TABLE I](https://arxiv.org/html/2606.18068#S1.T1.1.3.2.1),[§II](https://arxiv.org/html/2606.18068#S2.p1.1)\.
- \[11\]M\. Omaret al\.\(2025\)Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: a cross\-sectional benchmarking analysis\.The Lancet Digital Health8\(1\),pp\. 100949\.Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p3.1),[§II](https://arxiv.org/html/2606.18068#S2.p2.1)\.
- \[12\]K\. Singhalet al\.\(2023\)Towards expert\-level medical question answering with large language models\.External Links:2305\.09617Cited by:[TABLE I](https://arxiv.org/html/2606.18068#S1.T1.1.2.1.1),[§II](https://arxiv.org/html/2606.18068#S2.p1.1)\.
- \[13\]P\. N\. Srinivasu, G\. L\. Aruna Kumari, S\. Ahmed, and A\. Alhumam\(2026\)Exploring agentic ai in healthcare: a study on its working mechanism\.Frontiers in Medicine12,pp\. 1753443\.External Links:[Document](https://dx.doi.org/10.3389/fmed.2025.1753443)Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p1.1)\.
- \[14\]D\. Van Veen, C\. Van Uden, L\. Blankemeier, J\.\-B\. Delbrouck, A\. Aali, C\. Bluethgen, A\. Pareek, M\. Polacin, E\. P\. Reis, A\. Seehofnerová, N\. Rohatgi, P\. Hosamani, W\. Collins, N\. Ahuja, C\. P\. Langlotz, J\. Hom, S\. Gatidis, J\. Pauly, and A\. S\. Chaudhari\(2024\)Adapted large language models can outperform medical experts in clinical text summarization\.Nature Medicine30\(4\),pp\. 1134–1142\.External Links:[Document](https://dx.doi.org/10.1038/s41591-024-02855-5)Cited by:[§I](https://arxiv.org/html/2606.18068#S1.p2.1),[§II](https://arxiv.org/html/2606.18068#S2.p1.1)\.
- \[15\]L\. Zhenget al\.\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[TABLE I](https://arxiv.org/html/2606.18068#S1.T1.1.5.4.1),[§II](https://arxiv.org/html/2606.18068#S2.p2.1)\.
Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

Similar Articles

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

AI agent development

Submit Feedback

Similar Articles

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops
LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
An Agentic LLM-Based Framework for Population-Scale Mental Health Screening