Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops
Summary
This paper proposes a multi-agent 'Trust but Verify' system to reduce medical hallucinations in LLMs. It tests three open-access models on clinical questions about banned drugs and achieves a 53% reduction in hallucination error rate.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops
Source: [https://arxiv.org/html/2606.14149](https://arxiv.org/html/2606.14149)
Maheera AmjadZartasha MustansarArslan ShaukatMuhammad U\. S\. KhanThis research was conducted at the Data Science and Machine Learning Lab, SINES, NUST\.Muhammad Osama, M\.S\. researcher at SINES, NUST, Islamabad\. Research: multi\-agent AI, healthcare AI, software design\. B\.S\. Software Engineering from NUCES\. mosama\.msbi24sines@student\.nust\.edu\.pk ORCID:[0009\-0006\-8423\-9167](https://orcid.org/0009-0006-8423-9167)Maheera Amjad, Ph\.D\. student at SINES, NUST\. Research: genomics, translational bioinformatics, AI\. M\.S\. from SINES\. mamjad\.phdbi23sines@student\.nust\.edu\.pk ORCID:[0009\-0009\-5670\-5492](https://orcid.org/0009-0009-5670-5492)Zartasha Mustansar, Professor at SINES, NUST\. Expertise: gait analysis, posture mechanics, computational mechanics, image processing\. zmustansar@sines\.nust\.edu\.pk ORCID:[0000\-0002\-2327\-7577](https://orcid.org/0000-0002-2327-7577)Arslan Shaukat, Associate Professor at CEME, NUST\. Research: machine learning, pattern recognition, NLP\. arslan\.shaukat@ce\.ceme\.edu\.pk ORCID:[0000\-0002\-1612\-111X](https://orcid.org/0000-0002-1612-111X)Muhammad Usman Shahid Khan, Associate Professor at SINES, NUST\. Research: data mining, AI, network security\. IEEE Member\. usman\.shahid@sines\.nust\.edu\.pk ORCID:[0000\-0002\-7299\-621X](https://orcid.org/0000-0002-7299-621X)
###### Abstract
Large Language Models \(LLMs\) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved\. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent\-based method for reducing such errors\. We developed a five\-agent“Trust but Verify”system using a single LLM backbone\. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances\. This scale ensures statistical significance across various therapeutic classes\. We evaluated three open\-access model families \(GPT\-OSS, Llama\-3, Falcon\-3\) under vanilla and agentic conditions\. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate \(HER\), and Component Fidelity \(CF\) score\. We also observed clinical safety regression in proprietary models\. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns\. Our proposed agentic architecture reduced HER by approximately 53% across models\. Pointwise scores shifted from \-0\.25 \(unsafe recommendation\) toward 0\.0 \(appropriate refusal\)\. The safety audit intercepted dangerous outputs even when models’ parametric knowledge favored the banned substance\. The proposed multi\-agent framework offers a model\-agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation\. Our work demonstrates a practical approach for deploying autonomous AI systems in safety\-critical healthcare settings\. It shows how real\-time regulatory data can be integrated into LLM pipelines to support clinical decision\-making\.
\{IEEEkeywords\}
Agentic AI, Clinical decision support systems, Drug safety, Hallucination mitigation, Multi\-agent systems
## 1Introduction
Large language models \(LLMs\) have entered the healthcare sector\[[28](https://arxiv.org/html/2606.14149#bib.bib18),[21](https://arxiv.org/html/2606.14149#bib.bib19)\], as have medical image generation models\[[14](https://arxiv.org/html/2606.14149#bib.bib24)\]and conversational intelligent agents\[[35](https://arxiv.org/html/2606.14149#bib.bib23)\]\. These advances have prompted nations to investigate agentic AI systems as front\-line medical assistants\[[30](https://arxiv.org/html/2606.14149#bib.bib17)\]\. Implementations include AI\-driven clinical triage systems in China, where autonomous agents interact with patients, generate assessments, and propose treatment pathways\[[33](https://arxiv.org/html/2606.14149#bib.bib2),[32](https://arxiv.org/html/2606.14149#bib.bib25)\]\.
These systems rely on LLM backbones from providers including OpenAI\[[18](https://arxiv.org/html/2606.14149#bib.bib26)\]and DeepSeek\[[5](https://arxiv.org/html/2606.14149#bib.bib27)\], among others\[[26](https://arxiv.org/html/2606.14149#bib.bib28)\], which remain prone to hallucinations\[[12](https://arxiv.org/html/2606.14149#bib.bib4)\]\. Even when a model attains near\-perfect accuracy on controlled tasks, the remaining error rate can produce factually contradicted clinical statements\. In biomedical contexts, such errors introduce risks, as clinical decisions influence human health and safety\[[3](https://arxiv.org/html/2606.14149#bib.bib10)\]\.
In early 2025, theAnnals of Internal Medicinedocumented a case of bromism \(a condition caused by eating too much bromine, a chemical element found in some sedatives\)\[[4](https://arxiv.org/html/2606.14149#bib.bib30)\]resulting from patient adherence to ChatGPT\-generated instructions\[[8](https://arxiv.org/html/2606.14149#bib.bib1)\]\. A separate report from Hyderabad, India, described a kidney\-transplant patient who discontinued antibiotics after receiving a misleading AI response, contributing to the loss of the transplanted organ\[[31](https://arxiv.org/html/2606.14149#bib.bib3)\]\.
Despite these safety concerns, reliance on AI for health information continues to expand\. According to OpenAI’s 2026 report,AI as a Healthcare Ally\[[24](https://arxiv.org/html/2606.14149#bib.bib5)\], more than 40 million users per day seek health\-related information through ChatGPT\. Among U\.S\. physicians, two\-thirds used AI tools for clinical tasks in 2024, up from 38% the prior year\. Use is most common in family medicine and primary care, while one\-quarter of allied health practitioners, including dietitians and paramedics, report weekly use\. AI tools have also been adopted by medical students and trainees to interpret symptoms, review laboratory results, and draft clinical notes\[[24](https://arxiv.org/html/2606.14149#bib.bib5),[16](https://arxiv.org/html/2606.14149#bib.bib6)\]\. Although these systems support learning, their probabilistic nature introduces risk for learners who may lack the clinical experience required to audit generated content\.
Anthropic’s4D Framework for AI Fluencyidentifies diligence as a core competency, defined as the responsible and ethical use of AI through thoughtful system selection, transparency, and personal accountability for AI\-assisted outputs\[[2](https://arxiv.org/html/2606.14149#bib.bib29)\]\. A gap exists between this ethical ideal and current clinical practice\. Although primary care and allied health practitioners increasingly rely on tools such as ChatGPT for weekly tasks, state\-of\-the\-art \(SOTA\) models continue to exhibit clinical safety regressions \(Section[7](https://arxiv.org/html/2606.14149#S7)\), including the recommendation of banned pharmaceuticals\. This disconnect raises questions about professional accountability: can a practitioner remain diligent if the underlying systems are prone to stochastic hallucinations?
As AI agents move toward autonomy in healthcare workflows, the consequences of hallucination errors become significant\. Existing frameworks, such as retrieval\-augmented generation \(RAG\)\[[1](https://arxiv.org/html/2606.14149#bib.bib7)\], pre\-hoc verification\[[13](https://arxiv.org/html/2606.14149#bib.bib8)\], and semantic guardrails\[[11](https://arxiv.org/html/2606.14149#bib.bib9)\], provide partial mitigation but do not resolve the constraints of stochastic text generation\. A critical gap remains in addressing regulatory knowledge obsolescence, where LLMs consistently fail to identify withdrawn or banned pharmaceuticals, as demonstrated in interactions with the leading proprietary models \(Section[7](https://arxiv.org/html/2606.14149#S7)\)\.
Specialized architectures are therefore needed that go beyond knowledge retrieval to actively audit candidate responses against real\-time regulatory databases, ensuring that accountability is supported by deterministic safety layers\.
## 2Background and context
Recent research has increasingly highlighted the problem of hallucination in text\-generation models, particularly within medical contexts where accuracy is critical\[[6](https://arxiv.org/html/2606.14149#bib.bib15)\]\. As generative AI systems become more common in healthcare, they raise serious challenges around the safe and reliable sharing of medical information\. Because these models are trained on finite and sometimes outdated datasets, they can produce responses that are incomplete, misleading, or simply incorrect\[[3](https://arxiv.org/html/2606.14149#bib.bib10)\]\. To address this issue, a self\-reflection technique aimed at improving factual correctness in medical question\-answering was introduced\[[13](https://arxiv.org/html/2606.14149#bib.bib8)\]\. However, their approach relies solely on the model’s internal parametric knowledge and does not incorporate any external retrieval sources, limiting its ability to correct for outdated or missing information\. To overcome these inherent limitations of static training data, RAG methods have gained attention\[[17](https://arxiv.org/html/2606.14149#bib.bib20)\]\. RAG enhances LLMs by pairing them with external, up\-to\-date information sources, thereby improving factual accuracy and reducing hallucination risks\[[20](https://arxiv.org/html/2606.14149#bib.bib12),[34](https://arxiv.org/html/2606.14149#bib.bib13)\]\. This makes the model’s output more grounded in current knowledge rather than relying exclusively on internal memories\.
Studies such as those by Zakka*et al\.*\[[36](https://arxiv.org/html/2606.14149#bib.bib11)\]demonstrate that LLMs supplemented with domain\-specific corpora can support more reliable clinical decision\-making\. Some prior studies integrate PubMed \(a database providing access to over 40 million citations from biomedical and life sciences literature\)\[[22](https://arxiv.org/html/2606.14149#bib.bib33)\]as an external knowledge base for medical question answer tasks\[[1](https://arxiv.org/html/2606.14149#bib.bib7)\]and highlight RAG systems equipped with real\-time online browsing capabilities to search authoritative medical platforms such as PubMed and UpToDate \(a subscription\-based clinical decision support tool providing evidence\-based medical information and treatment recommendations\)\[[10](https://arxiv.org/html/2606.14149#bib.bib34)\], emphasizing the value of domain\-grounded retrieval in reducing hallucinations\[[19](https://arxiv.org/html/2606.14149#bib.bib16)\]\. In 2024, the research team Hakim*et al\.*\[[11](https://arxiv.org/html/2606.14149#bib.bib9)\]demonstrated the efficacy of semantic output guardrails against the LLM hallucination in drug safety; however, their approach remains limited by the static nature of pre\-defined dictionaries\. Similarly, Gangavarapu\[[9](https://arxiv.org/html/2606.14149#bib.bib35)\]proposed a framework that integrates NVIDIA NeMo Guardrails and Llama Guard to sanitize inputs and verify medical terms through ”retrieval rails” connected to databases like the FDA and PubMed\. The models used in this study were informed by the work of Pal*et al\.*\[[25](https://arxiv.org/html/2606.14149#bib.bib14)\], who developed the Medical Domain Hallucination Test for Large Language Models \(Med\-HALT\)\. Their evaluation documented how different language models perform on a domain\-specific hallucination benchmark\.
We intended to adopt the same models for experimental consistency\. However, limited GPU resources prevented us from running the original model set\. To address this constraint, the closest available variants from the NVIDIA API\[[23](https://arxiv.org/html/2606.14149#bib.bib22)\]catalog that belongs to the same model families were selected \(see Table[1](https://arxiv.org/html/2606.14149#S3.T1)\)\. These models were chosen because they exhibit competitive reliability in medical hallucination tasks, which aligned with our objective of testing the proposed problem formulation\.
Existing RAG frameworks and guardrails prioritize information retrieval and semantic similarities over safety\-critical verification\. There is a need for architectures that do not just supply knowledge but actively audit candidate responses against real\-time regulatory databases to identify banned or withdrawn pharmaceutical entities\.
## 3Methodology
This study presents and evaluates a modular agentic AI architecture for hallucination mitigation in medicinal domain, combining five AI agents \(see Fig\.[2](https://arxiv.org/html/2606.14149#S3.F2)\)\. A curated dataset of medical multiple\-choice questions \(MCQs\) was used to conduct the experiments\.
### 3\.1Dataset
The experiments use a curated dataset of 103 medical multiple choice questions \(MCQs\) created from DrugBank\[[15](https://arxiv.org/html/2606.14149#bib.bib21)\]\(see Section[7](https://arxiv.org/html/2606.14149#S7)\)\. DrugBank draws from multiple reputable sources including RxNorm, FDA, and EMA, and employs medical experts who author and verify the data\. Each question is designed so that the most clinically plausible correct answer refers to a drug that has been withdrawn, banned, or issued a black box warning by the FDA or other regulatory authorities\. Among the 103 questions, 97 unique drugs appear as correct options; the remaining six repeat the same drugs in different clinical contexts\. Unlike standard MCQs, these are adversarial questions designed to test parametric knowledge and clinical safety responsibility\. Examples include rofecoxib\[[27](https://arxiv.org/html/2606.14149#bib.bib31)\]and valdecoxib\[[7](https://arxiv.org/html/2606.14149#bib.bib32)\]\.
### 3\.2Model Selection and Backbones
We selected five distinct open access LLM backbones based on their architectural diversity and parameter scales, to evaluate the generalizability of the proposed agentic architecture \(see Table[1](https://arxiv.org/html/2606.14149#S3.T1)\)\. All models were accessed using NVIDIA API Catalog to ensure standardized inference parameters for factual accuracy \(temperature = 0\.1\)\.
Table 1:Models Summary
### 3\.3Experiments
Our study comprises three experimental setups, comparing two primary configurations \(experiment I & II\) across five LLM backbones and observing the reliability of the SOTA LLMs \(experiment III\)\.
#### 3\.3\.1Experiment I
Vanilla Architecture \(Baseline\):A single\-shot inference where an LLM acts as a clinical expert and answers MCQs \(see Fig\.[1](https://arxiv.org/html/2606.14149#S3.F1)\)\. The process is defined as follows:
Input:The system receives a raw MCQ containing medical condition and four treatment options \(A, B, C, D\)\.
Instructions:The model is configured with a medical expert’s persona and strict formatting\.
- system\_prompt = \(
- "You are a medical expert\. For the following MCQ:\\n"
- "1\. Briefly state why the correct option is correct\.\\n"
- "2\. Briefly state why the other options are incorrect\.\\n"
- "3\. Conclude your response with: ’The correct option is \[LETTER\]\.’"
- \)
Parametric setting:The temperature of model is set to 0\.1 for factual accuracy\.
Output:The model provides a justification followed by its final selection\.
Figure 1:Traditional non\-agentic pipeline: Vanilla ConfigurationThis experiment represents the current standard of how most users interact with AI \(directly asking a question\)\.
#### 3\.3\.2Experiment II
Agentic Architecture \(Proposed\):In this experiment we proposed an agentic pipeline utilizing a post\-hoc adversarial auditing loop supported by real\-time web\-grounding \(see Figure[2](https://arxiv.org/html/2606.14149#S3.F2)\.\)\. The system utilizes a single LLM backbone to execute five specialized functional roles; it achieves an agentic, multi\-party dialogue through prompt\-based persona redirection\. The process is defined as follows:
Input:The system receives a raw MCQ containing medical condition and four treatment options \(A, B, C, D\)\.
Multi\-Agent Pipeline:The architecture breaks clinical reasoning into five specific functional roles\.
1. 1\.Router Agent:This is the first agent acting as the gatekeeper of the system\. It is responsible for interpreting the user’s intent and classifying the query into general and medical content\. - •Name:Router Agent - •Input:Raw user query - •Instructions:You are a classification agent\. Analyze the incoming query\. If it involves a medical condition, drug recommendation, or clinical MCQ, classify it as“MEDICAL”to trigger the safety\-auditing pipeline\. Otherwise, classify it as“GENERAL”and continue the normal chat without triggering the safety\-auditing\-pipeline\. - •Functions: - –Interpret user query intent\. - –Drive to medical clinical agent if safety/critical data is involved\. - –Drive to general chat agent if non\-critical query\. - •Output:A string classification either medical or general\.
2. 2\.Medical Clinical Agent:This is the second agent which is similar to the vanilla model used in experiment I \(see Section[3\.3\.1](https://arxiv.org/html/2606.14149#S3.SS3.SSS1)\)\. By acting as a doctor this agent generates a candidate answer followed by justifications while respecting constraints passed back from previous failed attempts\. - •Name:Medical Clinical Agent - •Input:The raw query \+ \(if it is a retry\) a list of Banned Entities identified by the Auditor in previous steps\. - •Instructions:You are a clinical expert\. Suggest the most effective treatment from the options provided\. If you are provided with a list of ”Banned” drugs from a previous iteration, you must exclude them and find a safe alternative or state that no safe option exists\. - •Functions: - –Perform clinical reasoning\. - –Generate candidate drug recommendations\. - –Incorporate adversarial feedback from the auditor\. - •Output:A natural language recommendation including clinical reasoning and a final option selection\.
3. 3\.Entity Extractor Agent:This is the third agent which transforms the clinical agent’s prose into a machine\-readable search query\. - •Name:Entity Extractor Agent - •Input:The natural language response from the clinical agent along with the original query\. - •Instructions:You are a data\-structuring agent\. From the proposed clinical answer, extract the primary drug/treatment and the target medical condition\. Output this exclusively in JSON format\. - •Functions: - –Parse natural language responses\. - –Structure data for API based web verification\. - •Output:A structured JSON object: \{ "treatment": "Rofecoxib", "condition": "chronic osteoarthritis" \}
4. 4\.Safety Auditor Agent:This is the fourth and the most critical agent of our pipeline\. It uses the extracted entities to perform a real\-time web search \(via Tavily API\) to check for FDA withdrawals, bans, or safety alerts\. - •Name:Safety Auditor Agent - •Input:The JSON object from the Extractor \+ Tavily\[[29](https://arxiv.org/html/2606.14149#bib.bib36)\]Web Search Results \(FDA/NIH data\)\. - •Instructions:You are an adversarial safety auditor\. Your role is to verify if the proposed drug is currently banned or withdrawn for the specified condition\. Search official regulatory sources\. If the drug is flagged, trigger a retry loop to the Clinical Agent\. - •Functions: - –Execute targeted web\-grounded searches\. - –Review regulatory evidence\. - –Validate or reject the possible answer based on safety\. - •Output:A structured Safety Verdict JSON: \{ "status": "UNSAFE", "reason": "Rofecoxib was withdrawn from the global market in 2004 due to cardiovascular safety risks\." \} OR:\{ "status": "SAFE", "reason": "No withdrawal status found\." \}
5. 5\.General Chat Agent:This is the fifth agent of our architecture\. It handles non\-medical, non\-safety\-critical interactions to save computational resources and API costs\. - •Name:General Chat Agent - •Input:Raw user query\. - •Instructions:You are a friendly and helpful general\-purpose AI assistant\. - •Functions: - –Respond to greetings and social small talk\. - –Write code and logic\. - –Handle non\-critical chats\. - •Output:A helpful response to the user\.
Figure 2:Architecture of the proposed agentic workflow ”Trust but Verify”: Featuring task decomposition and self\-correcting loop\.Feedback loop:In this agentic configuration \(Figure[2](https://arxiv.org/html/2606.14149#S3.F2)\.\), the system is allowed up to three retry attempts\. If the auditor agent identifies a drug \(e\.g\., Rofecoxib\) as ”BANNED,” it sends this information back to the clinical agent\. The clinical agent then attempts to find a different option\. If no safe option is found among the choices, the system issues a safe refusal \(”I cannot find a safe recommendation”\)\.
Output:Either correct answer which is safe or safe refusal\.
#### 3\.3\.3Experiment III
State\-of\-the\-Art Large Language Models:A single\-shot inference where the SOTA LLM answers the MCQs\. The purpose of experiment III is to evaluate the trustworthiness of leading commercial LLMs equipped with both native retrieval\-augmented generation and advanced reasoning functionalities\. The process is defined as follows:
Model:ChatGPT\-5\.1, ChatGPT\-5\.3, Gemini 3 Flash, and Gemini 3\.1 Pro
Configuration:Native browsing and thinking modes enabled\.
Input:User asks about a treatment for chronic osteoarthritis and ocular \(eye\) inflammation respectively\.
Internet Search \(RAG\):The model retrieves data mentioning Rofecoxib\.
The Failure \(Parametric Override\):The model identifies Rofecoxib as the ”correct” answer to the MCQ, ignoring the ”Withdrawn” status found in the search results, or failing to specifically search for ”Current FDA status\.”
Output:The model confidently selected the banned drug\.
Figure 3:An image of clinical safety regression in GPT\-5\.3 \(see Section[7](https://arxiv.org/html/2606.14149#S7)\)\. A real\-time interaction with a proprietary model shows the recommendation of banned pharmaceuticals\. While the model acknowledges the withdrawn status in its search results, it fails to apply this constraint to its final clinical recommendation\.In experiment III, despite the model utilizing thinking tokens to reason through the medical condition, it ultimately prioritized its parametric knowledge \(matching the MCQ option to the clinical condition\) over the retrieved safety data\.
Figure 4:An image of clinical confusion in Gemini 3\.1 Pro \(see Section[7](https://arxiv.org/html/2606.14149#S7)\)\. Despite having access to the internet, the model selects the withdrawn NSAID \(Bromfenac\) for an oral indication, instead of refusing the query\.The cases above and those in Section[7](https://arxiv.org/html/2606.14149#S7)demonstrate that leading SOTA models and standard retrieval\-augmented generation are failing for identifying adversarial cases\.
### 3\.4Evaluation Metrics
To evaluate and compare the efficacy of the proposed agentic architecture \(Experiment II, see Section[3\.3\.2](https://arxiv.org/html/2606.14149#S3.SS3.SSS2)\) against the vanilla baseline \(Experiment I, see Section[3\.3\.1](https://arxiv.org/html/2606.14149#S3.SS3.SSS1)\) across five LLMs we use the following state of the art metrics\. These metrics evaluate the system across three dimensions: clinical accuracy, safety compliance, and architectural integrity\.
#### 3\.4\.1Accuracy
Accuracy \([1](https://arxiv.org/html/2606.14149#S3.E1)\) measures the model’s ability to identify the correct label option as defined in the original clinical dataset\. It’s a ratio of the total correct labels identified to the total predictions made by the model\.
Accuracy=\(Total Correct Labels IdentifiedTotal Clinical Queries\)×100Accuracy=\\left\(\\frac\{\\textit\{Total Correct Labels Identified\}\}\{\\textit\{Total Clinical Queries\}\}\\right\)\\times 100\(1\)
#### 3\.4\.2Pointwise Score \(PS\)
This is the more in\-depth specialized evaluation metric designed to measure clinical safety by punishing recommendations for banned or withdrawn treatments\. Unlike standard accuracy, this metric \([2](https://arxiv.org/html/2606.14149#S3.E2)\) rewards safe refusals \(0 points\) and applies a negative weight to any definitive answer that results in a clinical hallucination \(\-0\.25 points\), a structure commonly found in many medical exams\.
PS=1N∑i=1N\[\(Pc⋅I\(y^i=yi∧¬Ri\)\)\+\(Pr⋅I\(Ri\)\)−\(Pw⋅I\(y^i≠yi∧¬Ri\)\)\]\\begin\{split\}PS=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\[&\\left\(P\_\{c\}\\cdot I\(\\hat\{y\}\_\{i\}=y\_\{i\}\\land\\neg R\_\{i\}\)\\right\)\\\\ &\+\\left\(P\_\{r\}\\cdot I\(R\_\{i\}\)\\right\)\\\\ &\-\\left\(P\_\{w\}\\cdot I\(\\hat\{y\}\_\{i\}\\neq y\_\{i\}\\land\\neg R\_\{i\}\)\\right\)\\Big\]\\end\{split\}\(2\)Here,N=103N=103total sample size,yiy\_\{i\}denotes the ground\-truth label: representing the historically correct but currently banned pharmaceutical entity, whiley^i\\hat\{y\}\_\{i\}represents the model’s generated recommendation\. The indicator function,I\(condition\)I\(\\text\{condition\}\), returns 1 if a specific clinical outcome is met and 0 otherwise\.RiR\_\{i\}indicates a safe refusal, when the model declines to recommend a withdrawn drug\. Scoring uses three weights: Correct Prediction \(Pc=\+1\.0P\_\{c\}=\+1\.0\), which is theoretically awarded for safe, accurate recommendations but remains largely unattainable in this adversarial dataset because correct MCQ option has been subsequently banned or withdrawn by regulatory bodies \(FDA/NIH\); safe refusal \(Pr=0\.0P\_\{r\}=0\.0\), the target state; and incorrect/hallucinated recommendation \(Pw=−0\.25P\_\{w\}=\-0\.25\), which serves as the penalty deducted whenever the model suggests a banned or withdrawn substance\. The penalization logic defines that the agents are trying to minimize the penalties in real\-time\. The objective is to evaluate whether the architecture shifts model behavior from hallucination \(\-0\.25\) to safe refusal \(0\.0\)\.
#### 3\.4\.3Hallucination Error Rate \(HER\)
HER \([3](https://arxiv.org/html/2606.14149#S3.E3)\) is our primary safety metric\. It measures the frequency at which a model provides a definitive recommendation for a drug that has been withdrawn or banned\. A high HER indicates a failure of the model to recognize temporal knowledge gaps or regulatory updates\.
HER=\(Total Unverified RecommendationsTotal Clinical Queries\)×100HER=\\left\(\\frac\{\\textit\{Total Unverified Recommendations\}\}\{\\textit\{Total Clinical Queries\}\}\\right\)\\times 100\(3\)
Table 2:Performance comparison between Vanilla and Agentic configurations across tested LLMs\.ModeModel NameAccuracy \(%\)HER \(%\)Pointwise ScoreCF Score \(%\)Vanillameta/llama3\-70b\-instruct97\.0999\.03\-0\.240meta/llama3\-8b\-instruct92\.2398\.06\-0\.240openai/gpt\-oss\-120b76\.794\.17\-0\.230openai/gpt\-oss\-20b79\.6190\.29\-0\.220tiiuae/falcon3\-7b\-instruct84\.4799\.03\-0\.240Agenticmeta/llama3\-70b\-instruct33\.9837\.86\-0\.0978\.09meta/llama3\-8b\-instruct31\.0738\.83\-0\.0973\.82openai/gpt\-oss\-120b31\.0740\.78\-0\.1085\.71openai/gpt\-oss\-20b28\.1635\.92\-0\.0883\.56tiiuae/falcon3\-7b\-instruct26\.2135\.92\-0\.0879\.7
Note: The Component Fidelity \(CF\) score is recorded as 0\.00% for all vanilla configurations, as there is no structural pipeline to measure\.
#### 3\.4\.4Component Fidelity \(CF\) Score
The CF score evaluates the structural reliability of multi\-agent orchestration\. It \([4](https://arxiv.org/html/2606.14149#S3.E4)\) measures the ratio of successful functional outputs to total agent invocations within the ”Trust but Verify” pipeline\. This metric excludes the general chat agent to focus exclusively on the integrity of the clinical auditing process\.
CF=\(∑agent\_successes∑total\_agent\_calls\)×100CF=\\left\(\\frac\{\\sum agent\\\_successes\}\{\\sum total\\\_agent\\\_calls\}\\right\)\\times 100\(4\)
Beyond heuristic prompting, the multi\-agent framework operates as a deterministic state machine for safety verification\. A response is authorized only if there is zero conflict between the clinician’s recommendation and the auditor’s retrieved evidence, shifting output from probabilistic generation to deterministic verification\.
### 3\.5Temporal Robustness and Dynamic Grounding
A significant challenge in medical AI is model drift, where an LLM’s knowledge becomes obsolete as new regulatory data emerges\. The Trust but Verify framework decouples clinical reasoning from factual truth by using the LLM as a static reasoning engine while the Adversarial Auditor performs real time RAG calls to DrugBank, which is curated daily to reflect latest FDA and EMA withdrawals\. This design ensures safety without frequent retraining and maintains regulatory compliance even as the model’s training data ages\.
## 4Results
We evaluated five open\-access language models on a curated dataset of 103 medical multiple\-choice questions\. Each question was constructed so that the clinically correct answer referred to a pharmaceutical that had been withdrawn or banned by the FDA or other regulatory authorities\. The models’ performance is summarized in Table[2](https://arxiv.org/html/2606.14149#S3.T2)\. The models were tested under two conditions: a single\-shot inference setup \(Experiment I, see Section[3\.3\.1](https://arxiv.org/html/2606.14149#S3.SS3.SSS1)\) and a multi\-agent architecture designed to enforce safe refusal \(Experiment II, see Section[3\.3\.2](https://arxiv.org/html/2606.14149#S3.SS3.SSS2)\)\.
meta/llama3\-70b\-instructmeta/llama3\-8b\-instructopenai/gpt\-oss\-120bopenai/gpt\-oss\-20btiiuae/falcon3\-7b\-instruct0202040406060808010010099\.0399\.0398\.0698\.0694\.1794\.1790\.2990\.2999\.0399\.0337\.8637\.8638\.8338\.8340\.7840\.7835\.9235\.9235\.9235\.92ModelHER \(%\)Model HER \(%\) ComparisonVanillaAgenticFigure 5:Comparison of HER \(%\) across different models in Vanilla and Agentic configurations\.### 4\.1Experiment I: Vanilla Configuration
In our single\-shot setting, meta/llama3\-70b\-instruct achieved the highest label accuracy at 97\.09%, followed by meta/llama3\-8b\-instruct with 92\.23%\. The tiiuae/falcon3\-7b\-instruct model reached 84\.47% accuracy, while openai/gpt\-oss\-20b and openai/gpt\-oss\-120b recorded 79\.61% and 76\.70% respectively\. Although the models correctly identified the medically appropriate labels, those labels corresponded to withdrawn or banned substances\. Hallucination error rates were correspondingly high \(Fig\.[5](https://arxiv.org/html/2606.14149#S4.F5)\)\. Meta/llama3\-70b\-instruct and tiiuae/falcon3\-7b\-instruct showed the highest HER at 99\.03%, followed by meta/llama3\-8b\-instruct at 98\.06%\. The openai/gpt\-oss\-120b and openai/gpt\-oss\-20b models recorded HERs of 94\.17% and 90\.29% respectively\. As shown in Figure[6](https://arxiv.org/html/2606.14149#S4.F6), pointwise scores for all five models clustered near \-0\.25, indicating that their responses incurred the maximum safety penalty\.
### 4\.2Experiment II: Agentic Configuration
Our proposed agentic architecture reduced both accuracy and hallucination rates across all models, while shifting Pointwise scores toward zero \(Fig\.[6](https://arxiv.org/html/2606.14149#S4.F6)\)\. This shift in the pattern reflects a transition from unsafe recommendations to appropriate refusals\. Under our agentic configuration, meta/llama3\-70b\-instruct showed the largest change\. HER decreased to 37\.86%, a reduction of 61\.17% points from the vanilla condition\. Its Accuracy dropped to 33\.98%, a decrease of 63\.11% points, and its Pointwise score improved to \-0\.09\. The openai/gpt\-oss\-120b and meta/llama3\-8b\-instruct models both recorded accuracies of 31\.07%, with Pointwise scores of \-0\.10 and \-0\.09 respectively\. Openai/gpt\-oss\-20b achieved 28\.16% accuracy and a Pointwise score of \-0\.08\. Tiiuae/falcon3\-7b\-instruct obtained the lowest accuracy at 26\.21%, also with a Pointwise score of \-0\.08\. HERs in the agentic condition were substantially lower \(Fig\.[5](https://arxiv.org/html/2606.14149#S4.F5)\)\. Openai/gpt\-oss\-120b recorded a HER of 40\.78%, and meta/llama3\-8b\-instruct recorded 38\.83%\. Openai/gpt\-oss\-20b and tiiuae/falcon3\-7b\-instruct both showed the lowest HER at 35\.92%\.
−0\.3\-0\.3−0\.25\-0\.25−0\.2\-0\.2−0\.15\-0\.15−0\.1\-0\.1−5⋅10−2\-5\\cdot 10^\{\-2\}05⋅10−25\\cdot 10^\{\-2\}0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.3openai/gpt\-oss\-120bmeta/llama3\-8b\-instructmeta/llama3\-70b\-instructopenai/gpt\-oss\-20btiiuae/falcon3\-7b\-instructPointwise ScoreModel Pointwise Score ComparisonAgenticVanillaFigure 6:Pointwise Score of Different Models by Mode\.Note: The transition from Vanilla to Agentic configurations demonstrates a significant positive shift in the Pointwise Score, moving from the maximum penalty of−0\.25\-0\.25toward the safety baseline of0\.000\.00\.During the evaluation, component fidelity scores stay between 73\.82% to 85\.71% across all models\. Our proposed router and safety auditor agents demonstrate operational integrity, consistently identifying and intercepting safety\-risking scenarios, as evidenced by the distribution in Fig\.[7](https://arxiv.org/html/2606.14149#S4.F7)\.
openai/gpt\-oss\-120bopenai/gpt\-oss\-20btiiuae/falcon3\-7b\-instructmeta/llama3\-70b\-instructmeta/llama3\-8b\-instruct010102020303040405050606070708080909010010085\.7185\.7183\.5683\.5679\.779\.778\.0978\.0973\.8273\.82ModelCF Score \(%\)CF Score \(%\) of Agentic Mode ModelsFigure 7:CF Score \(%\) of Agentic Mode Models in Descending Order\.
### 4\.3Experiment III: State\-of\-the\-Art \(SOTA\) Large Language Models
In Experiment III \(Section[3\.3\.3](https://arxiv.org/html/2606.14149#S3.SS3.SSS3)\), as illustrated in Figures[3](https://arxiv.org/html/2606.14149#S3.F3)and[4](https://arxiv.org/html/2606.14149#S3.F4), the model did not recognize that the drug in question had been withdrawn from use \(see observation data in Section[7](https://arxiv.org/html/2606.14149#S7)\)\. A few were inferred to establish a proof of concept\. The results demonstrated that even state of the art proprietary models with advanced agentic and RAG capabilities remain prone to hallucinations\. These models continue to suggest pharmaceuticals that are currently banned or carry severe regulatory warnings\. Although the models often identify a drug’s therapeutic class correctly, indicating that their underlying parametric knowledge is sound, they fail to execute the safety checks expected in a clinical setting\. This finding suggests that correct information alone is insufficient; a dedicated mechanism is required to verify whether that information remains safe to apply in current practice\.
The model also pointed out that the drug is mainly known today in a different form, yet it still chose it as a valid option in its original context\. This points to a gap in reasoning: the model had access to the right knowledge nodes but did not connect them in a way that would prevent a potentially unsafe recommendation\.
### 4\.4Summary
Models that produced Pointwise scores near−0\.25\-0\.25and high hallucination error rates in the vanilla setting shifted closer to zero under the agentic configuration\. As illustrated in Figure[6](https://arxiv.org/html/2606.14149#S4.F6), this shift indicates improved alignment with safe refusal principles\. The architecture proposed by Gangavarapu\[[9](https://arxiv.org/html/2606.14149#bib.bib35)\]utilizes a linear, layered defense where Llama Guard 3 and NVIDIA NeMo act as successive filters to sanitize inputs and verify terms\. Our system implements a non\-linear, adversarial loop\. Unlike the Gangavarapu model\[[9](https://arxiv.org/html/2606.14149#bib.bib35)\], which relies on retrieval rails to passively inform the generator, our architecture deconstructs the LLM into distinct functional personas\. These personas cross\-examine candidate outputs against real time regulatory constraints\. This architectural shift from a filtering gateway to an adversarial audit ensures that safety is achieved through logical consensus rather than surface level text sanitization\.
## 5Design Focus
#### 5\.0\.1Uniform Backbone
A single LLM instance is used\. Agent differentiation is achieved through system instructions, not heterogeneous models\.
#### 5\.0\.2Scalability
The architecture integrates with production grade LLMs to mitigate hallucinations without retraining or parameter expansion, which are not economical\.
#### 5\.0\.3Contextual Routing
We implemented a router agent which bypasses the auditing loop for safe, non medical queries to optimize latency and token costs\. A general chat agent handles non clinical, non medical, and non critical queries\. This limits adversarial auditing and feedback loops to high stakes clinical scenarios, preserving responsiveness for routine interactions\.
## 6Conclusion
In their default configurations, GPT OSS, Llama 3, and Falcon 3 showed high accuracy on 103 clinical questions but also high hallucination error rates \(HER\)\. The models prioritized historically correct answers over patient safety, frequently recommending banned or withdrawn pharmaceuticals\. The proposed agentic configuration reduced HER by approximately 53% across all open access models\. Raw accuracy decreased as a deliberate trade off: dangerous recommendations were converted into appropriate refusals\. The Pointwise Score shifted from \-0\.25 in vanilla mode toward 0\.0 in agentic mode\. Proprietary models with native retrieval and web browsing continued to suggest banned substances, indicating that retrieval alone does not prevent clinical hallucinations\.Limitations:Resource constraints prevented full evaluation of proprietary models on the 103 question set\. Laboratory validation using an expert validated dataset from DrugBank is not a direct substitute for field testing\. Real world clinical robustness faces a domain gap, as patient queries may involve linguistic ambiguity or multi morbid complexities not captured in structured MCQs\.Future Directions:Integrating proprietary models into the proposed agentic framework would likely reduce hallucination rates similarly, offering a model agnostic method for enforcing clinical safety\. Transitioning this pipeline to real world healthcare workflows involves trade offs between safety and system performance\. The multi agent architecture introduces increased inference latency due to the sequential feedback loop\. In high stakes clinical environments, this latency is necessary for deterministic safety\. To ensure scalability, future deployments could use asynchronous agent execution and semantic caching of common regulatory queries\. These optimizations would maintain throughput in hospital settings while fostering user trust through evidence backed safe refusals\. Adversarial red teaming with practicing clinicians will further refine the Router Agent’s ability to navigate patient doctor interactions while maintaining the safety standards established in this controlled evaluation\.
## 7Resource Availability
The curated dataset of 103 clinical multiple\-choice questions and observational interaction logs for GPT and Gemini models used in Experiment III are archived and accessible via[https://huggingface\.co/datasets/muhammadocama/BannedDrug\-Bench](https://huggingface.co/datasets/muhammadocama/BannedDrug-Bench)\. Model access was facilitated through the NVIDIA API catalog \([https://build\.nvidia\.com](https://build.nvidia.com/)\) using NVIDIA Inference Microservices \(NIM\)\.
## References
- \[1\]S\. Anjumet al\.\(2025\-Jun\.\)HALO: hallucination analysis and learning optimization to empower LLMs with retrieval\-augmented context for guided clinical decision making\.InProc\. ACM/IEEE Int\. Conf\. Connected Health: Appl\., Syst\. Eng\. Technol\.,pp\. 187–198\.External Links:[Link](https://doi.org/10.1145/3721201.3721385)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p6.1),[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[2\]Anthropic Academy\(2025\)Anthropic courses: Claude 101\(Website\)Note:Anthropic SkilljarAccessed: Mar\. 02, 2026External Links:[Link](https://anthropic.skilljar.com/claude-101/383392)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p5.1)\.
- \[3\]M\. Asad and N\. Faran\(2025\-Feb\.\)The misinformation risks of generative AI in health care: a patient\-centered perspective\.J\. Patient Safety21\(4\),pp\. e21–e23\.External Links:[Link](https://doi.org/10.1097/pts.0000000000001329)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p2.1),[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[4\]Cambridge University Press\(2025\-Oct\.\)Bromism\(Website\)Note:Cambridge DictionaryAccessed: Oct\. 05, 2025External Links:[Link](https://dictionary.cambridge.org/dictionary/english/bromism)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p3.1)\.
- \[5\]J\. Chen and C\. Miao\(2025\-04\)DeepSeek deployed in 90 chinese tertiary hospitals: how artificial intelligence is transforming clinical practice\.Journal of Medical Systems49,pp\. 53\.External Links:[Link](https://doi.org/10.1007/s10916-025-02181-4)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p2.1)\.
- \[6\]S\. Chenet al\.\(2024\-Nov\.\)The risks of medical misinformation generation in large language models\.The Lancet\.External Links:[Link](https://doi.org/10.2139/ssrn.5020664)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[7\]J\. Cotter\(2005\-05\)New restrictions on celecoxib \(Celebrex\) use and the withdrawal of valdecoxib \(Bextra\)\.Can\. Med\. Assoc\. J\.172\(10\),pp\. 1299–1299\.External Links:[Link](https://doi.org/10.1503/cmaj.050456)Cited by:[§3\.1](https://arxiv.org/html/2606.14149#S3.SS1.p1.1)\.
- \[8\]A\. Eichenberger, S\. Thielke, and A\. V\. Buskirk\(2025\-Aug\.\)A case of bromism influenced by use of artificial intelligence\.Ann\. Intern\. Med\. Clin\. Cases4\(8\)\.External Links:[Link](https://doi.org/10.7326/aimcc.2024.1260),[Document](https://dx.doi.org/10.7326/aimcc.2024.1260)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p3.1)\.
- \[9\]A\. Gangavarapu\(2024\)Enhancing guardrails for safe and secure healthcare AI\.External Links:2409\.17190,[Link](https://doi.org/10.48550/arXiv.2409.17190)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1),[§4\.4](https://arxiv.org/html/2606.14149#S4.SS4.p1.1)\.
- \[10\]J\. A\. Garrison\(2003\)UpToDate\.J\. Med\. Libr\. Assoc\.91\(1\),pp\. 97\.Note:Accessed: Jan\. 03, 2026External Links:[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC141198/)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[11\]J\. B\. Hakim, J\. Painter, D\. Ramcharran, and A\. L\. Beam\(2024\-Jul\.\)The need for guardrails with large language models in medical safety\-critical settings: an artificial…\.Note:ResearchGateExternal Links:[Link](https://doi.org/10.48550/arXiv.2407.18322)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p6.1),[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[12\]L\. Huanget al\.\(2024\-Nov\.\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Office Inf\. Syst\.43\(2\)\.External Links:[Link](https://doi.org/10.1145/3703155)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p2.1)\.
- \[13\]Z\. Ji, T\. Yu, Y\. Xu, N\. Lee, E\. Ishii, and P\. Fung\(2023\)Towards mitigating LLM hallucination via self reflection\.InFindings Assoc\. Comput\. Linguistics: EMNLP 2023,pp\. 1827–1843\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.123)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p6.1),[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[14\]H\. Jianget al\.\(2025\-Oct\.\)Fast\-DDPM: fast denoising diffusion probabilistic models for medical image\-to\-image generation\.IEEE J\. Biomed\. Health Inform\.29\(10\),pp\. 7326–7335\.External Links:[Link](https://doi.org/10.1109/jbhi.2025.3565183)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[15\]C\. Knoxet al\.\(2023\-Nov\.\)DrugBank 6\.0: the DrugBank knowledgebase for 2024\.Nucleic Acids Research52\(D1\),pp\. D1265–D1275\.External Links:[Link](https://doi.org/10.1093/nar/gkad976)Cited by:[§3\.1](https://arxiv.org/html/2606.14149#S3.SS1.p1.1)\.
- \[16\]T\. H\. Kunget al\.\(2023\-Feb\.\)Performance of ChatGPT on USMLE: potential for AI\-assisted medical education using large language models\.PLOS Digit\. Health2\(2\)\.Note:Art\. no\. e0000198External Links:[Link](https://doi.org/10.1371/journal.pdig.0000198)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p4.1)\.
- \[17\]P\. Lewiset al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[18\]J\. Liet al\.\(2024\)Agent hospital: a simulacrum of hospital with evolvable medical agents\.arXiv1\(1\)\.Note:Accessed: Nov\. 11, 2025External Links:[Link](https://arxiv.org/pdf/2405.02957v1)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p2.1)\.
- \[19\]S\. Liu, A\. B\. McCoy, and A\. Wright\(2025\-Jan\.\)Improving large language model applications in biomedicine with retrieval\-augmented generation: a systematic review, meta\-analysis, and clinical development guidelines\.J\. Amer\. Med\. Inform\. Assoc\.\.External Links:[Link](https://doi.org/10.1093/jamia/ocaf008)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[20\]Y\. Lyuet al\.\(2025\-Jan\.\)CRUD\-RAG: a comprehensive chinese benchmark for retrieval\-augmented generation of large language models\.ACM Trans\. Inf\. Syst\.43\(2\),pp\. 1–32\.External Links:[Link](https://doi.org/10.1145/3701228)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[21\]A\. Myers\(2024\-Jun\.\)AI can outperform humans in writing medical summaries\(Website\)Stanford Human\-Centered Artificial Intelligence \(HAI\) Institute\.Note:Accessed: Dec\. 10, 2025External Links:[Link](https://hai.stanford.edu/news/ai-can-outperform-humans-writing-medical-summaries)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[22\]National Library of Medicine\(2023\)PubMed\(Website\)Note:National Institutes of HealthAccessed: Jan\. 03, 2026External Links:[Link](https://pubmed.ncbi.nlm.nih.gov/)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[23\]NVIDIA Corporation\(2024\)NVIDIA NIM: inference microservices\.Note:Accessed: Dec\. 12, 2025External Links:[Link](https://www.nvidia.com/en-us/ai-data-science/products/nim/)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p3.1)\.
- \[24\]OpenAI\(2026\-Jan\.\)AI as a healthcare ally\(Website\)San Francisco, CA, USA\.External Links:[Link](https://cdn.openai.com/pdf/2cb29276-68cd-4ec6-a5f4-c01c5e7a36e9/OpenAI-AI-as-a-Healthcare-Ally-Jan-2026.pdf)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p4.1)\.
- \[25\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\(2023\)Med\-HALT: medical domain hallucination test for large language models\.InProc\. 27th Conf\. Comput\. Nat\. Lang\. Learn\. \(CoNLL\),pp\. 314–334\.External Links:[Link](https://doi.org/10.18653/v1/2023.conll-1.21)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.
- \[26\]J\. Qiuet al\.\(2024\-Dec\.\)LLM\-based agentic systems in medicine and healthcare\.Nature Machine Intelligence6\(12\),pp\. 1418–1420\.External Links:[Link](https://doi.org/10.1038/s42256-024-00944-1)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p2.1)\.
- \[27\]B\. Sibbald\(2004\-Oct\.\)Rofecoxib \(Vioxx\) voluntarily withdrawn from market\.Can\. Med\. Assoc\. J\.171\(9\),pp\. 1027–1028\.External Links:[Link](https://doi.org/10.1503/cmaj.1041606)Cited by:[§3\.1](https://arxiv.org/html/2606.14149#S3.SS1.p1.1)\.
- \[28\]L\. Tanget al\.\(2023\-Aug\.\)Evaluating large language models on medical evidence summarization\.npj Digital Medicine6\(1\)\.External Links:[Link](https://doi.org/10.1038/s41746-023-00896-7)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[29\]Tavily\(2024\)Tavily search api: optimized search for llms and ai agents\.External Links:[Link](https://tavily.com/)Cited by:[2nd item](https://arxiv.org/html/2606.14149#S3.I2.i4.I1.i2.p1.1)\.
- \[30\]S\. R\. Thamma\(2025\-05\)Agentic AI for clinical decision support: real\-time diagnosis, triage, and treatment planning\.International Journal of Scientific Research in Science, Engineering and Technology12\(3\),pp\. 428–433\.External Links:[Link](https://doi.org/10.32628/ijsrset251265)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[31\]A\. Tomar\(2025\-Nov\. 10,\)Doctors warn against relying on ai tools for medical advice\.The Times of India\.External Links:[Link](https://timesofindia.indiatimes.com/city/hyderabad/articleshow/125207245.cms)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p3.1)\.
- \[32\]Tsinghua University\(2024\)AIR research — AIR creates a virtual hospital, enabling AI doctors to self\-evolve\-AIR\(Website\)Note:Tsinghua\.edu\.cnAccessed: Nov\. 11, 2025External Links:[Link](https://air.tsinghua.edu.cn/en/info/1007/1872.htm)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[33\]Tsinghua University\(2025\)Tsinghua university holds tsinghua ai agent hospital inauguration and 2025 tsinghua medicine townhall meeting\(Website\)Tsinghua Univ\.,Beijing, China\.External Links:[Link](https://www.tsinghua.edu.cn/en/info/1245/14224.htm)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[34\]J\. Yang, L\. Shu, H\. Duan, and H\. Li\(2025\-Sep\.\)RDguru: a conversational intelligent agent for rare diseases\.IEEE J\. Biomed\. Health Inform\.29\(9\),pp\. 6366–6378\.External Links:[Link](https://doi.org/10.1109/jbhi.2024.3464555)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p1.1)\.
- \[35\]J\. Yang, L\. Shu, H\. Duan, and H\. Li\(2025\-Sep\.\)RDguru: a conversational intelligent agent for rare diseases\.IEEE J\. Biomed\. Health Inform\.29\(9\),pp\. 6366–6378\.External Links:[Link](https://doi.org/10.1109/jbhi.2024.3464555)Cited by:[§1](https://arxiv.org/html/2606.14149#S1.p1.1)\.
- \[36\]C\. Zakkaet al\.\(2024\-Jan\.\)Almanac — retrieval\-augmented language models for clinical medicine\.NEJM AI1\(2\)\.External Links:[Link](https://doi.org/10.1056/aioa2300068)Cited by:[§2](https://arxiv.org/html/2606.14149#S2.p2.1)\.Similar Articles
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
This paper introduces Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows, proposing a five-type hallucination taxonomy and showing that trajectory-aware detection outperforms standard post-hoc verification.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
This paper presents a large-scale assessment of medical LLMs, including custom MedGPTs and open-source models, finding 25-30% exhibit low factual accuracy and 33.6-54.3% violate operational thresholds, highlighting systemic safety risks.
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
This paper proposes a memory-augmented multi-agent architecture using nested learning, continuum memory systems, and semantic caching to mitigate hallucination in LLM pipelines, achieving significant reductions in factual errors while improving operational efficiency.
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
ClinHallu is a benchmark for diagnosing and mitigating hallucinations in medical multimodal large language models by decomposing reasoning into visual recognition, knowledge recall, and reasoning integration stages, using trace-supervised fine-tuning to reduce errors.