SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

arXiv cs.CL Papers

Summary

Introduces SafeRx-Agent, a knowledge-grounded multi-agent framework for safe and explainable medication recommendation that generates fine-grained ATC code predictions while controlling drug interactions and contraindications, evaluated on MIMIC-III and MIMIC-IV datasets.

arXiv:2605.29146v1 Announce Type: new Abstract: Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:15 AM

# SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
Source: [https://arxiv.org/html/2605.29146](https://arxiv.org/html/2605.29146)
Xinyu Wang1,Hanwei Wu211footnotemark:1,Zhenghan Tai311footnotemark:1,Sicheng Lyu2,Qincheng Lu5, Ziyu Zhao1,Jijun Chi34,Jingrui Tian1,Xiao\-Wen Chang1,Ziyang Song6 1McGill University2McMaster University3University of Toronto 4ByteDance5LinkedIn6Ohio University

###### Abstract

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges\. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability\. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup\-level safety differences and can lead to risk overestimation\. We introduce the first fine\-grained medication recommendation setting based on fourth\-level ATC code generation\. We proposeSafe Prescription Agent \(SafeRx\-Agent\), a knowledge\-grounded multi\-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets\. Experimental results on MIMIC\-III and MIMIC\-IV datasets show that SafeRx\-Agent improves fine\-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size\.

SafeRx\-Agent: A Knowledge\-Grounded Multi\-Agent Framework for Safe and Explainable Medication Recommendation

## 1Introduction

Medication recommendation from electronic health records \(EHRs\) is a high\-stakes clinical natural language processing \(NLP\) taskXuet al\.\([2022](https://arxiv.org/html/2605.29146#bib.bib39)\)\. Given a patient’s longitudinal clinical context, including prior visits, diagnoses, procedures, and medication history, the model predicts the medications for the current encounter\. This task is challenging because ICU histories are sparse, visits involve multiple active conditions, and prescribing depends on both acute illness and treatment continuityShanget al\.\([2019](https://arxiv.org/html/2605.29146#bib.bib20)\)\.

Deep learning methods have advanced medication prediction by modeling visit sequences and drug co\-occurrenceAliet al\.\([2023](https://arxiv.org/html/2605.29146#bib.bib19)\), but they operate over structured EHR codes and cannot utilize patient\-specific textual context or external medical evidence at inference timeShanget al\.\([2019](https://arxiv.org/html/2605.29146#bib.bib20)\); Yanget al\.\([2021](https://arxiv.org/html/2605.29146#bib.bib21)\)\. Large language models \(LLMs\) can process clinical text and generate medication sets, and multi\-agent systems further decompose complex clinical reasoning into coordinated steps with tool useLiuet al\.\([2025](https://arxiv.org/html/2605.29146#bib.bib8)\); Fanet al\.\([2026](https://arxiv.org/html/2605.29146#bib.bib33)\); Liet al\.\([2024](https://arxiv.org/html/2605.29146#bib.bib41)\)\. However, unconstrained LLM agents remain unreliable for clinical decision support: prior work reports hallucinations, guideline misalignment, and unsafe recommendationsHageret al\.\([2024](https://arxiv.org/html/2605.29146#bib.bib29)\); Asgariet al\.\([2025](https://arxiv.org/html/2605.29146#bib.bib30)\); Farraget al\.\([2026](https://arxiv.org/html/2605.29146#bib.bib31)\)\. Safe medication recommendation therefore requires a knowledge\-grounded agentic framework with explicit evidence and safety verification\.

Medications are standardized with the Anatomical Therapeutic Chemical \(ATC\) taxonomyWHOCC \([2026](https://arxiv.org/html/2605.29146#bib.bib40)\), which organizes drugs into five hierarchical levels\. Most benchmarks predict at the third level of the ATC taxonomy, denoted ATC\-L3, which merges medication subgroups that can differ in clinical use and safety profileAliet al\.\([2023](https://arxiv.org/html/2605.29146#bib.bib19)\)\. This distorts safety evaluation: a drug interaction may apply to one fine\-grained subgroup but not another under the same ATC\-L3 parent, causing ATC\-L3 evaluation to overestimate risk\. Accurate safety measurement therefore requires predicting medication codes at a finer granularity\.

We proposeSafe Prescription Agent \(SafeRx\-Agent\), a knowledge\-grounded multi\-agent framework for safe and explainable fine\-grained medication recommendation\. SafeRx\-Agent routes each patient case to specialty\-aware medication agents that generate fine\-grained ATC\-L4 medication candidates grounded in patient context, ICD and ATC taxonomies, and medication indication evidence\. A safety\-aware critic\-revision loop then checks candidates against drug\-drug interaction \(DDI\) and contraindication resources, revises unsafe predictions, and produces a traceable report\. We evaluate SafeRx\-Agent on MIMIC\-III\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.29146#bib.bib37)\)and MIMIC\-IV\(Johnsonet al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib38)\)using both medication prediction accuracy and safety metrics\. SafeRx\-Agent outperforms traditional deep learning, LLM, and agentic baselines in prediction accuracy while reducing DDI and contraindication rates through explicit safety verification\. Our key contributions are:

- •We introduce the first fine\-grained medication recommendation setting for predicting ATC\-L4 code sets from EHRs, moving beyond the coarse ATC\-L3 setting used in prior benchmarks\.
- •We propose SafeRx\-Agent, a medication recommendation multi\-agent framework that combines specialty\-aware generation, evidence grounding, safety\-aware revision, and traceable reporting in a unified workflow\.
- •We introduce a knowledge\-grounded safety verifier that detects the risks of drug interactions and contraindications, revises unsafe candidates, and produces traceable medication reports\.
- •We evaluate SafeRx\-Agent on two real\-world EHR datasets, showing improved fine\-grained prediction accuracy, lower safety risks, and predicted set sizes closer to the ground truth\.

## 2Related Work

#### Supervised Medication Recommendation\.

Supervised methods cast medication recommendation as multi\-label prediction over structured EHR codes, with GAMENet\(Shanget al\.,[2019](https://arxiv.org/html/2605.29146#bib.bib20)\)introducing graph\-augmented memory and DDI\-aware decoding, and follow\-ups adding molecular structure\(Yanget al\.,[2021](https://arxiv.org/html/2605.29146#bib.bib21),[2023](https://arxiv.org/html/2605.29146#bib.bib23)\), copy\-generate decoding\(Wuet al\.,[2022](https://arxiv.org/html/2605.29146#bib.bib22)\), and rare\-drug or cold\-start training\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib24); Kuang and Xie,[2024](https://arxiv.org/html/2605.29146#bib.bib9)\)\. These models operate on coarse ATC\-L3 vocabularies and encode safety implicitly in the loss, but fail to perform fine\-grained medication prediction\.

#### LLMs for Medication Recommendation\.

Direct prompting of general or medical LLMs\(Zhanget al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib25); Christopheet al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib10); Ankit Pal,[2024](https://arxiv.org/html/2605.29146#bib.bib26); Chenet al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib27); Garcia\-Gasullaet al\.,[2025](https://arxiv.org/html/2605.29146#bib.bib28)\)leverages richer textual context than structured\-code models, but lacks built\-in safety verification\(Hageret al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib29); Asgariet al\.,[2025](https://arxiv.org/html/2605.29146#bib.bib30); Farraget al\.,[2026](https://arxiv.org/html/2605.29146#bib.bib31)\)\. Fine\-tuning approaches such as LAMO\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.29146#bib.bib32)\), FLAME\(Fanet al\.,[2026](https://arxiv.org/html/2605.29146#bib.bib33)\), and LEADER\(Liuet al\.,[2025](https://arxiv.org/html/2605.29146#bib.bib8)\)address safety via losses, rewards, or distillation, but require task\-specific training and remain tied to fixed backbones\.

#### Multi\-Agent Frameworks for Clinical Decision Support\.

Multi\-agent LLM systems decompose clinical reasoning across coordinated roles for medical QA\(Tanget al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib34); Kimet al\.,[2024](https://arxiv.org/html/2605.29146#bib.bib35)\)and rare\-disease diagnosis and treatment\(Chenet al\.,[2026](https://arxiv.org/html/2605.29146#bib.bib13)\)\. However, existing agent\-based generation frameworks generally do not support fine\-grained medication prediction with resource\-grounded verification of multiple safety risks, including DDIs and diagnosis\-conditioned contraindications\.

## 3Problem Formulation and Knowledge Resources

### 3\.1Problem Formulation

Let a patient record be a temporally ordered sequence of ICU visits\(v1,…,vT\)\(v\_\{1\},\\ldots,v\_\{T\}\), where each visitvt=\(𝒟t,𝒫t,ℳt\)v\_\{t\}=\(\\mathcal\{D\}\_\{t\},\\mathcal\{P\}\_\{t\},\\mathcal\{M\}\_\{t\}\)contains diagnoses, procedures, and medications\. Diagnoses are represented by ICD\-CM codes, procedures by ICD\-PCS codes, and medications by ATC\-L4 codes\. Given the past visits and the current diagnoses and procedures, the task is to predict the medications prescribed at the current visit\. The input for visitTTis

XT=\(\{\(𝒟t,𝒫t,ℳt\)\}t<T,𝒟T,𝒫T\)\.X\_\{T\}=\\big\(\\\{\(\\mathcal\{D\}\_\{t\},\\mathcal\{P\}\_\{t\},\\mathcal\{M\}\_\{t\}\)\\\}\_\{t<T\},\\mathcal\{D\}\_\{T\},\\mathcal\{P\}\_\{T\}\\big\)\.The ground\-truth medication set isℳT⊆𝒱med\\mathcal\{M\}\_\{T\}\\subseteq\\mathcal\{V\}\_\{\\mathrm\{med\}\}, and the model outputs a predicted setℳ^T⊆𝒱med\\hat\{\\mathcal\{M\}\}\_\{T\}\\subseteq\\mathcal\{V\}\_\{\\mathrm\{med\}\}, where𝒱med\\mathcal\{V\}\_\{\\mathrm\{med\}\}is the ATC\-L4 medication vocabulary\. We evaluate prediction quality by comparingℳ^T\\hat\{\\mathcal\{M\}\}\_\{T\}withℳT\\mathcal\{M\}\_\{T\}, and evaluate safety by measuring DDIs and contraindications inℳ^T\\hat\{\\mathcal\{M\}\}\_\{T\}\.

### 3\.2Clinical Knowledge Resources

SafeRx\-Agent uses external clinical knowledge for evidence\-grounded medication generation and safety\-aware revision\. We organize these resources around three functions:

- •Standardization:diagnosis and medication taxonomies align EHR concepts with the ATC\-L4 medication vocabulary\.
- •Grounding:indication evidence links patient conditions to clinically relevant medications\.
- •Safety checking:drug interaction and contraindication resources identify co\-prescription and disease\-conditioned risks\.

Table[1](https://arxiv.org/html/2605.29146#S3.T1)summarizes the resources used\. Details about preprocessing, identifier mapping, and matrix construction are provided in Appendix[A](https://arxiv.org/html/2605.29146#A1)\.

Table 1:Clinical knowledge resources used in this work\.All medication\-side resources are mapped to the ATC\-L4 vocabulary\. Contra\. = Contraindication\.

## 4SafeRx\-Agent

### 4\.1Overview

![Refer to caption](https://arxiv.org/html/2605.29146v1/x1.png)Figure 1:Overview of SafeRx\-Agent\.A patient record is routed via weighted ICD\-chapter and keyword scoring to a sparse subset of specialty experts plus an always\-on supportive\-care expert\. Each activated expert summarizes the patient record from its specialty scope and generates ATC\-L4 medication candidates grounded in MEDI and the ATC taxonomy\. A global Critique then judges and reconciles the expert proposals under the full patient context to produce a candidate set\. A two\-phase Safety Verifier retrieves DDI and contraindication evidence \(𝐌DDI\\mathbf\{M\}\_\{\\text\{DDI\}\},𝐌Contra\\mathbf\{M\}\_\{\\text\{Contra\}\}\) and adjudicates each flag in context, yielding a traceable ATC\-L4 prescription with per\-medication rationales\.As shown in Figure[1](https://arxiv.org/html/2605.29146#S4.F1), SafeRx\-Agent predicts the medication setℳ^T⊆𝒱med\\hat\{\\mathcal\{M\}\}\_\{T\}\\subseteq\\mathcal\{V\}\_\{\\mathrm\{med\}\}for the target visit from the longitudinal patient recordXTX\_\{T\}\. SafeRx\-Agent uses the clinical knowledge resources in Table[1](https://arxiv.org/html/2605.29146#S3.T1)to ground medication generation and verify safety risks\. SafeRx\-Agent uses a multi\-agent design because ATC codes span heterogeneous therapeutic domains, and clinically plausible candidates may still be unsafe due to DDIs or contraindications\. Algorithm[1](https://arxiv.org/html/2605.29146#algorithm1)summarizes the workflow, and the following subsections define each operator\. The prompt templates used by the LLM\-based operators are provided in Appendix[I](https://arxiv.org/html/2605.29146#A9), Figures[7](https://arxiv.org/html/2605.29146#A9.F7)–[10](https://arxiv.org/html/2605.29146#A9.F10)\.

1

Input:Patient record

XTX\_\{T\}; expert panel

ℰ\\mathcal\{E\}; indication resource

RR; safety matrices

𝐌DDI,𝐌Contra\\mathbf\{M\}\_\{\\mathrm\{DDI\}\},\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}\.

Output:Predicted medication set

ℳ^T\\hat\{\\mathcal\{M\}\}\_\{T\}\.

2

⊳\\trianglerightMulti\-agent generation

3

A←Route​\(XT,ℰ\),A⊆ℰA\\leftarrow\\textsc\{Route\}\(X\_\{T\},\\mathcal\{E\}\),\\ A\\subseteq\\mathcal\{E\}
4foreach*Ei∈AE\_\{i\}\\in A*do

5

\(si,ρi\)←Summarize​\(Ei,XT\)\(s\_\{i\},\\rho\_\{i\}\)\\leftarrow\\textsc\{Summarize\}\(E\_\{i\},X\_\{T\}\)
6

ℳi←Generate​\(Ei,si,ρi,XT,R\)\\mathcal\{M\}\_\{i\}\\leftarrow\\textsc\{Generate\}\(E\_\{i\},s\_\{i\},\\rho\_\{i\},X\_\{T\},R\)
7

8

ℳ~←Critique​\(\{ℳi,si,ρi\}i∈A,XT,R\)\\tilde\{\\mathcal\{M\}\}\\leftarrow\\textsc\{Critique\}\(\\\{\\mathcal\{M\}\_\{i\},s\_\{i\},\\rho\_\{i\}\\\}\_\{i\\in A\},X\_\{T\},R\)
9

⊳\\trianglerightSafety verification

10

ℱ←FindFlags​\(ℳ~,XT,𝐌DDI,𝐌Contra\)\\mathcal\{F\}\\leftarrow\\textsc\{FindFlags\}\(\\tilde\{\\mathcal\{M\}\},X\_\{T\},\\mathbf\{M\}\_\{\\mathrm\{DDI\}\},\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}\)
11

ℳ^T←ℳ~\\hat\{\\mathcal\{M\}\}\_\{T\}\\leftarrow\\tilde\{\\mathcal\{M\}\}
12foreach*f∈ℱf\\in\\mathcal\{F\}*do

13if**Verify*​\(f,ℳ^T,XT,R\)=*Rem*\\textsc\{Verify\}\(f,\\hat\{\\mathcal\{M\}\}\_\{T\},X\_\{T\},R\)=\\textsc\{Rem\}*then

14

ℳ^T←ℳ^T∖\{mf\}\\hat\{\\mathcal\{M\}\}\_\{T\}\\leftarrow\\hat\{\\mathcal\{M\}\}\_\{T\}\\setminus\\\{m\_\{f\}\\\}
15

16

17return*ℳ^T\\hat\{\\mathcal\{M\}\}\_\{T\}*

Algorithm 1SafeRx\-Agent Inference
### 4\.2Expert Panel

We construct a reusable expert panelℰ=\{E1,…,EK\}∪\{Euniv\}\\mathcal\{E\}=\\\{E\_\{1\},\\ldots,E\_\{K\}\\\}\\cup\\\{E\_\{\\mathrm\{univ\}\}\\\}before inference\. The panel containsKKspecialty experts and one always\-on universal supportive\-care expertEunivE\_\{\\mathrm\{univ\}\}for common inpatient medications\. We derive theKKspecialty experts empirically by clustering patient embeddings and mapping each cluster to an ICD chapter\-level scope\. Appendix[B](https://arxiv.org/html/2605.29146#A2)provides implementation details\.

### 4\.3Multi\-Agent Generation

Route: sparse specialty activation\.Invoking the full expert panel for every patient wastes computation and dilutes specialty\-specific signal with off\-topic perspectives, soRouteactivates only relevant experts by matching the patient’s current and historical ICD codes inXTX\_\{T\}against each expert’s scope, withEunivE\_\{\\mathrm\{univ\}\}always included:

- •A←Route​\(XT,ℰ\)A\\leftarrow\\textsc\{Route\}\(X\_\{T\},\\mathcal\{E\}\),A⊆ℰA\\subseteq\\mathcal\{E\}\.

Routing is sparse, so the number of expert calls scales with case complexity rather than the full panel size

\|ℰ\|\|\\mathcal\{E\}\|\.

Summarize: expert\-specific patient summarization\.The raw record

XTX\_\{T\}remains long and heterogeneous, with each specialty’s relevant evidence scattered among entries that fall outside its concern\.Summarizetherefore extracts and structures the specialty\-relevant portions of

XTX\_\{T\}into compact artifacts that surface what each expert needs for prescribing\. For each activated expert

EiE\_\{i\},Summarizeconverts the multi\-visit record

XTX\_\{T\}into expert\-specific artifacts \(Fig\.[7](https://arxiv.org/html/2605.29146#A9.F7)\):

- •Typed summarysis\_\{i\}: a structured summary of medication\-related evidence, including active problems, acute organ dysfunction, procedures, care context, medication\-relevant risks, and prior prescriptions\.
- •Visit rationalesρi\\rho\_\{i\}: structured explanations for prior visits that link diagnoses, procedures, and prescribed drugs, helping distinguish continued medications from transient treatments\.

Generate: per\-expert medication proposal\.With a focused view of the patient, each expert is now positioned to move from understanding to action and propose medications grounded in its own scope\. For each activated expertEiE\_\{i\},Generateproposes medications from the typed summarysis\_\{i\}, visit rationalesρi\\rho\_\{i\}, and diagnosis\-linked candidates retrieved from the MEDI resourceRR\(Row 3, Table[1](https://arxiv.org/html/2605.29146#S3.T1)\)\. It also consults the raw recordXTX\_\{T\}for details missing fromsis\_\{i\}orρi\\rho\_\{i\}:

- •ℳi←Generate​\(Ei,si,ρi,XT,R\)\\mathcal\{M\}\_\{i\}\\leftarrow\\textsc\{Generate\}\(E\_\{i\},s\_\{i\},\\rho\_\{i\},X\_\{T\},R\)\.

Critique: global critique and reconciliation\.Individual experts provide focused but partial prescribing views\. Their proposals may overlap, omit cross\-specialty context, or include medications that are locally plausible but weakly supported by the full patient record\.Critiquereviews all expert proposals together with the expert\-specific summariessis\_\{i\}, visit rationalesρi\\rho\_\{i\}, indication evidenceRR, and the full recordXTX\_\{T\}, and then merges duplicate candidates and removes medications that lack sufficient patient\-level support:

- •ℳ~←Critique​\(\{ℳi,si,ρi\}i∈A,XT,R\)\\tilde\{\\mathcal\{M\}\}\\leftarrow\\textsc\{Critique\}\(\\\{\\mathcal\{M\}\_\{i\},s\_\{i\},\\rho\_\{i\}\\\}\_\{i\\in A\},X\_\{T\},R\)\.

### 4\.4Safety Verification

SafeRx\-Agent verifies each generated candidate set because clinically plausible medications may still create co\-prescription or disease\-conditioned safety risks\. Safety verification takes the candidate setℳ~\\tilde\{\\mathcal\{M\}\}as input and does not generate new medications\. It uses the DDI and contraindication matrices,𝐌DDI\\mathbf\{M\}\_\{\\mathrm\{DDI\}\}and𝐌Contra\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}, derived from the safety resources in Table[1](https://arxiv.org/html/2605.29146#S3.T1), rows 4 and 5\. The verifier first retrieves safety evidence and then uses patient context to decide an action for each flagged item\.

#### FindFlags: matrix\-based evidence collection\.

FindFlagsscansℳ~\\tilde\{\\mathcal\{M\}\}against𝐌DDI\\mathbf\{M\}\_\{\\mathrm\{DDI\}\}and𝐌Contra\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}to collect a set of flagged itemsℱ\\mathcal\{F\}\. Each flag is represented asf=\(mf,relf,evidf\)f=\(m\_\{f\},\\mathrm\{rel\}\_\{f\},\\mathrm\{evid\}\_\{f\}\), wheremfm\_\{f\}is the flagged medication,relf∈\{DDI,Contra\}\\mathrm\{rel\}\_\{f\}\\in\\\{\\textsc\{DDI\},\\textsc\{Contra\}\\\}is the triggering safety relation, andevidf\\mathrm\{evid\}\_\{f\}contains case\-level evidence such as DDI degree and prior\-prescription status\.

#### Verify: case\-specific adjudication\.

For each flagged item,Verifyoutputs an actionaf∈\{Ret,Rem\}a\_\{f\}\\in\\\{\\textsc\{Ret\},\\textsc\{Rem\}\\\}based on the flagff, the patient contextXTX\_\{T\}, and the indication resourceRR\(Fig\.[10](https://arxiv.org/html/2605.29146#A9.F10)\)\.Retretains a flagged medication when the patient context supports its use, such as a previously tolerated DDI pair\.Remremoves a medication when the risk lacks patient\-specific justification, such as a contraindication against an active diagnosis without prior exposure\. This design keeps evidence\-supported medication generation and removes candidates with unsupported safety risks\.

## 5Experiment Design

### 5\.1Datasets and Preprocessing

We evaluate SafeRx\-Agent on MIMIC\-III and MIMIC\-IV\. Diagnoses and procedures are represented by ICD codes with textual descriptions; medications are normalized to ATC\-L4 codes with within\-visit duplicates removed\. The prediction vocabulary𝒱med\\mathcal\{V\}\_\{\\mathrm\{med\}\}is the closed set of retained ATC\-L4 codes; out\-of\-vocabulary predictions are discarded at evaluation time\. We use patient\-level splits with no leakage\. For MIMIC\-IV, we hold out 100 cases for prompt development and evaluate on the remaining 1,586; for MIMIC\-III, we use the full 901\-patient cohort with prompts transferred from MIMIC\-IV without re\-tuning\. Supervised baselines use the same splits\. Dataset statistics are summarized in Table[8](https://arxiv.org/html/2605.29146#A3.T8)\(Appendix[C](https://arxiv.org/html/2605.29146#A3)\)\.

### 5\.2Baselines

We compare SafeRx\-Agent against four groups of baselines, isolating the effects of task\-specific training, medical\-domain adaptation, general LLM capability, and agent workflow design\. \(1\)Traditional deep learning: GAMENet\(Shanget al\.,[2019](https://arxiv.org/html/2605.29146#bib.bib20)\), SafeDrug\(Yanget al\.,[2021](https://arxiv.org/html/2605.29146#bib.bib21)\), DrugDoctorKuang and Xie \([2024](https://arxiv.org/html/2605.29146#bib.bib9)\), and KEHGCN\(Zhanget al\.,[2026](https://arxiv.org/html/2605.29146#bib.bib36)\)\. \(2\)Medical LLMs: UltraMedical\-70BZhanget al\.\([2024](https://arxiv.org/html/2605.29146#bib.bib25)\), Med42\-v2\-70BChristopheet al\.\([2024](https://arxiv.org/html/2605.29146#bib.bib10)\), OpenBioLLM\-70BAnkit Pal \([2024](https://arxiv.org/html/2605.29146#bib.bib26)\), Llama3\.1\-Aloe\-Beta\-70BGarcia\-Gasullaet al\.\([2025](https://arxiv.org/html/2605.29146#bib.bib28)\), and HuatuoGPT\-o1Chenet al\.\([2024](https://arxiv.org/html/2605.29146#bib.bib27)\)\. \(3\)General LLMs: GPT\-5\.2, Claude Sonnet\-4\.6, Qwen3\-32BYanget al\.\([2025](https://arxiv.org/html/2605.29146#bib.bib12)\), and Gemma3\-27B\-ITTeamet al\.\([2025](https://arxiv.org/html/2605.29146#bib.bib11)\)\. \(4\)Agent\-based baselines: RareAgentsChenet al\.\([2026](https://arxiv.org/html/2605.29146#bib.bib13)\)adapted for ATC\-L4 code generation, and General Agent with a single ICU medication expert\.

### 5\.3Implementation Details

Supervised baselines are trained on the patient\-level splits with an output head over the ATC\-L4 vocabulary and hyperparameters tuned on the validation set\. All LLM\-based methods share the same serialized EHR input and closed ATC\-L4 vocabulary\. The general\-agent baseline replaces SafeRx\-Agent’s routed specialty experts with one generalist agent, and RareAgents is adapted to ATC\-L4 generation\. Full baseline implementations, prompts, and inference settings are in Appendix[D](https://arxiv.org/html/2605.29146#A4)\.

### 5\.4Evaluation Metrics

We evaluate each method along two axes: medication\-set prediction quality and safety\. For prediction quality, we report Jaccard, precision, recall, F1, and the average number of predicted ATC\-L4 codes\. For safety, we report GT\-normalized DDI rates \(DDI\-B and DDI\-W\) and contraindication rates \(Contra\-B and Contra\-W\), where DDI\-W and Contra\-W weight each safety relation by its observed frequency\. Lower values indicate safer predictions for all the generated medication sets\. Full metric definitions are provided in Appendix[E](https://arxiv.org/html/2605.29146#A5)\.

## 6Results

Table 2:Prediction performance across MIMIC\-IV and MIMIC\-III\. Within each LLM backbone group,boldmarks the best andunderlinemarks the second\-best \(excluding the Avg\. \#Pred column\)\. Avg\. \#Pred reports the average number of predicted medications per case and is compared with the average ground\-truth size\.### 6\.1Quantitative Results

Table[2](https://arxiv.org/html/2605.29146#S6.T2)reports medication recommendation performance on MIMIC\-IV and MIMIC\-III\. SafeRx\-Agent consistently outperforms direct prompting and generic agent baselines across all backbones, achieving the best pre\-filter F1 on MIMIC\-IV for every model tested\. The safety filter trades a small drop in recall for higher precision and smaller predicted sets, as expected from a verifier that removes unsupported candidates\. Without task\-specific training, SafeRx\-Agent also matches or exceeds supervised baselines on F1 and Jaccard\. MIMIC\-III shows the same pattern, confirming that the workflow generalizes beyond MIMIC\-IV\.

Table 3:Safety analysis of predicted medication sets across MIMIC\-IV and MIMIC\-III\. All safety metrics are lower\-is\-better\. Within each LLM backbone group,boldmarks the best andunderlinemarks the second\-best in each column \(excluding the Ground Truth row and the Avg\. \#Pred column\)\. DDI\-B and Contra\-B report the fraction of cases with at least one DDI or contraindication, while DDI\-W and Contra\-W report their frequency\-weighted rates\. Avg\. \#Pred is the average number of predicted ATC codes per case\.### 6\.2Drug Safety Analysis

We evaluate medication recommendation safety by comparing each predicted ATC\-L4 set with the DDI and contraindication matrices constructed from external safety resources \(Table[6\.1](https://arxiv.org/html/2605.29146#S6.SS1)\)\. Deep learning baselines and unfiltered LLM agents often generate larger medication sets with higher DDI and contraindication rates\. Therefore, these methods cannot reliably control medication safety risks without explicit safety verification\. Our safety filter consistently reduces DDI and contraindication rates while keeping the predicted set size closer to the ground\-truth set size\. The largest reductions are observed for GPT\-5\.2 and Gemma3\-27B\-IT, where filtered predictions show safety profiles that are better aligned with the extracted safety knowledge\. Overall, these findings suggest that reliable fine\-grained medication recommendation requires a specific safety verification instead of relying on supervised learning loss or LLM generation alone\.

### 6\.3Ablation Studies

To quantify the marginal contribution of each component in the generation pipeline, we ablate SafeRx\-Agent by removing one component at a time while keeping all other settings fixed\. All ablation experiments are conducted on MIMIC\-IV using Gemma3\-27B\-IT \(Table[4](https://arxiv.org/html/2605.29146#S6.T4)\); subsequent analyses use the same setting\. Replacing the specialty panel with a general agent causes the largest F1 drop \(−\-0\.057\) and increases the predicted set size \(10\.7 vs\. 9\.65\), confirming that specialty routing constrains each expert to its scope\. Removing the critique hurts precision most \(0\.379→\\to0\.322\) while recall stays flat, consistent with its role as a pruning stage\. Removing the medical resource instead hurts recall most \(0\.398→\\to0\.325\), indicating that retrieved evidence helps recover medications not surfaced by the LLM alone\.

Table 4:Prediction ablations\.Each variant removes one component while keeping other settings fixed\.Redmarks decreases\. \#Pred deltas are shown in black since smaller or larger sets are not directly better or worse\.### 6\.4Diagnostic Analysis

We analyze how the critique stage reshapes the proposed medication set by comparing pre\- and post\-critique predictions\. Figure[1\(a\)](https://arxiv.org/html/2605.29146#S6.F1.sf1)shows that critique reduces false positives from 9\.51 to 6\.01 per case at a cost of only 0\.46 true positives, acting as an effective precision\-oriented pruning stage\. Figure[1\(b\)](https://arxiv.org/html/2605.29146#S6.F1.sf2)further reveals why: of the 6,625 removed medications, 6,373 \(96\.2%\) were proposed by only a single expert, while medications corroborated by multiple specialists were rarely pruned\. This suggests that the critique operates conservatively, selectively targeting candidates that lack independent corroboration rather than indiscriminately reducing the set size\. Supplementary diagnostic results are reported in Appendix[F](https://arxiv.org/html/2605.29146#A6); and qualitative case studies are presented in Appendix[J](https://arxiv.org/html/2605.29146#A10)\.

![Refer to caption](https://arxiv.org/html/2605.29146v1/x2.png)\(a\)Effect of critique on TP/FP/FN\.
![Refer to caption](https://arxiv.org/html/2605.29146v1/x3.png)\(b\)Removed meds by expert support\.
![Refer to caption](https://arxiv.org/html/2605.29146v1/x4.png)\(c\)LOO subgroup performance\.

Figure 2:Diagnostic and expert analyses\.\(a\) Critique substantially reduces average false positives\. \(b\) Critique primarily removes medications proposed by only one expert, and effectively removes false\-positives\. \(c\) Leave\-one\-expert\-out subgroup performance; expert abbreviations are defined at the beginning of Appendix[G](https://arxiv.org/html/2605.29146#A7)\.### 6\.5Expert Analysis

To test whether individual experts provide marginal value beyond the panel as a whole, we conduct a leave\-one\-expert\-out \(LOO\) analysis on activation\-defined subgroups\. For each expert, we compare subgroup F1 before and after removing only that expert\. Figure[1\(c\)](https://arxiv.org/html/2605.29146#S6.F1.sf3)shows that every activated expert contributes positively, with F1 drops ranging from \+0\.101 \(RESP\) to \+0\.232 \(SUP\)\. The universal supportive\-care expert contributes the most, consistent with its broad activation across cases\. Even the least impactful expert \(RESP\) still yields a meaningful F1 gain, indicating that no activated expert is redundant\. Appendix[G](https://arxiv.org/html/2605.29146#A7)reports activation sparsity and per\-expert TP contributions\.

### 6\.6Medication Granularity and Safety Evaluation

To quantify how medication granularity affects safety evaluation, we compare binary DDI and contraindication rates under ATC\-L4 and ATC\-L3 representations\. We compute the safety metrics directly from ATC\-L4 medication sets\. We then collapse each ATC\-L4 code to its ATC\-L3 parent and recompute these metrics at the ATC\-L3 level\. Figure[3](https://arxiv.org/html/2605.29146#S6.F3)shows that ATC\-L3 yields higher DDI and contraindication rates than ATC\-L4\. On MIMIC\-III, DDI\-B increases from 24\.60% to 59\.12%, and Contra\-B increases from 0\.23% to 0\.57%\. On MIMIC\-IV, DDI\-B increases from 30\.13% to 63\.98%, and Contra\-B increases from 0\.43% to 1\.06%\. This finding supports fine\-grained medication recommendation at ATC\-L4, which preserves medication subgroup distinctions and avoids overestimation of safety risk\.

![Refer to caption](https://arxiv.org/html/2605.29146v1/x5.png)Figure 3:Binary DDI and contraindication rates under ATC\-L4 and ATC\-L3 medication vocabulary\.### 6\.7Efficiency Analysis

We evaluate the inference efficiency of SafeRx\-Agent on MIMIC\-IV in terms of average LLM calls, input/output tokens, and per\-case\. As shown in Table[5](https://arxiv.org/html/2605.29146#S6.T5), despite its evidence\-grounded multi\-agent workflow, SafeRx\-Agent uses fewer LLM calls than RareAgents \(9\.4 vs\. 10\.1\), suggesting that sparse routing avoids unnecessary expert calls\. The final safety filter is lightweight: relative to the no\-filter variant, it increases latency from 27\.7 to 29\.5 seconds per case and adds only one LLM call\. A per\-phase breakdown is provided in Appendix[H](https://arxiv.org/html/2605.29146#A8)\.

Table 5:Efficiency analysis\.We report average LLM calls, token usage, and wall\-clock latency per case\.## 7Conclusion

We presented SafeRx\-Agent, a knowledge\-grounded multi\-agent framework for safe and explainable medication recommendation\. We introduce a fine\-grained ATC\-L4 setting that preserves clinically important subgroup distinctions often hidden by coarser benchmarks\. SafeRx\-Agent routes each case to multiple specialty experts, grounds generation in patient context and indication evidence, and applies safety verification using DDI and contraindication resources\. Experiments on MIMIC\-III and MIMIC\-IV show that SafeRx\-Agent improves prediction accuracy over deep learning, LLM, and agentic baselines while reducing safety risks and controlling predicted set size\. Ablations further show that routing, summarization, evidence grounding, critique, and safety verification provide complementary gains\. We hope this work encourages more verifiable agentic systems for high\-stakes clinical NLP\.

## Limitations

SafeRx\-Agent evaluates medication safety using mapped external resources for drug interactions and disease\-conditioned contraindications\. These resource\-grounded metrics systematically assess safety risks in generated medication sets, while their coverage depends on the underlying knowledge sources and ATC\-L4 mapping\. Our safety and explainability analyses rely on structured evidence checks and traceable rationales, and future work may incorporate clinician review to assess their clinical validity\. Our evaluation is retrospective and offline; deploying the system in real clinical workflows would require prospective validation and physician oversight to ensure patient safety\.

## References

- Z\. Ali, Y\. Huang, I\. Ullah, J\. Feng, C\. Deng, N\. Thierry, A\. Khan, A\. U\. Jan, X\. Shen, W\. Rui, and G\. Qi \(2023\)Deep learning for medication recommendation: a systematic survey\.5\(2\),pp\. 303–354\(en\)\.Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§1](https://arxiv.org/html/2605.29146#S1.p3.1)\.
- M\. S\. Ankit Pal \(2024\)OpenBioLLMs: advancing open\-source large language models for healthcare and life sciences\.Hugging Face\.Note:[https://huggingface\.co/aaditya/OpenBioLLM\-Llama3\-70B](https://huggingface.co/aaditya/OpenBioLLM-Llama3-70B)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- E\. Asgari, N\. Montaña\-Brown, M\. Dubois, S\. Khalil, J\. Balloch, J\. Au Yeung, and D\. Pimenta \(2025\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.8,pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Chen, Z\. Cai, K\. Ji, X\. Wang, W\. Liu, R\. Wang, J\. Hou, and B\. Wang \(2024\)HuatuoGPT\-o1, towards medical complex reasoning with llms\.External Links:2412\.18925,[Link](https://arxiv.org/abs/2412.18925)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- L\. Chen, W\. Zeng, Y\. Cai, K\. Feng, and K\. Chou \(2012\)Predicting anatomical therapeutic chemical \(atc\) classification of drugs by integrating chemical\-chemical interactions and similarities\.PLOS ONEJ\. Am\. Med\. Inform\. Assoc\.Sci\. Transl\. Med\.J\. Am\. Med\. Inform\. Assoc\.Data Intell\.Proceedings of the AAAI Conference on Artificial IntelligenceNature Medicinenpj Digital Medicinenpj Digital MedicineProc\. Conf\. AAAI Artif\. Intell\.PhysioNetPhysioNetApplied Sciences7,pp\. 1–7\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0035254),[Link](https://doi.org/10.1371/journal.pone.0035254)Cited by:[Table 1](https://arxiv.org/html/2605.29146#S3.T1.1.1.2.1.1)\.
- X\. Chen, Y\. Jin, X\. Mao, L\. Wang, S\. Zhang, and T\. Chen \(2026\)RareAgents: autonomous multi\-disciplinary team for rare disease diagnosis and treatment\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/36969/40931)Cited by:[Figure 13](https://arxiv.org/html/2605.29146#A9.F13),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- C\. Christophe, P\. K\. Kanithi, T\. Raha, S\. Khan, and M\. A\. Pimentel \(2024\)Med42\-v2: a suite of clinical llms\.External Links:2408\.06142,[Link](https://arxiv.org/abs/2408.06142)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- C\. Fan, C\. Gao, W\. Shi, Y\. Gong, Z\. Zhao, and F\. Feng \(2026\)Fine\-grained list\-wise alignment for generative medication recommendation\.External Links:2505\.20218,[Link](https://arxiv.org/abs/2505.20218)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- A\. N\. Farrag, A\. El\-Zeiny, and A\. M\. Ali \(2026\)Evaluating large language models for pharmacotherapy simulations: a mixed\-methods study\.9,pp\. 355\.External Links:[Document](https://dx.doi.org/10.1038/s41746-026-02626-1)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Garcia\-Gasulla, J\. Bayarri\-Planas, A\. K\. Gururajan, E\. Lopez\-Cuena, A\. Tormos, D\. Hinjos, P\. Bernabeu\-Perez, A\. Arias\-Duart, P\. A\. Martin\-Torres, M\. Gonzalez\-Mallo, S\. Alvarez\-Napagao, E\. Ayguadé\-Parra, and U\. Cortés \(2025\)The aloe family recipe for open and specialized healthcare llms\.External Links:2505\.04388,[Link](https://arxiv.org/abs/2505.04388)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- P\. Hager, F\. Jungmann, R\. Holland, K\. Bhagat, I\. Hubrecht, M\. Knauer, J\. Vielhauer, M\. Makowski, R\. Braren, G\. Kaissis, and D\. Rueckert \(2024\)Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.30,pp\. 2613–2622\.External Links:[Document](https://dx.doi.org/10.1038/s41591-024-03097-1)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Johnson, L\. Bulgarelli, T\. Pollard, B\. Gow, B\. Moody, S\. Horng, L\. A\. Celi, and R\. Mark \(2024\)MIMIC\-IV\.Note:Version 3\.1External Links:[Document](https://dx.doi.org/10.13026/kpb9-mt58),[Link](https://doi.org/10.13026/kpb9-mt58)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p4.1)\.
- A\. Johnson, T\. Pollard, and R\. Mark \(2016\)MIMIC\-III Clinical Database\.Note:Version 1\.4External Links:[Document](https://dx.doi.org/10.13026/C2XW26),[Link](https://doi.org/10.13026/C2XW26)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p4.1)\.
- T\. A\. Kass\-Hout, Z\. Xu, M\. Mohebbi, H\. Nelsen, A\. Baker, J\. Levine, E\. Johanson, and R\. A\. Bright \(2016\)OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data\.23\(3\),pp\. 596–600\(en\)\.Cited by:[§A\.4](https://arxiv.org/html/2605.29146#A1.SS4.p1.1),[Table 1](https://arxiv.org/html/2605.29146#S3.T1.4.4.2.1.1)\.
- Y\. Kim, C\. Park, H\. Jeong, Y\. S\. Chan, X\. Xu, D\. McDuff, H\. Lee, M\. Ghassemi, C\. Breazeal, and H\. W\. Park \(2024\)MDAgents: an adaptive collaboration of llms for medical decision\-making\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 79410–79452\.External Links:[Document](https://dx.doi.org/10.52202/079017-2522),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/90d1fc07f46e31387978b88e7e057a31-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Kuang and M\. Xie \(2024\)DrugDoctor: enhancing drug recommendation in cold\-start scenario via visit\-level representation learning and training\.Briefings in Bioinformatics25\(6\),pp\. bbae464\.External Links:ISSN 1477\-4054,[Document](https://dx.doi.org/10.1093/bib/bbae464),[Link](https://doi.org/10.1093/bib/bbae464),https://academic\.oup\.com/bib/article\-pdf/25/6/bbae464/59242513/bbae464\.pdfCited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- B\. Li, T\. Yan, Y\. Pan, J\. Luo, R\. Ji, J\. Ding, Z\. Xu, S\. Liu, H\. Dong, Z\. Lin, and Y\. Wang \(2024\)MMedAgent: learning to use medical tools with multi\-modal agent\.External Links:2407\.02483,[Link](https://arxiv.org/abs/2407.02483)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1)\.
- Q\. Liu, X\. Wu, X\. Zhao, Y\. Zhu, Z\. Zhang, F\. Tian, and Y\. Zheng \(2025\)Large language model distilling medication recommendation model\.External Links:2402\.02803,[Link](https://arxiv.org/abs/2402.02803)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Shang, C\. Xiao, T\. Ma, H\. Li, and J\. Sun \(2019\)GAMENet: graph augmented memory networks for recommending medication combination\.33\(01\),pp\. 1126–1133\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/3905),[Document](https://dx.doi.org/10.1609/aaai.v33i01.33011126)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p1.1),[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- S\. J\. Steindel \(2010\)International classification of diseases, 10th edition, clinical modification and procedure coding system: descriptive overview of the next generation HIPAA code sets\.J\. Am\. Med\. Inform\. Assoc\.17\(3\),pp\. 274–282\(en\)\.Cited by:[Table 1](https://arxiv.org/html/2605.29146#S3.T1.4.6.1.1.1.1)\.
- X\. Tang, A\. Zou, Z\. Zhang, Z\. Li, Y\. Zhao, X\. Zhang, A\. Cohan, and M\. Gerstein \(2024\)MedAgents: large language models as collaborators for zero\-shot medical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 599–621\.External Links:[Link](https://aclanthology.org/2024.findings-acl.33/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.33)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px3.p1.1)\.
- N\. P\. Tatonetti, P\. P\. Ye, R\. Daneshjou, and R\. B\. Altman \(2012\)Data\-driven prediction of drug effects and interactions\.4\(125\),pp\. 125ra31\(en\)\.Cited by:[§A\.3](https://arxiv.org/html/2605.29146#A1.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.29146#S3.T1.3.3.2.1.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- W\. Wei, R\. M\. Cronin, H\. Xu, T\. A\. Lasko, L\. Bastarache, and J\. C\. Denny \(2013\)Development and evaluation of an ensemble resource linking medications to their indications\.20\(5\),pp\. 954–961\(en\)\.Cited by:[Table 1](https://arxiv.org/html/2605.29146#S3.T1.2.2.2.1.1)\.
- W\. C\. C\. f\. D\. S\. M\. WHOCC \(2026\)Guidelines for ATC classification and DDD assignment\.Oslo\.Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p3.1)\.
- R\. Wu, Z\. Qiu, J\. Jiang, G\. Qi, and X\. Wu \(2022\)Conditional generation net for medication recommendation\.InProceedings of the ACM Web Conference 2022,WWW ’22,New York, NY, USA,pp\. 935–945\.External Links:ISBN 9781450390965,[Link](https://doi.org/10.1145/3485447.3511936),[Document](https://dx.doi.org/10.1145/3485447.3511936)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Xu, X\. Xi, J\. Chen, V\. S\. Sheng, J\. Ma, and Z\. Cui \(2022\)A survey of deep learning for electronic health records\.12\(22\),pp\. 11709\.Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- C\. Yang, C\. Xiao, F\. Ma, L\. Glass, and J\. Sun \(2021\)SafeDrug: dual molecular graph encoders for recommending effective and safe drug combinations\.InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI\-21,Z\. Zhou \(Ed\.\),pp\. 3735–3741\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2021/514),[Link](https://doi.org/10.24963/ijcai.2021/514)Cited by:[§1](https://arxiv.org/html/2605.29146#S1.p2.1),[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- N\. Yang, K\. Zeng, Q\. Wu, and J\. Yan \(2023\)MoleRec: combinatorial drug recommendation with substructure\-aware molecular representation learning\.InProceedings of the ACM Web Conference 2023,WWW ’23,New York, NY, USA,pp\. 4075–4085\.External Links:ISBN 9781450394161,[Link](https://doi.org/10.1145/3543507.3583872),[Document](https://dx.doi.org/10.1145/3543507.3583872)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Zhang, S\. Zeng, E\. Hua, N\. Ding, Z\. Chen, Z\. Ma, H\. Li, G\. Cui, B\. Qi, X\. Zhu, X\. Lv, J\. Hu, Z\. Liu, and B\. Zhou \(2024\)UltraMedical: building specialized generalists in biomedicine\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 26045–26081\.External Links:[Document](https://dx.doi.org/10.52202/079017-0819),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/2dfc26ce9039f00eee4aba0c54931e46-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- Z\. Zhang, H\. Liu, X\. Guo, T\. Sun, and Z\. Wu \(2026\)Knowledge\-enhanced explainable HyperGraph convolution network for medication recommendation\.40\(19\),pp\. 16424–16432\.Cited by:[§5\.2](https://arxiv.org/html/2605.29146#S5.SS2.p1.1)\.
- Z\. Zhao, C\. Fan, J\. Liu, Z\. Wang, X\. He, C\. Gao, J\. Li, and F\. Feng \(2025\)Fine\-grained alignment of large language models for general medication recommendation without overprescription\.External Links:2503\.03687,[Link](https://arxiv.org/abs/2503.03687)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhao, Y\. Jing, F\. Feng, J\. Wu, C\. Gao, and X\. He \(2024\)Leave no patient behind: enhancing medication recommendation for rare disease patients\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’24,New York, NY, USA,pp\. 533–542\.External Links:ISBN 9798400704314,[Link](https://doi.org/10.1145/3626772.3657785),[Document](https://dx.doi.org/10.1145/3626772.3657785)Cited by:[§2](https://arxiv.org/html/2605.29146#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix ADetails About Clinical Domain Knowledge

### A\.1Construction of Medical Ontologies

We organize diagnosis and medication concepts into hierarchical ontologies to align EHR codes with external clinical knowledge\. Diagnosis codes follow the ICD taxonomy based on CCS hierarchy, and medication codes follow the ATC taxonomy\. Each code is assigned to a unique path from a broad category to a fine\-grained leaf code\. This path records the semantic lineage of the code and provides a consistent representation for evidence alignment, medication vocabulary mapping, and safety evaluation\.

#### Diagnosis\.

Diagnostic concepts are organized using the CCS framework developed by the Agency for Healthcare Research and Quality \(AHRQ\), which groups ICD\-9\-CM and ICD\-10\-CM diagnosis codes into clinically coherent categories at multiple levels of granularity\. We construct a five\-level hierarchy from CCS major categories down to decimal\-level ICD codes, ensuring that each ICD code maps to a unique hierarchical path\.

#### Medication\.

Medication concepts are organized using the ATC classification system, which defines a five\-level hierarchy from anatomical main groups down to chemical substances and encodes pharmacological relationships among drugs\.

Per\-level concept counts for both taxonomies are reported in Table[6](https://arxiv.org/html/2605.29146#A1.T6)\.

Table 6:Hierarchical taxonomy of diagnosis and medication concepts used in this study\. Each level contains two attributes:*Name*, which specifies the semantic grouping at that level, and*\# Concepts*, the number of unique codes at that level\. Diagnosis categories are derived from ICD\-CM codes mapped through the CCS diagnostic hierarchy, and medications follow the five\-level ATC ontology\. For ICD\-CM levels, the*\# Concepts*is reported as the number of ICD\-9\-CM and ICD\-10\-CM codes shown in parentheses, respectively\.

### A\.2Medication Indication

MEDI is an ensemble medication indication resource for primary and secondary uses of EHR data\. MEDI was created based on multiple commonly used medication resources \(RxNorm, MedlinePlus, SIDER 4\.1, Mayo Clinic, WebMD, and Wikipedia \) and by leveraging both ontology and NLP techniques\. The current release of MEDI contains 3,031 medications and 186,064 indications, including both ICD\-9\-CM and ICD\-10\-CM indication pairs\.

### A\.3Drug\-Drug Interactions

DDIs describe medication pairs with known co\-prescription risks\. We construct DDI resources from the TWOSIDES knowledge baseTatonettiet al\.\([2012](https://arxiv.org/html/2605.29146#bib.bib17)\)\. TWOSIDES contains drug\-pair interaction records with source drug identifiers\. We map these identifiers to ATC\-L4 medication codes and aggregate the mapped records over the ATC\-L4 medication vocabulary used in this study\. This mapping produces two DDI matrices\. The binary DDI matrix records whether an ATC\-L4 medication pair has at least one known interaction record in TWOSIDES\. If any source drug pair mapped to the same ATC\-L4 pair has a recorded interaction, the corresponding binary entry is set to one; otherwise, it is set to zero\. This matrix is used to compute binary DDI rates\. The weighted DDI matrix records the observed interaction frequency for each ATC\-L4 medication pair\. If multiple source drug pairs map to the same ATC\-L4 pair, we sum their frequencies to obtain one pair\-level weight\. This matrix is used to compute frequency\-weighted DDI rates, assigning larger penalties to medication pairs with more frequently observed interaction records\. We symmetrize both matrices because DDI risks are defined over unordered medication pairs\. We exclude diagonal entries when computing pairwise DDI rates\.

### A\.4Contraindications from openFDA

Contraindications specify clinical situations in which a medication may be unsafe because potential harms outweigh expected benefits\. The openFDA Drug Labeling dataset provides programmatic access to FDA Structured Product Labeling \(SPL\) submissions for prescription and over\-the\-counter drugsKass\-Houtet al\.\([2016](https://arxiv.org/html/2605.29146#bib.bib18)\)\. SPL records are organized into clinically meaningful fields, such as indications, contraindications, adverse reactions, and warnings, although field coverage and text format vary across products\. The openFDA Drug Labeling data are updated over time as new safety and effectiveness information becomes available\.

We use the openFDA Drug Labeling snapshot last updated on 2026\-03\-04\. To construct the ATC\-L4 contraindication knowledge database, we first merge all 13 openFDA Drug Labeling files from[https://open\.fda\.gov/data/downloads/](https://open.fda.gov/data/downloads/)\. We retain three fields:substance\_name,RxCUI, andcontraindications\. We remove records withoutRxCUI\(193,958 entries\) and records withoutcontraindications\(27,769 entries\)\. For duplicate single\-RxCUIrecords, we group records byRxCUIand retain the record with the longest contraindication text, measured bylen​\(contraindications\)\\mathrm\{len\}\(\\texttt\{contraindications\}\)\. This step reduces single\-RxCUIrecords from 22,382 to 2,765 unique entries\. We then remove all 11,152 multi\-RxCUIrecords to avoid ambiguous drug\-to\-code mappings\. Finally, we map the 2,765 single\-RxCUIrecords to ATC\-L4 codes and retain only successfully mapped entries\. This process yields 4,434 RxCUI\-ATC\-L4 pairs, corresponding to 2,136 unique RxCUI drugs and 400 unique ATC\-L4 codes\.

## Appendix BExpert Construction

This appendix describes how we construct the expert set used by SafeRx\-Agent\. The expert taxonomy is derived in two stages\. First, we serialize each patient case using the structured EHR\-to\-text template in Figure[4](https://arxiv.org/html/2605.29146#A2.F4)\. We encode the serialized cases into patient\-level text embeddings and sweep the number of clusters over these embeddings\. Second, after selecting the best cluster count, we inspect the diagnosis composition, ICD\-10 chapter distribution, and representative cases of each cluster to assign clinically interpretable expert names and chapter mappings\. In addition to these cluster\-derived experts, we introduce an always\-on universal expert to cover ICU\-common supportive\-care medications that are not specific to a single diagnosis chapter\.

#### EHR\-to\-text serialization\.

Each structured EHR case is serialized into natural language for LLM\-based methods \(Section[4](https://arxiv.org/html/2605.29146#S4)and the LLM/Agent baselines in Section[5](https://arxiv.org/html/2605.29146#S5)\)\. The serialized input contains the patient profile, historical visits \(diagnoses, procedures, prescribed medications\), and the target visit \(diagnoses and procedures only\)\. Each clinical concept is paired with its original code to preserve both textual semantics and code\-level grounding\. The full template is shown in Fig\.[4](https://arxiv.org/html/2605.29146#A2.F4)\.

EHR\-to\-text serialization templatePatient profile\.The patient’s age is \[AGE\] and gender is \[GENDER\]\. The patient’s insurance type is \[INSURANCE\], language is \[LANGUAGE\], admission type is \[ADMISSION TYPE\], marital status is \[MARITAL STATUS\], and race is \[RACE\]\. The patient has \[T\] ICU visits\.Historical visit\.In visit \[t\], the patient had diagnoses: \[DIAGNOSIS DESCRIPTION\] \(\[ICD CODE\]\), …; procedures: \[PROCEDURE DESCRIPTION\] \(\[PROCEDURE CODE\]\), …\. The patient was prescribed drugs: \[DRUG NAME\] \(\[ATC\-L4 CODE\]\), …\.Target visit\.In this visit, the patient has diagnoses: \[DIAGNOSIS DESCRIPTION\] \(\[ICD CODE\]\), …; procedures: \[PROCEDURE DESCRIPTION\] \(\[PROCEDURE CODE\]\), …\. Then, the patient should be prescribed:Figure 4:Structured EHR\-to\-text serialization template\.Textual descriptions are paired with structured codes to preserve both clinical semantics and code\-level grounding\.
### B\.1Cluster\-Derived Specialty Experts

We construct patient\-level routing features from diagnosis information and apply K\-means clustering with different values ofKK\. Specifically, we sweepK∈\{2,3,…,10\}K\\in\\\{2,3,\\ldots,10\\\}and select the value with the best silhouette score\. The sweep identifiesK=7K=7as the most separable solution, achieving a silhouette score of0\.450\.45\. This suggests that the patient population is best organized into seven coarse clinical groups under the constructed routing representation\.

Figure[6\.1](https://arxiv.org/html/2605.29146#S6.SS1)visualizes the resulting clusters\. The centroid–domain heatmap \(FigureLABEL:fig:cluster\_centroid\) shows that each of the seven clusters concentrates its mass on a single clinical domain, with off\-diagonal mass remaining low; this one\-cluster\-per\-domain structure motivates a one\-to\-one cluster\-to\-expert assignment\. The PCA projection of single\-domain patients \(Figure[4\(b\)](https://arxiv.org/html/2605.29146#A2.F4.sf2)\) confirms that patients with one dominant clinical domain separate visibly along the first two principal components \(53\.9% \+ 18\.5% explained variance\), with obstetrics/perinatal forming a clearly distinct branch and the remaining six domains spread along the second axis\.

![Refer to caption](https://arxiv.org/html/2605.29146v1/figures/patient_cluster_pca.png)\(b\)Single\-domain patients projected onto the first two principal components of the routing feature space, colored by dominant clinical domain\.

Figure 5:Cluster–domain structure of the routing feature space\.The centroid heatmap \(top\) and PCA projection \(bottom\) jointly justify theK=7K=7partition and the cluster\-to\-expert mapping in Table[7](https://arxiv.org/html/2605.29146#A2.T7)\.We then examine each cluster’s dominant diagnosis chapters, diagnosis descriptions, and representative cases\. Based on this inspection, we assign each cluster a clinically interpretable expert name and map it to the corresponding ICD\-10 chapter group\. Table[7](https://arxiv.org/html/2605.29146#A2.T7)summarizes the resulting expert taxonomy\.

Table 7:Cluster\-derived specialty experts and their ICD\-10 chapter mappings\. The expert names and chapter mappings are curated after selectingK=7K=7from the clustering sweep\.
## Appendix CDataset Statistics

Table 8:Dataset statistics after preprocessing\.## Appendix DImplementation Details

This appendix provides additional implementation details for LLM inference, agent\-side ablation settings, and prompt templates\. All LLM\-based methods are evaluated with the same serialized EHR input format, closed ATC\-L4 medication vocabulary, and evaluation script unless otherwise specified\.

### D\.1Inference Settings

Table[9](https://arxiv.org/html/2605.29146#A4.T9)summarizes the model\-level inference settings used in our experiments\. We generate one medication set per case\. Closed\-source models are accessed through their official APIs, while open\-source models are served locally via vLLM\. For backbone LLMs, we use temperature 0\.0 for deterministic stages \(summarization, critique, and safety verification\) and 0\.2 for the expert generation stage to encourage diverse medication proposals across specialists\.

Table 9:Inference settings for LLM\-based methods\.Temp\. column shows temperatures for deterministic / generation stages; medical LLM baselines use only direct prompting \(single stage\)\. Max model length denotes the maximum context length used for vLLM serving\.
## Appendix EEvaluation Metrics

We evaluate each prediction along two axes:*accuracy*and*safety*\. Prediction accuracy is measured using Jaccard, precision, recall, and F1\. Safety is measured using GT\-normalized drug–drug interaction rates and drug–diagnosis contraindication rates\. LetNNdenote the number of test visits;ℳ\(t\)\\mathcal\{M\}^\{\(t\)\}andℳ^\(t\)\\hat\{\\mathcal\{M\}\}^\{\(t\)\}denote the ground\-truth and predicted ATC\-L4 medication sets for visittt; and𝒟\(t\)\\mathcal\{D\}^\{\(t\)\}denote the diagnosis set used for contraindication screening at that visit\.

### E\.1Prediction Accuracy Metrics

#### Jaccard\.

We report the sample\-averaged Jaccard similarity:

Jaccard=1N​∑t=1N\|ℳ\(t\)∩ℳ^\(t\)\|\|ℳ\(t\)∪ℳ^\(t\)\|\.\\mathrm\{Jaccard\}=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}\\frac\{\\left\|\\mathcal\{M\}^\{\(t\)\}\\cap\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\right\|\}\{\\left\|\\mathcal\{M\}^\{\(t\)\}\\cup\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\right\|\}\.\(1\)

#### Precision, Recall, and F1\.

We report micro\-averaged precision, recall, and F1 by pooling true positives, false positives, and false negatives across all visits:

Precision=∑t=1N\|ℳ\(t\)∩ℳ^\(t\)\|∑t=1N\|ℳ^\(t\)\|,\\mathrm\{Precision\}=\\frac\{\\sum\_\{t=1\}^\{N\}\\left\|\\mathcal\{M\}^\{\(t\)\}\\cap\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\right\|\}\{\\sum\_\{t=1\}^\{N\}\\left\|\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\right\|\},\(2\)Recall=∑t=1N\|ℳ\(t\)∩ℳ^\(t\)\|∑t=1N\|ℳ\(t\)\|,\\mathrm\{Recall\}=\\frac\{\\sum\_\{t=1\}^\{N\}\\left\|\\mathcal\{M\}^\{\(t\)\}\\cap\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\right\|\}\{\\sum\_\{t=1\}^\{N\}\\left\|\\mathcal\{M\}^\{\(t\)\}\\right\|\},\(3\)F1=2⋅Precision⋅RecallPrecision\+Recall\.\\mathrm\{F1\}=\\frac\{2\\cdot\\mathrm\{Precision\}\\cdot\\mathrm\{Recall\}\}\{\\mathrm\{Precision\}\+\\mathrm\{Recall\}\}\.\(4\)Micro averaging weights each drug\-level decision equally and avoids giving the same weight to visits with very small and very large medication sets\.

### E\.2Prediction Safety Metrics

#### GT\-normalized DDI rate\.

Drug–drug interactions are represented by a symmetric matrix𝐌DDI∈\[0,1\]\|𝒱med\|×\|𝒱med\|\\mathbf\{M\}\_\{\\mathrm\{DDI\}\}\\in\[0,1\]^\{\|\\mathcal\{V\}\_\{\\mathrm\{med\}\}\|\\times\|\\mathcal\{V\}\_\{\\mathrm\{med\}\}\|\}indexed over the ATC\-L4 medication vocabulary\. We report two variants: DDI\-B uses a binary matrix indicating whether any documented interaction exists, while DDI\-W uses a weighted matrix with normalized interaction weights\.

For each visittt, we first compute the predicted interaction burden:

IDDI\(t\)=∑i<ji,j∈ℳ^\(t\)𝐌DDI​\[i,j\]\.I\_\{\\mathrm\{DDI\}\}^\{\(t\)\}=\\sum\_\{\\begin\{subarray\}\{c\}i<j\\\\ i,j\\in\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\\end\{subarray\}\}\\mathbf\{M\}\_\{\\mathrm\{DDI\}\}\[i,j\]\.\(5\)We normalize this burden by the number of possible ground\-truth drug pairs:

ZDDI\(t\)=max⁡\(\(\|ℳ\(t\)\|2\),1\)\.Z\_\{\\mathrm\{DDI\}\}^\{\(t\)\}=\\max\\\!\\left\(\\binom\{\|\\mathcal\{M\}^\{\(t\)\}\|\}\{2\},1\\right\)\.\(6\)The per\-visit GT\-normalized DDI score is:

sDDI\(t\)=min⁡\(IDDI\(t\)ZDDI\(t\),1\)\.s\_\{\\mathrm\{DDI\}\}^\{\(t\)\}=\\min\\\!\\left\(\\frac\{I\_\{\\mathrm\{DDI\}\}^\{\(t\)\}\}\{Z\_\{\\mathrm\{DDI\}\}^\{\(t\)\}\},1\\right\)\.\(7\)We report the dataset\-level DDI score as:

DDIGTnorm=1N​∑t=1NsDDI\(t\)\.\\mathrm\{DDI\}\_\{\\mathrm\{GTnorm\}\}=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}s\_\{\\mathrm\{DDI\}\}^\{\(t\)\}\.\(8\)
Unlike the conventional DDI rate, which normalizes by the number of predicted drug pairs, this metric normalizes the predicted interaction burden by the expected prescription complexity of the corresponding visit\. This prevents large prediction sets from artificially lowering the DDI score by adding many non\-interacting pairs\. The per\-visit score is capped at11so that visits with excessive interaction burden are treated as maximally unsafe without dominating the dataset average\. Lower values indicate safer predictions\.

#### GT\-normalized contraindication rate\.

Drug–diagnosis contraindications are represented by a bipartite matrix𝐌Contra∈\[0,1\]\|𝒱med\|×\|𝒱diag\|\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}\\in\[0,1\]^\{\|\\mathcal\{V\}\_\{\\mathrm\{med\}\}\|\\times\|\\mathcal\{V\}\_\{\\mathrm\{diag\}\}\|\}, where𝐌Contra​\[m,d\]\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}\[m,d\]indicates whether medicationmmis contraindicated for diagnosisddunder the binary variant, or gives a normalized contraindication weight under the weighted variant\.

For each visittt, we first compute the predicted contraindication burden:

Icontra\(t\)=∑m∈ℳ^\(t\)∑d∈𝒟\(t\)𝐌Contra​\[m,d\]\.I\_\{\\mathrm\{contra\}\}^\{\(t\)\}=\\sum\_\{m\\in\\hat\{\\mathcal\{M\}\}^\{\(t\)\}\}\\sum\_\{d\\in\\mathcal\{D\}^\{\(t\)\}\}\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}\[m,d\]\.\(9\)We normalize this burden by the ground\-truth medication\-set size and the number of active diagnoses:

Zcontra\(t\)=max⁡\(\|ℳ\(t\)\|​\|𝒟\(t\)\|,1\)\.Z\_\{\\mathrm\{contra\}\}^\{\(t\)\}=\\max\\\!\\left\(\|\\mathcal\{M\}^\{\(t\)\}\|\\,\|\\mathcal\{D\}^\{\(t\)\}\|,1\\right\)\.\(10\)The per\-visit GT\-normalized contraindication score is:

scontra\(t\)=min⁡\(Icontra\(t\)Zcontra\(t\),1\)\.s\_\{\\mathrm\{contra\}\}^\{\(t\)\}=\\min\\\!\\left\(\\frac\{I\_\{\\mathrm\{contra\}\}^\{\(t\)\}\}\{Z\_\{\\mathrm\{contra\}\}^\{\(t\)\}\},1\\right\)\.\(11\)We report the dataset\-level contraindication score as:

ContraGTnorm=1N​∑t=1Nscontra\(t\)\.\\mathrm\{Contra\}\_\{\\mathrm\{GTnorm\}\}=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}s\_\{\\mathrm\{contra\}\}^\{\(t\)\}\.\(12\)
Indices that fall outside the matrix support contribute zero to the numerator under both metrics: drug pairs\(i,j\)\(i,j\)outside𝐌DDI\\mathbf\{M\}\_\{\\mathrm\{DDI\}\}for DDI\-B / DDI\-W and drug–diagnosis pairs\(m,d\)\(m,d\)outside𝐌Contra\\mathbf\{M\}\_\{\\mathrm\{Contra\}\}for Contra\-B / Contra\-W\. This convention treats safety\-knowledge–unknown pairs as safe\-by\-default, so the reported DDI / contraindication rates are lower bounds on the true safety burden over𝒱med\\mathcal\{V\}\_\{\\mathrm\{med\}\}\. Lower values indicate safer predictions\.

## Appendix FAdditional Diagnostic Analysis

This appendix complements Section[6\.4](https://arxiv.org/html/2605.29146#S6.SS4)with two additional diagnostics for the critique stage: the overall action distribution \(Table[10](https://arxiv.org/html/2605.29146#A6.T10)\) and per\-expert removal behavior \(Table[11](https://arxiv.org/html/2605.29146#A6.T11)\)\.

Table 10:Critique action distribution\.Share% is the fraction of all critique decisions assigned to each action\. TP% and FP% report the composition of retained and removed medications with respect to the ground\-truth medication set\.Table 11:Per\-expert removal behavior\.Rm% denotes the fraction of proposed medications removed by critique\. RmFP% denotes the false\-positive rate among removed medications\.Removal behavior differs across experts\. The universal supportive\-care expert has the lowest removal rate, consistent with its role in proposing common ICU medications\. Specialty experts have higher removal rates and consistently high RmFP%, indicating that critique mostly filters overly broad or weakly supported specialty\-specific proposals rather than indiscriminately pruning correct ones\.

## Appendix GAdditional Expert Analysis

This appendix complements the leave\-one\-expert\-out analysis in Section[6\.5](https://arxiv.org/html/2605.29146#S6.SS5)with two further diagnostics: router activation behavior \(Figure[6\.1](https://arxiv.org/html/2605.29146#S6.SS1)\) and expert\-level proposal/retention statistics \(Table[12](https://arxiv.org/html/2605.29146#A7.T12)\)\. Expert abbreviations used throughout this appendix and in Figure[1\(c\)](https://arxiv.org/html/2605.29146#S6.F1.sf3): SUP = universal supportive, CVD = cardiovascular, ENDO = endocrine/metabolic, MSK = musculoskeletal/trauma, ONC = oncology/hematology, OB = obstetrics/perinatal, GI = gastroenterology/hepatology, RESP = respiratory\.

The cardiovascular expert has the highest TP/Ret\. ratio \(51\.5%\), suggesting that specialty experts contribute targeted domain\-specific candidates beyond what broad supportive\-care coverage provides; specialty experts that are activated less frequently \(e\.g\., RESP, OB\) still maintain TP/Ret\. above 35%, indicating that the router does not over\-activate marginal experts\.

![Refer to caption](https://arxiv.org/html/2605.29146v1/x6.png)\(b\)Number of active experts per case\.

Figure 6:Expert activation statistics\.The router activates a sparse, case\-dependent subset of experts\. Expert abbreviations are defined at the beginning of Appendix[G](https://arxiv.org/html/2605.29146#A7)\.Table 12:Expert contribution statistics\.Activated counts selected cases\. Proposed and retained count expert\-level medication candidates before and after critic filtering\. TP/Ret\. is the true\-positive density among retained predictions\.
## Appendix HEfficiency Breakdown

Table[13](https://arxiv.org/html/2605.29146#A8.T13)reports the average LLM calls, input tokens, output tokens, and wall\-clock latency per case for each operator in Algorithm[1](https://arxiv.org/html/2605.29146#algorithm1)\.Summarize,Generate, andVerifyare invoked per activated expert or per flagged item, so their per\-case totals scale with case complexity\.Route\(Section[4\.3](https://arxiv.org/html/2605.29146#S4.SS3)\) andFindFlags\(Section[4\.4](https://arxiv.org/html/2605.29146#S4.SS4)\) are deterministic and contribute negligible cost\.

Table 13:Per\-phase efficiency breakdown of SafeRx\-Agent on MIMIC\-IV\.All numbers are averaged over evaluation cases\.RouteandFindFlagsare deterministic \(no LLM calls\)\.## Appendix IPrompt Templates

This section provides all prompt templates used in our experiments\. Appendix[I\.1](https://arxiv.org/html/2605.29146#A9.SS1)lists the templates for the LLM\-based operators in SafeRx\-Agent, and Appendix[I\.2](https://arxiv.org/html/2605.29146#A9.SS2)lists the baseline templates for direct prompting, the general\-agent baseline, and the RareAgents\-style baseline\.

### I\.1SafeRx\-Agent Operator Prompts

Figures[7](https://arxiv.org/html/2605.29146#A9.F7),[8](https://arxiv.org/html/2605.29146#A9.F8),[9](https://arxiv.org/html/2605.29146#A9.F9), and[10](https://arxiv.org/html/2605.29146#A9.F10)correspond toSummarize,Generate,Critique, andVerify, respectively\. The non\-LLM operatorsRouteandFindFlagsare deterministic and do not require prompt templates\.

3 experts2 experts1 expert0 experts231612402468Number of medicationsTPFPRemovedFNFigure 15:Prediction outcome by expert support\(case study, Appendix[J](https://arxiv.org/html/2605.29146#A10)\)\. Medications proposed by more experts are more likely to be true positives\. All 5 codes with≥\\geq2 expert support that were retained byCritiqueare correct \(TP\)\. The only false positive \(A06AB\) and all 3 correctly removed codes were single\-expert proposals\. The 4 FN medications were not proposed by any expert\.SafeRx\-AgentSummarizePrompt ConstructorSystem prompt:You are the activated specialty module inside a routed inpatient medication prediction system\. The upstream router selected this skill; use the routing selection as an interpretive lens and not as permission to invent facts not present in the patient record\.Input:•Skill name and profile\.•Available tools\.•Summarization playbook \(expert\-specific\)\.•Patient record\.Task rules:•Produce a specialty\-aware medication summary by reading the patient record through the lens of the activated expert’s playbook\.•Keep the summary grounded in the record only; do not invent diagnoses, procedures, or medications that are not documented\.•Highlight domain\-specific risks, procedures, care context, and prior medication evidence that could affect inpatient medication classes\.•Do not predict medications at this stage; medication proposals are produced downstream by theGenerateoperator\.Output format:Return strict JSON with five fields:*expertise*\(the activated expert identity\),*current\_admission*\(specialty\-relevant active problems, organ dysfunction, and care context\),*medication\_relevant\_history*\(prior indications, procedures/devices, and explicit medication evidence\),*expertise\_focus*\(record\-grounded facts explaining why this expert is relevant\), and*risks\_to\_watch*\(specialty\-relevant medication risks\)\.Example playbook \(universal supportive expert\):Focus the summary on factors that drive universal supportive medication needs:•ICU / critical\-care context: ICU admission, intubation, vasopressor use, post\-surgical, or otherwise critically ill – drives prophylactic needs\.•Immobility / VTE risk: bed\-bound, post\-operative, trauma, or prolonged hospitalization→\\rightarrowVTE prophylaxis\.•GI stress / aspiration risk: NPO status, intubation, critical illness, steroid use→\\rightarrowstress\-ulcer prophylaxis, bowel management, antiemetics\.•Pain / sedation needs: surgical procedures, trauma, burns, invasive lines→\\rightarrowanalgesics, sedatives\.•Electrolyte context: renal function, diuretic use, critical illness→\\rightarrowelectrolyte monitoring and repletion\.•Nutritional status: NPO, TPN, tube feeding, prolonged admission→\\rightarrowvitamins, IV fluids, dextrose\.•Infection / vaccine status: any hospitalization→\\rightarrowinfluenza\-vaccine consideration; critical illness→\\rightarrowinfection prophylaxis\.•Prior medications: note supportive medications from prior visits that suggest continuation\.Do not summarize specialty\-specific clinical details \(e\.g\., cardiac rhythm, tumor staging\); those belong to routed specialty experts\.Figure 7:Prompt template for the SafeRx\-AgentSummarizeoperator described in Section[4\.3](https://arxiv.org/html/2605.29146#S4.SS3)\. The prompt is dynamically constructed per activated expert by injecting the expert\-specific playbook into the shared template; the universal supportive expert’s playbook is shown as an example\.SafeRx\-AgentGeneratePrompt TemplateSystem prompt:You are the medication generator for the activated expertise skill\. This is a medication\-list prediction task for the current admission; work only from the expert\-specific summary and the patient record\. Do not output guideline\-only wish lists – predict what is plausibly on the medication list\.Input:Expert\-specific clinical summary \(the JSON produced bySummarize\); expert\-specific drug\-prediction checklist; prior\-visit medication evidence; retrieved indication candidates from MEDI; and optional revision feedback from a prior generation round\.Task rules:•Predict ATC\-L4 medication classes likely present during the current admission\. Output only 5\-character ATC\-L4 codes from the closed vocabulary; convert any ingredient\- or product\-level ATC codes to their ATC\-L4 parent\.•Map from clinical scene to plausible medication class, then choose the fitting ATC\-L4 code; do not return an empty list when the summary clearly supports inpatient medication classes\.•Follow this evidence priority \(strongest first\): \(i\) explicit medication evidence, \(ii\) prior\-visit medications, \(iii\) procedure, device, or care\-context evidence, \(iv\) current diagnoses and acute organ dysfunction, \(v\) retrieved indication candidates, \(vi\) supportive inpatient priors when supported by the documented scene\.•When a revision block is present, address the prior critique by adjusting only the affected codes; do not regenerate the entire list from scratch\.Output format:Return strict JSON containing the predicted ATC\-L4 codes, per\-code confidence scores, input\-grounded reasons that cite specific summary fields, any assumptions that limit certainty, and plausible alternative codes when the evidence is ambiguous\.Expert\-specific checklist \(example: universal supportive expert\)\.The checklist gives per\-class predict/withhold rules grounded in documented clinical conditions\. For the universal supportive expert it covers four groups:•Thromboprophylaxis\(heparins, antiplatelets\) — predict when VTE prophylaxis is documented, the patient is critically ill or post\-operative, or antiplatelet therapy is indicated by coronary/cerebrovascular disease; withhold when active bleeding, HIT, or an anticoagulation contraindication is present\.•Analgesia\(opioids, acetaminophen, salicylates\) — predict for major surgery, severe trauma, documented pain or fever, or an active analgesia\-sedation protocol; do not predict from mild pain, minor procedures, or ICU admission alone\.•GI prophylaxis\(acid suppression, antiemetics, laxatives\) — predict when GI bleeding history, stress\-ulcer risk factors \(mechanical ventilation*with*coagulopathy or shock\), opioid\-induced constipation, or postoperative nausea is documented; do not predict from ICU stay alone or in the presence of diarrhea or bowel obstruction\.•Electrolyte and nutrition support\(calcium, potassium, dextrose, parenteral nutrition\) — predict when specific electrolyte depletion, diuretic use, renal impairment, insulin infusion, or enteral/parenteral feeding is documented; do not predict from IV access or NPO status alone\.Figure 8:Prompt template for the SafeRx\-AgentGenerateoperator described in Section[4\.3](https://arxiv.org/html/2605.29146#S4.SS3)\. An expert\-specific drug\-prediction checklist is injected at runtime to constrain drug\-class prediction to the activated expert’s scope\.SafeRx\-AgentCritiquePromptSystem prompt:You are the attending physician critique\. Multiple specialist experts have proposed ATC\-L4 medication classes for this inpatient admission\. Your job is to produce the final prescription list by conservatively filtering the union of proposed codes\.Input:Clinical summary; raw patient record; expert proposals with rationales; prior\-visit medication evidence; and the union of proposed ATC\-L4 codes\.Audit rules:•Review each proposed code using the clinical summary, raw record, expert rationales, prior\-medication evidence, and inpatient care context\.•Keep a code if it has concrete patient\-specific support from any expert or record evidence; if it is a prior\-visit medication whose underlying condition appears ongoing; or if it is a plausible inpatient supportive or prophylactic medication\.•Remove a code only if it is unsupported by any evidence in the summary, raw record, expert proposals, or prior\-medication context, and has no plausible inpatient reason given the documented conditions, procedures, or care context\.•Do not remove a code simply because it is outside a specialist’s scope\. Do not remove evidence\-grounded prophylactic or supportive medications unless there is specific evidence that they are contraindicated or absent\.•Remove redundant or overlapping codes only when they clearly duplicate the same therapeutic intent without added benefit\.•When in doubt, keep the code\. Do not add codes outside the union\.Output format:Return strict JSON containing the final retained ATC\-L4 codes, removed codes with brief removal reasons, an overall rationale, and any missing information needed for clarification\.Figure 9:Abbreviated prompt template for the SafeRx\-AgentCritiqueoperator\.SafeRx\-AgentVerifyPromptSystem prompt:You are a clinical pharmacist safety reviewer\. Given a list of predicted prescription drugs and safety signals \(drug–drug interactions and contraindications\), decide which drugs toremoveorretain\.Adjudication rules:•If a drug is inprior\_meds\(the patient was on it before this admission\), prefer tokeepit unless the interaction is severe\.•When a DDI pair is flagged, prefer removing the drug with the higher global DDI degree \(more interactions overall\) and lower clinical necessity\.•When a drug is contraindicated with a patient diagnosis, remove it unless it was inprior\_medsand the benefit clearly outweighs the risk\.•Be conservative: remove drugs only when there is a clear safety signal\. Do not remove a drug merely because it has a low DDI degree\.•Return strict JSON only, with no markdown fences and no commentary\.Input \(dynamically assembled user prompt\):The user prompt contains four sections, assembled at runtime from the candidate set and the two safety matrices:1\.Predicted drugs\.Each candidate listed as<ATC\-L4\>: <class name\>, with\[PRIOR\-MED\]when the drug was active in the patient’s previous admission\. Example:B01AB: Heparin group \[PRIOR\-MED\]\.2\.Prior medications\.All drugs active in the previous visit, used for continuation decisions\.3\.DDI pairs detected\.Each flagged pair from the binary DDI matrix, annotated with both drugs’ global DDI degree and a\[PRIOR\]tag where applicable\. Example:B01AB \(degree=42\) \[PRIOR\]↔\\leftrightarrowN02BA \(degree=38\)\.4\.Contraindication pairs detected\.Each flagged drug–diagnosis pair from the binary contraindication matrix\. Example:A02BB↔\\leftrightarrowdiagnosis O80\.If a case raises no DDI or contraindication flags, the candidate set is retained without invoking the verifier\.Output format:Return strict JSON with two fields:kept\_drugs\(list of retained ATC\-L4 codes\) andremoved\_drugs\(list of objects, each withcode, a shortreason, and an optional ATC\-L4replacementfrom the same therapeutic subgroup\)\.Figure 10:Prompt template for the SafeRx\-AgentVerifyoperator described in Section[4\.4](https://arxiv.org/html/2605.29146#S4.SS4)\. The system prompt encodes the retain/remove adjudication policy; the user prompt is dynamically assembled from matrix\-retrieved DDI and contraindication flags, drug\-name lookups, and prior\-medication status\. A\[PRIOR\]tag marks medications carried over from a prior visit so the model can bias toward continuation\.
### I\.2Baseline Prompt Templates

We provide abbreviated prompt templates used in our experiments\. Figures[11](https://arxiv.org/html/2605.29146#A9.F11)–[13](https://arxiv.org/html/2605.29146#A9.F13)show the baseline templates for direct prompting, the general\-agent baseline, and the adapted RareAgents baseline\.

Direct LLM PromptSystem prompt:You are an inpatient physician\. For an ICU inpatient admission, predict the ATC Level\-4 medication classes that will appear on this patient’s medication list during the admission\.User prompt:\{patient\_text\}Task instruction:Think step by step, then emit a single JSON object\. ATC L4 codes are exactly 5 characters: letter \+ 2 digits \+ 2 letters \(e\.g\., C10AA, B01AB, N02BE\)\. Do not invent codes, if unsure, omit\.Output format:Output STRICT JSON only \(no markdown, no commentary\):```
{
  "reasoning": "<step-by-step reasoning, plain text>",
  "predicted_drugs": [
    {"code": "<ATC L4>"}
  ]
}
```

Figure 11:Prompt template for the direct prompting baseline\.General\-Agent PromptNote\.The general\-agent baseline reuses SafeRx\-Agent’sSummarizeandGenerateprompt structure \(Figures[7](https://arxiv.org/html/2605.29146#A9.F7)and[8](https://arxiv.org/html/2605.29146#A9.F8)\) and the sharedCritiqueprompt \(Figure[9](https://arxiv.org/html/2605.29146#A9.F9)\), but replaces the routed specialty experts with a single*general\_agent*skill that covers every therapeutic domain\. Only the skill’s injected*summarization playbook*and*drug\-prediction checklist*differ from the SafeRx\-Agent versions; both are reproduced below\.Summarizestage — general\-agent playbook:Summarize the patient record to support broad inpatient medication prediction\. The summary should capture active conditions, relevant chronic history, procedures, acuity of care, medication\-relevant risks, and prior medication evidence\. It should be comprehensive across therapeutic domains rather than specialized to one organ system\.In particular, the summary should preserve information that may affect medication choice, including acute diagnoses, comorbidities, procedures, critical\-care status, infection evidence, pain or neurologic needs, renal and metabolic abnormalities, gastrointestinal or nutritional status, VTE or bleeding risk, and medications that may require continuation from prior visits\.Generatestage — general\-agent drug\-prediction instruction:Predict ATC\-L4 medication classes for the current inpatient admission as a single generalist agent\. Use the patient summary, prior\-visit medication evidence, diagnosis\-linked medication evidence, and the closed ATC\-L4 vocabulary\. The prediction should cover both disease\-directed therapy and common inpatient supportive care, while requiring patient\-specific evidence for each medication class\.The agent should reason over the case by identifying active problems and chronic conditions, mapping each to plausible medication needs, checking whether prior medications should be continued, and considering medications associated with procedures, critical illness, infection management, pain control, thrombosis prevention, gastrointestinal protection, electrolyte correction, nutrition, and bowel care\. Medication classes should be included only when supported by the current visit context, longitudinal history, or explicit evidence in the record\.The agent should prefer adequate coverage of clinically supported medication needs, but should avoid adding classes solely because they are common in inpatient care\. Confidence should reflect the strength of evidence: higher confidence is reserved for explicitly documented medications or indications, while lower confidence is used for plausible but indirect clinical support\.Figure 12:Prompt template for the general\-agent baseline\. The baseline reuses SafeRx\-Agent’sSummarize,Generate, andCritiquestructures, but replaces routed specialty experts with one all\-domain*general\_agent*\.RareAgents\-Style Prompts Adapted to ATC\-L4Stage 1: Attending Physician\.You are the Attending Physician coordinating a multi\-disciplinary team \(MDT\)\. Given the patient profile and a pool of 41 clinical specialists, each defined by anid, name, and clinical scope, select the specialists whose expertise is relevant to the patient’s active diagnoses, procedures, and medication needs\. Prefer disease\- and organ\-system specialists when they directly match the case, and include cross\-cutting specialists such as clinical pharmacy or internal medicine when their scope can improve treatment review\. Return the selected specialist IDs and a brief rationale as JSON\.Stage 2: Specialist Discussion\.You are the\{specialist\_name\}specialist in the MDT\. Based on the patient profile, prior prescription history, and your specialty scope, discuss which medication classes are clinically relevant for the current admission\. Use the provided DrugBank and DDI\-graph feedback, when available, to note treatment evidence and potential safety concerns\. Propose only medications within your specialty scope\. Output 5\-character ATC\-L4 codes adapted to our closed prediction vocabulary, and convert ingredient\-level ATC codes to their L4 parent\. Return an empty list if your specialty has no relevant role\. Return the proposed codes and reasoning as JSON\.Stage 3: Attending Synthesis\.You are the Attending Physician finalizing the medication recommendation after the MDT discussion\. Integrate the specialist proposals, prior prescription evidence, and DrugBank/DDI\-graph feedback to form the final prescription set\. Use the specialist proposals as the candidate pool, keep medications supported by the patient context or prior\-visit precedent, and remove unsupported or tool\-flagged candidates when the evidence does not justify keeping them\. Do not introduce medications that were not proposed by any specialist\. Output 1–30 ATC\-L4 codes with reasoning as JSON\.Figure 13:RareAgents\-style baseline prompts adapted to ATC\-L4 medication recommendation\.We adapt the RareAgents MDT workflowChenet al\.\([2026](https://arxiv.org/html/2605.29146#bib.bib13)\)to our closed\-vocabulary ATC\-L4 setting\. The three stages correspond to attending\-led specialist selection, specialist discussion with DrugBank and DDI\-graph feedback, and attending\-led synthesis\. Unlike the original RareAgents medication prompt, which selects medication names from a fixed candidate list, our adaptation requires 5\-character ATC\-L4 codes from the prediction vocabulary\.
## Appendix JCase Study

We present an end\-to\-end walkthrough of SafeRx\-Agent on a representative MIMIC\-IV case \(Subject 10785159\) using Gemma3\-27B\-IT\. This case illustrates how multi\-expert routing, grounded generation, and global critique produce a traceable medication set with per\-medication rationales\. Figure[14](https://arxiv.org/html/2605.29146#A10.F14)provides an overview of the pipeline flow on this case; Table[14](https://arxiv.org/html/2605.29146#A10.T14)traces every candidate code through the system\.

Patient Record87\-year\-old female — NSTEMI, acute systolic HF, AF, COVID\-19, T2DM, aortic stenosis \(s/p bioprosthetic AVR\), pulmonary HTN, prior VTE22 diagnoses⋅\\cdot2 procedures⋅\\cdot12 prior visit medicationsRoute: activates3of 8 expertsCardiovascular\(score = 0\.60\)Summarize focus:NSTEMI, acute systolic HF, AF,valve disease, antithrombotic,hemodynamics, cardiorenal riskEndocrine/Metabolic\(0\.17\)Summarize focus:T2DM, long\-term insulin,electrolyte management,metabolic instability riskUniversal Supportive\(always\-on\)Summarize focus:pain management, bowel care,GI prophylaxis, sleep support,electrolyte repletionGenerate:7 codesC01BD, C03CA, C07AB, B01AB,C09AA, C10AA, A12BAGenerate:5 codesC07AB, A06AB, A12BA,C10AA, A02BCGenerate:11 codesB01AB, C01BD, C07AB, A02BC,N02AX, A12BA, A06AD, A12CC,V06DC, A04AA, N05CHCritique:15unique candidates⟶\\longrightarrow12 retained, 3 removedRemoved: C09AA \(ACE inhibitor — weak acute support\), C10AA \(statin — low confidence\), A04AA \(antiemetic — speculative\)Final Prediction: 12 ATC\-L4 codesF1 = 0\.81511 TP⋅\\cdot1 FP⋅\\cdot4 FN\(not proposed by any expert\)Figure 14:Case study pipeline flow\.The patient record is routed to three specialty experts\. Each expert summarizes the record from its domain perspective and generates ATC\-L4 candidates\.Critiquemerges the 15 unique candidates across all experts and removes 3 weakly supported codes, producing a final set of 12 medications\.### J\.1Patient Presentation

![[Uncaptioned image]](https://arxiv.org/html/2605.29146v1/figures/briefing.png)Patient VignetteAn 87\-year\-old female with a history of nonrheumatic aortic stenosis \(post\-bioprosthetic valve replacement\), chronic heart failure, persistent atrial fibrillation, cardiomyopathy, hypertensive heart disease, type 2 diabetes mellitus on insulin, hyperlipidemia, obesity, GERD, irritable bowel syndrome, and prior venous thromboembolism\. She presents withNSTEMI, acute\-on\-chronic systolic heart failure, COVID\-19, pulmonary hypertension, phlebitis of the lower extremities, and hypotension\. Current visit procedures include coronary angiography and cardiac catheterization\. Prior visit medications \(12 codes\) include amiodarone \(C01BD\), furosemide \(C03CA\), metoprolol \(C07AB\), heparin \(B01AB\), PPI \(A02BC\), potassium \(A12BA\), magnesium \(A12CC\), tramadol \(N02AX\), contact laxatives \(A06AB\), osmotic laxatives \(A06AD\), melatonin \(N05CH\), and carbohydrate supplement \(V06DC\)\.Ground\-truth medications for the current visit: 15 ATC\-L4 codes\.

### J\.2Pipeline Walkthrough

#### Step 1: Expert Routing \(Route\)\.

Routescores the patient’s ICD codes against each expert’s scope\. The cardiovascular domain dominates \(score = 0\.60\), followed by endocrine/metabolic \(0\.17\)\. Three experts are activated:Cardiovascular,Endocrine/Metabolic, and the always\-onUniversal Supportive Care\(Figure[14](https://arxiv.org/html/2605.29146#A10.F14), rows 2–3\)\.

#### Step 2: Expert Summarization \(Summarize\)\.

Each activated expert extracts specialty\-relevant evidence from the patient record:

- •Cardiovascularidentifies NSTEMI, acute systolic HF, AF, aortic stenosis, pulmonary HTN, and hypotension as active problems\. It notes the prior aortic valve replacement and long\-term anticoagulant use, and flags risks of bleeding \(anticoagulation plus recent procedure\), arrhythmia \(AF, post\-MI\), and cardiorenal syndrome\.
- •Endocrine/Metabolicfocuses on type 2 diabetes with long\-term insulin use, and flags electrolyte management \(prior potassium and magnesium supplementation\) and metabolic instability risks given acute heart failure\.
- •Universal Supportive Careprovides a broad view covering pain management \(NSTEMI\), bowel care \(IBS history plus opioid use\), sleep support, GI prophylaxis \(GERD plus anticoagulation\), and electrolyte repletion in the context of heart failure and anticipated diuretic use\.

#### Step 3: Per\-Expert Generation \(Generate\)\.

Each expert proposes ATC\-L4 codes grounded in its summary and the MEDI indication resource\. The Cardiovascular expert proposes 7 codes, the Endocrine/Metabolic expert proposes 5 codes, and the Universal Supportive Care expert proposes 11 codes, yielding 15 unique candidates across all experts \(Table[14](https://arxiv.org/html/2605.29146#A10.T14)\)\.

#### Step 4: Global Critique \(Critique\)\.

Critiquereviews all expert proposals under the full patient context and removes three candidates:

- •C09AA\(ACE inhibitors\): while plausible given diabetes and cardiovascular disease, the acute presentation \(hypotension, acute HF\) does not strongly support initiation, and no prior ACE inhibitor use is documented\.
- •C10AA\(statins\): both proposing experts assigned low confidence, and the acute presentation does not prioritize statin initiation\.
- •A04AA\(antiemetics\): a speculative proposal, as COVID\-19 alone does not necessitate antiemetics without documented nausea or vomiting\.

The final candidate set contains 12 ATC\-L4 codes\.

Table 14:Case study pipeline trace\.Checkmarks indicate which expert\(s\) proposed each ATC\-L4 code viaGenerate\.Critiqueaction shows whether the code was retained \(Ret\) or removed \(Rem\)\. Outcome classifies each code against the ground\-truth medication set: true positive \(TP\), false positive \(FP\), or not applicable \(—\) for correctly removed candidates\. False\-negative medications \(bottom\) were not proposed by any expert\.ATC\-L4Drug ClassCVEndoUniv\#ExpCritiqueOutcomeC07ABBeta\-blockers, selective✓✓✓3RetTPA12BAPotassium✓✓✓3RetTPB01ABHeparin group✓✓2RetTPC01BDAntiarrhythmics, class III✓✓2RetTPA02BCProton pump inhibitors✓✓2RetTPC10AAHMG CoA reductase inhibitors✓✓2Rem—C03CASulfonamides, plain \(loop diuretics\)✓1RetTPC09AAACE inhibitors, plain✓1Rem—A06ABContact laxatives✓1RetFPN02AXOther opioids✓1RetTPA06ADOsmotically acting laxatives✓1RetTPA12CCMagnesium✓1RetTPV06DCCarbohydrates✓1RetTPN05CHMelatonin receptor agonists✓1RetTPA04AASerotonin \(5\-HT3\) antagonists✓1Rem—Ground\-truth medications not proposed by any expert \(false negatives\):A06AXOther drugs for constipation0—FND01ACImidazole/triazole antifungals0—FNJ07BXOther viral vaccines0—FNN06AXOther antidepressants0—FN

### J\.3Analysis

The final predicted set contains 12 medications, achieving a precision of 0\.917, recall of 0\.733, and F1 of 0\.815\. Figure[15](https://arxiv.org/html/2605.29146#A9.F15)summarizes the relationship between expert support and prediction outcome\.

#### Multi\-expert corroboration\.

All five codes proposed by≥\\geq2 experts that were retained byCritiqueare true positives \(C07AB, A12BA, B01AB, C01BD, A02BC\), consistent with the aggregate finding that multi\-expert corroboration is a strong signal for correctness \(Section[6\.4](https://arxiv.org/html/2605.29146#S6.SS4)\)\.

#### Selective critique\.

Critiqueremoves 3 of the 15 candidate codes\. None of the three removed drugs appear in the ground truth, indicating that theCritiquecorrectly identified false positives in this case\. Among single\-expert proposals,Critiqueretains those with clear patient\-specific support—for example, C03CA \(loop diuretics\) was proposed only by the Cardiovascular expert but was kept because the patient has documented heart failure, volume overload indicators, and prior furosemide use\. In contrast, C09AA \(ACE inhibitors\) was removed despite cardiovascular relevance because the patient’s hypotension and lack of prior ACE inhibitor use argued against initiation\.

#### False positive\.

The single false positive, A06AB \(contact laxatives\), was prescribed in the prior visit and proposed by the Endocrine/Metabolic expert based on documented constipation history\. This is a clinically reasonable continuation that falls outside the current visit’s ground\-truth set, illustrating that not all false positives represent clinical errors\.

#### False negatives\.

The four missed medications—A06AX \(other drugs for constipation\), D01AC \(imidazole/triazole antifungals\), J07BX \(other viral vaccines\), and N06AX \(other antidepressants\)—were not proposed by any expert\. D01AC and J07BX lack explicit diagnostic indicators in the structured record \(antifungal prophylaxis and COVID\-19 vaccination are context\-dependent decisions\)\. N06AX \(antidepressants\) may reflect undocumented psychiatric history\. These false negatives highlight the challenge of predicting medications whose indications require implicit clinical reasoning beyond the available EHR data\.

Similar Articles