An AI agent for treatment reasoning over a biomedical tool universe

arXiv cs.AI 06/30/26, 04:00 AM Papers
Summary
This paper introduces an AI agent trained via reinforcement learning to reason over all FDA-approved drugs since 1939 for treatment recommendations, integrating disease context, comorbidities, and contraindications.
arXiv:2606.28692v1 Announce Type: new Abstract: Treatment reasoning underpins every therapeutic decision, integrating disease context, comorbidities, medications, contraindications, and evolving biomedical knowledge to select an appropriate therapy. It is inherently iterative: candidates are weighed against many constraints, revised as evidence emerges, and grounded in verifiable sources. Here we introduce ATHENA-R1, an AI agent for treatment reasoning across all FDA approved drugs since 1939, trained by reinforcement learning over a universe of 212 biomedical tools. At each step it identifies missing information, selects and runs relevant tools, and incorporates the evidence. To train it without human-annotated traces, we build a two-level self-learning framework: multi-agent systems construct the tools, tasks, and reasoning trajectories for supervised fine-tuning, then reinforcement learning with scientific feedback rewards reasoning quality (evidence gathering, grounded tool use, logical non-redundancy). Across five benchmarks of 3,168 drug reasoning tasks and 456 patient treatment cases, ATHENA-R1 outperforms language models and tool-use systems, reaching 94.7% accuracy on open-ended drug reasoning and 82.9% on treatment reasoning, 17.8 and 10.7 points above GPT-5. In blinded evaluations by experts from 28 rare disease organizations, it is preferred over reference models on all criteria, and physicians rated it favorably on complex hospitalized cardiovascular and infectious-disease cases. Adverse-event hypotheses it generated, tested in electronic health records from 5.4 million patients, reached adjusted odds ratios of 1.48-1.84, with no elevation among negative controls. Because it requires knowing what evidence to seek before concluding, treatment reasoning has long been hard for AI; we show it can be reframed as a learnable process of iterative evidence gathering that reinforcement learning can train AI to perform.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:31 AM
# An AI agent for treatment reasoningover a biomedical tool universe
Source: [https://arxiv.org/html/2606.28692](https://arxiv.org/html/2606.28692)
Ayush Noori1,2,3[https://orcid.org/0000-0003-1420-1236](https://orcid.org/0000-0003-1420-1236)Richard Zhu1[https://orcid.org/0009-0004-6190-8503](https://orcid.org/0009-0004-6190-8503)Curtis Ginder1,4[https://orcid.org/0000-0001-8507-9624](https://orcid.org/0000-0001-8507-9624)Zhenglun Kong1[https://orcid.org/0000-0002-8120-4456](https://orcid.org/0000-0002-8120-4456)Xiaorui Su1 Justin Kauffman5[https://orcid.org/0009-0004-6371-4198](https://orcid.org/0009-0004-6371-4198)Benjamin S\. Glicksberg5,6,7[https://orcid.org/0000-0003-4515-8090](https://orcid.org/0000-0003-4515-8090)Joshua Lampert5,6,8Ankit Sakhuja5,9,10Ashwin Sawant5,9,11[https://orcid.org/0000-0003-1525-8541](https://orcid.org/0000-0003-1525-8541)ATHENA\-R1 Evaluation Consortium12David A\. Clifton2,13[https://orcid.org/0000-0002-9848-8555](https://orcid.org/0000-0002-9848-8555)Noa Dagan3,14,15[https://orcid.org/0000-0001-8811-7825](https://orcid.org/0000-0001-8811-7825)Ran Balicer3,14,16[https://orcid.org/0000-0002-7783-6362](https://orcid.org/0000-0002-7783-6362)Marinka Zitnik1,3,17,18,19,†[https://orcid.org/0000-0001-8530-7228](https://orcid.org/0000-0001-8530-7228) 1Department of Biomedical Informatics, Harvard Medical School, Boston, MA 2Department of Engineering Science, University of Oxford, Oxford, UK 3The Ivan and Francesca Berkowitz Family Living Laboratory Collaboration at Harvard Medical School and Clalit Research Institute, Boston, MA, USA 4Cardiovascular Division, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 5The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA 6The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and Mount Sinai Health System, New York City, NY, USA 7Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA 8Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA 9Mount Sinai AI Assurance Lab, Mount Sinai Health System, New York, NY, USA 10Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA 11Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA 12ATHENA\-R1 Evaluation Group \(the list of members and their affiliations appears in the Supplementary Information\) 13Oxford Suzhou Centre for Advanced ResearchUniversity of OxfordSuzhouJiangsuChina 14Clalit Research Institute, Innovation Division, Clalit Health Services, Ramat Gan, Israel 15Faculty of Computer and Information Science, Ben Gurion University of the Negev, Be’er Sheva, Israel 16Faculty of Health Sciences, School of Public Health, Ben Gurion University of the Negev, Be’er Sheva, Israel 17Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA 18Broad Institute of MIT and Harvard, Cambridge, MA 19Harvard Data Science Initiative, Cambridge, MA †Correspondence:[marinka@hms\.harvard\.edu](https://arxiv.org/html/2606.28692v1/mailto:[email protected])

###### Abstract

## Abstract

Treatment reasoning underpins every therapeutic decision in medicine, requiring the integration of disease context, comorbidities, concurrent medications, contraindications, and evolving biomedical knowledge to arrive at a therapy appropriate for an individual patient\. This process is inherently iterative, as candidate treatments must be evaluated against multiple constraints, revised as new evidence emerges, and grounded in sources that can be inspected and verified\. Here we introduce n AI agent for treatment reasoning across all FDA approved drugs since 1939, trained through reinforcement learning over a universe of 212 biomedical tools\. At each reasoning step, entifies missing information, selects and executes relevant tools, and incorporates the retrieved evidence before proceeding\. To train thout human\-annotated reasoning traces, we develop a two\-level self\-learning framework in which multi\-agent systems construct tools, treatment tasks, and reasoning trajectories for supervised fine\-tuning, followed by reinforcement learning with scientific feedback using rewards for reasoning quality, including evidence gathering, grounded tool use, and logical non\-redundancy, to refine evidence\-seeking strategy\. Across five benchmarks spanning 3,168 drug reasoning tasks and 456 patient treatment cases, tperforms language models and tool\-use systems\. hieves 94\.7% accuracy on open\-ended drug reasoning and 82\.9% accuracy on treatment reasoning, exceeding GPT\-5 by 17\.8 and 10\.7 percentage points, respectively\. In blinded evaluations involving experts from 28 rare disease organizations, preferred over reference models across all evaluation criteria\. Physicians rated vorably on complex hospitalized patient cases spanning cardiovascular management and infectious disease\. Adverse event hypotheses generated by re tested in electronic health records from 5\.4 million patients, with predicted associations reaching adjusted odds ratios of 1\.48–1\.84 and negative controls showing no elevation\. These results establish that treatment reasoning, long considered difficult for AI because it requires knowing what evidence to seek before a conclusion can be formed, can be reframed as a learnable process of iterative evidence gathering, and that reinforcement learning can train AI to perform it\.

## Main

Treatment reasoning is among the most demanding tasks in medicine, where selecting a therapy for an individual patient requires weighing disease context, patient characteristics, concurrent medications, safety constraints, and evolving evidence\[[undef](https://arxiv.org/html/2606.28692#bib.bibx1),[undefa](https://arxiv.org/html/2606.28692#bib.bibx2)\]\. Unlike fact retrieval or pattern recognition, treatment reasoning is an iterative process in which candidate strategies must be gathered, evaluated against multiple constraints, and revised until the evidence supports a decision\.

Large language models \(LLMs\) access medical knowledge through pretraining\[[undefb](https://arxiv.org/html/2606.28692#bib.bibx3)\], biomedical alignment\[[undefc](https://arxiv.org/html/2606.28692#bib.bibx4),[undefd](https://arxiv.org/html/2606.28692#bib.bibx5),[undefe](https://arxiv.org/html/2606.28692#bib.bibx6),[undeff](https://arxiv.org/html/2606.28692#bib.bibx7)\], and agentic frameworks\[[undefg](https://arxiv.org/html/2606.28692#bib.bibx8),[undefh](https://arxiv.org/html/2606.28692#bib.bibx9)\]\. These models generate fluent responses and capture broad clinical patterns, but they rely on parametric knowledge stored in model weights, lack access to updated and vetted medical information, and can produce recommendations that fail to account for relevant contraindications, interactions or patient\-specific constraints\. Retrieval augmented generation\[[undefi](https://arxiv.org/html/2606.28692#bib.bibx10)\]and tool\-augmented LLMs\[[undefj](https://arxiv.org/html/2606.28692#bib.bibx11),[undefk](https://arxiv.org/html/2606.28692#bib.bibx12),[undefl](https://arxiv.org/html/2606.28692#bib.bibx13)\]can give LLMs access to information outside their model weights, such as medical documents, biomedical databases and software tools, at inference time\. However, access to biomedical tools does not by itself produce treatment reasoning\. A model must determine what evidence is needed, select the appropriate source, interpret the result in the context of accumulated evidence, and revise its analysis when evidence is incomplete or conflicting\. This capacity cannot be assumed from tool access alone and must be learned\.

We introduce n AI agent for treatment reasoning that combines multi\-step analysis with direct access to medical evidence\. Rather than producing answers in a single step, termines what information is needed, retrieves relevant evidence, and uses that evidence to update its analysis\. In each reasoning step, lects from a library of 212 biomedical tools, retrieves information about drugs, diseases and patient populations, interprets the returned evidence, and incorporates it into subsequent reasoning steps\. This allows evaluate candidate treatments through iterative evidence gathering and analysis rather than relying solely on knowledge stored in model weights\.

Generating multi\-step treatment\-reasoning traces at the scale and diversity required for training cannot feasibly be done by human annotators, as each trace must specify what evidence is needed, which tools to call and how retrieved information should be interpreted across hundreds of tools and diverse drug, disease and patient contexts\. We therefore train rough two sequential stages\. First, a multi\-agent system automatically constructs biomedical tools, treatment tasks and reasoning traces, yieldingATHENA\-R1\-Instruct, a dataset of 378,027 instruction\-tuning samples derived from 85,340 reasoning traces, comprising 177,626 reasoning steps and 281,695 tool calls grounded in all US FDA approved drugs since 1939\. After supervised fine\-tuning onATHENA\-R1\-Instruct, refined through reinforcement learning in a live 212\-tool environment, receiving rule\-based scientific feedback across six dimensions of reasoning quality, including answer correctness, evidence gathering, multi\-step reasoning and tool\-use validity\.

We evaluate five datasets spanning drug reasoning and patient treatment cases\. DrugPC contains 3,168 treatment cases covering 11 treatment tasks, including indications, dosing, safety and pharmacology\. BrandPC and GenericPC replace drug names with brand and generic variants, and DescriptionPC replaces names with textual descriptions\. TreatmentPC contains 456 treatment cases in which the correct answer depends on patient\-specific constraints\. Across these datasets, nsistently outperforms LLMs and tool\-use models in open\-ended evaluation\. On DrugPC, hieves 94\.7% accuracy, exceeding GPT\-5\[[undefm](https://arxiv.org/html/2606.28692#bib.bibx14)\]by 17\.8 percentage points and DeepSeek\-R1 \(671B\)\[[undefn](https://arxiv.org/html/2606.28692#bib.bibx15)\]by 25\.9 percentage points\. On TreatmentPC, hieves 82\.9% accuracy, exceeding GPT\-5 by 10\.7 percentage points and DeepSeek\-R1 by 15\.4 percentage points\. neralizes across brand names, generic names and diverse drug descriptions \(BrandPC, GenericPC and DescriptionPC benchmarks; Extended Data FigureLABEL:fig:descriptionpc; Supplementary NoteLABEL:sec:note1\_si\)\.

We evaluate three real\-world settings\. First, experts from 28 rare disease organizations assessed blinded responses to rare disease treatment cases spanning neurodevelopmental disorders, epilepsies, metabolic diseases, rare cancers, channelopathies, and immune\-mediated diseases, and preferred er reference models across all eight evaluation criteria, with the largest gains in cognitive traceability and helpfulness of rationale\. Second, practicing physicians evaluated complex hospitalized patient cases in cardiovascular management and infectious disease, including post\-CABG patients with CKD, anticoagulated patients with surgical site infections, and post\-STEMI patients with severe asthma\. Third, we tested adverse event hypotheses generated by longitudinal health records from 5\.4 million patients, prioritizing predictions where prior pharmacovigilance evidence was limited or absent; predicted associations reached adjusted odds ratios of 1\.48–1\.84 in the highest risk patient subpopulations, while negative controls remained near null\.

## Results

### asons over treatment choices by gathering evidence step by step

rforms treatment reasoning by combining step\-by\-step analysis with access to medical evidence \(Figure[1](https://arxiv.org/html/2606.28692#Sx4.F1)\)\. It calls tools from a 212\-tool biomedical library \(Supplementary TableLABEL:tab:tool\_list\) to retrieve evidence about drugs, diseases and patient populations from curated sources\[[undefo](https://arxiv.org/html/2606.28692#bib.bibx16),[undefp](https://arxiv.org/html/2606.28692#bib.bibx17)\]\. These tools support queries about indications, contraindications, drug\-drug interactions, pharmacology, adverse reactions, disease phenotypes, therapeutic targets and patient\-population restrictions\. For example, n retrieve a drug’s current approved indications, identify contraindications for a candidate therapy, check interactions between co\-medications, map a disease to associated phenotypes, and query target phenotypic evidence\. Because these tools are queried in real time, n incorporate current information from FDA prescribing information and biomedical knowledge bases rather than relying solely on knowledge encoded in model parameters\. es the retrieved evidence to guide the next reasoning step, allowing mechanisms, interactions, contraindications and safety constraints to be evaluated together\.

At each step, termines what information is needed, selects relevant tools, retrieves evidence and incorporates the returned information into the analysis\[[undefq](https://arxiv.org/html/2606.28692#bib.bibx18)\]\. It continues this process until the evidence supports a final answer\. The output includes both the answer and a reasoning trace that records which evidence was retrieved and how it was used\.

n also break complex treatment tasks into smaller analyses\. A patient scenario may require identifying candidate treatments, checking drug\-drug interactions, evaluating comorbidities, comparing safety warnings and applying patient\-specific constraints\. alyzes these components and combines the results into a final answer \(Figure[1](https://arxiv.org/html/2606.28692#Sx4.F1)\)\. Additional details of the inference process are provided in MethodsLABEL:sec:skill\_txagentand AlgorithmLABEL:alg:txagent\_inference; examples of key agentic abilities \(knowledge grounding, goal\-oriented tool selection, multi\-step reasoning and real\-time retrieval\) are shown in Supplementary FigureLABEL:fig:extend\_4abilities\.

### Self\-learning for treatment reasoning

Multi\-step treatment reasoning traces are too large and varied to annotate by hand\[[undefr](https://arxiv.org/html/2606.28692#bib.bibx19),[undefs](https://arxiv.org/html/2606.28692#bib.bibx20)\]\. Each trace must specify what evidence to retrieve, which tools to use, how to interpret the returned information and how to combine evidence across multiple reasoning steps\. therefore trained through two levels of self\-learning that replace human written traces with generated reasoning trajectories\[[undeft](https://arxiv.org/html/2606.28692#bib.bibx21)\]\. The first level teaches e structure of treatment reasoning, including problem decomposition, evidence retrieval, tool use and evidence interpretation\. The second level teaches w to act within this structure by improving tool selection, evidence gathering and exploration of alternative reasoning paths\.

At the first level, tomatically constructs its own training data\. Generating treatment reasoning traces directly would require a model that already solves the task\. Instead, a collection of agent systems generates biomedical tools, treatment tasks and multi\-step reasoning traces\. This process producesATHENA\-R1\-Instruct, a dataset of 378,027 instruction tuning samples derived from 85,340 treatment tasks, comprising 177,626 reasoning steps and 281,695 tool calls grounded in FDA drug labels since 1939 \(Extended Data FigureLABEL:fig:multi\-agent\-system\)\. Supervised fine\-tuning onATHENA\-R1\-Instructyields the initial del\.

At the second level, fines its policy through reinforcement learning \(Extended Data FigureLABEL:fig:rl\-system\)\. During training, plores the 212 biomedical tools used at inference and generates multi\-turn reasoning trajectories for each prompt\. Each trajectory receives scientific feedback based on rewards for answer correctness\[[undefn](https://arxiv.org/html/2606.28692#bib.bibx15)\], output format validity\[[undefu](https://arxiv.org/html/2606.28692#bib.bibx22)\], evidence gathering, multi\-step reasoning, tool\-argument grounding and reasoning non\-redundancy\. Group relative policy optimization\[[undefv](https://arxiv.org/html/2606.28692#bib.bibx23),[undefw](https://arxiv.org/html/2606.28692#bib.bibx24)\]then increases the probability of higher\-scoring trajectories\. This process improves tool selection and evidence gathering across reasoning steps\. Training details are provided in MethodsLABEL:sec:train\_set\.

### tperforms language models on Drug Prescribing Card reasoning

We evaluate DrugPC \(Drug Prescribing Cards\), a dataset of 3,168 treatment questions \(Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)a\)\. DrugPC covers 11 tasks of drug information, including drug overview, ingredients, warnings and safety, dependence and abuse, dosage and administration, use in specific populations, pharmacology, clinical information, nonclinical toxicology, patient information and storage and supply\. To reduce data leakage from pretraining\[[undefx](https://arxiv.org/html/2606.28692#bib.bibx25)\], the evaluation uses drugs approved by the FDA in 2024, while excluding these drugs from training \(MethodsLABEL:sec:benchmark\_details\)\.

We evaluate models in an open\-ended setting\[[undefd](https://arxiv.org/html/2606.28692#bib.bibx5)\]\. Each treatment question is presented without answer choices\. The model generates a free\-form response, which is then mapped to the correct option among the original 4\-5 answer choices\[[undefy](https://arxiv.org/html/2606.28692#bib.bibx26)\]\. Treatment questions are verified by human experts for validity \(TableLABEL:tab:example\_question\_type; additional details are provided in MethodsLABEL:sec:benchmark\_details\)\.

We compare GPT\-5, DeepSeek\-R1 and Qwen3 across all 11 tasks \(Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)b\)\. hieves 94\.7% overall accuracy across the 3,168 questions\. GPT\-5 achieves 76\.9%, DeepSeek\-R1 achieves 68\.8% and Qwen3 achieves 48\.7%\. proves accuracy by 17\.8 percentage points over GPT\-5, 25\.9 percentage points over DeepSeek\-R1 and 46\.0 percentage points over Qwen3\. Performance remains consistently high across categories, including warnings and safety, dosage and administration, and use in specific populations\. These results show that drug reasoning benefits from iterative evidence gathering and tool use\. Rather than relying solely on parametric knowledge, trieves and interprets FDA label information through multi\-step reasoning, improving performance on tasks that require integrating indications, dosing, safety information and population\-specific treatment constraints\. Ablations isolating the contribution of multi\-step reasoning, training trace length, inference step budget, tool execution, and tool library scaling are reported in Extended Data FigureLABEL:fig:avg\_infer\_stepand Supplementary NoteLABEL:sec:note3\_si\.

### tperforms language models on patient treatment selection

We evaluate TreatmentPC, a dataset of 456 patient\-specific treatment cases\. Each case compares drugs indicated for the same disease but differing in properties relevant to treatment selection, including indications, use in specific populations, safety warnings, precautions, contraindications and drug interactions\. The correct answer depends on patient context\. A treatment may be appropriate for the disease but unsuitable because of pregnancy, comorbidity, dosing constraints or a contraindicated co\-medication\[[undefz](https://arxiv.org/html/2606.28692#bib.bibx27)\]\. We evaluate models in an open\-ended setting\[[undefd](https://arxiv.org/html/2606.28692#bib.bibx5),[undefy](https://arxiv.org/html/2606.28692#bib.bibx26)\], where the model generates a free form response that is subsequently mapped to one of the answer options \(MethodsLABEL:sec:benchmark\_details\)\.

hieves 82\.9% accuracy on TreatmentPC, outperforming all reference models \(Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)c\)\. ceeds GPT\-5 by 10\.7 percentage points, DeepSeek\-R1 by 15\.4 percentage points, Qwen3\-Next by 22\.8 percentage points and Qwen3 by 43\.7 percentage points\. Tool use LLMs with access to tool library perform substantially worse\. ToolACE\-8B\[[undefl](https://arxiv.org/html/2606.28692#bib.bibx13)\]achieves 13\.4% accuracy and WattTool\-8B\[[undefk](https://arxiv.org/html/2606.28692#bib.bibx12)\]achieves 5\.9%\.

These results reveal a central challenge of treatment reasoning: access to biomedical tools alone is insufficient\. Models must determine what evidence is needed, select relevant tools, interpret returned information and revise conclusions when evidence is incomplete, conflicting or unexpected\.

When GPT\-5 is given optional access to tool library, it invokes tools on only 1% of treatment cases and its accuracy falls below its own no\-tool baseline\. When tool use is required, performance does not recover\. In contrast, vokes tools on every treatment case and integrates retrieved evidence into subsequent reasoning steps, achieving 82\.9% accuracy\. The limiting factor in LLM treatment reasoning is therefore not tool availability but the learned capacity to reason over tool outputs, and providing a frontier model with direct access to tool library does not substitute for that capacity \(Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)d; Supplementary NoteLABEL:sec:note3\_si\)\.

TreatmentPC requires joint reasoning over patient case and drug properties\. DeepSeek\-R1\[[undefn](https://arxiv.org/html/2606.28692#bib.bibx15)\]and GPT\-5\[[undefm](https://arxiv.org/html/2606.28692#bib.bibx14)\]are designed for long chain\-of\-thought reasoning and test\-time scaling\. To enable multi\-step reasoning in DeepSeek\-R1, we prompt it using <think\> and <\\think\>\. Despite DeepSeek\-R1’s 671 billion parameters, tperforms it by 15\.4 percentage points \(82\.9% versus 67\.5%\)\. Unlike models that rely primarily on internal knowledge, trieves FDA labels and drug annotations before answering, allowing each conclusion to be grounded in source evidence\. Extended Data FigureLABEL:fig:comparison\_deepseekillustrates this on a pediatric corticosteroid scenario in which DeepSeek\-R1 incorrectly judges the drug as safe, whereas trieves the FDA label and identifies HPA axis suppression as a documented pediatric risk\[[undefaa](https://arxiv.org/html/2606.28692#bib.bibx28),[undefab](https://arxiv.org/html/2606.28692#bib.bibx29)\]\.

Both levels of aining contribute to TreatmentPC performance \(Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)e\)\. In this analysis, the model first generates a free\-form reasoning trace and answer, and then selects the answer option that best matches its own response \(the “self as judge” protocol; Figure[2](https://arxiv.org/html/2606.28692#Sx4.F2)d\)\. Under this setting, the Qwen3\-8B base model achieves 39\.2% accuracy\. The first level of lf\-learning, supervised fine\-tuning onATHENA\-R1\-Instruct, increases accuracy to 66\.5% \(\+27\.3 percentage points\)\. The second level of lf learning, reinforcement learning with scientific feedback, further improves accuracy to 74\.8% \(\+8\.3 percentage points\)\.

The choice of answer\-extraction protocol affects absolute scores but not the direction of the result\. Under the “self as judge” protocol, hieves 74\.8% accuracy; under the “GPT\-5 as judge” protocol used for all baselines, hieves 82\.9%\. The 8\.1 percentage point difference confirms that extraction protocol contributes to absolute performance, but “self as judge” score of 74\.8% still exceeds GPT\-5’s 72\.2% obtained under the more favorable “GPT\-5 as judge” protocol, indicating that advantage over GPT\-5 holds regardless of which answer extraction method is applied\.

### Disease experts prefer r rare disease treatment reasoning

We evaluated th disease experts from the Chan Zuckerberg Initiative Rare As One network, which brings together a network of rare disease organizations\. Rare diseases often lack treatment pathways\[[undefac](https://arxiv.org/html/2606.28692#bib.bibx30)\], requiring treatment decisions to be made from limited and fragmented information about disease mechanisms, contraindications, drug interactions and rare disease patient risks\[[undefad](https://arxiv.org/html/2606.28692#bib.bibx31)\]\. Experts from participating organizations provided disease\-specific context used to construct treatment cases spanning neurodevelopmental disorders, epilepsies, metabolic diseases, rare cancers, channelopathies and immune\-mediated diseases\. The evaluation involved 29 experts from 28 disease organizations \(Supplementary TableLABEL:tab:thera\_consortium\), including clinicians, disease researchers and patient advocates with expertise spanning therapeutic development and clinical care\. In total, 23 evaluators completed blinded assessments, contributing 110 expert evaluated responses\.

d reference models generated responses with full reasoning traces\. The primary reference model was Qwen3\-8B, the base model from which built, chosen to isolate the contribution of tool library and self learning while holding the underlying language model fixed\. Additional comparisons to o3\-mini, Gemini\-2\.0\-Flash and DeepSeek\-R1 variants \(full list in MethodsLABEL:sec:human\_eval\) yielded consistent preferences for hough the smaller sample per model precludes formal statistical comparison\. Experts evaluated paired outputs in a blinded, arena based setting \(Figure[3](https://arxiv.org/html/2606.28692#Sx4.F3)a,b\)\[[undefae](https://arxiv.org/html/2606.28692#bib.bibx32),[undefd](https://arxiv.org/html/2606.28692#bib.bibx5)\]\. For each treatment case, evaluators viewed two responses without model identity and provided pairwise preferences and absolute ratings across eight criteria: task success, helpfulness of rationale, cognitive traceability, possibility of harm, alignment with clinical consensus, accuracy of content, completeness and clinical relevance \(MethodsLABEL:sec:human\_eval, Supplementary NoteLABEL:sec:note5\_si\)\[[undefc](https://arxiv.org/html/2606.28692#bib.bibx4),[undefaf](https://arxiv.org/html/2606.28692#bib.bibx33)\]\. Pairwise comparisons used four options: “Model A is better,” “Model B is better,” “Both are equally good,” or “Neither did well\.” Absolute ratings used a 1–5 Likert scale and an “Unable to Judge” option\.

Experts preferred er reference models across all eight criteria \(Figure[3](https://arxiv.org/html/2606.28692#Sx4.F3)c\)\. Percentages were computed over all 110 evaluations; comparison results are reported in NoteLABEL:sec:note5\_siand TableLABEL:tab:stat\_tests\. Preferences were strongest for cognitive traceability \(95\.5%\) and helpfulness of rationale \(94\.5%\), indicating that experts valued responses that exposed the evidence gathering and reasoning process\. Across the remaining criteria, s preferred in 57–66% of evaluations, whereas reference models were preferred in 16–20%\. Including ties, tched or exceeded reference models in 74–77% of evaluations for completeness \(66\.4% win, 10\.0% tie\), task success \(63\.6% win, 10\.9% tie\), possibility of harm \(61\.8% win, 11\.8% tie\), alignment with clinical consensus \(59\.1% win, 18\.2% tie\), accuracy of content \(58\.2% win, 18\.2% tie\) and clinical relevance \(57\.3% win, 18\.2% tie\)\.

Absolute ratings confirmed that experts assigned higher quality scores to sponses than to reference model responses\. hieves a mean score of4\.16±0\.904\.16\\pm 0\.90out of 5, compared with2\.44±1\.262\.44\\pm 1\.26for reference models \(Figure[3](https://arxiv.org/html/2606.28692#Sx4.F3)d\)\. Scores are highest for cognitive traceability \(4\.67±0\.614\.67\\pm 0\.61\) and helpfulness of rationale \(4\.58±0\.694\.58\\pm 0\.69\), and lowest for possibility of harm \(3\.76±0\.813\.76\\pm 0\.81\)\. The largest gaps occur for cognitive traceability \(Δ=3\.13\\Delta=3\.13\) and helpfulness of rationale \(Δ=2\.93\\Delta=2\.93\), where reference models score1\.54±0\.961\.54\\pm 0\.96and1\.65±1\.091\.65\\pm 1\.09\. so improves accuracy of content \(3\.993\.99vs\.2\.822\.82\), completeness \(3\.883\.88vs\.2\.462\.46\) and alignment with clinical consensus \(4\.184\.18vs\.2\.762\.76\)\. All differences are significant \(binomial test and Wilcoxon signed rank test,P<5×10−5P<5\\times 10^\{\-5\}; MethodsLABEL:sec:human\_eval, Supplementary NoteLABEL:sec:note5\_si\)\.

### asons through complex, real\-world treatment decisions

We next evaluated r patient cases where treatment choices require judgment across incomplete guidelines and competing physiological constraints\. We considered five clinical cases from real adult and neonatal hospitalized patients\. Three physicians independently rated responses on the three cases selected for formal evaluation, spanning cardiovascular management and infectious disease, using eight criteria on a 1–5 Likert scale: task success, helpfulness of rationale, cognitive traceability, possibility of harm, alignment with clinical consensus, accuracy of content, completeness and clinical relevance\. The three formally evaluated cases are shown in Figure[4](https://arxiv.org/html/2606.28692#Sx4.F4)\. Two additional cases, a preoperative polypharmacy case and a neonatal case rated by a single physician, and all five reasoning traces are provided in Supplementary NoteLABEL:sec:note7\_si\. Case construction, rubric anchors, ethical framework and statistical treatment are described in MethodsLABEL:sec:mount\_sinai\.

On each case, entified the principal therapeutic risk and proposed a recommendation grounded in cited drug labels, adverse event, and drug mechanism evidence\. \(1\) In a post\-CABG patient with CKD stage 2, recent contrast\-induced nephropathy, and HFrEF\[[undefag](https://arxiv.org/html/2606.28692#bib.bibx34)\], lected enalapril over the pre\-admission lisinopril on the basis of reversible effects on BUN and creatinine and the absence of specific CKD contraindications in its label, in contrast to lisinopril’s more frequently reported renal dysfunction and hyperkalemia\. Reviewer\-level means across the eight criteria were3\.503\.50,3\.383\.38, and3\.503\.50, the most consistent agreement of the three cases \(standard deviation across reviewers=0\.07=0\.07; Figure[4](https://arxiv.org/html/2606.28692#Sx4.F4)a, b; Supplementary NoteLABEL:sec:note7\_si, Case 1\)\. \(2\) In a patient with a mechanical mitral valve on warfarin who developed a surgical\-site infection after total knee arthroplasty with gram\-positive cocci on wound culture, agged the levofloxacin\-warfarin interaction\[[undefah](https://arxiv.org/html/2606.28692#bib.bibx35)\]as the decisive risk and proposed vancomycin, clindamycin, or linezolid \(alongside penicillin\-class beta\-lactams\) as alternatives effective against the cultured organism with more favorable profiles in anticoagulated patients\. Reviewer level means were2\.752\.75,3\.633\.63, and4\.254\.25\(SD=0\.75=0\.75; Figure[4](https://arxiv.org/html/2606.28692#Sx4.F4)c, d; Supplementary NoteLABEL:sec:note7\_si, Case 2\)\. \(3\) In a post\-STEMI patient with severe persistent asthma and type 2 diabetes, led out propranolol on bronchoconstriction grounds, recommended β\-blockers with favorable profiles in asthma \(metoprolol, bisoprolol, or carvedilol\)\[[undefai](https://arxiv.org/html/2606.28692#bib.bibx36)\], and flagged hypoglycemia symptom masking as a secondary monitoring concern\. Reviewer level means were4\.754\.75,4\.384\.38, and2\.172\.17, the highest overall per\-reviewer mean of the three cases but also the widest spread \(SD=1\.40=1\.40; Figure[4](https://arxiv.org/html/2606.28692#Sx4.F4)e, f; Supplementary NoteLABEL:sec:note7\_si, Case 3\), with one reviewer scoring the response markedly below the other two\.

Across the three cases, ceived high ratings for task success, with a mean score of4\.63±0\.524\.63\\pm 0\.52across nine case–reviewer pairs\. Every reviewer scored every case at 4 or above for this criterion\. Helpfulness of rationale \(3\.75±0\.713\.75\\pm 0\.71\) and alignment with clinical consensus were rated above 3 by all reviewers on all cases\. Across all 24 case–criterion cells, at least two of three reviewers rated 3 or above\. Reviewer disagreement was concentrated in cases without a single guideline endorsed treatment, consistent with the intended ambiguity of the selected scenarios\. These cases illustrate how asons through individual treatment decisions rather than providing a systematic quantitative evaluation; the systematic comparison is the cross\-organizational expert study reported above\. Per\-criterion means, standard deviations and handling of reviewer\-missing ratings follow the analysis plan in MethodsLABEL:sec:mount\_sinai\.

### Population\-scale health records support treatment\-associated risk hypotheses

We next evaluated whether n generate clinically meaningful hypotheses about treatment\-associated adverse event risks\. Many adverse events arise not from a disease, comorbidity or medication alone, but from their interaction within specific patient populations\. To study this setting, we defined triadic patient cases consisting of a primary disease, a comorbidity and a medication \(Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)a\)\. For each case, nerated candidate adverse events associated with the combined context\. To focus on context specific risks, we excluded adverse events predicted from the disease, comorbidity or medication individually and retained those specific to the full triad\. Candidate hypotheses were then screened against pharmacovigilance literature by a clinician and a general\-purpose LLM\[[undefaj](https://arxiv.org/html/2606.28692#bib.bibx37)\]and prioritized for analysis where supporting evidence was limited or absent from the prior literature, making EHR evaluations a test of hypothesis generation rather than confirmation of established drug safety signals \(Methods SectionLABEL:sec:ehr\_eval, Supplementary NoteLABEL:sec:si\_ae\_generation\)\.

We evaluated these hypotheses using cohort analyses\[[undefak](https://arxiv.org/html/2606.28692#bib.bibx38)\]in longitudinal health records from 5\.4 million patients \(Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)b; Methods SectionLABEL:sec:ehr\_eval\)\. For each predicted adverse event, we measured prevalence and estimated confounder adjusted odds ratios using multivariate regression\. Models adjusted for age, sex, socioeconomic status and outpatient healthcare utilization\. Residual confounding may remain because clinical factors, treatment history and disease severity are not fully captured by these variables\[[undefal](https://arxiv.org/html/2606.28692#bib.bibx39)\]\(Supplementary NoteLABEL:sec:note6\_si\)\. Across disease–comorbidity–drug contexts, adverse events predicted by owed higher prevalence in the most specific patient subpopulations and elevated adjusted odds ratios relative to broader comparison cohorts\.

Cohort analyses supported prediction of increased risk of acute kidney failure in patients with hypertension and gout treated with β\-blockers \(OR = 1\.84, 95% CI: 1\.69–2\.00; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)c,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. Antihypertensive therapy has been associated with acute kidney injury \(AKI\), although prior analyses have not consistently identified a specific signal for β\-blockers\[[undefam](https://arxiv.org/html/2606.28692#bib.bibx40)\]\. β\-blockers are widely used in patients with hypertension and chronic kidney disease \(CKD\) and are generally considered hemodynamically neutral with respect to renal perfusion and glomerular filtration\[[undefan](https://arxiv.org/html/2606.28692#bib.bibx41)\]\. However, β\-blockers increase serum uric acid levels and are associated with incident gout\[[undefao](https://arxiv.org/html/2606.28692#bib.bibx42)\]\. Hyperuricemia is associated with renal microvascular dysfunction, inflammation and increased risk of AKI\[[undefap](https://arxiv.org/html/2606.28692#bib.bibx43)\]\. This supports a pathway in which elevated uric acid levels in patients with gout act as an effect modifier, identifying a subgroup in which β\-blocker associated metabolic effects may contribute to adverse renal outcomes\. Disease severity and baseline renal function are the variables most likely to contribute to residual confounding in this analysis, as both influence β\-blocker selection and independently predict renal outcomes\. Clinically, AKI, including severe forms extending to acute renal failure, carries substantial morbidity and mortality, particularly in patients with multiple comorbidities\.

For the same patient subpopulation, edicted increased risk of hyperkalemia, supported by cohort analyses \(OR = 1\.78, 95% CI: 1\.59–2\.00; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)d,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. β\-blockers have been associated with a modest but statistically significant increase in hyperkalemia risk after adjustment for comorbidities and monitoring frequency\[[undefaq](https://arxiv.org/html/2606.28692#bib.bibx44)\]\. Mechanistically, β2\-adrenergic blockade reduces cellular potassium uptake, impairing extrarenal potassium handling and increasing serum potassium levels\[[undefar](https://arxiv.org/html/2606.28692#bib.bibx45)\]\. As described above, β\-blockers increase uric acid levels and gout risk, which is associated with AKI\. Because renal excretion accounts for most potassium elimination, reduced kidney function further increases hyperkalemia risk\. This defines a sequential pathway from β\-blockade to hyperuricemia to AKI to impaired potassium clearance\. Clinically, hyperkalemia is a life threatening electrolyte disturbance associated with arrhythmias, sudden cardiac death and interruption of guideline\-directed therapies, making identification of high risk subgroups important for monitoring and prevention\.

Cohort analyses supported prediction of increased risk of hepatocellular carcinoma in patients with diabetes and ischemic heart disease treated with DPP\-4 inhibitors \(OR = 1\.48, 95% CI: 1\.17–1\.88; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)e,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. Diabetes is a well\-established risk factor for hepatocellular carcinoma \(HCC\), driven by insulin resistance and chronic inflammation leading to steatohepatitis and cirrhosis\[[undefas](https://arxiv.org/html/2606.28692#bib.bibx46)\]\. DPP\-4 inhibitors are oral anti\-hyperglycemic agents, although their use has declined relative to SGLT\-2 inhibitors and GLP\-1 receptor agonists\[[undefat](https://arxiv.org/html/2606.28692#bib.bibx47)\]\. Evidence linking DPP\-4 inhibitors to HCC remains inconsistent, with randomized trial meta\-analyses showing no clear increase in risk compared to alternative therapies\[[undefau](https://arxiv.org/html/2606.28692#bib.bibx48)\]\. By contrast, GLP\-1 receptor agonists and SGLT\-2 inhibitors show signals of reduced HCC risk\[[undefav](https://arxiv.org/html/2606.28692#bib.bibx49)\]\. DPP\-4 is involved in immune regulation and tumor biology, providing a basis for context dependent effects\[[undefaw](https://arxiv.org/html/2606.28692#bib.bibx50)\]\. Ischemic heart disease likely identifies a population with greater metabolic burden and comorbidity rather than a direct interaction, although this association warrants further study\.

Prevalence analyses supported prediction of increased risk of squamous cell carcinoma \(SCC\) in patients with hypertension and gout treated with diuretics, although adjusted regression did not reach statistical significance \(OR = 1\.08, 95% CI: 0\.66–1\.78; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)f,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. Thiazide diuretics, particularly hydrochlorothiazide, are associated with increased SCC risk in a dose\-dependent manner\[[undefax](https://arxiv.org/html/2606.28692#bib.bibx51)\]\. The mechanism likely involves photosensitization, which increases susceptibility to ultraviolet\-induced DNA damage\. Diuretics also increase uric acid levels and gout risk, and gout has been linked to increased cancer risk\[[undefay](https://arxiv.org/html/2606.28692#bib.bibx52)\]\. These effects combine to produce a layered risk structure\. Clinically, SCC can be locally invasive or metastatic, particularly in older patients with comorbidities\. The absence of a significant adjusted odds ratio does not rule out a true association; SCC is a relatively rare outcome in observational data and the analysis is likely underpowered to detect modest effects\. This result illustrates the range of confidence that should be applied to model generated hypotheses and motivates prospective validation in larger cohorts\.

Prevalence analyses supported prediction of increased risk of liver failure in patients with hyperlipidemia and hypothyroidism treated with statins, although adjusted regression did not reach statistical significance \(OR = 1\.04, 95% CI: 0\.68–1\.58; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)g,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. Statins can cause liver enzyme elevations, but true liver failure is rare, with rates below 2 per 1 million patient years\[[undefaz](https://arxiv.org/html/2606.28692#bib.bibx53)\]\. Mechanistically, statins can induce hepatocellular injury through mitochondrial and metabolic effects\. Hypothyroidism is associated with steatotic liver disease and cirrhosis, which may increase baseline hepatic vulnerability\. This suggests a context in which statin exposure contributes to more severe hepatic outcomes in a susceptible subgroup\.

edicted increased risk of respiratory failure in patients with diabetes and chronic kidney disease treated with metformin; cohort analyses did not show a significant increase \(OR = 1\.00, 95% CI: 0\.92–1\.07; Figure[5](https://arxiv.org/html/2606.28692#Sx4.F5)h,[6](https://arxiv.org/html/2606.28692#Sx4.F6)a\)\. Metformin is associated with lactic acidosis in patients with impaired renal function\. Reduced clearance increases the risk of drug accumulation and acidosis\. Metabolic acidosis increases respiratory drive and can contribute to respiratory decompensation\[[undefaaa](https://arxiv.org/html/2606.28692#bib.bibx54)\]\. Observational studies also report increased respiratory complications in high\-risk populations with COPD\. These mechanisms support indirect pathways linking metformin to respiratory outcomes in specific patient groups\.

We validate the analysis using positive and negative controls\. Positive controls recover established clinical effects, including hyperkalemia from ACE inhibitor therapy in CKD\[[undefaab](https://arxiv.org/html/2606.28692#bib.bibx55)\], ischemic risk from inhaled β\-agonists\[[undefaac](https://arxiv.org/html/2606.28692#bib.bibx56)\], and acidosis from metformin in CKD\[[undefaad](https://arxiv.org/html/2606.28692#bib.bibx57)\]\. We also recover the cardiovascular benefits of GLP\-1RA and SGLT\-2 therapies\[[undefaae](https://arxiv.org/html/2606.28692#bib.bibx58),[undefaaf](https://arxiv.org/html/2606.28692#bib.bibx59),[undefaag](https://arxiv.org/html/2606.28692#bib.bibx60)\]\(Figure[6](https://arxiv.org/html/2606.28692#Sx4.F6)b\)\. Negative controls show no increase in adjusted risk, with estimates centered near null \(Figure[6](https://arxiv.org/html/2606.28692#Sx4.F6)c\)\. These results indicate that the evaluation approach does not generate spurious associations in the absence of plausible mechanisms and supports its use for evaluating plausibility of clinically meaningful adverse event risk\.

## Discussion

trained to perform treatment reasoning by gathering evidence through tool use, interpreting returned information, and revising its analysis over multiple steps\. This process produces both a treatment recommendation and a reasoning trace that records how evidence contributes to the final conclusion\. dresses treatment reasoning tasks in which a treatment case is posed and evidence must be gathered before a conclusion can be reached, tasks that arise across drug selection, personalized treatment planning, and hypothesis generation about treatment\-associated risks\. It is not a point\-of\-care reference tool or a risk calculator, but a research system to evaluate whether AI can learn the evidence seeking process that underlies treatment decisions\. Across benchmark datasets, expert evaluations, clinical cases, and population\-scale analyses, tperformed language models and tool\-use systems\. Benchmark evaluations establish performance under controlled conditions, while expert evaluations in rare disease settings, physician ratings on complex hospitalized patient cases, and population\-scale EHR validation test whether that performance extends to open\-ended cases with real\-world clinical complexity\.

A central design choice in the separation of reasoning from knowledge storage\. Rather than relying solely on knowledge encoded in model parameters, trieves evidence from tools that query biomedical resources, including FDA drug labels and clinically validated knowledge bases, allowing conclusions to be grounded in retrievable evidence and updated as new information becomes available\. This separation also has implications for how model scale relates to reasoning performance\. Consistent performance of ove DeepSeek\-R1 \(671B parameters\) and GPT\-5 suggests that targeted tool\-use training may be more effective than scaling alone for tasks that require iterative evidence gathering\. Together, these results show that treatment reasoning can be framed as an iterative process of evidence gathering and analysis, and that reinforcement learning can train AI to perform this process\.

Several limitations warrant further investigation\. The quality of outputs depends on the coverage and reliability of tool library\. Missing endpoints and retrieval errors can propagate through treatment reasoning\[[undefaah](https://arxiv.org/html/2606.28692#bib.bibx61)\]\. so does not quantify uncertainty\. Although grounding in retrieved evidence improves verifiability \(Extended Data FigureLABEL:fig:comparison\_deepseek\), uncertainty can arise from incomplete evidence, incorrect tool outputs, and errors in evidence synthesis\[[undefaai](https://arxiv.org/html/2606.28692#bib.bibx62)\]\. In the clinical case evaluations, reviewer disagreement was highest on cases without clear guideline endorsed answers, precisely the cases where quantified uncertainty would be most valuable\[[undeft](https://arxiv.org/html/2606.28692#bib.bibx21)\]\. In addition, training relies on generated reasoning traces rather than human written demonstrations\. This enables supervision at a scale that would otherwise be impractical, but generated traces may inherit biases from the generation process\[[undefaaj](https://arxiv.org/html/2606.28692#bib.bibx63)\]\.

erates on natural language inputs and does not incorporate imaging, laboratory time series, genomic measurements, or longitudinal patient records\. Extending treatment reasoning to multimodal patient data will require new representations and evaluation frameworks\[[undefaak](https://arxiv.org/html/2606.28692#bib.bibx64)\]\. Population\-scale analyses presented here to evaluate hypotheses generated by e observational\. Although these analyses adjust for patient demographics and healthcare utilization \(Supplementary NoteLABEL:sec:note6\_si\), residual confounding and measurement error may remain\. These results should be interpreted as support for hypothesis generation rather than causal inference\.

ames treatment reasoning as an evidence seeking process in which AI gathers and evaluates biomedical information before reaching a conclusion\. By separating treatment reasoning from knowledge storage, n incorporate new evidence while preserving a transparent record of how conclusions are formed\. Results presented here establish a principle that treatment reasoning, which requires knowing what evidence to seek before a recommendation can be formed, can be learned through reinforcement learning over a universe of biomedical tools\. As biomedical knowledge expands to encompass genomic, imaging, and longitudinal patient data, and as tool libraries expand to query these sources, training frameworks of this kind may support reasoning across an increasingly broad range of therapeutic contexts\.

Acknowledgements\.We gratefully acknowledge the support of NSF CAREER 2339524, ARPA\-H Biomedical Data Fabric \(BDF\) Toolbox Program, Harvard Data Science Initiative, Amazon Faculty Research, Google Research Scholar Program, AstraZeneca Research, Roche Alliance with Distinguished Scientists \(ROADS\) Program, Sanofi iDEA\-iTECH Award, GlaxoSmithKline Award, Boehringer Ingelheim Award, Merck Award, Optum AI Research Collaboration Award, Pfizer Research, Gates Foundation \(INV\-079038\), Chan Zuckerberg Initiative, John and Virginia Kaneb Fellowship at Harvard Medical School, Biswas Computational Biology Initiative in partnership with the Milken Institute, Collaborative Center for XDP at Massachusetts General Hospital, Harvard Medical School Dean’s Innovation Fund for the Use of Artificial Intelligence, and the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University\. A\.N\. was supported by the Rhodes Scholarship\. D\.A\.C\. was funded by an NIHR Research Professorship \(NIHR302440\), a Royal Academy of Engineering Research Chair, and the InnoHK Hong Kong Centre for Cerebro\-Cardiovascular Engineering, and was supported by the National Institute for Health Research Oxford Biomedical Research Centre and the Pandemic Sciences Institute at the University of Oxford\. This research was enabled by the AI Cluster at the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University\. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders\. Figures[1](https://arxiv.org/html/2606.28692#Sx4.F1),[2](https://arxiv.org/html/2606.28692#Sx4.F2),[3](https://arxiv.org/html/2606.28692#Sx4.F3), and[5](https://arxiv.org/html/2606.28692#Sx4.F5)were created, in part, using Biorender\.com \(see[https://biorender\.com/zn1rz6u](https://biorender.com/zn1rz6u)\)\.

Ethics approval\.All parts of this study that relate to the use of Clalit Health Services data were approved by the Clalit Health Services Institutional Review Board \(Helsinki\) committee\. Work conducted at Mount Sinai was approved by the Mount Sinai Institutional Review Board \(STUDY\-20\-00338\)\.

Author contributions\.S\.G\. developed and implemented nd performed all benchmarking analyses\. A\.N\., J\.K\., B\.G\., A\.S\. and J\.L\. developed and evaluated the clinical case scenarios\. S\.G\. and R\.Z\. developed the rare disease treatment reasoning evaluation\. Members of the aluation Consortium performed human evaluations in rare disease treatment reasoning\. A\.N\., N\.D\. and R\.B\. designed and performed the patient cohort analyses using electronic health records\. All authors discussed the results and contributed to the manuscript\. S\.G\. and M\.Z\. designed the study, and M\.Z\. supervised and led the overall study\.

Competing interests\.The authors declare no competing interests\.

![Refer to caption](https://arxiv.org/html/2606.28692v1/x1.png)Figure 1:lves precision treatment reasoning problems by retrieving and analyzing medical evidence from a biomedical tool universe\.For a patient treatment scenario, nerates a treatment recommendation together with a reasoning trace that records evidence retrieval, tool use, and intermediate analyses\. The example shows justing therapy for a 77\-year\-old man with type 2 diabetes and early chronic kidney disease \(eGFR 52 mL/min\) receiving metformin, an ACE inhibitor, and hydrochlorothiazide\. Green nodes denote human input and clinician interjections, blue nodes reasoning steps, and orange nodes tool calls that retrieve biomedical evidence\. es not follow a fixed sequence of operations\. It adaptively determines which evidence to collect, which analyses to perform, and which questions require further investigation based on information gathered in previous steps\. In this example, aluates the ACE inhibitor, hydrochlorothiazide, and alternative treatment options in parallel, eliminates unsupported hypotheses \(here, calcium\-channel blockers show no interaction with metformin\), and synthesizes the remaining evidence\. A clinician interjection \("Is lactic acidosis risk significant at this eGFR?"\) triggers additional evidence gathering and analysis before nerates its final recommendation\. Multiple orange nodes branching from a single blue node indicate parallel tool calls within one reasoning step\. Separate reasoning branches indicate concurrent analyses of different treatment considerations\. This adaptive process allows integrate patient characteristics, medications, contraindications, and biomedical evidence when evaluating treatment options\.![Refer to caption](https://arxiv.org/html/2606.28692v1/x2.png)Figure 2:tperforms reasoning models and tool\-use LLMs on drug prescribing and patient treatment benchmarks\.\(a\)Construction of the DrugPC and TreatmentPC benchmarks from FDA prescribing information\. Structured FDA drug labels were used to generate treatment questions across drug prescribing and patient\-specific treatment selection tasks\. Human review was used to refine the questions, answer choices, and explanations\.\(b\)Accuracy on the DrugPC open\-ended benchmark, comprising 3,168 questions across 11 drug\-prescribing tasks\. Each model generated a free\-form answer without access to answer options; an independent GPT\-5 instance then mapped the answer to one option for scoring, using the protocol shown ind\. hieved 94\.7% micro\-averaged accuracy across all questions, compared with 76\.9% for GPT\-5, 68\.8% for DeepSeek\-R1, and 48\.7% for Qwen3\. The 11 tasks cover drug overview, ingredients, warnings and safety, dependence and abuse, dosage and administration, use in specific populations, pharmacology, clinical information, nonclinical toxicology, patient\-focused information, and storage and supply\.\(c\)Accuracy on the TreatmentPC open\-ended benchmark, comprising 456 patient\-specific treatment scenarios\. hieved 82\.9% accuracy, outperforming reasoning models GPT\-5 \(72\.2%\), DeepSeek\-R1 \(67\.5%\), Qwen3\-Next \(60\.1%\), and Qwen3 \(39\.2%\), as well as tool\-use LLMs with full access to tool library, including ToolACE\-8B \(13\.4%\) and WattTool\-8B \(5\.9%\)\.\(d\)TreatmentPC accuracy under matched tool access and answerextraction conditions\. GPT\-5 reached 72\.2% without tools, 66\.9% when tools were available but optional, and 70\.2% when required to call a tool on every question\. Under optional tool access, GPT\-5 used tools on 1% of questions, whereas ed tools on every question\. Because open\-ended answers must be mapped to answer options before scoring, we compared two answer\-extraction protocols for n the ‘self as judge” protocol, pped its own free\-form answer to an option and reached 74\.8% accuracy\. In the ‘GPT\-5 as judge” protocol, an independent GPT\-5 instance performed the same mapping and ached 82\.9% accuracy, the protocol used for all baselines in c\. Both protocols score the selected option against the same gold answer key; they differ only in how answer extraction is performed\.\(e\)TreatmentPC accuracy across the two levels of lf\-learning, scored with the “self as judge” protocol fromd\. The Qwen3\-8B base model achieved 39\.2% accuracy\. Level 1, supervised fine\-tuning onATHENA\-R1\-Instruct, increased accuracy to 66\.5%\. Level 2, reinforcement learning with scientific feedback, further improved accuracy across RL steps 0\-60, reaching 74\.8% at step 60\. The y\-axis is broken to show the Qwen3 baseline and the RL trajectory on the same plot\. Each curacy inb\-ecomes from one independent rollout; sampling variability acrossn=5n\{=\}5independent rollouts is reported in Extended Data FigureLABEL:fig:reproducibilityand Supplementary NoteLABEL:sec:note8\_si\.![Refer to caption](https://arxiv.org/html/2606.28692v1/x3.png)Figure 3:Across all eight evaluation criteria, disease experts from 28 rare disease organizations prefer responses and reasoning traces over those of reference models\.\(a\)Design of the human evaluation of or each patient case and treatment\-development scenario, d reference models \(predominantly Qwen3\-8B; six reference models in total, listed in Online Methods\) independently generated a response and multi\-step reasoning trace\. Each case was routed to experts whose disease expertise matched the case\. We collected 110 expert\-evaluated responses from 23 evaluators, including disease experts from 28 rare disease organizations in the Chan Zuckerberg Initiative Rare As One network\.\(b\)Arena\-based evaluation interface\. For each treatment case, experts view two responses side by side with model identity hidden, select the preferred response \(pairwise preference\), and rate each response on a 1–5 scale\. Both judgments span eight criteria: task success, helpfulness of rationale, cognitive traceability, possibility of harm, alignment with clinical consensus, accuracy of content, completeness, and clinical relevance\.\(c\)Pairwise preferences across all eight criteria\. In blinded head\-to\-head comparisons, experts prefer er reference models on every criterion, with the largest margins for cognitive traceability \(95\.5%\) and helpfulness of rationale \(94\.5%\)\.\(d\)Absolute ratings across all eight criteria\. Experts rate tputs at a mean of4\.16±0\.904\.16\\pm 0\.90out of 5, versus2\.44±1\.262\.44\\pm 1\.26for reference models\.\(e\)Δ\\Deltarating between d the reference model\. Bars are colored by evaluation axis\.![[Uncaptioned image]](https://arxiv.org/html/2606.28692v1/x4.png)Figure 4:On complex, real\-world hospitalized\-patient cases in cardiovascular management and infectious disease, physician reviewers consistently rate treatment recommendations as successful\.Each case is a real hospitalized patient whose treatment decision requires weighing competing physiological constraints under incomplete guideline coverage\. In each case, entifies the principal therapeutic risk and proposes a recommendation grounded in cited label, adverse\-event, and mechanism evidence\. Panels pair each case \(left\) with its expert ratings \(right\)\.\(a\)A 67\-year\-old male on postoperative day 2 following three\-vessel CABG, with HFrEF, CKD stage 2 and recent contrast\-induced nephropathy, asking which ACE inhibitors can be safely administered postoperatively\.\(b\)Expert absolute ratings \(1–5 scale\) on the case inaacross eight criteria: task success, possibility of harm, completeness, helpfulness of rationale, cognitive traceability, clinical relevance, alignment with clinical consensus, and accuracy of content\. Bars show the mean across three physician reviewers; dots show individual reviewer scores\.\(c\)A 68\-year\-old woman with a mechanical mitral valve on warfarin, 4 days after elective total knee arthroplasty, developed a surgical\-site infection with gram\-positive cocci on wound culture; the infectious disease team recommended switching to levofloxacin\. Question: what are the risks of levofloxacin and what are the alternative antibiotics?\(d\)Expert absolute ratings on the case inc, across the same eight criteria; bars and dots as inb\.\(e\)A 51\-year\-old female on postoperative day 1 after PCI for STEMI with severe persistent asthma and type 2 diabetes, asking whichβ\\beta\-blocker is most appropriate for secondary prevention\.\(f\)Expert absolute ratings on the case ine, across the same eight criteria; bars and dots as inb\. Task success received a mean rating of4\.63±0\.524\.63\\pm 0\.52across the nine case\-reviewer pairs, and across all three cases and all eight criteria at least two of three reviewers rated response at 3 or above\. These cases illustrate reasoning on individual treatment decisions and complement the systematic expert evaluation in Figure[3](https://arxiv.org/html/2606.28692#Sx4.F3)\. Bars are colored by evaluation axis\.![Refer to caption](https://arxiv.org/html/2606.28692v1/FIG/FIG5.png)Figure 5:Adverse events predicted by r disease, comorbidity, and medication profiles occur at the highest prevalence in the most specific patient subpopulations across electronic health records from 5\.4 million patients\.\(a\)Workflow for generating adverse\-event hypotheses with e defined patient profiles, each specified by a primary disease, a comorbidity, and a medication, and used each profile to construct a contrastive prompt that directs predict adverse events attributable to the full combination rather than to any single component\. Candidate predictions were then scored and ranked by a clinician and a large language model\.\(b\)Retrospective validation pipeline using population\-scale EHRs\. Clinical entities \(diseases, comorbidities, drugs, and predicted adverse events\) are mapped to standardized medical codes\. We construct patient cohorts by identifying individuals who meet the specified clinical criteria, with index dates set by the timing of diagnoses and drug exposures\. Statistical analyses then quantify the risk of each predicted adverse event in exposed versus unexposed populations, using prevalence calculations and confounder\-adjusted logistic regression\.\(c\-h\)Evaluation of six adverse\-event hypotheses generated by ross patient populations drawn from electronic health records of more than 5\.4 million individuals in the Clalit Health Services\. Each panel shows the prompt provided to eft\), the predicted adverse event \(center\), and the observed prevalence of that event across five progressively more specific patient cohorts \(right\)\. In all six panels, prevalence is highest in the most specific cohort, indicating that entifies adverse events that are most prevalent in narrowly defined patient subpopulations\. Confounder\-adjusted effect estimates and control analyses for these hypotheses are shown in Figure[6](https://arxiv.org/html/2606.28692#Sx4.F6)\.![Refer to caption](https://arxiv.org/html/2606.28692v1/x5.png)Figure 6:Population\-scale electronic health records support adverse\-event risk predictions\.Forest plot of adjusted odds ratios \(OR\) with 95% confidence intervals \(CI\) for each adverse event predicted by hown alongside positive controls and negative controls \(red\)\. Positive controls are established drug\-risk associations that a calibrated pipeline should recover, and negative controls are biologically implausible associations that it should not\. All regression models were adjusted for demographic confounders, including age, sex, and socioeconomic status\. An OR above 1 whose CI excludes 1 denotes a statistically significant increase in risk\.\(a\)edictions: adverse\-event risks predicted by r patient profiles defined by a disease, a comorbidity, and a medication\. Three of six predicted associations reach statistical significance \(OR\>1\>1, 95% CI excluding11\): acute kidney failure \(OR 1\.84; 95% CI 1\.69–2\.00\), hyperkalemia \(OR 1\.78; 95% CI 1\.59–2\.00\), and hepatocellular carcinoma \(OR 1\.48; 95% CI 1\.17–1\.88\)\. All six have OR≥1\\geq 1\.\(b\)Positive controls: established clinical associations used to benchmark the EHR analysis pipeline, including known risks such as hyperkalemia from ACE inhibitor use in chronic kidney disease \(OR 1\.59\) and known protective effects such as reduced heart failure risk with SGLT\-2 inhibitors \(OR 0\.57\)\.\(c\)Negative controls: associations between the target drugs and unrelated medical conditions\. In all cases, OR=1=1, with confidence intervals crossing the vertical line of no effect\.
## References

## References

- \[undef\]Margaret A Hamburg and Francis S Collins“The path to personalized medicine”In*New England Journal of Medicine*363\.4Mass Medical Soc, 2010, pp\. 301–304
- \[undefa\]Eric J\. Topol“High\-performance medicine: the convergence of human and artificial intelligence”In*Nature Medicine*25\.1Springer ScienceBusiness Media LLC, 2019, pp\. 44–56DOI:[10\.1038/s41591\-018\-0300\-7](https://dx.doi.org/10.1038/s41591-018-0300-7)
- \[undefb\]Abhimanyu Dubey et al\.“The llama 3 herd of models”In*arXiv preprint arXiv:2407\.21783*, 2024
- \[undefc\]Karan Singhal et al\.“Large language models encode clinical knowledge”In*Nature*620\.7972Nature Publishing Group, 2023, pp\. 172–180
- \[undefd\]Karan Singhal et al\.“Toward expert\-level medical question answering with large language models”In*Nature Medicine*Nature Publishing Group US New York, 2025, pp\. 1–8
- \[undefe\]Zeming Chen et al\.“Meditron\-70b: Scaling medical pretraining for large language models”In*arXiv preprint arXiv:2311\.16079*, 2023
- \[undeff\]Daniel McDuff et al\.“Towards accurate differential diagnosis with large language models”In*Nature*642\.8067Nature Publishing Group UK London, 2025, pp\. 451–457
- \[undefg\]Shanghua Gao et al\.“Empowering biomedical discovery with AI agents”In*Cell*187\.22Elsevier, 2024, pp\. 6125–6151DOI:[10\.1016/j\.cell\.2024\.09\.022](https://dx.doi.org/10.1016/j.cell.2024.09.022)
- \[undefh\]Tao Tu et al\.“Towards conversational diagnostic artificial intelligence”In*Nature*642\.8067Nature Publishing Group UK London, 2025, pp\. 442–450
- \[undefi\]Yunfan Gao et al\.“Retrieval\-augmented generation for large language models: A survey”In*arXiv preprint arXiv:2312\.10997*, 2023
- \[undefj\]Fanjia Yan et al\.“Berkeley Function Calling Leaderboard”,[https://gorilla\.cs\.berkeley\.edu/blogs/8\_berkeley\_function\_calling\_leaderboard\.html](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html), 2024
- \[undefk\]Inc Dadao“watt\-tool\-8B: A Fine\-Tuned Language Model for Tool Usage and Multi\-Turn Dialogue”,[https://huggingface\.co/watt\-ai/watt\-tool\-8B](https://huggingface.co/watt-ai/watt-tool-8B), 2025
- \[undefl\]Weiwen Liu et al\.“ToolACE: Winning the Points of LLM Function Calling”In*The Thirteenth International Conference on Learning Representations \(ICLR\)*, 2025URL:[https://openreview\.net/forum?id=8EB8k6DdCU](https://openreview.net/forum?id=8EB8k6DdCU)
- \[undefm\]Aaditya Singh et al\.“OpenAI GPT\-5 System Card”In*arXiv preprint arXiv:2601\.03267*, 2025
- \[undefn\]Daya Guo et al\.“DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning”In*Nature*645\.8081Nature Publishing Group, 2025, pp\. 633–638DOI:[10\.1038/s41586\-025\-09422\-z](https://dx.doi.org/10.1038/s41586-025-09422-z)
- \[undefo\]Taha A Kass\-Hout et al\.“OpenFDA: an innovative platform providing access to a wealth of FDA’s publicly available data”In*Journal of the American Medical Informatics Association*23\.3Oxford University Press, 2016, pp\. 596–600
- \[undefp\]David Ochoa et al\.“The next\-generation Open Targets Platform: reimagined, redesigned, rebuilt”In*Nucleic Acids Research*51\.D1Oxford University Press, 2023, pp\. D1353–D1359DOI:[10\.1093/nar/gkac1046](https://dx.doi.org/10.1093/nar/gkac1046)
- \[undefq\]Shunyu Yao et al\.“React: Synergizing reasoning and acting in language models”In*International Conference on Learning Representations \(ICLR\)*, 2023
- \[undefr\]Eric Zelikman, Yuhuai Wu, Jesse Mu and Noah D\. Goodman“STaR: Bootstrapping Reasoning With Reasoning”In*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 \(NeurIPS 2022\)*, 2022URL:[https://proceedings\.neurips\.cc/paper\_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5\-Abstract\-Conference\.html](https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract-Conference.html)
- \[undefs\]Yizhong Wang et al\.“Self\-Instruct: Aligning Language Models with Self\-Generated Instructions”In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*Toronto, Canada: Association for Computational Linguistics, 2023, pp\. 13484–13508DOI:[10\.18653/v1/2023\.acl\-long\.754](https://dx.doi.org/10.18653/v1/2023.acl-long.754)
- \[undeft\]Long Ouyang et al\.“Training language models to follow instructions with human feedback”In*Advances in Neural Information Processing Systems*35, 2022, pp\. 27730–27744
- \[undefu\]Cheng Qian et al\.“ToolRL: Reward is All Tool Learning Needs”In*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025 \(NeurIPS 2025\)*, 2025URL:[https://openreview\.net/forum?id=eOLdGbXT6t](https://openreview.net/forum?id=eOLdGbXT6t)
- \[undefv\]Zhihong Shao et al\.“Deepseekmath: Pushing the limits of mathematical reasoning in open language models”In*arXiv preprint arXiv:2402\.03300*, 2024
- \[undefw\]Yuzhong Zhao et al\.“Geometric\-mean policy optimization”In*International Conference on Learning Representations \(ICLR\)*, 2026
- \[undefx\]Shahriar Golchin and Mihai Surdeanu“Time Travel in LLMs: Tracing Data Contamination in Large Language Models”In*Proceedings of the 12th International Conference on Learning Representations \(ICLR 2024\)*, 2024URL:[https://openreview\.net/forum?id=2Rwq6c3tvr](https://openreview.net/forum?id=2Rwq6c3tvr)
- \[undefy\]Xiaorui Su et al\.“KGARevion: An AI Agent for Knowledge\-Intensive Biomedical QA”In*Proceedings of the 13th International Conference on Learning Representations \(ICLR 2025\)*, 2025URL:[https://openreview\.net/forum?id=tnB94WQGrn](https://openreview.net/forum?id=tnB94WQGrn)
- \[undefz\]Kexin Huang et al\.“A foundation model for clinician\-centered drug repurposing”In*Nature Medicine*30\.12Nature Publishing Group US New York, 2024, pp\. 3601–3613
- \[undefaa\]undef Bristol\-Myers Squibb Company“KENALOG\-10 Injection \(triamcinolone acetonide injectable suspension, USP\): Prescribing Information”, 2018URL:[https://www\.accessdata\.fda\.gov/drugsatfda\_docs/label/2018/012041s045lbl\.pdf](https://www.accessdata.fda.gov/drugsatfda_docs/label/2018/012041s045lbl.pdf)
- \[undefab\]Alexandra Ahmet et al\.“Adrenal suppression from glucocorticoids: preventing an iatrogenic cause of morbidity and mortality in children”In*BMJ Paediatrics Open*3\.1BMJ Publishing Group, 2019, pp\. e000569DOI:[10\.1136/bmjpo\-2019\-000569](https://dx.doi.org/10.1136/bmjpo-2019-000569)
- \[undefac\]Erik Tambuyzer et al\.“Therapies for rare diseases: therapeutic modalities, progress and challenges ahead”In*Nature Reviews Drug Discovery*19\.2Nature Publishing Group, 2020, pp\. 93–111
- \[undefad\]Kym M\. Boycott and Diego Ardigó“Addressing challenges in the diagnosis and treatment of rare genetic diseases”In*Nature Reviews Drug Discovery*17\.3Nature Publishing Group, 2018, pp\. 151–152
- \[undefae\]Wei\-Lin Chiang et al\.“Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference”In*Proceedings of the 41st International Conference on Machine Learning*235, Proceedings of Machine Learning ResearchPMLR, 2024, pp\. 8359–8388URL:[https://proceedings\.mlr\.press/v235/chiang24b\.html](https://proceedings.mlr.press/v235/chiang24b.html)
- \[undefaf\]Stephen R Pfohl et al\.“A toolbox for surfacing health equity harms and biases in large language models”In*Nature Medicine*30\.12Nature Publishing Group US New York, 2024, pp\. 3590–3600
- \[undefag\]Paul A\. Heidenreich et al\.“2022 AHA/ACC/HFSA Guideline for the Management of Heart Failure: A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines”In*Circulation*145\.18Ovid Technologies \(Wolters Kluwer Health\), 2022, pp\. e895–e1032DOI:[10\.1161/CIR\.0000000000001063](https://dx.doi.org/10.1161/CIR.0000000000001063)
- \[undefah\]Anne M\. Holbrook et al\.“Systematic overview of warfarin and its drug and food interactions”In*Archives of Internal Medicine*165\.10American Medical Association \(AMA\), 2005, pp\. 1095–1106DOI:[10\.1001/archinte\.165\.10\.1095](https://dx.doi.org/10.1001/archinte.165.10.1095)
- \[undefai\]Shelley R\. Salpeter, Thomas M\. Ormiston and Edwin E\. Salpeter“Cardioselectiveβ\\beta\-Blockers in Patients with Reactive Airway Disease: A Meta\-Analysis”In*Annals of Internal Medicine*137\.9American College of Physicians, 2002, pp\. 715–725DOI:[10\.7326/0003\-4819\-137\-9\-200211050\-00035](https://dx.doi.org/10.7326/0003-4819-137-9-200211050-00035)
- \[undefaj\]Lianmin Zheng et al\.“Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena”In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023 \(NeurIPS 2023\) Datasets and Benchmarks Track*, 2023URL:[https://arxiv\.org/abs/2306\.05685](https://arxiv.org/abs/2306.05685)
- \[undefak\]Rachel E\. Sherman et al\.“Real\-World Evidence – What Is It and What Can It Tell Us?”In*New England Journal of Medicine*375\.23Massachusetts Medical Society, 2016, pp\. 2293–2297DOI:[10\.1056/NEJMsb1609216](https://dx.doi.org/10.1056/NEJMsb1609216)
- \[undefal\]Miguel A\. Hernán and James M\. Robins“Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available”In*American Journal of Epidemiology*183\.8Oxford University Press \(OUP\), 2016, pp\. 758–764DOI:[10\.1093/aje/kwv254](https://dx.doi.org/10.1093/aje/kwv254)
- \[undefam\]Ali Albasri et al\.“Association between antihypertensive treatment and adverse events: systematic review and meta\-analysis”In*BMJ*372British Medical Journal Publishing Group, 2021, pp\. n189DOI:[10\.1136/bmj\.n189](https://dx.doi.org/10.1136/bmj.n189)
- \[undefan\]Biff F\. Palmer“Renal Dysfunction Complicating the Treatment of Hypertension”In*New England Journal of Medicine*347\.16, 2002, pp\. 1256–1261DOI:[10\.1056/NEJMra020676](https://dx.doi.org/10.1056/NEJMra020676)
- \[undefao\]H\.\. Choi, L\.\. Soriano, Y\. Zhang and L… Rodriguez“Antihypertensive drugs and risk of incident gout among patients with hypertension: population based case\-control study”In*BMJ*344\.jan12 1, 2012, pp\. d8190–d8190DOI:[10\.1136/bmj\.d8190](https://dx.doi.org/10.1136/bmj.d8190)
- \[undefap\]Xialian Xu et al\.“Hyperuricemia increases the risk of acute kidney injury: a systematic review and meta\-analysis”In*BMC Nephrology*18\.1, 2017, pp\. 27DOI:[10\.1186/s12882\-016\-0433\-1](https://dx.doi.org/10.1186/s12882-016-0433-1)
- \[undefaq\]Alex R\. Chang et al\.“Antihypertensive Medications and the Prevalence of Hyperkalemia in a Large Health System”In*Hypertension*67\.6, 2016, pp\. 1181–1188DOI:[10\.1161/HYPERTENSIONAHA\.116\.07363](https://dx.doi.org/10.1161/HYPERTENSIONAHA.116.07363)
- \[undefar\]Robert M\. Rosa et al\.“Adrenergic Modulation of Extrarenal Potassium Disposal”In*New England Journal of Medicine*302\.8, 1980, pp\. 431–434DOI:[10\.1056/NEJM198002213020803](https://dx.doi.org/10.1056/NEJM198002213020803)
- \[undefas\]Xu Li, Xiaocong Wang and Pujun Gao“Diabetes Mellitus and Risk of Hepatocellular Carcinoma”In*BioMed Research International*2017, 2017, pp\. 1–10DOI:[10\.1155/2017/5202684](https://dx.doi.org/10.1155/2017/5202684)
- \[undefat\]Duy Do et al\.“Trends in first\-line glucose\-lowering medication use among US adults with type 2 diabetes from 2019 to 2023”In*Journal of Managed Care & Specialty Pharmacy*31\.5, 2025, pp\. 520–526DOI:[10\.18553/jmcp\.2025\.31\.5\.520](https://dx.doi.org/10.18553/jmcp.2025.31.5.520)
- \[undefau\]Ming Zhao et al\.“Dipeptidyl peptidase\-4 inhibitors and cancer risk in patients with type 2 diabetes: a meta\-analysis of randomized clinical trials”In*Scientific Reports*7\.1, 2017, pp\. 8273DOI:[10\.1038/s41598\-017\-07921\-2](https://dx.doi.org/10.1038/s41598-017-07921-2)
- \[undefav\]Jiwon Yang et al\.“Impact of newer antihyperglycemic agents on hepatic complications: A systematic review and meta\-analysis of data from 5\.3 million patients with type 2 diabetes mellitus”In*Hepatology*, 2026DOI:[10\.1097/HEP\.0000000000001695](https://dx.doi.org/10.1097/HEP.0000000000001695)
- \[undefaw\]Emi Kawakita, Daisuke Koya and Keizo Kanasaki“CD26/DPP\-4: Type 2 Diabetes Drug Target with Potential Influence on Cancer Biology”In*Cancers*13\.9, 2021, pp\. 2191DOI:[10\.3390/cancers13092191](https://dx.doi.org/10.3390/cancers13092191)
- \[undefax\]R\. Schneider et al\.“Risk of skin cancer in new users of thiazides and thiazide‐like diuretics: a cohort study using an active comparator group\*”In*British Journal of Dermatology*185\.2, 2021, pp\. 343–352DOI:[10\.1111/bjd\.19880](https://dx.doi.org/10.1111/bjd.19880)
- \[undefay\]Lin Tian et al\.“Association between gout and cancers: A systematic review and meta\-analysis”In*Medicine*103\.43, 2024, pp\. e40234DOI:[10\.1097/MD\.0000000000040234](https://dx.doi.org/10.1097/MD.0000000000040234)
- \[undefaz\]Connie B\. Newman et al\.“Statin Safety and Associated Adverse Events: A Scientific Statement From the American Heart Association”In*Arteriosclerosis, Thrombosis, and Vascular Biology*39\.2, 2019DOI:[10\.1161/ATV\.0000000000000073](https://dx.doi.org/10.1161/ATV.0000000000000073)
- \[undefaaa\]Fu\-Shun Yen et al\.“Respiratory outcomes of metformin use in patients with type 2 diabetes and chronic obstructive pulmonary disease”In*Scientific Reports*10\.1, 2020, pp\. 10298DOI:[10\.1038/s41598\-020\-67338\-2](https://dx.doi.org/10.1038/s41598-020-67338-2)
- \[undefaab\]Joy M\. Weinberg et al\.“Risk of hyperkalemia in nondiabetic patients with chronic kidney disease receiving antihypertensive therapy”In*Archives of Internal Medicine*169\.17, 2009, pp\. 1587–1594DOI:[10\.1001/archinternmed\.2009\.284](https://dx.doi.org/10.1001/archinternmed.2009.284)
- \[undefaac\]David H\. Au et al\.“Association between inhaled beta\-agonists and the risk of unstable angina and myocardial infarction”In*Chest*121\.3, 2002, pp\. 846–851DOI:[10\.1378/chest\.121\.3\.846](https://dx.doi.org/10.1378/chest.121.3.846)
- \[undefaad\]Ralph DeFronzo, G\. Fleming, Kim Chen and Thomas A\. Bicsak“Metformin\-associated lactic acidosis: Current perspectives on causes and risk”In*Metabolism: Clinical and Experimental*65\.2, 2016, pp\. 20–29DOI:[10\.1016/j\.metabol\.2015\.10\.014](https://dx.doi.org/10.1016/j.metabol.2015.10.014)
- \[undefaae\]Darren K\. McGuire et al\.“Oral Semaglutide and Cardiovascular Outcomes in High\-Risk Type 2 Diabetes”In*The New England Journal of Medicine*392\.20, 2025, pp\. 2001–2012DOI:[10\.1056/NEJMoa2501006](https://dx.doi.org/10.1056/NEJMoa2501006)
- \[undefaaf\]Rodica Pop\-Busui et al\.“Oral Semaglutide and Heart Failure Outcomes in Persons With Type 2 Diabetes: A Secondary Analysis of the SOUL Randomized Clinical Trial”In*JAMA internal medicine*186\.4, 2026, pp\. 426–436DOI:[10\.1001/jamainternmed\.2025\.7774](https://dx.doi.org/10.1001/jamainternmed.2025.7774)
- \[undefaag\]Patrizia Natale et al\.“Sodium\-glucose co\-transporter protein 2 \(SGLT2\) inhibitors for people with chronic kidney disease and diabetes”In*The Cochrane Database of Systematic Reviews*5\.5, 2024, pp\. CD015588DOI:[10\.1002/14651858\.CD015588\.pub2](https://dx.doi.org/10.1002/14651858.CD015588.pub2)
- \[undefaah\]Paul Hager et al\.“Evaluation and mitigation of the limitations of large language models in clinical decision\-making”In*Nature Medicine*30\.9Nature Publishing Group US New York, 2024, pp\. 2613–2622
- \[undefaai\]Marzyeh Ghassemi, Luke Oakden\-Rayner and Andrew L\. Beam“The false hope of current approaches to explainable artificial intelligence in health care”In*The Lancet Digital Health*3\.11Elsevier BV, 2021, pp\. e745–e750DOI:[10\.1016/S2589\-7500\(21\)00208\-9](https://dx.doi.org/10.1016/S2589-7500(21)00208-9)
- \[undefaaj\]Ilia Shumailov et al\.“AI models collapse when trained on recursively generated data”In*Nature*631\.8022Nature Publishing Group, 2024, pp\. 755–759
- \[undefaak\]Michael Moor et al\.“Foundation models for generalist medical artificial intelligence”In*Nature*616\.7956Springer ScienceBusiness Media LLC, 2023, pp\. 259–265DOI:[10\.1038/s41586\-023\-05881\-4](https://dx.doi.org/10.1038/s41586-023-05881-4)
An AI agent for treatment reasoning over a biomedical tool universe

Similar Articles

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Submit Feedback

Similar Articles

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking
A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation