Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

arXiv cs.AI Papers

Summary

This paper presents an online adaptive clinical decision support AI system that integrates treatment effect estimation, digital twin simulation, and reinforcement learning to recommend treatments in a safe, clinician-supervised manner, validated on a synthetic simulator and the TCGA ovarian cancer dataset.

arXiv:2606.17405v1 Announce Type: new Abstract: Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:36 AM

# Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation
Source: [https://arxiv.org/html/2606.17405](https://arxiv.org/html/2606.17405)
Xinyu Qin†, Anil K\. Sood‡, Ruiheng Yu†, Sara Corvigno‡, Elaine Stur‡, and Lu Wang†,††,∗

###### Abstract

Clinical decision support AI systems \(CDSASs\) must adapt to evolving patient conditions in real\-time while adhering to strict safety constraints\. We present an online adaptive framework that integrates Treatment Effect \(TE\) estimation to quantify clinical benefits, a patient Digital Twin \(DT\) to simulate treatment trajectories, and Reinforcement Learning \(RL\) for sequential decision\-making\. The AI system is initially trained on historical medical records and operates in a continuous learning loop\. To ensure safety, a rule\-based module monitors vital signs and blocks contraindicated treatments\. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre\-trained outcome model\. We validate our framework using both a synthetic clinical simulator and a real\-world ovarian cancer dataset from The Cancer Genome Atlas \(TCGA\)\. In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines\. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician\-supervised tool for personalized medicine that continuously improves through practical use\.

## IIntroduction

Clinical decisions arrive in sequence and involve risk\[Suttonet al\.,[1998](https://arxiv.org/html/2606.17405#bib.bib48)\]\. Policies learned offline can be effective at deployment, yet dataset shift and limited coverage reduce value as conditions evolve\[Levineet al\.,[2020](https://arxiv.org/html/2606.17405#bib.bib52)\]\. Our objective is an online adaptive clinical decision support tool that learns during use while respecting safety\. Treatment Effect \(TE\) estimation serves as the primary metric for clinical benefit, ensuring the AI system prioritizes interventions that offer the greatest evidence\-based improvement for the patient under a clear counterfactual reference\[Hernan and Robins,[2020](https://arxiv.org/html/2606.17405#bib.bib51)\]\. A patient Digital Twin \(DT\) provides a virtual environment to simulate patient responses and predict potential future health states based on real\-time data\[Meijeret al\.,[2023](https://arxiv.org/html/2606.17405#bib.bib50)\]\. Reinforcement Learning \(RL\) enables long\-term treatment planning by modeling the relative value of different clinical actions over time\[Suttonet al\.,[1998](https://arxiv.org/html/2606.17405#bib.bib48), Chenet al\.,[2022](https://arxiv.org/html/2606.17405#bib.bib2)\]\.

We link these parts into a single AI system focused on online learning with guardrails\. First, the AI system undergoes an offline training stage using historical medical records, ensuring that its recommendations remain within the boundaries of established clinical practices\[Fujimotoet al\.,[2019](https://arxiv.org/html/2606.17405#bib.bib53)\]\. Second, during real\-time operation, the tool suggests treatments by pooling insights from multiple internal models and only requests human expert guidance when these models show high uncertainty\. This uncertainty is quantified by measuring the variation in predictions across the model ensemble, providing a reliable measure of confidence for clinical decision support\[Lakshminarayananet al\.,[2017](https://arxiv.org/html/2606.17405#bib.bib54), Chenet al\.,[2025](https://arxiv.org/html/2606.17405#bib.bib7), Razaet al\.,[2025](https://arxiv.org/html/2606.17405#bib.bib5)\]\. Third, the AI system performs frequent, stable updates on recent patient data to adapt to evolving conditions without reacting erratically to new information\[Jayaramanet al\.,[2024](https://arxiv.org/html/2606.17405#bib.bib49)\]\. To minimize the workload for clinicians, an automated selection process identifies only the most informative and diverse cases for expert review\[Sener and Savarese,[2017](https://arxiv.org/html/2606.17405#bib.bib55)\]\. The tool exposes lightweight controls for query threshold, stream rate, and batch size, enabling simple, rapid runtime behavior changes without requiring full retraining\.

![Refer to caption](https://arxiv.org/html/2606.17405v1/framework.png)Figure 1:Overview of the proposed DT powered treatment response AI system\. Dynamic multimodal data are ingested continuously to update the DT state, which supports treatment recommendation under safety constraints and uncertainty monitoring, and generates a treatment response report; an online updating loop flags uncertain cases for clinician follow\-up and uses accumulated feedback for model updates\.![Refer to caption](https://arxiv.org/html/2606.17405v1/interface.png)Figure 2:Tool overview of the deployed AI system\. The left panel provides the data upload and preview interface with controls to launch model training, the right sidebar integrates a Large Language Model \(LLM\)\-powered AI chatbot tool for interactive assistance and AI system guidance\. The top\-right panel shows patient data exploration with report generation and an in\-page preview \(a full report example is shown in Fig\.[3](https://arxiv.org/html/2606.17405#S4.F3)\)\. The middle\-right panel displays a monitoring dashboard for data and training status, and the bottom\-right panel offers parameter controls for configuring model training and settings\.This paper presents an online adaptive decision support framework that integrates TE estimation, DT, and RL into an online adaptive decision support tool\. First, we establish a robust decision\-making core by combining a stable model initialized from historical data with a responsive online update mechanism\. Second, we integrate a patient DT that enables rapid and consistent health state simulations during real\-time operation\. Third, we guide the learning process using TE metrics, ensuring the AI system optimizes for clinically significant outcomes while maintaining safety through uncertainty monitoring and rule\-based constraints\.

To validate these contributions and address the gap between simulation and clinical practice, we evaluate the framework on two datasets: a controlled synthetic environment for systematic analysis, and a real\-world ovarian cancer treatment cohort derived from The Cancer Genome Atlas \(TCGA\)\[Network and others,[2011](https://arxiv.org/html/2606.17405#bib.bib60)\]\. We selected ovarian cancer as the real\-world case study because our clinical collaborator is an ovarian cancer specialist, providing a clinically grounded setting to validate our AI system’s decision\-making behavior and reporting workflow\. The ovarian cancer dataset presents clinical challenges, including infrequent positive treatment responses \(occurring in only 27\.5% of patients\), complex combinations of up to 11 treatments, and detailed patient profiles encoding clinical staging and performance status\. Our results demonstrate that the proposed method achieves significant improvements in both simulated and real\-world clinical settings, reflecting its practical applicability\. Overall, an illustration of our method is shown in Fig\.[1](https://arxiv.org/html/2606.17405#S1.F1), and our main contributions are briefly summarized below as follows:

- •Publicly deployed AI tool for immediate and broad access\.The complete implementation of our framework, encompassing all methodological details presented in this paper, is publicly deployed as an interactive web application\.111[https://huggingface\.co/spaces/KingmaoQ/RLDT](https://huggingface.co/spaces/KingmaoQ/RLDT)The tool is available on\-demand to any user without installation or registration\. The complete AI tool overview is shown in Fig\.[2](https://arxiv.org/html/2606.17405#S1.F2)\.
- •A safety\-aware online evaluation loop for Digital Twins in Healthcare\.We integrate an uncertainty\-driven query mechanism with explicit rule\-based safety gates \(such as vital\-sign plausibility, medication dose bounds, and conflict checks\) to trigger conservative fallbacks before any potential clinical violation\.
- •Uncertainty\-driven selective querying under real\-time constraints\.We formalize an automated querying process that identifies informative cases by evaluating the level of agreement across multiple internal models\.\[Lakshminarayananet al\.,[2017](https://arxiv.org/html/2606.17405#bib.bib54), Thuy and Benoit,[2024](https://arxiv.org/html/2606.17405#bib.bib34)\]\.
- •Seamless transition from historical data to real\-time adaptation\.We initialize the AI system using high\-performing models trained on retrospective data and apply frequent, stable updates to balance the need for learning new patterns with the necessity of maintaining AI system stability\[Fujimotoet al\.,[2019](https://arxiv.org/html/2606.17405#bib.bib53), Jayaramanet al\.,[2024](https://arxiv.org/html/2606.17405#bib.bib49)\]\.
- •Privacy\-preserving data processing\.We implement a module to de\-identify data at the point of entry, automatically removing direct identifiers and applying standard privacy protection techniques in compliance with the Health Insurance Portability and Accountability Act \(HIPAA\) Safe Harbor standardsPortability and Act \[[2012](https://arxiv.org/html/2606.17405#bib.bib68)\]\.

## IIMethodology

### II\-AOffline Training with Three\-Stage Model Development

Before any model consumes data, we run a policy\-driven de\-identification pass so all learning uses HIPAA Safe Harbor de\-identification standard\. Specifically, we remove direct identifiers \(such as names and medical record numbers\), replace internal record identifications with random study identifications, reduce the detail of potentially identifying fields \(for example, we keep only the first three digits of Zone Improvement Plan \(ZIP\) codes and group ages into ranges\), and shift dates by a small, fixed maximum amount to prevent re\-identification while preserving the relative timing of events\. As an additional safeguard, we verify k\-anonymity \(k\), meaning that each record is indistinguishable from at least k other records on the fields that could indirectly identify a person\.

#### Stage 1: Dynamics Model \(Ensemble of Five\)

We construct a patient DT that predicts the next state from recent history and the applied treatment\. The model is a Transformer encoder that receives a sequence of state vectors and the aligned action tokens, with a causal attention mask and a padding maskLaiet al\.\[[2026](https://arxiv.org/html/2606.17405#bib.bib3)\]\. At each step the network predicts a residual change, and we apply a strictly bounded update to improve stability during iterative multi\-step rollouts:

𝐬t\+1=clip⁡\(𝐬t\+0\.05​tanh⁡\(fθ​\(𝐬0:t,a0:t\)\),0,1\)\.\\mathbf\{s\}\_\{t\+1\}\\;=\\;\\operatorname\{clip\}\\\!\\Bigl\(\\mathbf\{s\}\_\{t\}\\,\+\\,0\.05\\,\\tanh\\\!\\bigl\(f\_\{\\theta\}\(\\mathbf\{s\}\_\{0:t\},a\_\{0:t\}\)\\bigr\),\\;0,\\;1\\Bigr\)\.\(1\)
Here𝐬t∈\[0,1\]d\\mathbf\{s\}\_\{t\}\\in\[0,1\]^\{d\}is the normalized state andat∈\{0,…,K−1\}a\_\{t\}\\in\\\{0,\\dots,K\-1\\\}is the discrete action\. The loss is computed only on valid timesteps within each sequence by a binary mask that ignores padding\. We use a Smooth L1 objective over one step predictions:

ℒDT​\(θ\)=1\|Ω\|​∑\(i,t\)∈Ωℓsmooth​\(𝐬^t\+1\(i\),𝐬t\+1\(i\)\),\\mathcal\{L\}\_\{\\text\{DT\}\}\(\\theta\)\\;=\\;\\frac\{1\}\{\|\\Omega\|\}\\sum\_\{\(i,t\)\\in\\Omega\}\\ell\_\{\\text\{smooth\}\}\\\!\\left\(\\hat\{\\mathbf\{s\}\}^\{\(i\)\}\_\{t\+1\},\\,\\mathbf\{s\}^\{\(i\)\}\_\{t\+1\}\\right\),\(2\)
whereΩ\\Omegadenotes all valid positions in the mini batch\. Training uses AdamW, gradient clipping, and a learning rate scheduler\. We train five independent models under different seeds and keep all five for evaluation\. During rollout we aggregate the predictions by the ensemble mean\. We also use the ensemble variance as an uncertainty signal\.

#### Stage 2: Counterfactual Treatment Outcome and Reward Model

Networkrϕr\_\{\\phi\}predicts the immediate outcome from\(𝐬,a\)\(\\mathbf\{s\},a\)\. Let𝐳health=gϕ​\(𝐬\)\\mathbf\{z\}\_\{\\text\{health\}\}=g\_\{\\phi\}\(\\mathbf\{s\}\)denote the health representation learned from state features\. We apply adversarial deconfounding with a discriminatorDξ​\(a\|𝐳health\)D\_\{\\xi\}\(a\\,\|\\,\\mathbf\{z\}\_\{\\text\{health\}\}\), whereℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}is the action\-prediction cross\-entropy:

minϕmaxξ𝔼\(𝐬,a,y\)∼𝒟\[\|rϕ\(𝐬,a\)−y\|\+λadvCE\(Dξ\(⋅\|𝐳health\),a\)\]\.\\min\_\{\\phi\}\\;\\max\_\{\\xi\}\\;\\;\\mathbb\{E\}\_\{\(\\mathbf\{s\},a,y\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\,\|\\,r\_\{\\phi\}\(\\mathbf\{s\},a\)\-y\\,\|\\;\+\\;\\lambda\_\{\\text\{adv\}\}\\,\\mathrm\{CE\}\\\!\\bigl\(D\_\{\\xi\}\(\\cdot\\,\|\\,\\mathbf\{z\}\_\{\\text\{health\}\}\),\\,a\\bigr\)\\right\]\.\(3\)Here𝔼\(𝐬,a,y\)∼𝒟\\mathbb\{E\}\_\{\(\\mathbf\{s\},a,y\)\\sim\\mathcal\{D\}\}denotes the expectation over the dataset𝒟\\mathcal\{D\},\|⋅\|\|\\cdot\|is the absolute prediction error, andCE​\(⋅,⋅\)\\mathrm\{CE\}\(\\cdot,\\cdot\)is the cross\-entropy loss for trainingDξD\_\{\\xi\}to predictaafrom𝐳health\\mathbf\{z\}\_\{\\text\{health\}\}\. The weightλadv\>0\\lambda\_\{\\text\{adv\}\}\>0balances predictive accuracy and adversarial regularization, reducing dependence on observed confounding structure in the learned representation\. It does not eliminate bias from unmeasured confounders\.

#### II\-A1Stage 3: Offline Policy Learning with Batch\-ConstrainedQQ\-learning \(BCQ\)

The core of our decision\-making logic is centered on the concept of Quality, represented by the quality value \(QQ\-value\)\. Specifically,Qψ​\(𝐬,a\)Q\_\{\\psi\}\(\\mathbf\{s\},a\)denotes the predicted long\-term clinical benefit of applying a treatment actionaato a patient in a given health state𝐬\\mathbf\{s\}\(comprising vital signs and clinical covariates\), as estimated by a neuralQQ\-network with learned parametersψ\\psi\. To maintain clinical safety, we utilize BCQ, which limits the AI system’s recommendations to a validated subset of actions𝒜valid​\(𝐬\)\\mathcal\{A\}\_\{\\text\{valid\}\}\(\\mathbf\{s\}\)\. The policyπ​\(𝐬\)\\pi\(\\mathbf\{s\}\)then identifies the optimal intervention by maximizing this quality score within the safe set:

π​\(𝐬\)\\displaystyle\\pi\(\\mathbf\{s\}\)=arg⁡maxa∈𝒜valid​\(𝐬\)⁡Qψ​\(𝐬,a\),\\displaystyle=\\;\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{\\text\{valid\}\}\(\\mathbf\{s\}\)\}Q\_\{\\psi\}\(\\mathbf\{s\},a\),\(4\)𝒜valid​\(𝐬\)\\displaystyle\\mathcal\{A\}\_\{\\text\{valid\}\}\(\\mathbf\{s\}\)=\{a∈𝒜:b​\(a\|𝐬\)≥τsupp\}\.\\displaystyle=\\;\\\{\\,a\\in\\mathcal\{A\}\\;:\\;b\(a\\,\|\\,\\mathbf\{s\}\)\\geq\\tau\_\{\\text\{supp\}\}\\,\\\}\.Within this framework,b​\(a\|𝐬\)b\(a\\,\|\\,\\mathbf\{s\}\)represents the behavior model, which characterizes the likelihood that a human expert would select actionaafrom the total set of possible interventions𝒜\\mathcal\{A\}for a patient in state𝐬\\mathbf\{s\}\. The support threshold,τsupp\\tau\_\{\\text\{supp\}\}, acts as a safety gate to exclude actions that lack sufficient evidence in historical clinical data, ensuring the model avoids unproven or potentially hazardous decisions\. This threshold is tuned during validation to effectively balance the optimization of treatment outcomes with rigorous safety requirements\.

### II\-BOnline Learning with Uncertainty\-Based Sampling

High\-uncertainty candidates \(u~​\(st\)\>τquery\\tilde\{u\}\(s\_\{t\}\)\>\\tau\_\{\\text\{query\}\}\) are buffered; once the buffer reacheskkitems \(query batch size\), applykk\-center selection \(uncertainty\-weighted farthest\-first\) to query a batch of diverse samples\.

#### II\-B1Uncertainty\-Based Selective Querying

We maintain a*QQ\-ensemble*ofH=5H=5independently trainedQQ\-networksQψkk=1H\{Q\_\{\\psi\_\{k\}\}\}\_\{k=1\}^\{H\}initialized from the offline stage\. Greedy action selection uses the ensemble mean:

at=arg⁡maxa∈𝒜⁡1H​∑k=1HQψk​\(st,a\)\.a\_\{t\}\\;=\\;\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\;\\frac\{1\}\{H\}\\sum\_\{k=1\}^\{H\}Q\_\{\\psi\_\{k\}\}\(s\_\{t\},a\)\.\(5\)Ensemble mean and standard deviation:

μa​\(st\)=1H​∑k=1HQψk​\(st,a\),\\mu\_\{a\}\(s\_\{t\}\)\\;=\\;\\frac\{1\}\{H\}\\sum\_\{k=1\}^\{H\}Q\_\{\\psi\_\{k\}\}\(s\_\{t\},a\),\(6\)σa​\(st\)=1H−1​∑k=1H\(Qψk​\(st,a\)−μa​\(st\)\)2\.\\sigma\_\{a\}\(s\_\{t\}\)\\;=\\;\\sqrt\{\\frac\{1\}\{H\-1\}\\sum\_\{k=1\}^\{H\}\\\!\\bigl\(Q\_\{\\psi\_\{k\}\}\(s\_\{t\},a\)\-\\mu\_\{a\}\(s\_\{t\}\)\\bigr\)^\{2\}\}\.\(7\)Coefficient of variation:

CVa​\(st\)=σa​\(st\)\|μa​\(st\)\|\+ϵ,ϵ=10−8\.\\mathrm\{CV\}\_\{a\}\(s\_\{t\}\)\\;=\\;\\frac\{\\sigma\_\{a\}\(s\_\{t\}\)\}\{\|\\mu\_\{a\}\(s\_\{t\}\)\|\+\\epsilon\},\\quad\\epsilon=10^\{\-8\}\.\(8\)tanh\\tanh\-squashed uncertainty statistic:

u~​\(st\)=tanh⁡\(maxa∈𝒜⁡CVa​\(st\)\)\.\\tilde\{u\}\(s\_\{t\}\)\\;=\\;\\tanh\\\!\\Bigl\(\\max\_\{a\\in\\mathcal\{A\}\}\\mathrm\{CV\}\_\{a\}\(s\_\{t\}\)\\Bigr\)\.\(9\)
Query expert ifu~​\(st\)\>τ\\tilde\{u\}\(s\_\{t\}\)\>\\tau\(defaultτ=0\.2\\tau=0\.2\)\. For BCQ without ensemble, useu^​\(st\)=Var​\(𝐬t\)/maxs∈𝒟⁡Var​\(s\)∈\[0,1\]\\hat\{u\}\(s\_\{t\}\)=\\text\{Var\}\(\\mathbf\{s\}\_\{t\}\)/\\max\_\{s\\in\\mathcal\{D\}\}\\text\{Var\}\(s\)\\in\[0,1\]with same threshold\. The AI system supports two modes:*manual*mode where clinicians provide treatment labels and expected outcomes through the web interface, and*automatic*mode where the pre\-trained outcome model generates reward labels for queried transitions\. In our experimental validation, we use automatic mode to simulate expert feedback\.

K\-center selection for batch sizekk\. Let𝒰\\mathcal\{U\}be candidates exceeding threshold,d​\(⋅,⋅\)d\(\\cdot,\\cdot\)Euclidean distance:

selected=arg⁡maxℬ⊆𝒰,\|ℬ\|=k⁡min𝐬∈𝒰∖ℬ⁡max𝐬′∈ℬ⁡d​\(𝐬,𝐬′\)⋅u~​\(𝐬\)\.\\operatorname\*\{selected\}\\;=\\;\\arg\\max\_\{\\mathcal\{B\}\\subseteq\\mathcal\{U\},\\,\|\\mathcal\{B\}\|=k\}\\;\\min\_\{\\,\\mathbf\{s\}\\in\\mathcal\{U\}\\setminus\\mathcal\{B\}\}\\;\\max\_\{\\,\\mathbf\{s\}^\{\\prime\}\\in\\mathcal\{B\}\}d\(\\mathbf\{s\},\\mathbf\{s\}^\{\\prime\}\)\\cdot\\tilde\{u\}\(\\mathbf\{s\}\)\.\(10\)

#### II\-B2Incremental Model Updates

For Transformerf^θ\\hat\{f\}\_\{\\theta\}with layers\{l1,…,ln\}\\\{l\_\{1\},\.\.\.,l\_\{n\}\\\}, freezeθ1:n−2\\theta\_\{1:n\-2\}, update onlyθn−1:n\\theta\_\{n\-1:n\}:

θt\+1\(n−1:n\)=θt\(n−1:n\)−η​∇θn−1:nℒ​\(θt;𝒟tnew\)\.\\theta\_\{t\+1\}^\{\(n\-1:n\)\}\\;=\\;\\theta\_\{t\}^\{\(n\-1:n\)\}\\;\-\\;\\eta\\nabla\_\{\\theta\_\{n\-1:n\}\}\\mathcal\{L\}\(\\theta\_\{t\};\\,\\mathcal\{D\}\_\{t\}^\{\\text\{new\}\}\)\.\(11\)Exponential moving average for stability:

θ¯t\+1=α​θ¯t\+\(1−α\)​θt\+1,α=0\.99\.\\bar\{\\theta\}\_\{t\+1\}\\;=\\;\\alpha\\bar\{\\theta\}\_\{t\}\+\(1\-\\alpha\)\\theta\_\{t\+1\},\\quad\\alpha=0\.99\.\(12\)

#### II\-B3Experience Replay with Prioritization

Labeled bufferℬL\\mathcal\{B\}\_\{L\}\(10K\) for expert\-validated transitions; weak bufferℬW\\mathcal\{B\}\_\{W\}\(50K\) for model predictions\. Prioritized sampling:

p​\(τi\)∝ωi⋅exp⁡\(−λt⋅\(t−ti\)\),p\(\\tau\_\{i\}\)\\;\\propto\\;\\omega\_\{i\}\\cdot\\exp\\\!\\bigl\(\-\\lambda\_\{t\}\\cdot\(t\-t\_\{i\}\)\\bigr\),\(13\)whereωi\\omega\_\{i\}is uncertainty weight,λt\\lambda\_\{t\}controls temporal decay\.

### II\-CHot Parameter Adaptation

Three\-tier adaptation without full retraining\.Tier 1\(instant\): uncertainty thresholdτ\\tau, batch sizeBB, stream raterr, candidate actionsNN, perturbation boundΦ\\Phi\.Tier 2\(fast fine\-tune,M=500M=500steps\): discountγ\\gamma, target exponential moving average \(EMA\)ρ\\rho, regularizationλreg\\lambda\_\{\\text\{reg\}\}, imitation balanceβ\\beta; recompute targetsyt=rt\+γ​maxa′∈𝒜N​\(st\+1\)⁡minj∈\{1,2\}⁡Qθj−​\(st\+1,a′\)y\_\{t\}=r\_\{t\}\+\\gamma\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\_\{N\}\(s\_\{t\+1\}\)\}\\min\_\{j\\in\\\{1,2\\\}\}Q\_\{\\theta\_\{j\}^\{\-\}\}\(s\_\{t\+1\},a^\{\\prime\}\)on recent data\.Tier 3\(full retrain\): architecture changes, action generator changes, feature space changes, major distribution shift, or substantial drift\.

### II\-DLLM Integration and Clinical Interface

#### II\-D1LLM\-Based Interpretability

Tool\-augmented LLM approach for natural\-language explanations\. LLM accesses functions \(optimal action retrieval, trajectory simulation, feature importance\) to generate rationales\. Local LLM serverOpenAI \[[2024](https://arxiv.org/html/2606.17405#bib.bib46)\], Chenet al\.\[[2026](https://arxiv.org/html/2606.17405#bib.bib6)\], Caoet al\.\[[2026](https://arxiv.org/html/2606.17405#bib.bib4)\]with constraints:<1200<1200words, cite tool outputs, no hallucinated data\.

#### II\-D2Human\-Computer Interface and Report Generation

Progressive disclosure interface: patient dashboard \(vital signs with abnormality flags\), treatment comparison panel \(side\-by\-side projections\), training monitor \(live metrics\)\. Three modes: consultation \(natural\-language queries\), configuration \(parameter tuning\), monitoring \(AI system performance\)\.

Auto\-generated Hypertext Markup Language \(HTML\) report: \(1\) patient profile with flagged vitals, \(2\) primary recommendation with confidence and expected outcome, \(3\) treatment comparison table, \(4\) decision rationale emphasizing abnormal metrics and contrasting alternatives, \(5\) trajectory visualizations of simulated biomarker evolution\.

## IIIExperiments and Results

### III\-AEvaluation Datasets

We evaluate on two complementary datasets: a controlled synthetic simulator for systematic analysis and a real\-world ovarian cancer cohort for clinical validation\.

#### III\-A1Synthetic Clinical Simulator

10\-dimensional state space with physiologically relevant features: blood pressure \(BP∼𝒩​\(0\.5,0\.152\)\\sim\\mathcal\{N\}\(0\.5,0\.15^\{2\}\)\), heart rate \(HR∼𝒩​\(0\.5,0\.12\)\\sim\\mathcal\{N\}\(0\.5,0\.1^\{2\}\)\), glucose \(∼𝒩​\(0\.5,0\.22\)\\sim\\mathcal\{N\}\(0\.5,0\.2^\{2\}\)\), creatinine, hemoglobin, temperature, oxygen saturation \(SpO2\), age, gender, and Body Mass Index \(BMI\), all normalized to\[0,1\]\[0,1\]\. Action space:K=5K=5treatments\. Reward structure incentivizes normal vital signs \(SpO2\>0\.9\>0\.9yields bonus\) while penalizing abnormal values; SpO2<0\.80<0\.80triggers conservative fallback and mandatory expert query\. Dataset:10,000 trajectories\(max horizon 50\), split8,000/1,000/1,000\(train/val/test\) by patient ID\.

#### III\-A2Real\-World Ovarian Cancer Dataset

TCGA Ovarian Cancer cohort with587 patients,2,552 treatment eventsNetwork and others \[[2011](https://arxiv.org/html/2606.17405#bib.bib60)\]\. Drug names normalized into 11 therapeutic classes: platinum, taxanes, anthracyclines, antimetabolites, topoisomerase inhibitors, alkylating agents, anti\-angiogenics, hormonal agents, vinca alkaloids, radiation, and other, yieldingK=47K=47treatment combinationsvia multi\-hot encoding\. State representation includes age, gender, tumor status, grade, stage, cumulative drug count, radiation history, Eastern Cooperative Oncology Group \(ECOG\) performance status, Karnofsky performance score, and reserved features\. Binary reward: 1 if tumor\-free transition, 0 otherwise \(27\.5% positive observed\)\. Split:469/59/59 patients\(train/val/test, 80/10/10\)\. Most frequent regimens: platinum\-taxane \(32%\), platinum\-only \(27%\), triplet combinations \(11%\)\.

### III\-BComparative Methods

Five methods evaluated with unified preprocessing,γ=0\.99\\gamma=0\.99, five random seeds: Deep Q\-Network \(DQN\)Mnihet al\.\[[2015](https://arxiv.org/html/2606.17405#bib.bib41)\], Double Deep Q\-Network \(Double DQN\)Van Hasseltet al\.\[[2016](https://arxiv.org/html/2606.17405#bib.bib42)\], Neural Fitted Q\-Iteration \(NFQ\)Riedmiller \[[2005](https://arxiv.org/html/2606.17405#bib.bib43)\], Conservative Q\-Learning \(CQL\)Kumaret al\.\[[2020](https://arxiv.org/html/2606.17405#bib.bib40)\], and our BCQ\-based approach\. All methods trained on historical data using Transformer\-based dynamics ensembles and treatment outcome model with adversarial deconfounding\. Treatment strategies are evaluated via rollout in the learned DT environment, reporting discounted cumulative benefit\.Safety monitoring: all recommendations verified against rule\-based clinical constraints\. For synthetic data, vital sign ranges \(BP∈\[0\.3,0\.8\]\\in\[0\.3,0\.8\], HR∈\[0\.4,0\.7\]\\in\[0\.4,0\.7\], glucose∈\[0\.3,0\.7\]\\in\[0\.3,0\.7\], SpO2∈\[0\.85,1\.0\]\\in\[0\.85,1\.0\], temperature∈\[0\.45,0\.55\]\\in\[0\.45,0\.55\]\) and drug contraindication checks\. For ovarian cancer data, safety constraints include stage validity \(Stage I–IV only\), tumor status validity \(tumor\-free/with\-tumor only\), and a conservative eligibility gate requiring ECOG≤2\\leq 2and age∈\[18,90\]\\in\[18,90\]strictly\.

### III\-COffline Evaluation and Analysis

All methods are evaluated on held\-out test data by rolling out learned policies in the DT environment and measuring cumulative treatment benefit, defined as the total reward accumulated over the rollout with future rewards discounted, together with selection consistency\.

##### Treatment Selection Performance

Table[I](https://arxiv.org/html/2606.17405#S3.T1)summarizes offline and online results\. On the synthetic simulator, our method achieves mean return 37\.73, a 2\.8% improvement over Double DQN \(p=0\.02p=0\.02\) with higher consistency across patient cases \(Sharpe\-like index 3\.43 vs\. 3\.17\)\. On the ovarian cancer dataset, our method demonstrates substantially better treatment selection with 136% higher predicted benefit \(33\.26 vs\. 14\.06,p<0\.001p<0\.001\)\. This advantage reflects the AI system’s ability to identify appropriate treatment combinations in sparse\-reward scenarios where only 27\.5% of treatment events lead to tumor\-free outcomes\. Our method also produces more consistent recommendations \(action entropy 0\.96 vs\. DQN’s 1\.58\), reliably suggesting the same treatment for similar patients rather than alternating between options, which is critical for maintaining physician confidence\. All methods maintained perfect safety \(100% compliance with vital sign constraints and contraindication checks\)\.

TABLE I:Comparative performance across offline and online evaluation\. Mean return measures cumulative treatment benefit; Query rate indicates overall expert consultation frequency\. Statistical significance:p∗<0\.05\{\}^\{\*\}p<0\.05,p∗∗<0\.01\{\}^\{\*\*\}p<0\.01relative to our baseline method\.

### III\-DOnline Learning Evaluation

We evaluated online adaptation by replaying test data chronologically under unified conditions\. Our uncertainty\-based querying achieved the lowest consultation rates: 13\.1% for synthetic cases and 39\.9% for ovarian cancer cases, representing 15\.5–37\.0% reductions versus baseline methods \(Table[I](https://arxiv.org/html/2606.17405#S3.T1)\)\. This efficiency stems from accurate identification of clinically ambiguous scenarios via ensemble disagreement \(Eq\.[9](https://arxiv.org/html/2606.17405#S2.E9)\)\. Higher query rates on ovarian data \(39\.9–57\.4% across methods\) reflect genuine clinical uncertainty in sparse\-outcome settings where treatment success is variable\. All methods maintained 100% compliance with clinical constraints throughout online operation\. On the synthetic simulator, we introduced a population shift after 1000 cases \(simulating an influx of older, higher\-risk patients with shifted vital sign distributions\)\. Our method adapted effectively, accumulating substantially more labeled data \(1,620 samples vs\. 800–1,420 for baselines\) and executing more frequent updates \(80 blocks vs\. 39–70\) while maintaining rapid decision times, sustaining treatment quality despite the demographic shift\.

### III\-EModel Component Evaluation

We systematically assessed each component on held\-out test data\. The DT ensemble achieved strong predictive accuracy \(R2=0\.82R^\{2\}=0\.82\) for forecasting patient state transitions across 500 test trajectories, with prediction error remaining controlled even at 5\-step projections \(mean squared error \(MSE\)=0\.006\), validating the bounded update mechanism \(Eq\. 1\)\. The treatment outcome model attainedR2=0\.87R^\{2\}=0\.87across 7,395 treatment\-outcome observations with well\-calibrated uncertainty estimates \(expected calibration error \(ECE\)=0\.105\), confirming reliable treatment benefit predictions on average\.

## IVClinical Case Analysis and Validation

### IV\-ARepresentative Patient Cases

To assess clinical plausibility, we reviewed five representative cases from the held\-out TCGA test cohort\. These cases span different ages, stages, grades, and treatment patterns\. Table[II](https://arxiv.org/html/2606.17405#S4.T2)compares the AI system’s top\-ranked treatment class with the treatment recorded in the dataset\. The table is intended as a qualitative illustration of plausibility and alignment with observed practice, not as a population\-level accuracy analysis\. Across these cases, the model did not produce clinically implausible or unsupported treatment combinations\.

Across these representative cases, the recommended treatment class matched the historical clinical decision in all five instances\. These cases include patients aged 42–67 years, with advanced\-stage disease \(Stage IIIC–IV\), variable tumor grade \(2–3\), and diverse treatment strategies including platinum\-based monotherapy, platinum–taxane combinations, alternative agents, and triplet regimens\. Rather than implying optimality of the historical treatment, this concordance demonstrates that the AI system’s learned policy remains aligned with real\-world oncologic practice when operating under the same observational constraints\. Importantly, the model did not propose clinically implausible or unsupported treatment combinations, reinforcing the effectiveness of the behavior\-constrained action space and safety gating mechanisms\.

TABLE II:Five representative TCGA ovarian cancer cases used for qualitative comparison between the model recommendation and the recorded treatment\.Rec\.=Recommendation; Plat=Platinum; Tax=Taxane; Alt=Alternative; Rx=Treatment; yr=years; d=days; mo=months\.

### IV\-BAutomated Clinical Report Generation

![Refer to caption](https://arxiv.org/html/2606.17405v1/V1.png)Figure 3:AI\-generated patient report for TCGA\-04\-1367, integrating treatment rankings, clinical covariates, genomic profiles, and longitudinal outcomes\.To illustrate how the AI system supports real\-world clinical workflows, we present the automatically generated decision support report for Case \#1 \(TCGA\-04\-1367\) in Fig\.[3](https://arxiv.org/html/2606.17405#S4.F3)\. This report demonstrates how complex model outputs, including DT simulations, TE estimates, and uncertainty metrics, are translated into a concise, clinician\-oriented summary\. The report ranks candidate treatment options according to predicted clinical response, where higher scores indicate greater expected benefit relative to a reference treatment\. Each option corresponds to a historically observed regimen within the dataset, ensuring interpretability and clinical relevance\. For this patient, the top\-ranked recommendation \(platinum \+ taxane; predicted response score = 0\.0086\) matched the treatment administered in clinical care\.

Beyond treatment ranking, the report integrates longitudinal clinical data \(2009–2014\), baseline covariates, and molecular context\. Genomic features include copy number \(CN\) status for key ovarian cancer–associated genes \(MYC, BRCA1, BRCA2, TP53, CCNE1\), where CN = 2\.0 indicates diploid status, CN\>\>2\.0 amplification, and CN<<2\.0 deletionNetwork and others \[[2011](https://arxiv.org/html/2606.17405#bib.bib60)\]\. These features are presented for contextualization rather than direct causal attribution, consistent with the observational nature of the dataset\.

The report consolidates six integrated components: \(1\)*Treatment Alignment*, verifying concordance between AI system recommendation and actual treatment to establish trust through demonstrated alignment with clinical practice; \(2\)*Outcome Snapshot*, presenting tumor status trajectory with model\-generated evidence summary and qualitative confidence assessment; \(3\)*Visit\-Level Variables*, documenting longitudinal treatment history and response patterns across multiple clinical encounters; \(4\)*Clinical Covariates*, providing structured summaries of demographics, disease staging, histology, and baseline characteristics; \(5\)*Key Genomic Alterations*, contextualizing recommendations within the patient’s molecular profile through copy number status displayNetwork and others \[[2011](https://arxiv.org/html/2606.17405#bib.bib60)\]; and \(6\)*Treatment Response Ranking*, presenting the top 10 treatment options with predicted response scores where positive values indicate expected benefit relative to reference treatment and negative values suggest potentially inferior outcomes\. By integrating these elements into a unified interface, the AI system reduces the need for manual data synthesis across disparate sources and supports rapid clinical review\.

### IV\-CClinical Interpretation and Expert Validation

To evaluate interpretability and clinical acceptability, the generated reports are reviewed by domain experts with experience in gynecologic oncology and clinical trial design, consistent with established guidance that clinically deployed Clinical Decision Support AI System \(CDSAS\) should be assessed for explainability, usability, and end\-user trustAmannet al\.\[[2020](https://arxiv.org/html/2606.17405#bib.bib61)\], Tonekaboniet al\.\[[2019](https://arxiv.org/html/2606.17405#bib.bib62)\]\. Expert feedback focused on three dimensions: \(i\) clinical plausibility of recommendations, \(ii\) clarity and completeness of the explanatory content, and \(iii\) perceived utility as a decision support aid rather than an autonomous decision\-maker, aligning with the prevailing view that AI\-enabled CDSAS is intended to support rather than supplant clinician judgmentShortliffe and Sepúlveda \[[2018](https://arxiv.org/html/2606.17405#bib.bib63)\], Joneset al\.\[[2023](https://arxiv.org/html/2606.17405#bib.bib64)\]\.

Experts noted that the AI system’s recommendations were generally clinically plausible and remained within historically observed treatment classes and that uncertainty\-driven escalation appropriately flagged ambiguous cases for human oversight, matching uncertainty\-based referral paradigms that defer high\-uncertainty cases to cliniciansZhanget al\.\[[2023](https://arxiv.org/html/2606.17405#bib.bib65)\]\. The presentation of alternative treatments with explicit comparative scores is viewed as particularly valuable for shared decision\-making and hypothesis generation, especially in settings with sparse or heterogeneous evidenceElwynet al\.\[[2012](https://arxiv.org/html/2606.17405#bib.bib66)\]\. Importantly, experts emphasized that the AI system’s primary value lies in decision support rather than replacement of clinician judgmentShortliffe and Sepúlveda \[[2018](https://arxiv.org/html/2606.17405#bib.bib63)\], Joneset al\.\[[2023](https://arxiv.org/html/2606.17405#bib.bib64)\]\. The framework is perceived as most useful for synthesizing prior patient trajectories, exploring counterfactual treatment paths through DT simulation, and prioritizing options for further discussion or tumor board review\. These observations align with the our AI system’s design philosophy, in which human expertise remains central and is selectively engaged when model uncertainty is high, which also helps mitigate well\-described risks such as over\-reliance on automated advice \(automation bias\)Joneset al\.\[[2023](https://arxiv.org/html/2606.17405#bib.bib64)\], Goddardet al\.\[[2012](https://arxiv.org/html/2606.17405#bib.bib67)\]\.

## VConclusion

We present a clinical decision\-support framework that combines RL, a patient DT, and TE\-based rewards with human oversight\. On synthetic and retrospective ovarian cancer data, the method outperformed standard baselines while maintaining low query rates and rule\-based safety\. The results support the framework as a cohort\-specific decision\-support system within a learned DT environment\. Limitations include retrospective evaluation, possible bias from unmeasured confounders, and the lack of external or prospective validation\. Future work will study guideline\-aware constraints, out\-of\-distribution detection, and multi\-site prospective evaluation\.

## References

- J\. Amann, A\. Blasimme, E\. Vayena, D\. Frey, V\. I\. Madai, and P\. Consortium \(2020\)Explainability for artificial intelligence in healthcare: a multidisciplinary perspective\.BMC medical informatics and decision making20\(1\),pp\. 310\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p1.1)\.
- J\. Cao, Y\. Ma, X\. Li, Q\. Ren, and X\. Chen \(2026\)Task\-specific efficiency analysis: when small language models outperform large language models\.arXiv preprint arXiv:2603\.21389\.Cited by:[§II\-D1](https://arxiv.org/html/2606.17405#S2.SS4.SSS1.p1.1)\.
- Z\. Chen, J\. Cheng, H\. Amiri, K\. Nag, L\. Lin, S\. Liu, G\. Tolomei, and X\. Sun \(2025\)FROG: fair removal on graph\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 415–424\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- Z\. Chen, J\. Cheng, Z\. Fan, H\. Amiri, Y\. Yao, X\. Sun, and Y\. Zhang \(2026\)CURE: circuit\-aware unlearning for llm\-based recommendation\.arXiv preprint arXiv:2604\.04982\.Cited by:[§II\-D1](https://arxiv.org/html/2606.17405#S2.SS4.SSS1.p1.1)\.
- Z\. Chen, F\. Silvestri, J\. Wang, H\. Zhu, H\. Ahn, and G\. Tolomei \(2022\)Relax: reinforcement learning agent explainer for arbitrary predictive models\.InProceedings of the 31st ACM international conference on information & knowledge management,pp\. 252–261\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p1.1)\.
- G\. Elwyn, D\. Frosch, R\. Thomson, N\. Joseph\-Williams, A\. Lloyd, P\. Kinnersley, E\. Cording, D\. Tomson, C\. Dodd, S\. Rollnick,et al\.\(2012\)Shared decision making: a model for clinical practice\.Journal of general internal medicine27\(10\),pp\. 1361–1367\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p2.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InInternational conference on machine learning,pp\. 2052–2062\.Cited by:[4th item](https://arxiv.org/html/2606.17405#S1.I1.i4.p1.1),[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- K\. Goddard, A\. Roudsari, and J\. C\. Wyatt \(2012\)Automation bias: a systematic review of frequency, effect mediators, and mitigators\.Journal of the American Medical Informatics Association19\(1\),pp\. 121–127\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p2.1)\.
- M\. Hernan and J\. Robins \(2020\)Causal inference: what if chapman hall/crc, boca raton\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p1.1)\.
- P\. Jayaraman, J\. Desman, M\. Sabounchi, G\. N\. Nadkarni, and A\. Sakhuja \(2024\)A primer on reinforcement learning in medicine for clinicians\.NPJ Digital Medicine7\(1\),pp\. 337\.Cited by:[4th item](https://arxiv.org/html/2606.17405#S1.I1.i4.p1.1),[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- C\. Jones, J\. Thornton, and J\. C\. Wyatt \(2023\)Artificial intelligence and clinical decision support: clinicians’ perspectives on trust, trustworthiness, and liability\.Medical law review31\(4\),pp\. 501–520\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p1.1),[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p2.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InNeurIPS,External Links:[Link](https://papers.nips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html)Cited by:[§III\-B](https://arxiv.org/html/2606.17405#S3.SS2.p1.8)\.
- L\. Lai, Z\. Cheng, K\. Cheng, and X\. Qi \(2026\)Do transformers always win? an empirical study of semantic embeddings for short\-text e\-commerce reviews\.In2026 9th International Symposium on Big Data and Applied Statistics \(ISBDAS\),pp\. 525–529\.External Links:[Document](https://dx.doi.org/10.1109/ISBDAS69350.2026.11484350)Cited by:[§II\-A](https://arxiv.org/html/2606.17405#S2.SS1.SSSx1.p1.1)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.Vol\.30\.Cited by:[3rd item](https://arxiv.org/html/2606.17405#S1.I1.i3.p1.1),[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p1.1)\.
- C\. Meijer, H\. Uh, and S\. El Bouhaddani \(2023\)Digital twins in healthcare: methodological challenges and opportunities\.Journal of personalized medicine13\(10\),pp\. 1522\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p1.1)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.nature518\(7540\),pp\. 529–533\.Cited by:[§III\-B](https://arxiv.org/html/2606.17405#S3.SS2.p1.8)\.
- C\. G\. A\. R\. Networket al\.\(2011\)Integrated genomic analyses of ovarian carcinoma\.Nature474\(7353\),pp\. 609\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p4.1),[§III\-A2](https://arxiv.org/html/2606.17405#S3.SS1.SSS2.p1.1),[§IV\-B](https://arxiv.org/html/2606.17405#S4.SS2.p2.2),[§IV\-B](https://arxiv.org/html/2606.17405#S4.SS2.p3.1)\.
- OpenAI \(2024\)OpenAI api documentation\.Note:[https://platform\.openai\.com/docs/](https://platform.openai.com/docs/)Accessed: 2025\-08\-18Cited by:[§II\-D1](https://arxiv.org/html/2606.17405#S2.SS4.SSS1.p1.1)\.
- I\. Portability and A\. Act \(2012\)Guidance regarding methods for de\-identification of protected health information in accordance with the health insurance portability and accountability act \(hipaa\) privacy rule\.Human Health Services: Washington, DC, USA\.Cited by:[5th item](https://arxiv.org/html/2606.17405#S1.I1.i5.p1.1)\.
- W\. H\. Raza, A\. B\. Shah, Y\. Wen, Y\. Shen, J\. D\. M\. Lemus, M\. C\. Schiess, T\. M\. Ellmore, R\. Hu, and X\. Fu \(2025\)NeuroMoE: a transformer\-based mixture\-of\-experts framework for multi\-modal neurological disorder classification\.In2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society \(EMBC\),pp\. 1–7\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- M\. Riedmiller \(2005\)Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method\.InEuropean conference on machine learning,pp\. 317–328\.Cited by:[§III\-B](https://arxiv.org/html/2606.17405#S3.SS2.p1.8)\.
- O\. Sener and S\. Savarese \(2017\)Active learning for convolutional neural networks: a core\-set approach\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p2.1)\.
- E\. H\. Shortliffe and M\. J\. Sepúlveda \(2018\)Clinical decision support in the era of artificial intelligence\.Jama320\(21\),pp\. 2199–2200\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p1.1),[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p2.1)\.
- R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§I](https://arxiv.org/html/2606.17405#S1.p1.1)\.
- A\. Thuy and D\. F\. Benoit \(2024\)Reliable uncertainty with cheaper neural network ensembles: a case study in industrial parts classification\.arXiv preprint arXiv:2403\.10182\.Cited by:[3rd item](https://arxiv.org/html/2606.17405#S1.I1.i3.p1.1)\.
- S\. Tonekaboni, S\. Joshi, M\. D\. McCradden, and A\. Goldenberg \(2019\)What clinicians want: contextualizing explainable machine learning for clinical end use\.InMachine learning for healthcare conference,pp\. 359–380\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p1.1)\.
- H\. Van Hasselt, A\. Guez, and D\. Silver \(2016\)Deep reinforcement learning with double q\-learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.30\.Cited by:[§III\-B](https://arxiv.org/html/2606.17405#S3.SS2.p1.8)\.
- R\. Zhang, C\. Gatsonis, and J\. A\. Steingrimsson \(2023\)Role of calibration in uncertainty\-based referral for deep learning\.Statistical methods in medical research32\(5\),pp\. 927–943\.Cited by:[§IV\-C](https://arxiv.org/html/2606.17405#S4.SS3.p2.1)\.

Similar Articles

Enabling a new model for healthcare with AI co-clinician

Google DeepMind Blog

Google DeepMind announces an AI co-clinician research initiative aimed at improving healthcare delivery through 'triadic care,' where AI agents assist patients under physician supervision. The system demonstrated high accuracy and zero critical errors in a study of primary care queries, outperforming existing evidence synthesis tools.

Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

arXiv cs.AI

This paper proposes a knowledge-enhanced visual diagnostic system for traditional Chinese medicine that uses a Neo4j knowledge graph, a four-stage symptom matching pipeline, and an information gain-driven proactive questioning strategy to improve transparency and interpretability. Results demonstrate significant improvements in diagnostic trust and reduced cognitive load.