MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Summary
MedGuideX transforms clinical practice guidelines into executable decision logic to generate factual and counterfactual QA data for training medical LLMs, achieving a 10.28% relative improvement in average accuracy across clinical reasoning benchmarks.
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Source: [https://arxiv.org/html/2605.26567](https://arxiv.org/html/2605.26567)
Yuhao Shen1,2Lang Cao111footnotemark:1Simo Du3Yuqing Wang3 Juexiao Zhou2Hao Peng1Yue Guo1 1University of Illinois Urbana\-Champaign 2The Chinese University of Hong Kong, Shenzhen 3Albert Einstein College of Medicine
###### Abstract
Clinical practice guidelines \(CPGs\) encode evidence\-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules\. However, existing methods often use CPGs as free\-text training data or retrieval sources, underutilizing their procedural decision structure\. To better exploit this structure, we introduce a guideline\-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question\-answering data\. Theses data teach models both guideline\-supported decisions and how decisions change under different patient conditions\. Post\-training a medical LLM on the generated data yieldsMedGuideX\. Across four clinical reasoning benchmarks,MedGuideXachieves a 10\.28% relative improvement in average accuracy\. Physician evaluation further shows thatMedGuideXbetter recovers clinician\-authored reasoning steps and produces physician\-preferred rationales in faithfulness, validity, completeness, and clarity\. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs\.
![[Uncaptioned image]](https://arxiv.org/html/2605.26567v1/figures/medguidex.png)MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
## 1Introduction
Large language models \(LLMs\)Singh et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib38)\); Yang et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib52)\)have shown strong potential in medical domain, including electronic health records understanding, clinical case reasoning, and medical decision supportCao et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib4)\); Wu et al\. \([2025b](https://arxiv.org/html/2605.26567#bib.bib50)\); Lai et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib22)\)\. However, reliable clinical reasoning remains challenging\. It requires models to integrate heterogeneous patient evidence, apply domain knowledge, compare plausible clinical decisions, handle uncertainty, and follow evidence\-based decision logicBowen \([2006](https://arxiv.org/html/2605.26567#bib.bib2)\); Nendaz and Perrier \([2012](https://arxiv.org/html/2605.26567#bib.bib31)\); Sox et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib39)\)\. Existing medical LLM training often relies on large\-scale medical corpora, clinical notes, or case reportsChen et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib8)\); Han et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib15)\); Labrak et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib21)\); Garcia\-Gasulla et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib11)\); Wu et al\. \([2025b](https://arxiv.org/html/2605.26567#bib.bib50)\)\. While useful, these data sources provide reasoning supervision only implicitly: they are often noisy, heterogeneous, incomplete, and weakly aligned with the explicit decision procedures clinicians use in practiceChen et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib6)\); Lai et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib22)\); Gu et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib12)\); Li et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib24)\); Yoo and Woo \([2025](https://arxiv.org/html/2605.26567#bib.bib54)\); Yang et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib53)\)\. As a result, models may acquire broad medical knowledge without learning stable and generalizable clinical decision logic\.
Clinical practice guidelines \(CPGs\) offer a natural source of such decision logic\. In clinical practice, clinicians apply guidelines by identifying patient variables, evaluating conditional criteria, and following recommendation rules\. Thus, beyond textual medical knowledge, CPGs encode procedural decision structures for diagnosis, treatment, and disease management\. However, existing CPG\-based methods often underutilize this structure\. Retrieval\-augmented or prompting\-based methods treat CPGs as external knowledge sourcesSchubert et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib33)\); Deng et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib10)\); Oniani et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib32)\); Li et al\. \([2023a](https://arxiv.org/html/2605.26567#bib.bib23)\), while direct training on guideline text exposes models to the content but does not explicitly represent the variables, conditions, and decision rules that make guidelines operationalStaniek et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib40)\); Chen et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib8)\)\. Therefore, the internal decision logic of CPGs remains underexploited as scalable supervision for medical LLMs \(More related work in Appendix[B](https://arxiv.org/html/2605.26567#A2)\)\.
To better exploit this structure, we propose a guideline\-grounded training pipeline for building LLMs with stronger clinical reasoning ability\. We first collect high\-quality, publicly available CPGs and transform their recommendations into executable functions that represent structured clinical decision logic\. Each function operates over patient variables and produces guideline\-consistent decisions, enabling controlled data generation and automatic verification\. Based on these functions, we generate factual and counterfactual question\-answering instances\. Factual instances teach models guideline\-supported decisions, while counterfactual instances teach how decisions should change when key patient conditions are modified\. This design follows prior findings that counterfactual reasoning can improve model reasoning ability and expose failures that standard QA evaluation may missChen et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib7)\); You et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib55)\); Vashishtha et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib46)\)\.
Using this pipeline, we trainMedGuideX, a medical LLM designed to internalize guideline\-grounded clinical decision logic\. Specifically, we post\-train the base model with supervised fine\-tuning \(SFT\) and reinforcement learning \(RL\) on the generated factual and counterfactual data\. Experiments on four clinical reasoning benchmarks show thatMedGuideXsubstantially improves over its base model and achieves strong performance among open\-source medical LLMs\. Compared with Qwen3\.5\-9BYang et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib52)\),MedGuideX\-9B achieves relative improvements of26\.64%26\.64\\%,9\.45%9\.45\\%,4\.41%4\.41\\%, and10\.51%10\.51\\%on MedCaseReasoningWu et al\. \([2025b](https://arxiv.org/html/2605.26567#bib.bib50)\), MedQAJin et al\. \([2021](https://arxiv.org/html/2605.26567#bib.bib19)\), MIMIC\-CDM\-FIHager et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib14)\), and ER\-ReasonMehandru et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib29)\), respectively\. Notably, the larger relative gains appear on the lower\-accuracy benchmarks, MedCaseReasoning and ER\-Reason, suggesting that guideline\-derived supervision is particularly useful for challenging reasoning settings\. Physician evaluation further show thatMedGuideXbetter recovers clinician\-authored reasoning steps and produces physician\-preferred rationales in faithfulness, validity, completeness, and overall quality\.
In summary, our contributions are:
- •We propose a guideline\-derived post\-training pipeline that transforms CPGs into executable clinical decision logic and uses it to generate factual and counterfactual QA supervision\.
- •We trainMedGuideX, a medical LLM that internalizes guideline\-grounded clinical decision logic through SFT and RL\.
- •We conduct experiments on four medical reasoning benchmarks, showing thatMedGuideXimproves over the base model and similarly sized medical LLMs, while producing higher\-quality clinical rationales\.
Figure 1:Overview ofMedGuideX\.Top: We transform raw CPGs into executable Python functionsffvia an intermediate decision tree, then sample clinical variablesXXand executef\(X\)f\(X\)to deterministically label both factual and counterfactual QA instances\.Bottom: We post\-train an LLM with SFT on the mixed QA and RL on factual QA, where re\-executingffprovides the reward\. This design turns each guideline into an executable verifier that grounds both labels and rewards in its decision logic\.
## 2Preliminary
We define a guideline recommendation as an actionable clinical decision rule extracted from a complete CPG\. A recommendation specifies how a clinical decision should be made under certain patient conditions, such as diagnosis, treatment, screening, or disease management\. For example, a recommendation may state that a patient with severe symptoms should be referred for further evaluation\.
We formalize each guideline recommendation as a function
Y=f\(X\),f:𝒳→𝒴,Y=f\(X\),\\qquad f:\\mathcal\{X\}\\to\\mathcal\{Y\},\(1\)whereX=\(x1,…,xn\)∈𝒳X=\(x\_\{1\},\\dots,x\_\{n\}\)\\in\\mathcal\{X\}denotes a vector of clinical variables,ffdenotes the conditional decision logic encoded by the recommendation, andY∈𝒴Y\\in\\mathcal\{Y\}is the guideline\-prescribed output\. We instantiateffas a finite decision treeTfT\_\{f\}, whose internal nodes are atomic predicatesa∈𝒜fa\\in\\mathcal\{A\}\_\{f\}overXXand whose leaves are outputs in𝒴\\mathcal\{Y\}\. Each predicate takes the formai=𝟏\[gi\(X\)\]a\_\{i\}=\\mathbf\{1\}\[g\_\{i\}\(X\)\], wheregig\_\{i\}is a guideline\-defined condition, such asage≥65\\textit\{age\}\\geq 65oreGFR<30\\textit\{eGFR\}<30\.
Given an inputXX, executingTfT\_\{f\}activates a path
πf\(X\)=\(ai1,ai2,…,aik\)⊆𝒜f,\\pi\_\{f\}\(X\)=\(a\_\{i\_\{1\}\},a\_\{i\_\{2\}\},\\dots,a\_\{i\_\{k\}\}\)\\subseteq\\mathcal\{A\}\_\{f\},\(2\)whose predicates jointly select the leaf outputY=f\(X\)Y=f\(X\)\. This path is the verifiable unit of guideline logic that we aim for the model to internalize: it should not only predict the correct output, but also reason through a path consistent withπf\(X\)\\pi\_\{f\}\(X\)\.
We next describe how we collect CPGs and construct factual and counterfactual QA data from them \(§[3](https://arxiv.org/html/2605.26567#S3)\), followed by how we trainMedGuideXusing SFT and RL \(§[4](https://arxiv.org/html/2605.26567#S4)\)\. Figure[1](https://arxiv.org/html/2605.26567#S1.F1)illustrates the full pipeline\.
## 3Data Preparation
The central artifact of data preparation is an executable implementation offf, whose control flow mirrors the decision treeTfT\_\{f\}\. This executable form enables downstream supervision by deterministically labeling synthesized questions through direct execution and verifying model predictions during training by re\-executingffon the model\-stated intermediate variables\. In this formulation, the inputs correspond to structured patient information and clinical scenarios, while the executable functionffrepresents structured clinical reasoning grounded in guideline decision logic\.
### 3\.1Guideline Curation
Our initial guideline source is an open CPG collection based on the corpus used to train MEDITRONChen et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib8)\)\. However, many documents in this collection are noisy, low quality, or near\-duplicate guidelines, which can substantially degrade the quality of downstream QA data\. We therefore apply a curation pipeline to retain high\-quality guidelines\.
We first restrict the corpus to guidelines from U\.S\.\-based sources, since clinical recommendations may vary across countries, healthcare systems, and organizations\. Specifically, we retain guidelines from the Centers for Disease Control and Prevention \(CDC\)111[https://www\.cdc\.gov/](https://www.cdc.gov/)and PubMed222[https://pubmed\.ncbi\.nlm\.nih\.gov/](https://pubmed.ncbi.nlm.nih.gov/)\. We then use an LLM to extract structured metadata for each guideline, including the disease or drug, target age group, race, gender, and publication date\. Guidelines with identical metadata are treated as duplicates, and only the most recent version is retained\. In addition, we instruct the LLM to discard incomplete documents directly\. After curation, we obtain a filtered subset of CPGs\.
### 3\.2Executable Transformation
#### Recommendation Extraction\.
We first split each document into recommendation\-oriented chunks, where each chunk contains one or more complete guideline recommendations\. This produces a set of guideline recommendation passages\. An LLM extractor then identifies recommendation candidates from each chunk\. For each candidate, we ask the extractor to identify the target population, clinical condition, recommended action, relevant exceptions, and evidence grade, when available\. We then validate these candidates and retain only usable recommendations\. Specifically, we discard candidates that do not describe a concrete clinical action or cannot be expressed as a condition\-action rule\. We also remove near\-duplicates, where multiple recommendations describe highly similar populations, conditions, and actions\.
#### Decision\-Tree Validation\.
Each retained recommendation is converted into a decision treeTfT\_\{f\}, which specifies the required input variables, decision conditions, and final outputs\. An LLM validator then checks whether the tree is complete, whether each condition is clear, whether every branch leads to an output, and whether all variables are supported by the source guideline\.
#### Compilation to Executable Function\.
Each validated tree is compiled into an executable Python function that takes the variablesXXas input and returns the guideline outputf\(X\)f\(X\)\. We further check whether the function is syntactically correct, executable on sampled inputs, and consistent with the original decision tree\.
### 3\.3QA Synthesis
#### Factual QA Synthesis\.
For each executable functionff, we sample complete clinical variable assignmentsXXand executeffto obtain the guideline outputYobs=f\(X\)Y\_\{\\text\{obs\}\}=f\(X\)\. A naive sampling strategy would produce too many easy or default cases, such as no\-action recommendations\. To avoid this imbalance, we enforce two constraints: \(1\)Path coverage: the generated data should cover all decision conditions in the tree, and \(2\)Output balance: no\-diagnosis outputs, meaning that no diagnosis is recommended for the current inputs, should not dominate the dataset\.
After applying the coverage and balancing constraints, we obtain the factual QA set\. For each sample, we user an LLM to generate a step\-by\-step reasoning trace from the underlying Python functionff, input variablesXX, and executed outputYobsY\_\{\\text\{obs\}\}\. The reasoning trace verbalizes the executed decision pathπf\(X\)\\pi\_\{f\}\(X\)and is stored alongside the QA pair for SFT training\.
#### Counterfactual QA Synthesis\.
We further generate counterfactual QA data to train the model to reason about hypothetical changes in patient conditions\. For each counterfactual example, we first sample a complete variable assignmentXXover the inputs offfand executeffto obtain the factual outcomeYobs=f\(X\)Y\_\{\\text\{obs\}\}=f\(X\)\. We then partition the variables into three disjoint parts:
- •XobsX\_\{\\text\{obs\}\}: observed variables that are shown and remain unchanged\.
- •XhidX\_\{\\text\{hid\}\}: hidden variables that are not shown\.
- •xintx\_\{\\text\{int\}\}: a single observed variable modified by the intervention, meaning that its value is changed while all other observed variables are held fixed\.
We writex^int\\hat\{x\}\_\{\\text\{int\}\}for the value ofxintx\_\{\\text\{int\}\}after intervention\. Onlyxintx\_\{\\text\{int\}\}changes, whileXobsX\_\{\\text\{obs\}\}andXhidX\_\{\\text\{hid\}\}are held fixed\. The factual and counterfactual outcomes can be written as
Yobs\\displaystyle Y\_\{\\text\{obs\}\}=f\(Xobs,Xhid,xint\),\\displaystyle=f\\big\(X\_\{\\text\{obs\}\},X\_\{\\text\{hid\}\},x\_\{\\text\{int\}\}\\big\),\(3\)Y^cf\\displaystyle\\hat\{Y\}\_\{\\text\{cf\}\}=f\(Xobs,Xhid,x^int\)\.\\displaystyle=f\\big\(X\_\{\\text\{obs\}\},X\_\{\\text\{hid\}\},\\hat\{x\}\_\{\\text\{int\}\}\\big\)\.By construction,YobsY\_\{\\text\{obs\}\}andY^cf\\hat\{Y\}\_\{\\text\{cf\}\}differ only through the intervention onxintx\_\{\\text\{int\}\}\.
The model receivesXobsX\_\{\\text\{obs\}\}, the identity of the hidden variables, the original and intervened valuesxintx\_\{\\text\{int\}\}andx^int\\hat\{x\}\_\{\\text\{int\}\}, and the factual outcomeYobsY\_\{\\text\{obs\}\}\. It must then predictY^cf\\hat\{Y\}\_\{\\text\{cf\}\}through three steps:
1. 1\.Abduction\.Infer hidden valuesXhidX\_\{\\text\{hid\}\}that are consistent with the observed context and reproduceYobsY\_\{\\text\{obs\}\}when executed throughff\.
2. 2\.Intervention\.Replacexintx\_\{\\text\{int\}\}withx^int\\hat\{x\}\_\{\\text\{int\}\}while keepingXobsX\_\{\\text\{obs\}\}andXhidX\_\{\\text\{hid\}\}fixed\.
3. 3\.Prediction\.Executeffon the resulting complete input to obtain the counterfactual outcomeY^cf\\hat\{Y\}\_\{\\text\{cf\}\}\.
At the data level, we executeffon the factual input, apply the intervention, and executeffagain to generate interventional scenarios\. We retain only cases where the intervention changes the outcome and discard those with unchanged outputs\. Each counterfactual sample is paired with a reasoning trace generated by an LLM, which verbalizes the three steps of abduction, intervention, and prediction over the executable functionff\. Overall data pipeline statistics are shown in Appendix[C](https://arxiv.org/html/2605.26567#A3)\.
## 4Model Training
We post\-trainMedGuideXusing prepared factual and counterfactual QA sets\.
#### Optimization objectives\.
For each guideline recommendation, we obtain an executable functionff\. Given a complete patient scenarioXX, the function returns the guideline output
A factual QA instance asks the model to determine the guideline output for the complete patient scenarioXX\. For factual examples, each oracle trajectory contains the patient scenario, the executed guideline output, and a rationale that explains why the scenario leads to that outputYY\.
For counterfactual QA examples, each trajectory follows an abduction–intervention–prediction structure\. The prompt providesXobsX\_\{\\text\{obs\}\}, the original valuexintx\_\{\\text\{int\}\}, the intervened valuex^int\\hat\{x\}\_\{\\text\{int\}\}, and the factual guideline outputYobsY\_\{\\text\{obs\}\}; the hidden variable is not shown to the model\. The model must first infer a hidden value that makes the factual observation consistent with the executable guideline, then apply the intervention while keeping the inferred hidden value fixed, and finally predict the counterfactual guideline outputY^cf\\hat\{Y\}\_\{\\text\{cf\}\}\.
The optimization objective is to train the model so that its predicted guideline outputY^\\hat\{Y\}matches the oracle outputYYas closely as possible in both factual and counterfactual QA, thereby improving its ability to perform clinical reasoning aligned with the CPGs from which the synthetic data are derived\.
#### SFT\.
The SFT stage trains the model to imitate oracle trajectories synthesized from the executable guideline functions\. This stage serves as a cold start for subsequent RL training by injecting guideline\-grounded clinical knowledge and exposing the model to a structured clinical reasoning framework\.
#### RL\.
After SFT, we apply GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.26567#bib.bib35)\)to optimize the model on its own sampled responses\. For a sampled responseoo, we define a format rewardrfmt\(o\)r\_\{\\text\{fmt\}\}\(o\)and an answer rewardranswer\(o\)r\_\{\\text\{answer\}\}\(o\)\. The format reward checks whetheroosatisfies the required response format, while the answer reward checks whether the parsed answer is correct\. The final reward is
r\(o\)=\{−1,rfmt\(o\)=−1,ranswer\(o\),rfmt\(o\)=0\.r\(o\)=\\begin\{cases\}\-1,&r\_\{\\text\{fmt\}\}\(o\)=\-1,\\\\ r\_\{\\text\{answer\}\}\(o\),&r\_\{\\text\{fmt\}\}\(o\)=0\.\\end\{cases\}\(5\)Here,rfmt\(o\)=−1r\_\{\\text\{fmt\}\}\(o\)=\-1indicates an invalid response format,rfmt\(o\)=0r\_\{\\text\{fmt\}\}\(o\)=0indicates a valid format, andranswer\(o\)∈\{0,1\}r\_\{\\text\{answer\}\}\(o\)\\in\\\{0,1\\\}indicates whether the task\-specific correctness check passes\.
For factual prompts, correctness depends only on the final guideline output\. LetY^\(o\)\\hat\{Y\}\(o\)denote the final answer parsed from a sampled responseoo\. We define the factual answer reward as
rF\(o\)=𝟙\[Y^\(o\)≡Y⋆\]\.r\_\{\\text\{F\}\}\(o\)=\\mathds\{1\}\\\!\\left\[\\hat\{Y\}\(o\)\\equiv Y^\{\\star\}\\right\]\.\(6\)
For counterfactual prompts, correctness requires more than the final answer\. LetX^hid\(o\)\\hat\{X\}\_\{\\text\{hid\}\}\(o\)denote the hidden value inferred by the model, and letY^cf\(o\)\\hat\{Y\}\_\{\\text\{cf\}\}\(o\)denote its final counterfactual answer\. The response is counted as correct only if the inferred hidden value matches the intended hidden state, the inferred hidden value together with the original factual context reproducesYobsY\_\{\\text\{obs\}\}, and the final counterfactual answer matches the executed counterfactual label:
rCF\(o\)=\\displaystyle r\_\{\\text\{CF\}\}\(o\)=𝟙\[X^hid\(o\)=Xhid⋆\]\\displaystyle\\mathds\{1\}\\\!\\left\[\\hat\{X\}\_\{\\text\{hid\}\}\(o\)=X\_\{\\text\{hid\}\}^\{\\star\}\\right\]\(7\)⋅𝟙\[f\(Xbase,x~int,X^hid\(o\)\)=Yobs\]\\displaystyle\\cdot\\mathds\{1\}\\\!\\left\[f\\big\(X\_\{\\text\{base\}\},\\tilde\{x\}\_\{\\text\{int\}\},\\hat\{X\}\_\{\\text\{hid\}\}\(o\)\\big\)=Y\_\{\\text\{obs\}\}\\right\]⋅𝟙\[Y^cf\(o\)=Ycf⋆\]\.\\displaystyle\\cdot\\mathds\{1\}\\\!\\left\[\\hat\{Y\}\_\{\\text\{cf\}\}\(o\)=Y\_\{\\text\{cf\}\}^\{\\star\}\\right\]\.Thus, the counterfactual reward requires the model to recover the intended hidden value and predict the correct counterfactual output\.
#### Training procedure\.
We perform SFT on the balanced factual and counterfactual QA mixture, then initialize GRPO from the SFT checkpoint\. The reward implementation supports both factual and counterfactual QA\. Our finalMedGuideXconfiguration uses mixed factual and counterfactual SFT followed by factual GRPO, as selected by the ablation study in Section[F](https://arxiv.org/html/2605.26567#A6)\. Counterfactual GRPO is evaluated as an alternative configuration in the same ablation\.
## 5Experiments
Table 1:Main results on four clinical reasoning benchmarks \(accuracy, %\)\.Boldmarks the best score in each column, anditalicsindicate the second\-best score among open\-source medical LLMs\. Relative gains over the corresponding base model are shown ingreen\.MedGuideXachieves strong performance across benchmarks\.### 5\.1Experimental Setup
We evaluateMedGuideXon four clinical reasoning benchmarks spanning medical exam questions and real\-world case\-based diagnostic reasoning: MedQAJin et al\. \([2021](https://arxiv.org/html/2605.26567#bib.bib19)\), MedCaseReasoningWu et al\. \([2025b](https://arxiv.org/html/2605.26567#bib.bib50)\), MIMIC\-CDM\-FIHager et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib14)\), and ER\-ReasonMehandru et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib29)\)\. These four benchmarks differ in data source, task format, reasoning depth, and difficulty, enabling a more comprehensive evaluation ofMedGuideX\. Details and examples of these benchmarks are provided in Appendix[D](https://arxiv.org/html/2605.26567#A4)\. Notably, none of the training sets from these benchmarks are used during the training ofMedGuideX\.
We compareMedGuideXwith three groups of baselines, with sources listed in Table[1](https://arxiv.org/html/2605.26567#S5.T1)\.*Frontier LLMs*include strong proprietary or large\-scale general\-purpose models and serve as an upper\-performance reference\. To comply with institutional data governance requirements, all proprietary model inference is conducted through Azure AI\.*Open\-source Medical LLMs*are models specialized for the medical domain through medical pretraining or fine\-tuning, and constitute the most directly comparable baselines toMedGuideXat a similar scale\.*Base models*are the untuned Qwen3\.5\-4B and Qwen3\.5\-9B backbones used byMedGuideX\. We also compare against methods that leverage CPGs in different ways \(Appendix[H](https://arxiv.org/html/2605.26567#A8)\)\. The main results report final\-answer accuracy, measured by exact match or LLM\-as\-a\-judge evaluation\. Additional experimental details are provided in Appendix[E](https://arxiv.org/html/2605.26567#A5)\.
Figure 2:Training strategy ablation on data composition and training phases\. Each group of bars corresponds to a configuration specifying the data used in the SFT and RL phases\.*Mixed*denotes a jointly trained corpus with equal factual and counterfactual proportions, andA→BA\\to Bdenotes applying data typeAAbefore data typeBBwithin a single phase\. The best overall configuration uses mixed\-data SFT followed by factual RL\.
### 5\.2Main Results
Table[1](https://arxiv.org/html/2605.26567#S5.T1)reports accuracy on four clinical reasoning benchmarks\. Overall,MedGuideXconsistently improves performance at both model scales\. At the 9B scale,MedGuideXimproves over Qwen3\.5\-9B by26\.64%26\.64\\%,9\.45%9\.45\\%,4\.41%4\.41\\%, and11\.67%11\.67\\%on MedCaseReasoning, MedQA, MIMIC\-CDM\-FI, and ER\-Reason, respectively, increasing the average score by10\.28%10\.28\\%\. The 4B model shows the same trend, with relative improvements of23\.35%23\.35\\%,6\.39%6\.39\\%,2\.49%2\.49\\%, and9\.75%9\.75\\%, and an average\-score improvement of7\.63%7\.63\\%\. These results show that guideline\-derived post\-training provides a consistent improvement signal across both model scales\.
The relative gains are largest on MedCaseReasoning and ER\-Reason, the two benchmarks where the backbone accuracy is lowest\. This suggests that guideline\-derived decision supervision is particularly useful in more challenging reasoning settings\. At the same time, the gains on MIMIC\-CDM\-FI are smaller, likely because this benchmark already has high backbone performance and requires additional abilities beyond guideline decision logic, such as synthesizing noisy EHR evidence and mapping clinical findings to benchmark\-specific labels\. Thus,MedGuideXimproves broadly, but the magnitude of improvement depends on how directly each benchmark matches the decision\-logic supervision used in training\.
Compared with open\-source medical LLMs,MedGuideXachieves strong performance\.MedGuideX\-9B obtains the best average score among all open\-source medical LLMs in the table and is the strongest open\-source model on MedCaseReasoning, MedQA, and MIMIC\-CDM\-FI, while remaining competitive on ER\-Reason\.MedGuideX\-4B also exceeds the average performance of all listed open\-source medical LLMs, showing that the proposed training pipeline is effective even at a smaller scale\. AlthoughMedGuideX\-9B remains below the strongest proprietary systems on average, it narrows the gap substantially while using a compact open model\.
The guideline\-based baselines further clarify the importance of executable supervision\. Retrieval\-augmented guideline prompting and in\-context guideline demonstrations provide only modest improvements over Qwen3\.5\-9B, increasing the average score by1\.56%1\.56\\%and1\.48%1\.48\\%, respectively\. These methods expose the model to guideline text or guideline\-style examples at inference time, but do not train the model to internalize the underlying decision logic\. CPGPromptDeng et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib10)\), which constructs a decision\-tree prompt from retrieved guideline recommendations, performs substantially worse in our setting, decreasing the average score by24\.56%24\.56\\%\. Direct fine\-tuning on raw CPG text also leads to negative transfer, decreasing the average score by9\.08%9\.08\\%, suggesting that textual exposure to guidelines alone is insufficient for learning operational clinical decision rules\.
The strongest guideline\-based baseline is RL with CPG\-derived process rewards, which improves the average score by5\.37%5\.37\\%over Qwen3\.5\-9B\. However, it still underperformsMedGuideX, whose average relative improvement is10\.28%10\.28\\%\. This baseline uses an LLM judge to assess consistency with retrieved guideline recommendations, but it does not execute guideline functions or verify counterfactual decision behavior\. The gap between this baseline andMedGuideXsuggests that executable guideline functions provide a more precise and effective supervision signal than retrieval\-based or judge\-only uses of CPGs\. A controlled qualitative case study ofMedGuideXwith GPT\-5\.0 and MedReason\-8B is presented in Appendix[I](https://arxiv.org/html/2605.26567#A9)\.
Ablation studyDetailed ablations are reported in Appendix[F](https://arxiv.org/html/2605.26567#A6)\. The best configuration applies mixed factual/counterfactual SFT followed by factual RL, yielding gains of8\.78\.7points on MedCaseReasoning and6\.96\.9points on MedQA over Qwen3\.5\-9B\. This configuration is used forMedGuideX\-9B, supporting our design choice of converting guideline decision logic into executable QA supervision and training strategies\.
Table 2:Paired answer transition analysis between Qwen3\.5 andMedGuideX\. Corrected denotes cases whereMedGuideXchanges a wrong Qwen3\.5 answer into a correct one, while Regressed denotes the opposite transition\. Unchanged denotes cases with no change in correctness status\. Net Gain is computed as Corrected minus Regressed\.
### 5\.3Paired Answer Transition Analysis
To better understand the source of the performance gains, we conduct a paired comparison betweenMedGuideXand its Qwen3\.5 backbone on the same test instances\. For each benchmark, we categorize answer transitions into three groups:Corrected,Regressed,Unchanged\.
As shown in Table[2](https://arxiv.org/html/2605.26567#S5.T2),MedGuideXconsistently produces positive net gains across all benchmarks and both model scales\. At the 4B scale,MedGuideXcorrects substantially more backbone errors than it introduces regressions, yielding net gains of 60 on MedCaseReasoning, 59 on MedQA, 20 on MIMIC\-CDM\-FI, and 23 on ER\-Reason\. The same trend holds at the 9B scale\. These results indicate that guideline\-derived post\-training improves performance primarily by correcting backbone errors rather than by introducing unpredictable answer shifts\.
Table 3:Clinical reasoning evaluation comparingMedGuideX\-9B and GPT\-5\.0 \(%\)\. Reasoning recall is the fraction of clinician\-authored reference reasoning steps recovered in the model rationale on MedCaseReasoning, scored by an LLM judge\. Physician pairwise win rates report blinded non\-tie preference rates ofMedGuideX\-9B over GPT\-5\.0 on 30 case annotations\.
### 5\.4Clinical Reasoning Process Evaluation
Final\-answer accuracy does not directly measure whether a model reaches its answer through clinically sound reasoning\. We therefore evaluate rationale quality using two complementary protocols: automatic reasoning recall against clinician\-authored reference traces and a blinded physician preference study\.
#### Reasoning recall\.
MedCaseReasoning includes clinician\-authored reasoning traces for each case, which allows us to measure*reasoning recall*: the fraction of salient reference reasoning steps recovered in the model generated rationale, scored by an LLM judge following the paper’s protocol \(Appendix[D](https://arxiv.org/html/2605.26567#A4)\)\. As reported in Table[3](https://arxiv.org/html/2605.26567#S5.T3),MedGuideX\-9B achieves a reasoning recall of62\.77%62\.77\\%, compared with50\.78%50\.78\\%for GPT\-5\.0\. This suggests thatMedGuideX\-9B recovers more clinically relevant intermediate reasoning steps, indicating that guideline\-derived post\-training improves not only answer prediction but also the coverage of clinician\-aligned reasoning\.
#### Blinded physician evaluation\.
We conduct a blinded pairwise physician evaluation on3030MedCaseReasoning cases where bothMedGuideX\-9B and GPT\-5\.0 produce the correct final diagnosis\. This design controls for final\-answer correctness and focuses the comparison on rationale quality\. Two practicing physicians each annotate2020cases\. For each case, the physician compares two anonymized responses shown in randomized order and judges four dimensions: evidence faithfulness, reasoning validity, reasoning completeness, and reasoning clarity\. We treatSameas a tie and report non\-tie win rates ofMedGuideX\-9B over GPT\-5\.0\.
As shown in Table[3](https://arxiv.org/html/2605.26567#S5.T3),MedGuideX\-9B is preferred in faithfulness \(85\.00%85\.00\\%\), validity \(79\.31%79\.31\\%\), completeness \(92\.31%92\.31\\%\), and clarity \(51\.52%51\.52\\%\), with an overall non\-tie win rate of76\.86%76\.86\\%\. The largest advantages appear in completeness and faithfulness, suggesting thatMedGuideX\-9B provides more complete reasoning and better grounding in case evidence\. The clarity score is close to parity, indicating thatMedGuideX\-9B maintains comparable presentation quality while improving the clinically substantive dimensions of rationale quality\. These results show that the proposed training improves reasoning quality in ways that are visible to practicing physicians and not captured by final\-answer accuracy alone\.
## 6Conclusion
Inspired by clinical practice, we introducedMedGuideX, a medical LLM trained to internalize decision logic from CPGs for clinical reasoning\. Instead of treating guidelines merely as unstructured textual knowledge, we transform high\-quality guidelines into executable and structured training supervision, including both factual and counterfactual QA instances that expose models to evidence\-based clinical decision rules\. Through SFT and RL post\-training, experiments on four benchmarks show thatMedGuideXsubstantially improves over the base model and outperforms similarly sized medical LLMs\. Further human evaluations demonstrate thatMedGuideXnot only improves diagnostic accuracy, but also generates more clinically grounded, coherent, and reliable rationales\. These results suggest that CPGs can serve as a valuable source of scalable supervision for building more reliable medical reasoning models\.
Overall, our work highlights the value of moving beyond surface\-level medical knowledge toward structured, executable clinical decision logic\. We believe guideline\-derived supervision provides a promising direction for developing medical LLMs that reason more consistently with evidence\-based clinical practice\.
## Limitations
While our work demonstrates promising results, it still has limitations\.MedGuideXshould be viewed as a research system for supporting studies of medical reasoning rather than as a substitute for clinical professionals\. Further validation, safety evaluation, regulatory review, and integration with clinical workflows are needed before deployment in production environments or real\-world clinical use\.
## References
- Anthropic \(2025\)Anthropic\. 2025\.Claude haiku 4\.5 \[large language model\]\.[https://claude\.ai/](https://claude.ai/)\.Released October 15, 2025\.
- Bowen \(2006\)Judith L Bowen\. 2006\.Educational strategies to promote clinical diagnostic reasoning\.*New England Journal of Medicine*, 355\(21\):2217–2225\.
- Cao \(2024\)Lang Cao\. 2024\.Graphreason: Enhancing reasoning capabilities of large language models through a graph\-based verification approach\.In*Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations \(@ ACL 2024\)*, pages 1–12\.
- Cao et al\. \(2026\)Lang Cao, Qingyu Chen, and Yue Guo\. 2026\.Ehr\-rag: Bridging long\-horizon structured electronic health records and large language models via enhanced retrieval\-augmented generation\.*arXiv preprint arXiv:2601\.21340*\.
- Cao et al\. \(2025\)Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou, Haoyu Dong, Shi Han, and Dongmei Zhang\. 2025\.Formula\-r1: Incentivizing llm reasoning over complex tables with numerical computation via formula\-driven reinforcement learning\.*arXiv preprint arXiv:2505\.23667*\.
- Chen et al\. \(2024\)Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang\. 2024\.Huatuogpt\-o1, towards medical complex reasoning with llms\.*arXiv preprint arXiv:2412\.18925*\.
- Chen et al\. \(2025\)Yuefei Chen, Vivek K Singh, Jing Ma, and Ruxiang Tang\. 2025\.Counterbench: A benchmark for counterfactuals reasoning in large language models\.*arXiv preprint arXiv:2502\.11008*\.
- Chen et al\. \(2023\)Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, and 1 others\. 2023\.Meditron\-70b: Scaling medical pretraining for large language models\.*arXiv preprint arXiv:2311\.16079*\.
- Corbeil et al\. \(2025\)Jean\-Philippe Corbeil, Amin Dada, Jean\-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, and Paul Vozila\. 2025\.A modular approach for clinical slms driven by synthetic data with pre\-instruction tuning, model merging, and clinical\-tasks alignment\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 19352–19374\.
- Deng et al\. \(2026\)Ruiqi Deng, Geoffrey Martin, Tony Wang, Gongbo Zhang, Yi Liu, Chunhua Weng, Yanshan Wang, Justin F Rousseau, and Yifan Peng\. 2026\.Cpgprompt: Translating clinical guidelines into llm\-executable decision support\.*arXiv preprint arXiv:2601\.03475*\.
- Garcia\-Gasulla et al\. \(2025\)Dario Garcia\-Gasulla, Jordi Bayarri\-Planas, Ashwin Kumar Gururajan, Enrique Lopez\-Cuena, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu\-Perez, Anna Arias\-Duart, Pablo Agustin Martin\-Torres, Marta Gonzalez\-Mallo, and 1 others\. 2025\.The aloe family recipe for open and specialized healthcare llms\.*arXiv preprint arXiv:2505\.04388*\.
- Gu et al\. \(2025\)Boyang Gu, Hongjian Zhou, Bradley Max Segal, Jinge Wu, Zeyu Cao, Hantao Zhong, Lei Clifton, Fenglin Liu, and David A Clifton\. 2025\.Clinical\-r1: Empowering large language models for faithful and comprehensive reasoning with clinical objective relative policy optimization\.*arXiv preprint arXiv:2512\.00601*\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others\. 2025\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*\.
- Hager et al\. \(2024\)Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, and 1 others\. 2024\.Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.*Nature medicine*, 30\(9\):2613–2622\.
- Han et al\. \(2023\)Tianyu Han, Lisa C Adams, Jens\-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexei Figueroa, Alexander Löser, Daniel Truhn, and Keno K Bressem\. 2023\.Medalpaca–an open\-source collection of medical conversational ai models and training data\.*arXiv preprint arXiv:2304\.08247*\.
- Jiang et al\. \(2025a\)Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han\. 2025a\.Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning\.*arXiv preprint arXiv:2503\.00223*\.
- Jiang et al\. \(2025b\)Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, and 1 others\. 2025b\.Hulu\-med: A transparent generalist model towards holistic medical vision\-language understanding\.*arXiv preprint arXiv:2510\.08668*\.
- Jin et al\. \(2025\)Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\. 2025\.Search\-r1: Training llms to reason and leverage search engines with reinforcement learning\.*arXiv preprint arXiv:2503\.09516*\.
- Jin et al\. \(2021\)Di Jin, Eileen Pan, Nassim Oufattole, Wei\-Hung Weng, Hanyi Fang, and Peter Szolovits\. 2021\.What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.*Applied Sciences*, 11\(14\):6421\.
- Kaelbling et al\. \(1996\)Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore\. 1996\.Reinforcement learning: A survey\.*Journal of artificial intelligence research*, 4:237–285\.
- Labrak et al\. \(2024\)Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre\-Antoine Gourraud, Mickael Rouvier, and Richard Dufour\. 2024\.Biomistral: A collection of open\-source pretrained large language models for medical domains\.In*Findings of the association for computational linguistics: acl 2024*, pages 5848–5864\.
- Lai et al\. \(2025\)Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, and Yang Liu\. 2025\.Doctor\-r1: Mastering clinical inquiry with experiential agentic reinforcement learning\.*arXiv preprint arXiv:2510\.04284*\.
- Li et al\. \(2023a\)Binbin Li, Tianxin Meng, Xiaoming Shi, Jie Zhai, and Tong Ruan\. 2023a\.Meddm: Llm\-executable clinical guidance tree for clinical decision\-making\.*arXiv preprint arXiv:2312\.02441*\.
- Li et al\. \(2026\)Bingxuan Li, Simo Du, and Yue Guo\. 2026\.Joint optimization of reasoning and dual\-memory for self\-learning diagnostic agent\.*arXiv preprint arXiv:2604\.07269*\.
- Li et al\. \(2023b\)Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao\. 2023b\.Llava\-med: Training a large language\-and\-vision assistant for biomedicine in one day\.*Advances in Neural Information Processing Systems*, 36:28541–28564\.
- Li et al\. \(2025\)Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, and 1 others\. 2025\.Macd: Multi\-agent clinical diagnosis with self\-learned knowledge for llm\.*arXiv preprint arXiv:2509\.20067*\.
- Lightman et al\. \(2024\)Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2024\.Let’s verify step by step\.In*International Conference on Learning Representations*, volume 2024, pages 39578–39601\.
- Liu et al\. \(2024\)Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others\. 2024\.Deepseek\-v3 technical report\.*arXiv preprint arXiv:2412\.19437*\.
- Mehandru et al\. \(2025\)Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F Molina, and Ahmed Alaa\. 2025\.Er\-reason: A benchmark dataset for llm\-based clinical reasoning in the emergency room\.*arXiv preprint arXiv:2505\.22919*\.
- Muennighoff et al\. \(2025\)Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto\. 2025\.s1: Simple test\-time scaling\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 20286–20332\.
- Nendaz and Perrier \(2012\)Mathieu Nendaz and Arnaud Perrier\. 2012\.Diagnostic errors and flaws in clinical reasoning: mechanisms and prevention in practice\.*Swiss medical weekly*, 142\(4344\):w13706–w13706\.
- Oniani et al\. \(2024\)David Oniani, Xizhi Wu, Shyam Visweswaran, Sumit Kapoor, Shravan Kooragayalu, Katelyn Polanska, and Yanshan Wang\. 2024\.Enhancing large language models for clinical decision support by incorporating clinical practice guidelines\.In*2024 IEEE 12th International Conference on Healthcare Informatics \(ICHI\)*, pages 694–702\. IEEE\.
- Schubert et al\. \(2025\)Marc Cicero Schubert, Stella Soyka, Wolfgang Wick, and Varun Venkataramani\. 2025\.Guideline\-incorporated large language model\-driven evaluation of medical records using medcheckllm\.*JMIR Formative Research*, 9:e53335\.
- Sellergren et al\. \(2025\)Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, and 1 others\. 2025\.Medgemma technical report\.*arXiv preprint arXiv:2507\.05201*\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others\. 2024\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*\.
- Shen et al\. \(2026\)Yuhao Shen, Zhangtianyi Chen, Yuanhao He, Yan Xu, Shuping Zhang, Liyuan Sun, Zijian Wang, Yinghao Zhu, Yuyuan Yang, Jiahe Qian, Ziwen Wang, Xinyuan Zhang, Wenbin Liu, Zongyuan Ge, Tao Lu, Siyuan Yan, and Juexiao Zhou\. 2026\.Trustworthy and fair skingpt\-r1 for democratizing dermatological reasoning across diverse ethnicities\.*arXiv preprint arXiv:2511\.15242*\.
- Sheng et al\. \(2025\)Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu\. 2025\.Hybridflow: A flexible and efficient rlhf framework\.In*Proceedings of the Twentieth European Conference on Computer Systems*, pages 1279–1297\.
- Singh et al\. \(2025\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others\. 2025\.Openai gpt\-5 system card\.*arXiv preprint arXiv:2601\.03267*\.
- Sox et al\. \(2024\)Harold C Sox, Michael C Higgins, Douglas K Owens, and Gillian Sanders Schmidler\. 2024\.*Medical decision making*\.John Wiley & Sons\.
- Staniek et al\. \(2025\)Michael Staniek, Artem Sokolov, and Stefan Riezler\. 2025\.Training and evaluation of guideline\-based medical reasoning in llms\.*arXiv preprint arXiv:2512\.03838*\.
- Suzgun et al\. \(2023\)Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others\. 2023\.Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 13003–13051\.
- Team et al\. \(2025\)Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, and 1 others\. 2025\.Kimi k2: Open agentic intelligence\.*arXiv preprint arXiv:2507\.20534*\.
- Tong et al\. \(2024\)Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He\. 2024\.Dart\-math: Difficulty\-aware rejection tuning for mathematical problem\-solving\.*Advances in Neural Information Processing Systems*, 37:7821–7846\.
- Tziakouri and Menolascina \(2025\)Anni Tziakouri and Filippo Menolascina\. 2025\.Reinforcement learning for clinical reasoning: Aligning llms with acr imaging appropriateness criteria\.*arXiv preprint arXiv:2510\.05194*\.
- Uesato et al\. \(2022\)Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins\. 2022\.Solving math word problems with process\-and outcome\-based feedback\.*arXiv preprint arXiv:2211\.14275*\.
- Vashishtha et al\. \(2025\)Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, and Hao Peng\. 2025\.Executable counterfactuals: Improving llms’ causal reasoning through code\.*arXiv preprint arXiv:2510\.01539*\.
- Wei et al\. \(2022a\)Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, and 1 others\. 2022a\.Emergent abilities of large language models\.*arXiv preprint arXiv:2206\.07682*\.
- Wei et al\. \(2022b\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others\. 2022b\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837\.
- Wu et al\. \(2025a\)Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang\-In Choi, and 1 others\. 2025a\.Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs\.*arXiv preprint arXiv:2504\.00993*\.
- Wu et al\. \(2025b\)Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou\. 2025b\.Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports\.*arXiv preprint arXiv:2505\.11733*\.
- Xu et al\. \(2025\)Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, and 1 others\. 2025\.Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning\.*arXiv preprint arXiv:2506\.07044*\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others\. 2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\.
- Yang et al\. \(2024\)Rui Yang, Han Zhong, Jiawei Xu, Amy Zhang, Chongjie Zhang, Lei Han, and Tong Zhang\. 2024\.Towards robust offline reinforcement learning under diverse data corruption\.In*International Conference on Learning Representations*, volume 2024, pages 15512–15543\.
- Yoo and Woo \(2025\)Gwangpyo Yoo and Honguk Woo\. 2025\.Model risk\-sensitive offline reinforcement learning\.In*The Thirteenth International Conference on Learning Representations*\.
- You et al\. \(2026\)Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion\-Barner, Hongyuan Mei, Hao Peng, and Yue Guo\. 2026\.Improving clinical diagnosis with counterfactual multi\-agent reasoning\.*arXiv preprint arXiv:2603\.27820*\.
- Yuan et al\. \(2023\)Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou\. 2023\.Scaling relationship on learning mathematical reasoning with large language models\.*arXiv preprint arXiv:2308\.01825*\.
Contents of Appendix
## Appendix AEthical Statement
This work aims to improve clinical reasoning in medical LLMs by transforming publicly available clinical practice guidelines into structured training supervision\. All guideline sources used in this study are publicly accessible and do not contain private patient information\. The factual and counterfactual QA instances are synthetically generated from executable guideline functions and are not derived from identifiable patient records\.
We acknowledge that medical LLMs may produce incorrect, incomplete, or potentially harmful recommendations, even when trained on guideline\-derived data\. Therefore,MedGuideXis intended for research use only and should not be used as a substitute for professional medical judgment, clinical diagnosis, or treatment decisions\. Any deployment in real\-world healthcare settings would require rigorous clinical validation, human oversight, bias and safety evaluation, and compliance with applicable medical regulations\.
## Appendix BRelated Work
#### Post\-training for LLM Reasoning\.
LLMs have shown strong reasoning capabilities on complex tasks, especially when guided by intermediate reasoning steps such as chain\-of\-thought promptingWei et al\. \([2022a](https://arxiv.org/html/2605.26567#bib.bib47)\); Suzgun et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib41)\); Wei et al\. \([2022b](https://arxiv.org/html/2605.26567#bib.bib48)\); Cao \([2024](https://arxiv.org/html/2605.26567#bib.bib3)\)\. Beyond inference\-time prompting, recent work improves reasoning through post\-training\. Supervised fine\-tuning \(SFT\) on high\-quality reasoning traces can help models learn more coherent reasoning behaviors, using either filtered self\-generated solutionsYuan et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib56)\); Tong et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib43)\)or rationales distilled from stronger modelsMuennighoff et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib30)\); Shen et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib36)\)\. RLKaelbling et al\. \([1996](https://arxiv.org/html/2605.26567#bib.bib20)\)further optimize toward task\-level objectives and has been shown effective for enhancing reasoning ability beyoud next\-token predictionLightman et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib27)\); Uesato et al\. \([2022](https://arxiv.org/html/2605.26567#bib.bib45)\); Guo et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib13)\)\. Our work follows this post\-training direction, but focuses on constructing high\-quality clinical reasoning supervision from evidence\-based medical guidelines\. Notably, DeepSeek\-R1Guo et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib13)\)demonstrates that large\-scale RL can substantially improve the reasoning abilities of language models\. In downstream applications, DeepRetrievalJiang et al\. \([2025a](https://arxiv.org/html/2605.26567#bib.bib16)\)and Search\-R1Jin et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib18)\)apply RL to teach models to reason about interactions with search engines for information retrieval, while Formula\-R1Cao et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib5)\)extends RL\-based reasoning to structured data reasoning\.
#### LLM for Clinical Reasoning\.
LLMs have increasingly been applied to healthcare tasks\. Prior work improves medical LLMs by adapting them to medical corpora, clinical notes, or case reports\(Chen et al\.,[2023](https://arxiv.org/html/2605.26567#bib.bib8); Han et al\.,[2023](https://arxiv.org/html/2605.26567#bib.bib15); Labrak et al\.,[2024](https://arxiv.org/html/2605.26567#bib.bib21); Garcia\-Gasulla et al\.,[2025](https://arxiv.org/html/2605.26567#bib.bib11)\)\. Other studies enhance clinical reasoning through retrieval\-augmented generation, multi\-agent diagnosis, or post\-training with clinical feedbackCao et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib4)\); Li et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib26)\); Chen et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib6)\); Lai et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib22)\)\. While these methods improve medical knowledge or case\-level reasoning, they often rely on heterogeneous clinical data whose reasoning supervision is implicit, noisy, or incomplete\.
#### Clinical Practice Guidelines for Medical LLMs\.
CPGs have also been used to enhance medical reasoning, primarily by retrieving guideline passages for promptingSchubert et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib33)\); Deng et al\. \([2026](https://arxiv.org/html/2605.26567#bib.bib10)\); Oniani et al\. \([2024](https://arxiv.org/html/2605.26567#bib.bib32)\); Li et al\. \([2023a](https://arxiv.org/html/2605.26567#bib.bib23)\), constructing guideline\-grounded rationales for SFTStaniek et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib40)\); Chen et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib8)\), or using guideline constraints as rewards during RLTziakouri and Menolascina \([2025](https://arxiv.org/html/2605.26567#bib.bib44)\); Gu et al\. \([2025](https://arxiv.org/html/2605.26567#bib.bib12)\)\. These approaches typically treat guidelines as external knowledge sources, rationale references, or reward constraints\. In contrast, our work transforms guideline\-derived decision logic into factual and counterfactual question\-answering data for post\-training medical LLMs\.
## Appendix CData Preparation Details
### C\.1Guideline Curation
Our initial guideline source is an open CPG collection containing37,97037\{,\}970documents\. After curation, we obtain a filtered subset of841841CPGs, including607607from CDC and234234from PubMed\.
### C\.2Executable Transformation
Starting from841841curated documents, we split the corpus into3,1273\{,\}127recommendation\-oriented chunks using a soft limit of4,5004\{,\}500words per chunk, with at most44chunks per document and22recommendations per chunk\. The LLM extractor produces4,7504\{,\}750recommendation candidates\. After validation, we retain3,7963\{,\}796usable recommendations, discarding827827candidates that lack a concrete condition\-action structure and removing127127near\-duplicates with highly similar populations, conditions, and actions\.
Each of the3,7963\{,\}796retained recommendations is converted into a decision treeTfT\_\{f\}specifying input variables, decision conditions, and final outputs\. After LLM\-based validation for completeness, condition clarity, branch coverage, and guideline support,2,8002\{,\}800trees pass validation\.
The2,8002\{,\}800validated trees are compiled into executable Python functions that map input variablesXXto guideline outputsf\(X\)f\(X\)\. We check each function for syntactic correctness, executability on sampled inputs, and consistency with the original decision tree\. In total,2,7932\{,\}793functions pass these checks and form our validated executable guideline knowledge base\.
### C\.3QA Synthesis
At the data level, executingffbefore and after intervention yields7,9937\{,\}993interventional scenarios\. Of these,7,2057\{,\}205produce changed outcomes, while788788unchanged\-output cases are discarded\. After applying the same balancing rule used for factual QA, we retain5,0285\{,\}028counterfactual candidates\. Each sample is paired with a GPT\-5\.4\-mini generated reasoning trace that verbalizes the abduction, intervention, and prediction steps over the executable functionff\. Overall data pipeline statistics are shown in Table[4](https://arxiv.org/html/2605.26567#A3.T4)\.
Table 4:Data pipeline statistics on the guideline corpus\. Each stage discards instances that fail automated checks\.
## Appendix DBenchmark Details and Examples
We evaluateMedGuideXon four external medical reasoning benchmarks\. All four are used only for evaluation and are disjoint from the executable guideline corpus used to trainMedGuideX\.
- •MedQA\(Jin et al\.,[2021](https://arxiv.org/html/2605.26567#bib.bib19)\): A multiple\-choice medical exam benchmark based on United States Medical Licensing Examination \(USMLE\)\-style questions\. It evaluates whether executable guideline\-grounded training transfers to exam\-style medical knowledge\. We report multiple\-choice accuracy on1,2731\{,\}273test questions\.
- •MedCaseReasoning\(Wu et al\.,[2025b](https://arxiv.org/html/2605.26567#bib.bib50)\): A long\-form open diagnostic reasoning benchmark based on open\-access clinical case reports from the New England Journal of Medicine Clinicopathological Conferences \(NEJM CPC\), primarily sourced from Massachusetts General Hospital\. It evaluates case\-based differential diagnosis\. We report diagnostic accuracy on the897897cases in the test split\.
- •MIMIC\-CDM\-FI\(Hager et al\.,[2024](https://arxiv.org/html/2605.26567#bib.bib14)\): A full\-information open clinical decision\-making benchmark derived from MIMIC\-IV, which is based on electronic health records from Beth Israel Deaconess Medical Center\. It evaluates full\-information clinical decision making\. We report diagnostic accuracy on a randomly sampled subset of1,0001\{,\}000cases\.
- •ER\-Reason\(Mehandru et al\.,[2025](https://arxiv.org/html/2605.26567#bib.bib29)\): An emergency\-department open diagnosis prediction benchmark derived from patient records at a large academic medical center\. It evaluates emergency\-room diagnosis prediction\. We report diagnostic accuracy on a randomly sampled subset of1,0001\{,\}000cases\.
Unless otherwise stated, we use greedy decoding with a1,0241\{,\}024\-token generation budget, disable Qwen thinking mode, and usegpt\-5\.4\-minifor all LLM judges\. For each benchmark, we describe the task, split, and evaluation protocol, and provide one representative example\.
#### MedQA
is a USMLE\-style multiple\-choice medical QA benchmark\. We evaluate on the US test split of1,2731\{,\}273questions\. Given a question stem and five options A through E, the model selects the single best answer after a chain\-of\-thought reasoning step\. Answer extraction is fully deterministic: a regex\-based parser strips reasoning blocks and extracts the final option letter or option text, with no LLM judge involved\. We report accuracy as the fraction of correctly answered questions\.
MedQA: USMLE\-style multiple\-choice QASystem You are a medical expert\. Answer the multiple\-choice question, thinking step by step before choosing one option from A to E\.User Context\.A 65\-year\-old man with fever, productive cough, pleuritic chest pain, and right lower\-lobe consolidation on chest x\-ray\. Gram stain shows lancet\-shaped gram\-positive diplococci\. Question\.Which option is the single best answer? Options\.A:Mycoplasma pneumoniaeB:Streptococcus pneumoniaeC:Pseudomonas aeruginosaD:Pneumocystis jiroveciiE:Histoplasma capsulatumReference diagnosis BEvaluation The deterministic parser extracts option B from the model output and matches it against the gold answer B\. The prediction is counted as correct\.
#### MedCaseReasoning
evaluates long\-form diagnostic reasoning from open\-access clinical case reports\. Each example provides a clinical case prompt, a clinician\-authored reasoning trace, and a final diagnosis\. We evaluate on the official test split of897897cases and use the official prompt template, which requests reasoning inside<think\>tags and the diagnosis inside<answer\>tags\. We report11\-pass diagnostic accuracy under the official LLM\-as\-judge protocol, in which the judge compares the predicted and gold diagnoses and counts synonyms, abbreviations, and close medical paraphrases as correct\. We additionally report reasoning recall, computed by a separate judge that checks, for each clinician\-authored reasoning step, whether the model’s reasoning explicitly or implicitly covers it\.
MedCaseReasoning: long\-form case diagnosisUser Please reason through the following clinical case and predict the final diagnosis, returning reasoning inside<think\>tags and the diagnosis inside<answer\>tags\. Clinical case\.A middle\-aged patient presents with severe episodic headaches, palpitations, sweating, and paroxysmal hypertension\. Abdominal imaging reveals an adrenal mass\. Plasma metanephrines are markedly elevated\.Reference diagnosis PheochromocytomaEvaluation The judge compares the predicted diagnosis against the gold diagnosis pheochromocytoma and finds them equivalent\. Reasoning recall is computed over the clinician\-authored reasoning steps; in this case all reference steps are covered, giving a recall of1\.01\.0\.
#### MIMIC\-CDM\-FI
is a full\-information clinical decision\-making benchmark derived from the MIMIC\-IV\-Ext CDM dataset\. We evaluate on a fixed1,0001\{,\}000\-case test split balanced across four acute abdominal pathologies, with250250cases each of appendicitis, cholecystitis, pancreatitis, and diverticulitis\. The model receives all relevant patient information upfront, including history, physical examination, selected laboratory results, and abdominal imaging reports, and is asked to output a single final diagnosis of the most severe pathology with no further explanation\. Predictions are scored by an LLM\-as\-judge that identifies the primary acute abdominal diagnosis\. Synonyms, spelling variants, abbreviations, and clinically equivalent paraphrases count as correct; comorbidities or full billing\-code matches are not required\.
MIMIC\-CDM\-FI: full\-information clinical decision makingSystem You are a medical AI assistant\. Based on the provided information, give a single final diagnosis of the most severe pathology, with no further information\.User Provide the most likely final diagnosis of the following patient\. History\.Acute right lower quadrant abdominal pain, nausea, anorexia, low\-grade fever\. Examination\.Tenderness at McBurney’s point with guarding\. Laboratory\.Elevated white blood cell count and C\-reactive protein\. Imaging\.CT abdomen shows a dilated appendix with wall thickening and periappendiceal fat stranding\.Reference diagnosis Acute appendicitisEvaluation The judge identifies the model’s response as a match for the gold diagnosis acute appendicitis\. The prediction is counted as correct\.
#### ER\-Reason
evaluates clinical reasoning in emergency\-department settings; we use Task 4, final ED diagnosis prediction\. We evaluate on a deterministic1,0001\{,\}000\-record split\. Each record provides age, sex, chief complaint, the current ED presentation, and a set of clinical notes spanning discharge summary, progress notes, history and physical, imaging, and consult\. The model receives all notes as context and outputs a single ED diagnosis in free text\. Our primary metric is LLM\-judge accuracy, with the judge counting synonyms, abbreviations, wording differences, and clinically equivalent specificity as correct\. The official Task 4 evaluator also provides a CMS ICD\-10\-CM crosswalk that yields exact match, normalized match, token F1, clinical\-cluster, body\-system, ICD, and HCC accuracy\. We do not adopt the crosswalk\-based HCC accuracy as our primary metric because the crosswalk maps only about30%30\\%of the gold and predicted diagnoses to an HCC category\. The resulting score is computed over a small and non\-representative subset of cases, whereas LLM\-judge accuracy scores every case and better reflects clinically correct predictions\.
ER\-Reason \(Task 4\): emergency\-department diagnosisSystem You are an experienced ED physician\. Predict the most likely diagnosis for the current ED visit, using all available clinical notes as full information\.User ER\-Reason Task 4: Final Diagnosis\.Age 72, Female\. Chief complaint: shortness of breath\. Clinical notes\.History of heart failure; worsening dyspnea, orthopnea, bilateral leg edema, elevated BNP, pulmonary vascular congestion on chest x\-ray\. Current ED presentation\.Tachypneic with hypoxia and bibasilar crackles; no fever or focal infiltrate\. Output a single CMS\-standardized diagnosis as text, with no explanation\.Reference diagnosis Acute decompensated heart failureEvaluation The judge compares the model’s prediction acute decompensated heart failure against the gold acute congestive heart failure and counts them as clinically equivalent\. The prediction is counted as correct\.
Table 5:Summary of the four evaluation benchmarks\. \# Eval is the number of evaluated examples\. MedQA uses a fully deterministic answer parser with no LLM judge; the other three use an LLM\-as\-judge \(gpt\-5\.4\-mini, temperature0\) for semantic diagnosis matching\. ER\-Reason also has an official CMS ICD\-10\-CM crosswalk evaluator available, but it is not used as the primary metric \(see Appendix[D](https://arxiv.org/html/2605.26567#A4)\)\.
#### Licenses and access\.
We confirm that all four evaluation benchmarks are used in accordance with their original licenses, and only for non\-commercial academic research\. MedQA is released under the MIT license via the official GitHub repository\.333[https://github\.com/jind11/MedQA](https://github.com/jind11/MedQA)MedCaseReasoning’s code is released under the MIT license and its dataset is released under CC\-BY 4\.0, derived from the PubMed Central Open Access Subset\.444[https://github\.com/kevinwu23/Stanford\-MedCaseReasoning](https://github.com/kevinwu23/Stanford-MedCaseReasoning),555[https://huggingface\.co/datasets/zou\-lab/MedCaseReasoning](https://huggingface.co/datasets/zou-lab/MedCaseReasoning)MIMIC\-CDM\-FI is distributed under the PhysioNet Credentialed Health Data License 1\.5\.0 with an accompanying Data Use Agreement, and access requires completion of the CITI Data or Specimens Only Research training; we accessed it as credentialed PhysioNet users and use it only for evaluation in this study\.666[https://physionet\.org/content/mimic\-iv\-ext\-cdm/](https://physionet.org/content/mimic-iv-ext-cdm/)ER\-Reason is distributed under the more restrictive PhysioNet Contributor Review Health Data License 1\.5\.0, which additionally requires per\-study review by the dataset contributors; we accessed it under this credentialed and reviewed access process and use it only for evaluation\.777[https://physionet\.org/content/er\-reason/1\.0\.0/](https://physionet.org/content/er-reason/1.0.0/)We do not redistribute any patient\-level records from MIMIC\-CDM\-FI or ER\-Reason, and all evaluation outputs reported in this paper are aggregate metrics rather than raw clinical content\.
#### Summary\.
Table[5](https://arxiv.org/html/2605.26567#A4.T5)summarizes the four benchmarks\. MedQA isolates broad exam\-style medical knowledge with deterministic scoring; MedCaseReasoning targets long\-form differential diagnosis with both accuracy and reasoning\-recall judging; MIMIC\-CDM\-FI measures full\-information acute abdominal decision making on a class\-balanced split; and ER\-Reason measures ED diagnosis prediction with both semantic and code\-crosswalk evaluation\.
## Appendix ETraining Details
#### Training setup\.
We trainMedGuideX\-4B andMedGuideX\-9B from Qwen3\.5\-4B and Qwen3\.5\-9B, respectively\. Both models use the same post\-training recipe: supervised fine\-tuning on the guideline\-derived factual and counterfactual QA mixture, followed by GRPO on factual guideline QA prompts\. Unless otherwise stated, the details below describe the 9B run; the 4B run follows the same data construction and optimization setup with the corresponding Qwen3\.5\-4B backbone\.
#### Supervised fine\-tuning\.
The SFT stage is designed to teach the model how to express the executable guideline logic in natural\-language reasoning\. We train on the balanced mixture of factual QA and counterfactual QA examples described in Section[3](https://arxiv.org/html/2605.26567#S3)\. For factual examples, the target response contains the guideline\-prescribed output and a rationale verbalizing the executed pathπf\(X\)\\pi\_\{f\}\(X\)\. For counterfactual examples, the target response follows the abduction–intervention–prediction structure, requiring the model to infer hidden variables, apply the intervention, and predict the new guideline output\. This stage therefore serves as a cold start for both the answer format and the reasoning pattern used in later RL\.
All SFT runs use LoRA adaptation on all linear layers, with rank1616andα=32\\alpha=32, bfloat16 precision, and gradient checkpointing\. We train with verl\(Sheng et al\.,[2025](https://arxiv.org/html/2605.26567#bib.bib37)\)for55epochs, using a learning rate of1×10−51\\times 10^\{\-5\}, cosine warmup ratio0\.030\.03, maximum sequence length20482048, and global batch size6464\. For the 9B model, the SFT stage runs for approximately1616hours on8×8\\timesRTX 5090 GPUs, and the resulting checkpoint is used to initialize the RL stage\.
#### Reinforcement learning\.
After SFT, we apply GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.26567#bib.bib35)\)using the same verl training stack\. The finalMedGuideXconfiguration uses factual RL, selected by the ablation study in Section[F](https://arxiv.org/html/2605.26567#A6)\. The RL prompts reuse the4,9634\{,\}963factual prompts from the guideline\-derived SFT corpus\. Since all four evaluation benchmarks are external to the executable guideline corpus, this reuse does not introduce benchmark contamination\.
During RL, each prompt is sampled with multiple rollouts and scored by the reward function described in Section[4](https://arxiv.org/html/2605.26567#S4)\. The reward first enforces the required reasoning\-before\-answer format, then checks whether the parsed final guideline output matches the oracle output produced by the executable function\. For counterfactual RL ablations, the same reward implementation additionally verifies hidden\-variable recovery and executable counterfactual consistency, but this is not the final configuration used forMedGuideXin Table[1](https://arxiv.org/html/2605.26567#S5.T1)\.
The RL stage is initialized from the SFT checkpoint and trained for one epoch with learning rate5×10−65\\times 10^\{\-6\}\. We use a maximum prompt length of15361536, maximum response length of10241024,2424rollouts per prompt, sampling temperature1\.01\.0, top\-p=1\.0p=1\.0, one PPO update epoch, entropy coefficient0\.00\.0, and KL coefficient0\.0050\.005with the low\-variance KL estimator\. We evaluate and save checkpoints every1010training steps, selecting the final checkpoint according to validation performance on MedCaseReasoning and MedQA\. For the 9B model, the RL stage runs for approximately3333hours on8×8\\timesRTX 5090 GPUs, and the resulting checkpoint is used to initialize the RL stage\.
## Appendix FTraining Strategy Ablation
We ablate two components of the 9B training pipeline: the*data type*used in each phase, including factual QA, counterfactual QA, or a balanced mixture of both, and the*training phase*, including SFT, RL, or SFT followed by RL\. Figure[2](https://arxiv.org/html/2605.26567#S5.F2)reports accuracy changes relative to the Qwen3\.5\-9B backbone on MedCaseReasoning and MedQA\. We also include acontinued pretraining on raw guidelinesbaseline that uses the same raw guideline text without generated QA supervision\. The goal of this ablation is to identify the most effective way to combine data composition and training phases for internalizing decision logic from executable clinical guidelines into LLMs for clinical reasoning\.
The results show that exposure to guideline text alone is insufficient: continued pretraining on raw CPG text substantially decreases performance, despite being a common strategy in prior medical LLM trainingChen et al\. \([2023](https://arxiv.org/html/2605.26567#bib.bib8)\)\. In contrast, most configurations using generated QA supervision improve over the backbone, suggesting that structured factual and counterfactual QA examples are important for transferring guideline knowledge into reasoning behavior\.
Among SFT\-only settings, the balanced mixture performs best on both benchmark, indicating that factual and counterfactual examples provide complementary supervision during supervised training\. For RL\-only settings, factual QA gives the strongest gains, suggesting that factual examples provide a more stable optimization signal for RL\. The best overall configuration applies mixed\-data SFT followed by factual RL\. This configuration corresponds toMedGuideX\-9B in Table[1](https://arxiv.org/html/2605.26567#S5.T1)and supports our final training design\. We hypothesize that factual reasoning is more closely aligned with downstream clinical reasoning benchmarks, making it a more effective final\-stage RL objective before inference\.
## Appendix GHuman Study Details
Case 010 \| Clinical question A 29\-year\-old Indigenous Australian man presented with severe \(8/10\) epigastric pain radiating to his back\. History: asthma, type 2 diabetes, obesity, hypertriglyceridemia, prior hypertriglyceridemia\-induced pancreatitis three years earlier \(triglycerides 39 mmol/L\)\. Current smoker, denies alcohol\. On arrival: T 36\.5°C, HR 64, BP 138/83, RR 11, SpO299%, clinically dehydrated\. Post\-resuscitation labs: triglycerides 24\.1 mmol/L \(<<1\.70\), cholesterol 8\.4 mmol/L, lipase 119 U/L \(8–78\), WBC 12\.3×\\times109/L, corrected calcium 2\.29 mmol/L, creatinine 68 µmol/L, glucose 21\.3 mmol/L, no ketonuria\. Contrast\-enhanced abdominal CT at 24 h: fatty liver, no features of pancreatitis\.*\[…full prompt shown to annotator…\]*Ground\-truth final diagnosis:hyperlipemic abdominal crisis
Response A*Reasoning:*The patient has classic features of pancreatitis \(severe epigastric pain radiating to the back, hypertriglyceridemia, prior episode\), but the lipase is only mildly elevated and the CT at 24 h shows no pancreatic changes\. Acute pancreatitis would typically show lipase 3–10×\\timesupper limit and imaging findings by this time\. However, lipemia can interfere with lipase assays and the classic presentation plus severe hypertriglyceridemia still favor an underlying pancreatitis process\.*\[…\]**Final diagnosis:*Acute pancreatitis secondary to hypertriglyceridemia
Response B*Reasoning:*Multiple pancreatitis risk factors are present, but lipase is only mildly elevated and CT at 24 h shows no features of pancreatitis\. In a patient with severe hypertriglyceridemia, severe abdominal pain without imaging or biochemical evidence of pancreatitis is most consistent with a hyperlipemic abdominal crisis, a recognized entity that mimics pancreatitis clinically\.*\[…\]**Final diagnosis:*Hyperlipemic abdominal crisis
Case 010 \| Pairwise judgments*Select Response A, Response B, or Same on each dimension\.*
Figure 3:Layout of one case page in the blinded physician questionnaire\. Each page shows the clinical question and the ground\-truth final diagnosis, two blinded responses \(one fromMedGuideX\-9B and one from GPT\-5\.0, presentation order independently randomized per case\), and four mandatory single\-choice questions, one per evaluation dimension\. Annotators may select Response A, Response B, or Same, and can optionally provide free\-text elaboration for each judgment\. The ground\-truth diagnosis is shown to the annotator so that judgments compare reasoning quality against a known correct answer; sample selection is further restricted to cases on which both models produce a correct final diagnosis\. Case and response text are abbreviated here for space\.#### Study design\.
We conduct a blinded pairwise comparison study to assess the clinical reasoning quality ofMedGuideX\-9B against GPT\-5\.0\. The study covers3030clinical cases sampled from the subset of the MedCaseReasoning test split on which both models produce a correct final diagnosis, so that the comparison isolates rationale quality from answer correctness\. For each case, an annotator is shown the case prompt, the ground\-truth diagnosis, and two anonymized responses, and judges which response is better along four reasoning dimensions\. Model identities are hidden, and the left/right presentation order of the two responses is independently randomized per case so that the annotator cannot infer which system produced which response\.
#### Annotators and case allocation\.
Annotations are provided by two practicing physicians\. Each physician annotates2020cases, with1010cases shared between them\. Concretely, physician 1 annotates cases11–2020and physician 2 annotates cases1111–3030, so that cases1111–2020are double annotated\. This design covers all3030cases with4040total annotations while keeping per\-annotator load manageable, and the shared1010cases provide a within\-study measure of inter\-annotator agreement\. Annotators are not told the identity, size, or training procedure of either model\.
#### Evaluation dimensions\.
The four dimensions and the instructions shown to each annotator are as follows\.
- •Evidence Faithfulness\.Which response is more faithful to the evidence provided in the case? Prefer the response that grounds its reasoning in the stated history, symptoms, labs, imaging, and pathology, and penalize unsupported claims, invented findings, or conclusions that go beyond the provided evidence\.
- •Reasoning Validity\.Which response has more valid clinical reasoning? Prefer the response whose diagnostic logic is medically sound, internally consistent, and appropriate for the task, and penalize incorrect causal links, medically implausible interpretations, or contradictions\.
- •Reasoning Completeness\.Which response provides a more complete reasoning process? Prefer the response that covers the key positive and negative evidence, necessary intermediate steps, and important differential considerations, and penalize responses that skip critical evidence or jump to a diagnosis without adequate justification\.
- •Reasoning Clarity\.Which response is more concise and focused while still being clinically useful? Prefer responses that are clear, organized, and free of unnecessary repetition or irrelevant detail, without rewarding brevity that omits important reasoning\.
#### Annotation protocol\.
The study is administered as a structured questionnaire\. Each case occupies a separate page containing, in order, the clinical*Question*,*Response A*,*Response B*, and four mandatory single\-choice questions, one per evaluation dimension\. For each dimension the annotator selects Response A, Response B, or Same\. Optional free\-text elaboration is allowed but not required\. Responses cannot be edited after submission, and no annotator email or identifying information is collected\.
#### Scoring\.
After collection, the blinded A/B labels are decoded back to model identities using a held\-out mapping that is never exposed during annotation\. For each dimension we report the*non\-tie win rate*ofMedGuideX\-9B over GPT\-5\.0, defined as the number of annotations in whichMedGuideX\-9B is preferred divided by the total number of non\-tie annotations,
Win Rate=nMedGuideX\-9BnMedGuideX\-9B\+nGPT\-5\.0,\\text\{Win Rate\}=\\frac\{n\_\{\\text\{\{\\text\{MedGuideX\}\}\-9B\}\}\}\{n\_\{\\text\{\{\\text\{MedGuideX\}\}\-9B\}\}\+n\_\{\\text\{GPT\-5\.0\}\}\},wherenMedGuideX\-9Bn\_\{\\text\{\{\\text\{MedGuideX\}\}\-9B\}\}andnGPT\-5\.0n\_\{\\text\{GPT\-5\.0\}\}are the counts ofMedGuideX\-9B\-preferred and GPT\-5\.0\-preferred judgments, respectively\. Same judgments are reported but excluded from the denominator, which is standard for pairwise model comparisons where ties carry no signal about relative quality\. The aggregate*Overall*win rate in Table[3](https://arxiv.org/html/2605.26567#S5.T3)is computed by pooling all non\-tie judgments across the four dimensions and all4040annotations\.
#### Per\-dimension judgment counts\.
Table[6](https://arxiv.org/html/2605.26567#A7.T6)reports the raw counts underlying the win rates in Table[3](https://arxiv.org/html/2605.26567#S5.T3)\.MedGuideX\-9B is preferred on every dimension, with the strongest margins on reasoning completeness \(3636vs33\) and evidence faithfulness \(1717vs33\)\. Reasoning clarity is the most contested dimension, with1717preferences forMedGuideX\-9B,1616for GPT\-5\.0, and77ties\.
Table 6:Raw judgment counts from the blinded pairwise physician evaluation\. Each cell is the number of annotations in which a given outcome was chosen\. The total per dimension is4040, corresponding to3030cases with1010double\-annotated cases\. The non\-tie win rates in Table[3](https://arxiv.org/html/2605.26567#S5.T3)are computed by excluding the Same column from the denominator\.
#### Questionnaire example\.
Figure[3](https://arxiv.org/html/2605.26567#A7.F3)shows the layout of a representative case page as presented to the annotator\. The case prompt and the two candidate responses are reproduced in abbreviated form; in the actual questionnaire the full untruncated text is shown\.
## Appendix HDetails of Guideline\-Based Baselines
#### Retrieval\-Augmented Guideline Prompting\.
We evaluate an inference\-time retrieval baseline that uses the same raw US\-only guideline chunks as external knowledge\. We build a TF\-IDF retriever over the3,1273\{,\}127raw CPG chunks, using each chunk’s topic, section path, and guideline text as the retrieval corpus\. For each benchmark instance, we form a query from the test input only: the question stem and answer choices for MedQA, the case prompt for MedCaseReasoning, the patient information and clinical fields for MIMIC\-CDM\-FI, and the emergency\-department presentation and available notes for ER\-Reason\. Gold labels are never used in retrieval\. We retrieve the top\-33chunks, truncate each retrieved snippet to700700characters, and prepend them to the original benchmark prompt with an instruction that the snippets may or may not be relevant\. No model parameters are updated\. This baseline obtains31\.88%31\.88\\%,74\.71%74\.71\\%,82\.50%82\.50\\%, and27\.40%27\.40\\%on the four benchmarks, with an average accuracy of54\.12%54\.12\\%\.
#### In\-Context Guideline Demonstrations
We also evaluate an in\-context learning baseline that provides guideline\-derived examples at inference time without parameter updates\. We select three factual QA examples from the guideline\-derived factual QA set\. Each demonstration contains a patient/guideline query, a short reasoning trace, and the corresponding guideline output\. These three demonstrations are fixed across all test instances and are prepended before the original benchmark prompt, with an instruction that they are demonstrations only and are not facts about the target case\. Unlike RAG, this baseline does not retrieve case\-specific guideline passages; it tests whether a small number of guideline\-style demonstrations can induce the desired reasoning behavior in the base model\. This baseline obtains33\.11%33\.11\\%,74\.94%74\.94\\%,80\.10%80\.10\\%, and28\.20%28\.20\\%on the four benchmarks, with an average accuracy of54\.08%54\.08\\%\.
#### CPGPrompt\.
We adapt CPGPrompt\(Deng et al\.,[2026](https://arxiv.org/html/2605.26567#bib.bib10)\)as a train\-free decision\-tree prompting baseline\. For each benchmark instance, we first retrieve the top\-33guideline recommendations from the validated recommendation corpus using TF\-IDF retrieval\. We then prompt Qwen3\.5\-9B to construct a compact CPGPrompt\-style decision tree as a Python dictionary literal, including the selected recommendation ID, yes/no decision nodes, a final rationale, and a final answer\. The generated tree is parsed, normalized, and executed locally by a deterministic Python traverser; the reached terminal action or top\-level final answer is then scored under the same benchmark\-specific evaluation protocol as the main experiments\. This baseline does not use our precompiled executable functions and does not update model parameters\. It obtains14\.16%14\.16\\%,59\.15%59\.15\\%,67\.60%67\.60\\%, and19\.90%19\.90\\%on the four benchmarks, with an average accuracy of40\.20%40\.20\\%\.
#### Fine\-tuning with CPG\.
We include a text\-only fine\-tuning baseline that trains Qwen3\.5\-9B directly on raw CPG text\. The training data are constructed from guideline chunks, where the model is asked to continue or reproduce the raw guideline content under a medical guideline system prompt\. This baseline uses only unstructured guideline text and does not use extracted recommendations, decision trees, executable Python functions, generated factual/counterfactual QA pairs, or verifier\-based rewards\. Its goal is to test whether exposure to guideline text alone is sufficient to improve clinical reasoning\. We evaluate this baseline on MedCaseReasoning and MedQA, where it obtains28\.87%28\.87\\%and65\.04%65\.04\\%, respectively\. The results indicate that raw guideline text alone can lead to negative transfer compared with the Qwen3\.5\-9B backbone\.
#### RL with CPG\-Derived Process Rewards\.
Finally, we evaluate an RL baseline that uses CPG\-derived process rewards instead of our executable\-guideline supervision\. This baseline starts from the same mixed factual/counterfactual SFT checkpoint and applies GRPO on factual guideline QA prompts\. The reward function reuses the factual CoT format reward and final\-answer reward, then adds an LLM\-as\-a\-judge process reward: for each sampled response, we retrieve the most relevant validated guideline recommendation, using an exact CPG identifier when available and otherwise lexical overlap retrieval, and ask the judge whether the model’s reasoning process is consistent with the retrieved guideline\. The final reward is the sum of the format reward, answer reward, and process\-consistency reward\. UnlikeMedGuideX, this baseline does not execute guideline functions to verify the model’s reasoning path or counterfactual behavior\. It obtains37\.68%37\.68\\%,75\.73%75\.73\\%,83\.40%83\.40\\%, and27\.80%27\.80\\%on the four benchmarks, with an average accuracy of56\.15%56\.15\\%\.
## Appendix ICase Study
We provide a controlled qualitative comparison on MedCaseReasoning, a long\-form diagnostic reasoning benchmark in which each model receives the same clinical case and must produce a final diagnosis\. To make the comparison directly auditable, we select three cases whereMedGuideX, as shown in this section, is judged correct while both GPT\-5\.0 and MedReason\-8B are judged incorrect\. For each case, we preserve the complete clinical case, the complete raw model response, and the verbatim judge rationale for each prediction\.
Green boxes indicate correctMedGuideXoutputs, and red boxes indicate incorrect baseline outputs\. These examples are not intended to replace aggregate evaluation; rather, they illustrate diagnostic distinctions captured by guideline\-derived post\-training but missed by the comparison models\.
### I\.1Case 1: Central Serous Chorioretinopathy
Task PromptPlease reason through the following clinical case and predict the final diagnosis\. Do not use tables or bullet points\. Return your answer in the following format: <think\> \[Reasoning here\] </think\> <answer\> \[Final diagnosis here\] </answer\>\.
Full Clinical CaseA 42\-year\-old previously healthy woman presented with a 4\-day history of blurring of vision and visual loss in her left eye, associated with mild ocular pain\. She denied systemic symptoms, drug use, and had no significant family, medical, or ocular history, except for a type\-A behavioral pattern\. She had been diagnosed elsewhere with presumed unilateral Vogt–Koyanagi–Harada disease and referred for further evaluation prior to treatment\.On examination, best\-corrected visual acuity was 20/20 in the right eye and 20/40 in the left\. Cycloplegic refraction was \+0\.50 diopters in the right eye and \-1\.00 diopters in the left\. Intraocular pressures were normal bilaterally\. Slit\-lamp biomicroscopy showed a clinically shallow anterior chamber in the left eye and \+1 cells in the anterior vitreous; there was no anterior chamber inflammation\. Funduscopic examination of the left eye revealed large areas of neurosensory retinal elevation consistent with subretinal fluid involving the peripapillary region and extending beyond the posterior pole; the right eye fundus was unremarkable\.Spectral\-domain optical coherence tomography of the left eye confirmed serous detachment of the neurosensory retina in the macular, peripapillary, and juxtapapillary regions\. Optical biometry demonstrated that the anterior chamber depth was 0\.62 mm shallower in the left eye than in the right, and axial length was 0\.16 mm shorter in the left eye\.
Model Outputs\.
MedGuideX \(correct\)Prediction:Idiopathic central serous chorioretinopathyJudge reason:Idiopathic central serous chorioretinopathy is the same diagnosis as central serous chorioretinopathy; the added idiopathic qualifier does not change the condition\.Raw response:The patient is a 42\-year\-old woman with acute unilateral visual loss and blurring in the left eye, mild ocular pain, and a shallow anterior chamber\. She was previously diagnosed with Vogt\-Koyanagi\-Harada disease, but the current presentation is atypical for VKH, which typically presents with bilateral uveitis, meningismus, and systemic symptoms such as hearing loss or skin changes\. Here, the anterior chamber inflammation is minimal \(only \+1 cells\), and there is no significant vitritis or posterior uveitis on exam\. Instead, the key findings are a shallow anterior chamber, reduced axial length, and extensive serous retinal detachment involving the macula, peripapillary region, and beyond the posterior pole\. The shallow anterior chamber and shorter axial length in the affected eye suggest a structural abnormality rather than an inflammatory process\. These features are characteristic of a primary vitreoretinal disorder, specifically idiopathic central serous chorioretinopathy \(CSC\), which can present with serous retinal detachment and sometimes a shallow anterior chamber due to ciliary body effusion or altered aqueous dynamics\. However, the combination of shallow anterior chamber, reduced axial length, and extensive serous detachment is more consistent with a diagnosis of idiopathic central serous chorioretinopathy with associated anterior chamber shallowing, possibly due to ciliary body effusion or a variant of CSC\. Given the absence of significant uveitis and the presence of serous retinal detachment with shallow anterior chamber, the most likely diagnosis is idiopathic central serous chorioretinopathy\.<answer\>Idiopathic central serous chorioretinopathy</answer\>
GPT\-5\.0 \(wrong\)Prediction:Posterior scleritisJudge reason:Posterior scleritis is a different inflammatory ocular condition and is not a synonym or close paraphrase of central serous chorioretinopathy\.Raw response:Reasoning:Unilateral serous retinal detachment with mild ocular pain, shallow anterior chamber, axial shortening, and a paradoxical myopic shift indicates anterior displacement from ciliochoroidal effusion, favoring posterior scleritis\. Mild vitreous cells and peripapillary/macular detachment support inflammation rather than CSC\. Vogt–Koyanagi–Harada is unlikely given the unilateral involvement and lack of significant anterior chamber inflammation or systemic features\.<answer\>Posterior scleritis</answer\>
MedReason\-8B \(wrong\)Prediction:Persistent fetal vasculature \(PFV\)Judge reason:Persistent fetal vasculature \(PFV\) is a different retinal developmental anomaly and is not a synonym or close paraphrase of central serous chorioretinopathy\.Raw response:Based on the provided clinical details, the patient presents with unilateral visual impairment and blurring of vision in the left eye, along with mild ocular pain\. The key findings include:1\. \*\*Shallow anterior chamber\*\* in the left eye compared to the right\.2\. \*\*Subretinal fluid\*\* in the left eye, particularly around the optic disc and extending into the macula\.3\. Normal intraocular pressure bilaterally\.4\. No anterior chamber inflammation observed during slit\-lamp examination\.5\. A history of being diagnosed with Vogt\-Koyanagi\-Harada \(VKH\) disease, which typically involves bilateral involvement and uveitis\.The presence of a shallow anterior chamber suggests a structural abnormality, possibly related to the anatomy of the eye\. However, VKH disease is usually characterized by bilateral symptoms and inflammation, which are not evident here\. The absence of anterior chamber inflammation and the unilateral nature of the presentation make VKH less likely\.Given these observations, the most plausible explanation for the unilateral subretinal fluid and shallow anterior chamber without inflammation could be a congenital anomaly such as persistent fetal vasculature \(PFV\), also known as persistant hyperplastic primary vitreous \(PHPV\)\. This condition can lead to structural abnormalities like a shallow anterior chamber and subretinal fluid due to improper development of the hyaloid artery and lens\-related anomalies\.Therefore, the final diagnosis aligning with the clinical findings would be \*\*persistent fetal vasculature \(PFV\)\*\*\.
### I\.2Case 2: Behcet’s Disease
Task PromptPlease reason through the following clinical case and predict the final diagnosis\. Do not use tables or bullet points\. Return your answer in the following format: <think\> \[Reasoning here\] </think\> <answer\> \[Final diagnosis here\] </answer\>\.
Full Clinical CaseAn 8\-year\-old boy of Brazilian descent with autism presented with an 8\-month history of painful oral ulcers\. He had a one\-time perianal ulcer, folliculitis on his scalp and buttocks, intermittent fever, and bloody stools\. He had no genital ulcers\. He was initially treated as PFAPA with steroids, but symptoms recurred when steroids were tapered\. Family history was noncontributory\.On examination, he had multiple large oral ulcers on the tongue, palate, and lips\. Abdominal examination was normal, with no tenderness or masses; there were no perianal ulcers, arthritis, or other skin findings\. An extensive workup for systemic lupus erythematosus, herpes simplex virus, Epstein–Barr virus, tuberculosis, cyclic neutropenia, primary immunodeficiencies, and periodic fever syndromes was unremarkable\. Ophthalmological examination was negative, with no pathergy\. Buccal biopsies showed chronic inflammation with granulation tissue, no granulomas, and grew Fusobacterium periodontium\.Prometheus IBD sgi testing demonstrated a pattern consistent with Crohn’s disease \(positive anti\-A4\-Fa2 IgG, anti\-Fax IgG, ATG16L1 heterozygous, NKX2\-3 heterozygous, ICAM\-1, SAA 179, elevated CRP; ASCA and ANCA negative\)\. Multiple endoscopies and colonoscopies with ileal intubation were negative\. Capsule endoscopy, CT of the abdomen, and MRI enterography showed no abnormalities\.Despite treatment with steroids, colchicine, infliximab, and methotrexate–later escalated to high\-dose infliximab–the patient’s bloody stools worsened and he developed severe malnutrition \(BMI z\-score \-2\.38\)\. At month 18, a fourth colonoscopy revealed large, round ulcers in the terminal ileum with granulation tissue and chronic inflammatory infiltrates, but no granulomas or cobblestoning\.Treatment was switched to ustekinumab, after which gastrointestinal symptoms improved\. At month 19, he developed bilateral hip effusions and later presented with headaches, emesis, and fevers\. A lumbar puncture showed an opening pressure of 39 cm H2O; brain MRI/MRA/MRV revealed extensive thrombosis of the superior sagittal sinus, straight sinus, and proximal right transverse sinus\. A hypercoagulability panel was negative\.
Model Outputs\.
MedGuideX \(correct\)Prediction:Behçet’s diseaseJudge reason:Exact match with the gold diagnosis\.Raw response:The patient is an 8\-year\-old boy with a prolonged history of painful oral ulcers, bloody stools, and intermittent fever, initially misdiagnosed as PFAPA\. Despite negative workup for common infectious and autoimmune causes, and negative endoscopic and imaging studies for Crohn’s disease, he had a positive Prometheus IBD sgi test pattern consistent with Crohn’s disease\. However, the endoscopic findings were atypical: large round ulcers with granulation tissue and chronic inflammation but no granulomas or cobblestoning, and no ileal involvement on capsule endoscopy, CT, or MRI enterography\. His symptoms worsened with multiple Crohn’s\-directed therapies, including infliximab, and he developed severe malnutrition\. The key turning point was the development of bilateral hip effusions and then cerebral venous sinus thrombosis \(CVST\) with extensive thrombosis of the superior sagittal, straight, and right transverse sinuses, despite a negative hypercoagulability panel\. This presentation is highly suggestive of Behçet’s disease, which can present with oral ulcers, gastrointestinal involvement \(often ileal or colonic ulcers\), and thrombotic complications such as CVST\. The absence of genital ulcers does not exclude Behçet’s disease, as it can be atypical in children\. The negative ASCA and ANCA, along with the lack of granulomas, further supports a diagnosis of Behçet’s disease rather than Crohn’s disease\. The response to ustekinumab, which is effective in Behçet’s disease, also supports this diagnosis\.<answer\>Behçet’s disease</answer\>
GPT\-5\.0 \(wrong\)Prediction:Crohn’s diseaseJudge reason:Crohn’s disease is a different inflammatory bowel disease and not a synonym or direct equivalent of Behçet’s disease\.Raw response:Reasoning:Severe, recurrent oral ulcers with bloody stools and later terminal ileal ulcers showing chronic inflammation, plus an IBD serologic pattern, point to Crohn’s disease; symptoms improved with ustekinumab\. Behçet disease is less likely given absent genital ulcers, negative pathergy and ophthalmologic exams, and GI findings ultimately localized to terminal ileum consistent with Crohn’s\. Cerebral venous sinus thrombosis is a known IBD complication\.<answer\>Crohn’s disease</answer\>
MedReason\-8B \(wrong\)Prediction:Crohn’s diseaseJudge reason:Crohn’s disease is not the same diagnosis as Behçet’s disease; they can overlap clinically but are distinct conditions\.Raw response:Crohn’s disease
### I\.3Case 3: Infective Endocarditis with Renal Embolization
Task PromptPlease reason through the following clinical case and predict the final diagnosis\. Do not use tables or bullet points\. Return your answer in the following format: <think\> \[Reasoning here\] </think\> <answer\> \[Final diagnosis here\] </answer\>\.
Full Clinical CaseA 52\-year\-old woman with known unrepaired Type II tricuspid atresia with d\-transposition of the great arteries, hypoplastic right ventricle, unrestricted atrial and ventricular septal defects, and multi\-level pulmonic stenosis presented with sudden\-onset severe right upper quadrant abdominal pain without nausea, vomiting, diarrhea, or dysuria\. Two months earlier, she had flu\-like symptoms treated with doxycycline, and she reported poor dentition with no dental care for two years\. On examination, her blood pressure was 90/54 mmHg, she was afebrile, heart rate was 80 bpm, and oxygen saturation was 79% on room air\. A grade\-2 systolic ejection murmur was heard at the left upper sternal border\. She had cyanotic fingers bilaterally but no Osler nodes, Janeway lesions, or splinter hemorrhages\. Laboratory studies showed hemoglobin 17\.9 g/dL and C\-reactive protein 2\.2 mg/dL; lactate, white blood cell count, and creatinine were within reference ranges\. Computed tomography of the abdomen demonstrated an acute right renal infarct; cerebral computed tomography was negative\. She was anticoagulated with heparin\.
Model Outputs\.
MedGuideX \(correct\)Prediction:Infective endocarditis with septic embolization to the right kidneyJudge reason:The prediction includes the gold diagnosis, infective endocarditis; the added septic embolization to the right kidney is a complication and does not change the core diagnosis\.Raw response:The patient has a complex cyanotic congenital heart disease with unrepaired tricuspid atresia, d\-transposition of the great arteries, hypoplastic right ventricle, unrestricted atrial and ventricular septal defects, and multi\-level pulmonic stenosis\. This anatomy results in right\-to\-left shunting and systemic cyanosis, as evidenced by her low oxygen saturation and cyanotic fingers\. She presents with sudden severe right upper quadrant abdominal pain, and imaging reveals an acute right renal infarct\. Her history includes recent doxycycline use for flu\-like symptoms and poor dentition, which raises concern for a source of bacteremia\. Although she lacks classic peripheral stigmata of infective endocarditis such as Osler nodes or Janeway lesions, the presence of a new renal infarct in the setting of cyanotic heart disease is highly suggestive of septic emboli from infective endocarditis\. The murmur at the left upper sternal border may represent her baseline murmur from pulmonic stenosis, but the acute presentation with organ infarction points to embolic complications\. Given her cyanotic heart disease, she is at high risk for infective endocarditis due to turbulent flow and endothelial damage, and the renal infarct is a classic manifestation of septic emboli\. Therefore, the most likely diagnosis is infective endocarditis with septic embolization to the kidney\.<answer\>Infective endocarditis with septic embolization to the right kidney</answer\>
GPT\-5\.0 \(wrong\)Prediction:Paradoxical systemic arterial embolism causing acute right renal infarction in cyanotic congenital heart disease \(unrepaired tricuspid atresia with d\-TGAJudge reason:Predicted diagnosis is paradoxical systemic arterial embolism/renal infarction due to cyanotic congenital heart disease, which is not the same as infective endocarditis\.Raw response:Reasoning:Unrepaired cyanotic congenital heart disease with right\-to\-left shunt allows venous thrombi to bypass the lungs and embolize systemically\. She is afebrile with normal WBC, only mildly elevated CRP, no peripheral stigmata of endocarditis, and acute CT\-proven renal infarct; anticoagulation was initiated for embolism\. Infective endocarditis with septic emboli is less likely given absence of fever and classic signs\.<answer\>Paradoxical systemic arterial embolism causing acute right renal infarction in cyanotic congenital heart disease \(unrepaired tricuspid atresia with d\-TGA\)</answer\>
MedReason\-8B \(wrong\)Prediction:Intracardiac thrombus with paradoxical embolismJudge reason:Predicted diagnosis is intracardiac thrombus with paradoxical embolism/renal infarction, which is not the same as infective endocarditis\.Raw response:\#\#\# Finding Reasoning Paths:1\. \*\*Right upper quadrant abdominal pain\*\*: Consider causes such as gallbladder disease \(e\.g\., cholecystitis\), pancreatitis, peptic ulcer disease, or vascular events like mesenteric ischemia or renal infarction\.2\. \*\*History of flu\-like illness and doxycycline use\*\*: Evaluate for complications related to recent antibiotic use \(e\.g\., drug\-induced side effects\) or secondary infections\.3\. \*\*Cyanosis and low oxygen saturation\*\*: Investigate cardiac or pulmonary causes, including pulmonary embolism, chronic lung disease, or congenital heart defects\.4\. \*\*Acute renal infarction on CT scan\*\*: Explore potential causes such as thromboembolism, vasculitis, or other vascular abnormalities\.5\. \*\*Congenital heart disease history\*\*: Assess how underlying structural heart defects \(e\.g\., tricuspid atresia, transposition of the great arteries\) contribute to systemic issues like hypoxemia or embolic events\.—\#\#\# Reasoning Process:1\. \*\*Initial Presentation\*\*:\- The patient has severe right upper quadrant abdominal pain, cyanosis, and hypoxemia \(oxygen saturation 79%\)\. These findings suggest a systemic issue involving both cardiovascular and possibly gastrointestinal systems\.\- The absence of fever, nausea, vomiting, or diarrhea makes infectious or inflammatory causes less likely in the GI tract\.2\. \*\*Cardiac History\*\*:\- The patient has a complex congenital heart defect \(tricuspid atresia with transposition of the great arteries\)\. This condition predisposes her to cyanosis due to impaired oxygenation and potential complications like paradoxical emboli\.\- Cyanosis and hypoxemia could also result from pulmonary hypertension or right\-to\-left shunting due to the congenital defect\.3\. \*\*Renal Infarction\*\*:\- The CT scan shows an acute right renal infarction, which is often caused by embolism, thrombosis, or vasculitis\. In this case, embolism is most likely given the systemic nature of the presentation\.\- Emboli can originate from the heart, particularly in patients with congenital heart disease, where intracardiac thrombi may form due to stagnant blood flow or turbulent currents\.4\. \*\*Doxycycline Use\*\*:\- Doxycycline is not commonly associated with embolic phenomena, so it is unlikely to be directly responsible for the renal infarction\.\- However, the flu\-like illness treated with doxycycline raises the possibility of a preceding viral infection that could have triggered a hypercoagulable state or endothelial damage, increasing the risk of thromboembolism\.5\. \*\*Systemic Embolism\*\*:\- The combination of cyanosis, hypoxemia, and renal infarction strongly suggests a systemic embolic event\. Given the patient’s congenital heart disease, paradoxical embolism \(via a right\-to\-left shunt\) is a plausible mechanism\.\- Intracardiac thrombus formation is a common complication in patients with congenital heart defects, especially when there are atrial or ventricular septal defects allowing shunting of deoxygenated blood\.6\. \*\*Exclusion of Other Causes\*\*:\- There is no evidence of infection \(no fever, leukocytosis, or elevated CRP\)\.\- No signs of pancreatitis or peptic ulcer disease based on the clinical presentation\.\- The absence of neurological symptoms makes stroke or cerebral embolism less likely\.—\#\#\# Conclusion:The most likely diagnosis is \*\*intracardiac thrombus formation\*\*, leading to a paradoxical embolism causing the acute right renal infarction and systemic hypoxemia\. This conclusion aligns with the patient’s congenital heart disease, cyanosis, and renal infarction findings\.Final AnswerIntracardiac thrombusSimilar Articles
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
ClinicalMC is a benchmark designed to evaluate large language models in multi-course clinical decision-making, featuring datasets in Chinese and English and a multi-agent evaluation framework.
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
ChatHealthAI is a multimodal reasoning framework that aligns structured EHR representations with a frozen LLM to enable grounded clinical reasoning while maintaining predictive performance.
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
This paper introduces Code-Guided Reasoning (CGR), an evaluation protocol for measuring how executable reasoning scaffolds improve small language model performance on multiple-choice question answering tasks, showing a significant accuracy improvement over direct answering.
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
This paper proposes HyperGuide, a method that distills reasoning progress into a hyperbolic geometric signal to guide step-by-step generation in LLMs, improving multi-step reasoning efficiency without explicit tree search.
MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.