ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

arXiv cs.AI Papers

Summary

ChatHealthAI is a multimodal reasoning framework that aligns structured EHR representations with a frozen LLM to enable grounded clinical reasoning while maintaining predictive performance.

arXiv:2606.02802v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records (EHRs). In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language-based reasoning. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task-aware resampler. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural-language reasoning while maintaining accurate patient prediction. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:41 AM

# ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
Source: [https://arxiv.org/html/2606.02802](https://arxiv.org/html/2606.02802)
Bo\-Hong Wang1,2, Baicheng Peng1, Ruilin Wang1,2, Jun Bai1,2, Ziyang Song1,2, Yue Li1 1School of Computer Science, McGill University, Montreal, QC, Canada 2Mila \- Quebec AI Institute, Montreal, QC, Canada

###### Abstract

Large language models \(LLMs\) exhibit strong natural\-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records \(EHRs\)\. In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language\-based reasoning\. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task\-aware resampler\. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural\-language reasoning while maintaining accurate patient prediction\. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark\. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance\. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction\.

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

Bo\-Hong Wang1,2, Baicheng Peng1, Ruilin Wang1,2, Jun Bai1,2, Ziyang Song1,2, Yue Li1††thanks:Corresponding author: yueli@cs\.mcgill\.ca1School of Computer Science, McGill University, Montreal, QC, Canada2Mila \- Quebec AI Institute, Montreal, QC, Canada

## 1Introduction

Large Language Models \(LLMs\) have shown strong natural\-language reasoning capabilities, making them increasingly appealing for clinical decision\-making\(Singhalet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib1)\)\. However, clinical predictions often require not only output labels but also interpretable explanations grounded in patient records\(Lauritsenet al\.,[2019](https://arxiv.org/html/2606.02802#bib.bib11)\)\. This is challenging for structured longitudinal electronic health records \(EHRs\), which consist of coded clinical events whose temporal order and clinical context are important for prediction\(Panget al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib12); Choiet al\.,[2016](https://arxiv.org/html/2606.02802#bib.bib17)\)\. Simply serializing EHR events into an LLM prompt can exceed context limits and lose temporal and structural patterns\(Wuet al\.,[2024](https://arxiv.org/html/2606.02802#bib.bib13); Panget al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib12)\)\. This motivates integrating LLM\-based reasoning with representations learned from structured EHR trajectories\.

EHR foundation models address this complementary modeling problem by pretraining on large\-scale structured clinical trajectories\. Models such as CLMBR\-T\-Base\(Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\), Med\-BERT\(Rasmyet al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib7)\), and BEHRT\(Liet al\.,[2020](https://arxiv.org/html/2606.02802#bib.bib8)\)learn patient representations from longitudinal EHR data and have shown strong performance on downstream clinical prediction tasks\. These models capture predictive patterns from diagnosis, medication, laboratory, and other clinical event sequences, but their outputs are typically latent embeddings or risk scores rather than natural\-language explanations\(Liet al\.,[2020](https://arxiv.org/html/2606.02802#bib.bib8); Rasmyet al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib7); Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\)\. This creates a methodological gap between predictive EHR representation learning and natural\-language clinical reasoning\. EHR embeddings and LLM token embeddings are learned from heterogeneous input spaces\(Alayracet al\.,[2022](https://arxiv.org/html/2606.02802#bib.bib2); Richardet al\.,[2024](https://arxiv.org/html/2606.02802#bib.bib4)\)and training objectives, and therefore are not naturally aligned in a shared representation space\. Consequently, EHR embeddings cannot be directly treated as meaningful LLM inputs\. Effective clinical reasoning therefore requires explicit alignment between structured EHR representations and the LLM representation space\.

To address this challenge, we propose ChatHealthAI, a multimodal clinical reasoning framework that aligns structured EHR representations with a frozen open\-source LLM to enable grounded clinical prediction and reasoning generation\. Specifically, we use CLMBR\-T\-Base as the EHR foundation model and Deepseek\-R1\-Distill\-Qwen\-14B as LLM\. ChatHealthAI aligns patient representations learned from structured EHR trajectories with a frozen LLM through a task\-aware resampler\(Richardet al\.,[2024](https://arxiv.org/html/2606.02802#bib.bib4)\)\. It further incorporates selected clinical events as textual evidence, enabling the model to generate grounded clinical reasoning and task\-specific prediction conclusions\. We evaluate ChatHealthAI on the EHRSHOT benchmark\(Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\), including length\-of\-stay \(LOS\), ICU admission, and 30\-day readmission prediction\. Results show that ChatHealthAI maintains competitive predictive performance and improves reasoning quality, highlighting EHR\-LLM alignment as a promising direction for interpretable clinical prediction\.

In summary, our contributions are:

- •We propose ChatHealthAI, a multimodal clinical reasoning framework that aligns structured longitudinal EHR representations with frozen LLM to support clinically grounded prediction and natural\-language reasoning\.
- •We integrate task\-aware latent EHR representations with refined clinical event evidence to improve clinically grounded and interpretable reasoning over longitudinal patient trajectories\.
- •We evaluate ChatHealthAI on multiple clinical prediction tasks from the EHRSHOT benchmark, including LOS, ICU admission, and 30\-day readmission\. Experimental results demonstrate improved reasoning quality while maintaining competitive predictive performance\.

## 2Related Work

##### EHR representation learning\.

Prior EHR foundation models mainly focus on learning predictive patient embeddings from structured EHR data\. Transformer\-based EHR foundation models such as CEHR\-BERT\(Panget al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib12)\), Med\-BERT\(Rasmyet al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib7)\), and CLMBR\-T\-Base\(Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\)learn longitudinal patient representations through large\-scale pretraining on structured EHR trajectories\. EHRSHOT provided a standard few\-shot evaluation benchmark used to evaluate CLMBR\-T\-Base\(Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\)\. However, these models mainly focus on predictive representation learning and do not explicitly support grounded natural\-language clinical reasoning generation\.

##### Clinical LLMs and representation alignment\.

Clinical LLMs show great potential in biomedicine reasoning and generating interpretation\(Singhalet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib1)\)\. However, prompting alone may not be enough for clinical LLMs to capture latent longitudinal patterns and temporal structure in complex EHR trajectories\. Multimodal frameworks such as Flamingo\(Alayracet al\.,[2022](https://arxiv.org/html/2606.02802#bib.bib2)\)and ChatNT\(Richardet al\.,[2024](https://arxiv.org/html/2606.02802#bib.bib4)\)suggest that lightweight alignment modules, like perceiver resampler, can help bridge pretrained non\-text representations with frozen LLMs\.

## 3ChatHealthAI

![Refer to caption](https://arxiv.org/html/2606.02802v1/overview.png)Figure 1:Overview of ChatHealthAI\. CLMBR\-T\-Base encodes structured EHR events into latent patient representations, which are aligned with a frozen open\-source LLM through a task\-aware resampler\. The frozen open\-source LLM additionally receives the task prompt, refined clinical events, and patient context as grounding information\. During training, teacher\-generated reasoning targets supervise the EHR\-side modules through next\-token prediction loss\. During inference, the model generates evidence, reasoning, and a final prediction conclusion\.We propose ChatHealthAI, a multimodal framework that aligns longitudinal EHR representations with a frozen open\-source LLM to support clinical prediction with natural\-language reasoning \(Fig\.[1](https://arxiv.org/html/2606.02802#S3.F1)\)\. ChatHealthAI encodes longitudinal patient trajectories with CLMBR\-T\-Base, maps the resulting representations to a frozen open\-source LLM through a resampler, and uses refined clinical events as textual grounding for reasoning\.

### 3\.1Problem Formulation

Letℰ=\{e1,e2,…,eT\}\\mathcal\{E\}=\\\{e\_\{1\},e\_\{2\},\\ldots,e\_\{T\}\\\}denote a patient’s longitudinal EHR trajectory, where each eventete\_\{t\}contains structured clinical information such as laboratory results, vital signs, medications, or diagnosis codes observed at timestamptt\. Let𝒯\\mathcal\{T\}denote the textual inputs, including: task prompt, refined clinical events, patient context\.

Given both the structured EHR trajectoryℰ\\mathcal\{E\}and the textual inputs𝒯\\mathcal\{T\}, ChatHealthAI aims to identify clinically relevant evidence from the longitudinal EHR trajectory, generate reasoning grounded in that evidence, and produce a prediction supported by both the structured EHR information and the reasoning process\.

### 3\.2Clinical Event Retrieval and Refinement

Raw longitudinal EHR trajectories often contain lengthy, repetitive, and low\-information clinical events\. Directly serializing the full EHR trajectory into the LLM context is computationally inefficient and may dilute clinically important signals within long sequences\. In addition, limited context windows can lead to truncation of important events\.

To address this problem, we select a compact subset of clinically representative events from the longitudinal EHR trajectory for downstream clinical reasoning\. Specifically, we divide the EHR trajectory into temporal chunks within a configurable lookback window \(48 hours by default\)\. We retrieve chunks using RAG\(Lewiset al\.,[2020](https://arxiv.org/html/2606.02802#bib.bib16)\)with the retrieval query "Retrieve chunks that best summarize the patient’s clinical course", and the retrieved chunks are then refined by a LLM to select at most 30 representative clinical events that best summarize the patient’s trajectory\. These*refined clinical events*serve as textual grounding signals for downstream clinical reasoning\.

### 3\.3Task\-Aware EHR\-LLM Representation Alignment

We encode the patient’s event sequenceℰ\\mathcal\{E\}within the lookback window using CLMBR\-T\-Base, obtaining contextualized EHR embeddings𝐇CLMBR∈ℝT×d\\mathbf\{H\}\_\{\\text\{CLMBR\}\}\\in\\mathbb\{R\}^\{T\\times d\}, whereTTdenotes the number of clinical events andd=768d=768is the EHR embedding dimension\. In our dataset, the number of clinical events varies substantially across patients, withT∈\[100,60000\]T\\in\[100,60000\]\.

EHR embeddings and LLM token embeddings are learned from heterogeneous input spaces and training objectives, and are therefore not naturally aligned in a shared representation space\. We initially explored using a trainable linear projection layer to directly map CLMBR embeddings into the LLM embedding dimension in order to bridge the representation gap between EHR embeddings and LLM token embeddings\. However, we found that simple linear projection was insufficient for the frozen LLM to effectively interpret longitudinal EHR representations, leading to unstable reasoning generation and degraded predictive performance\.

Therefore, we employ a perceiver resampler \(Fig\.[2](https://arxiv.org/html/2606.02802#S3.F2)\), inspired by Flamingo\(Alayracet al\.,[2022](https://arxiv.org/html/2606.02802#bib.bib2)\)and ChatNT\(Richardet al\.,[2024](https://arxiv.org/html/2606.02802#bib.bib4)\), to align structured EHR representations with the frozen LLM\. The resampler containsM=64M\{=\}64learnable latent queries,𝐐∈ℝM×d\\mathbf\{Q\}\\in\\mathbb\{R\}^\{M\\times d\}, which attend to the CLMBR\-T\-Base embeddings through a cross\-attention module followed by a feed\-forward network \(FFN\)\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.02802#bib.bib22); Jaegleet al\.,[2021](https://arxiv.org/html/2606.02802#bib.bib23)\):

𝐙=CrossAttn​\(𝐐,𝐊,𝐕\),\\mathbf\{Z\}=\\mathrm\{CrossAttn\}\(\\mathbf\{Q\},\\mathbf\{K\},\\mathbf\{V\}\),\\quad\(1\)where𝐊\\mathbf\{K\}and𝐕\\mathbf\{V\}are obtained from the CLMBR\-T\-Base embeddings,𝐇CLMBR∈ℝT×d\\mathbf\{H\}\_\{\\text\{CLMBR\}\}\\in\\mathbb\{R\}^\{T\\times d\}\. The resampler compresses the EHR trajectory𝐇\\bf\{H\}fromTTto a fixed set of M=64 latent tokens, thereby reducing computational cost and meeting the context length requirement\. Then, we further condition the latent representation on a task\-prompt embedding to support task\-aware clinical reasoning\. We apply another cross\-attention layer followed by an FFN to produce a task\-aware latent representation:

𝐙′=CrossAttn​\(𝐙,𝐊,𝐕\),\\mathbf\{Z\}^\{\\prime\}=\\mathrm\{CrossAttn\}\(\\mathbf\{Z\},\\mathbf\{K\},\\mathbf\{V\}\),\(2\)where𝐊\\mathbf\{K\}and𝐕\\mathbf\{V\}are obtained from the task\-specific prompt embeddings,𝐏∈ℝL×d\\mathbf\{P\}\\in\\mathbb\{R\}^\{L\\times d\}, whereLLis the number of prompt tokens\. This task\-aware cross\-attention enables the model to dynamically reweight latent resampler representations according to the target prediction task, thereby emphasizing different clinical information for different tasks\.

![Refer to caption](https://arxiv.org/html/2606.02802v1/resampler.png)Figure 2:Task\-aware resampler\. Learnable latent queries first attend to CLMBR\-T\-Base embeddings to produce compact EHR latents, which then attend to the task prompt to generate task\-aware representations\.
### 3\.4Clinical Reasoning Instruction Tuning

The input to the frozen LLM consists of four components: \(1\) the task prompt, \(2\) the task\-aware latent representations, \(3\) the patient\-level context such as patient age at prediction time, and \(4\) serialized refined clinical events with relative timestamps measured from the first observed event\. We use DeepSeek\-R1\-Distill\-Qwen\-14B as the frozen LLM backbone\.

To supervise the generation of clinically grounded reasoning, we use GPT\-oss\-120B as a teacher model to generate structured reasoning targets\.\(Hsiehet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib19); Weiet al\.,[2022](https://arxiv.org/html/2606.02802#bib.bib18)\)Given refined clinical events and the ground\-truth, the teacher model produces structured answers consisting of: \(1\) clinical evidence, \(2\) step\-by\-step reasoning, and \(3\) a label\-consistent conclusion\. These teacher\-generated answers serve as supervision targets during training\. Given a tokenized teacher\-generated target sequencey∗=\(y1∗,…,yN∗\)y^\{\*\}=\(y^\{\*\}\_\{1\},\\ldots,y^\{\*\}\_\{N\}\), we optimize the trainable ChatHealthAI modules based on the next\-token prediction loss:

ℒNTP=−∑j=1Nlog⁡pθ​\(yj∗∣𝐱<j\),\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}=\-\\sum\_\{j=1\}^\{N\}\\log p\_\{\\theta\}\\left\(y^\{\*\}\_\{j\}\\mid\\mathbf\{x\}\_\{<j\}\\right\),\(3\)whereθ\\thetadenotes the trainable parameters of CLMBR\-T\-base encoder and task\-aware resampler, while the LLM backbone remains frozen\. At inference time, ChatHealthAI autoregressively generates clinical evidence, reasoning, and a final prediction conclusion\.

## 4Experimental Setting

### 4\.1Dataset

We evaluated ChatHealthAI onEHRSHOT\(Wornowet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib3)\), which contains longitudinal structured EHR trajectories of diagnoses, medications, laboratory tests, procedures, etc\. EHRSHOT provides multiple standardized clinical prediction tasks, along with patient\-level CLMBR\-T\-Base embeddings pre\-trained on 2\.57 million deidentified EHRs\. We focused on three different clinical prediction tasks: \(1\) length\-of\-stay \(LOS\) prediction, which predicts whether the patient’s hospitalization lasts at least 7 days, \(2\) ICU admission prediction, which predicts whether the patient will require intensive care unit admission, and \(3\) 30\-day readmission prediction, which predicts whether the patient will be readmitted within 30 days after discharge\. We also splited the train and test sets following the established EHRSHOT procedures\. Table[1](https://arxiv.org/html/2606.02802#S4.T1)summarizes dataset statistics\.

Table 1:Dataset statistics for the EHRSHOT benchmark with three clinical prediction tasks\.Pos\.denotes the positive ratios \(%\)\.TaskPos\.TrainValTestLength of Stay≥\\geq7d25\.3%256922312195ICU admission4\.5%24022052203730\-day readmission13\.0%260822062189Table 2:Main results on EHRSHOT benchmark test sets reported in Precision / Recall / F1\. LLM\-based methods output discrete labels; CLMBR\+Linear uses threshold 0\.5\. Best results are highlighted inbold\.MethodSettingLOSICU AdmissionReadmissionPRF1PRF1PRF1CLMBR \+ Linear–0\.6670\.2860\.4011\.00\.010\.020\.6700\.1100\.190Llama\-3\.1\-8B\-Instructzero\-shot0\.2530\.9960\.4030\.051\.00\.090\.1190\.9870\.213finetuned0\.2690\.7940\.4020\.050\.5630\.0830\.1320\.5070\.210BioMistral\-7Bzero\-shot0\.2520\.9760\.3980\.051\.00\.090\.1200\.9810\.214finetuned0\.2690\.7950\.4010\.0820\.0470\.0600\.2310\.1420\.175DeepSeek\-R1\-14Bzero\-shot0\.2580\.9930\.4090\.0460\.8350\.0870\.1200\.9690\.214finetuned0\.2370\.9250\.3770\.0660\.7620\.1210\.1190\.7830\.206ChatHealthAIfinetuned0\.4420\.5510\.4910\.3850\.1360\.1960\.2080\.2420\.224
### 4\.2Baselines

We compared with CLMBR\+Linear using frozen CLMBR\-T\-Base with a linear classifier given a prediction threshold of 0\.5, serving as a structured EHR prediction baseline without natural\-language reasoning\. We also included a prompt\-based LLM baseline that directly uses serialized refined clinical events as input without using a resampler and an EHR representation\. We evaluated four open\-sourced models under zero\-shot and finetuned settings: Llama\-3\.1\-8B\-Instruct, selected as an open\-source instruction\-tuned LLM; DeepSeek\-R1\-Distill\-Qwen\-14B, selected as a frozen LLM backbone of ChatHealthAI with strong reasoning capabilities; BioMistral\-7B, selected as a pretrained bacbkone on large\-scale biomedical textual data; and GPT\-oss\-120B, selected as the teacher model for event refinement and instruction\-answer generation in ChatHealthAI\.

### 4\.3Metrics

In this work, our ChatHealthAI and baselines directly output structured natural\-language outputs that include evidence, step\-by\-step reasoning, and conclusions, rather than providing prediction probabilities\. We therefore used Precision, Recall, and F1 as the predictive metrics for evaluation\. The prediction is obtained from the generated “Yes” or “No” token following “Conclusion:” in the output template\. For CLMBR\+Linear, which produces continuous prediction scores, we apply a fixed threshold of 0\.5 to obtain prediction labels\.

### 4\.4Reasoning Evaluation

We quantitatively evaluated the reasoning quality of each model using GPT\-5\.4, Claude Sonnet 4\.5, and Gemini 2\.5 Pro as independent judges under a structured rubric\-based protocol\(Liuet al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib14); Zhenget al\.,[2023](https://arxiv.org/html/2606.02802#bib.bib15); Croxford and others,[2025](https://arxiv.org/html/2606.02802#bib.bib20)\)\. Judges are instructed to use the task prompt, refined clinical events, and ground\-truth, along with the model’s generated explanation and predicted result for assessment\. The explanation contains clinically relevant evidence, step\-by\-step reasoning, and a final conclusion, scored on eight criteria using a 1–5 rating scale\.

##### Reasoning Quality\.

We assessed the quality of each generated rationale across six dimensions:\(1\) Evidence groundingassesses whether claims are supported by the provided refined clinical events\.\(2\) Clinical relevancemeasures whether the selected evidence is pertinent to the prediction task\.\(3\) Temporal reasoningcaptures whether the explanation reflects progression, stability, escalation, or de\-escalation over time\.\(4\) Clinical Coherenceevaluates whether the explanation follows clinically plausible reasoning\.\(5\) Completenessevaluates whether sufficient evidence and reasoning support the conclusion\.\(6\) Safety of overclaimingpenalizes unsupported claims or clinically misleading statements\(Asgariet al\.,[2025](https://arxiv.org/html/2606.02802#bib.bib21)\)\.

##### Reasoning Utility\.

We also assessed whether the explanation supports the final prediction in a clinically meaningful way via two criteria:\(1\) Outcome alignmentmeasures whether the conclusion matches the ground\-truth label, while\(2\) Clinical usefulnessreflects whether the explanation aids a clinician in interpreting the model prediction\.

Table 3:Per\-dimension LLM\-judge reasoning quality scores on the LoS task \(1–5 scale, averaged over GPT\-5\.4, Claude Sonnet 4\.5, and Gemini 2\.5 Pro judges\)\. Column abbreviations — Evid\. Ground\.: Evidence Grounding; Clin\. Relev\.: Clinical Relevance; Temp\. Reason\.: Temporal Reasoning; Clin\. Coher\.: Clinical Coherence; Compl\.: Completeness; Outcome Align\.: Outcome Alignment; Clin\. Useful\.: Clinical Usefulness\.MethodSettingReasoning QualityReasoning UtilityEvid\.Ground\.Clin\.Relev\.Temp\.Reason\.Clin\.Coher\.Compl\.SafetyOutcomeAlign\.Clin\.Useful\.Llama\-3\.1\-8B\-Instructzero\-shot3\.6553\.8231\.7663\.0882\.5403\.1482\.1532\.268finetuned3\.3833\.8112\.8852\.6852\.7642\.8912\.3032\.132BioMistral\-7Bzero\-shot2\.7103\.0551\.3072\.0431\.4782\.4431\.8621\.474finetuned2\.1323\.3212\.6402\.3472\.3632\.0973\.0381\.925DeepSeek\-R1\-14Bzero\-shot3\.6384\.0573\.1603\.2333\.2632\.8262\.1702\.318finetuned3\.2813\.7402\.7972\.7302\.6462\.9331\.9422\.060ChatHealthAIfinetuned3\.4634\.1653\.5773\.1953\.3792\.7783\.5662\.817

## 5Results

### 5\.1Main Prediction Results

Prompting\-only LLM baselines, especially in the zero\-shot setting, generally achieve high recall but low precision \(Table[2](https://arxiv.org/html/2606.02802#S4.T2)\)\. This result indicates that prompting\-only LLMs tend to over\-predict positive outcomes when exposed to high\-risk clinical events such as severe diagnoses, medications, or procedures\. In some cases, zero\-shot models predict nearly all samples as positive, resulting in an poor precision–recall trade\-off\. Instruction fine\-tuning mitigates this issue but remains inconsistent across tasks\. For example, the finetuned BioMistral\-7B improves precision on readmission prediction while substantially reducing recall\. In contrast, ChatHealthAI achieves a more balanced precision\-recall trade\-off and obtains the best F1 scores on all evaluated tasks among the compared methods\. These results suggest that incorporating longitudinal EHR representations helps capture latent temporal and clinical patterns that cannot be reliably inferred from refined clinical events alone, thereby improving clinical risk discrimination beyond prompting\-only LLMs\.

### 5\.2Reasoning Ability

ChatHealthAI shows improved reasoning performance across multiple criteria, particularly in clinical relevance, temporal reasoning, completeness, outcome alignment, and clinical usefulness \(Table[3](https://arxiv.org/html/2606.02802#S4.T3)\)\. These improvements also lead to the highest overall reasoning performance, including average Reasoning Quality \(RQ\) score across six reasoning dimensions and average Reasoning Utility \(RU\) score across two reasoning utility criteria \(Fig\.[3](https://arxiv.org/html/2606.02802#S5.F3)\)\.

![Refer to caption](https://arxiv.org/html/2606.02802v1/bar_chart.png)Figure 3:Average LLM\-judges evaluation results on the length\-of\-stay prediction task\. ChatHealthAI achieves the highest reasoning quality \(RQ\), reasoning utility \(RU\), and overall score among all compared baselines\.Directly finetuning DeepSeek\-R1\-14B generally leads to lower reasoning quality and reasoning utility than the zero\-shot baseline\. This suggests that instruction finetuning alone may encourage the model to rely on isolated clinical events or salient medical terms, while failing to consistently capture the patient’s longitudinal EHR trajectory\.

In contrast, ChatHealthAI shows consistent improvement in both Reasoning Quality and Reasoning Utility\. Its gains in temporal reasoning suggest that aligned longitudinal EHR representations help the model reason more effectively about clinical events over time\. This improvement is particularly important for longitudinal clinical prediction tasks, where the significance of clinical events often depends on their temporal progression and relationships with other events\. ChatHealthAI also improves completeness and clinical relevance, indicating that longitudinal EHR priors help generate explanations that are more grounded in the patient\-specific clinical trajectory\. Overall, these results suggest that longitudinal EHR representations provide useful temporal signals for improving clinical reasoning about quality utility beyond prompting\-only LLMs\.

### 5\.3Ablation Study

Table 4:Ablation results on the LoS task\.VariantPRF1RQRUFull ChatHealthAI0\.4420\.5510\.4913\.6653\.425Random CLMBR0\.3610\.3500\.3553\.4263\.192w/ linear proj\.N/AN/AN/A1\.7331\.726Using the LOS task as an example, replacing CLMBR\-T\-Base embeddings with random embeddings reduces F1\-score from 0\.491 to 0\.355 \(Table[4](https://arxiv.org/html/2606.02802#S5.T4)\), indicating that the clinical priors learned by CLMBR\-T\-base provide essential predictive information that cannot be recovered from refined clinical events alone\. As expected, when CLMBR\-T\-Base embeddings are replaced with random vectors, the reasoning quality and clinical utility scores remain relatively high\. This is because the reasoning quality is mainly supported by frozen LLMs and refine clinical events, whereas latent EHR representations are more important for prediction discrimination\. However, compared with the zero\-shot and finetuned DeepSeek\-R1\-14B backbone used in ChatHealthAI, the full ChatHealthAI framework has a more balanced recall\-precision trade\-off, and improved reasoning quality \(RQ\) and reasoning utility \(RU\)\. This observation suggests that, even without meaningful longitudinal EHR embeddings, the resampler can still learn to calibrate reasoning behavior and improve prediction\. In contrast, replacing the resampler with a simple linear projection layer causes the model to score very low on both RQ and RU, and results in invalid structured outputs that cannot be used for classification metric calculation\. This indicates that linear layer does not compress and align CLMBR\-T\-Base’s embeddings into semantically meaningful latent tokens for the frozen LLM\. These results show that the resampler is not merely a dimensional projection module, but a critical alignment module for connecting structured EHR representations with LLM\-based reasoning\.

![Refer to caption](https://arxiv.org/html/2606.02802v1/case_study_both.png)Figure 4:Case studies of ChatHealthAI\-generated clinical reasoning\. \(a\) A positive case involving immunosuppressive therapy and cardiometabolic instability\. \(b\) A negative case involving routine postpartum recovery\.

## 6Case Studies

A representative positive LOS case involves immunosuppressive therapy and broad\-spectrum therapy, followed by transfusion\-related care, infection\-related testing, and continued cardiometabolic monitoring across different stages of the 48\-hour EHR timeline \(Figure[4](https://arxiv.org/html/2606.02802#S5.F4)a\)\. ChatHealthAI organizes these events into a structured explanation\. It first identifies early immunosuppressive therapy and transfusion as evidence of high\-acuity inpatient management\. It then connects these findings with later MRSA screening, bacterial culture, HLA compatibility testing, and ongoing vital\-sign monitoring, suggesting continued complication risk and persistent clinical instability\. Together, these temporally distributed findings suggest sustained treatment burden and failure to achieve early clinical stabilization\. Finally, the model combines these clinical signals with the patient’s age of 60 years to support the conclusion that the hospitalization is likely to exceed 7 days\. The case illustrates the intended behavior of ChatHealthAI by showing how it links multiple clinical events across the longitudinal timeline and produces a reasoning chain consistent with the final prediction\.

A representative negative LOS case reflects routine postpartum recovery without escalation to intensive treatment \(Figure[4](https://arxiv.org/html/2606.02802#S5.F4)b\)\. Although the patient received active delivery\-related management during the early stage of hospitalization, the later clinical events consisted mainly of routine postpartum monitoring, blood work, and standard supportive care\. ChatHealthAI correctly identifies the patient’s transition from active delivery management to stable postpartum recovery and concludes that this hospitalization is unlikely to exceed 7 days\. This example further demonstrates that ChatHealthAI can distinguish between temporary intensive treatment and persistent clinical instability\.

## 7Conclusion

We developed ChatHealthAI, a multimodal clinical reasoning framework that aligns a structured longitudinal EHR representation with a frozen LLM to generate grounded clinical predictions and reasoning\. Across three EHRSHOT prediction tasks, ChatHealthAI achieves a more balanced precision—recall trade\-off and obtains the best F1 score on LOS and ICU admission prediction\. Reasoning evaluation demonstrates that ChatHealthAI achieves the best overall Reasoning Quality and Reasoning Utility scores\. Our ablation study shows that the resampler is an essential alignment module, enabling the frozenLLM to effectively leverage structured EHR representations for reasoning and prediction\. Overall, ChatHealthAI demonstrates the potential of integrating EHR foundation models with LLMs to build clinical prediction systems that are both predictive and interpretable\.

## Limitations

Our proposed ChatHealthAI has several limitations\. First, ChatHealthAI is evaluated only on the EHRSHOT benchmark, and its generalizability to other EHR datasets and healthcare systems remains uncertain\. Second, reasoning quality is evaluated primarily using LLM\-based judges rather than expert human clinicians\. Although recent studies suggest that LLM judges can provide relatively consistent evaluation, they may not fully reflect real\-world clinical preferences or expert\-level reasoning assessment\. Third, the reasoning supervision used for instruction finetuning is generated by a teacher LLM rather than expert clinicians\. The generated explanations may still contain factual inaccuracies, incomplete reasoning, or clinically imperfect interpretations\. Finally, limitations in framework compatibility between CLMBR\-T\-Base and newer open\-source LLM architectures currently restrict evaluation on more recent frozen LLM backbones\.

## References

- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds, R\. Ring, E\. Rutherford, S\. Cabi, T\. Han, Z\. Gong, S\. Samangooei, M\. Monteiro, J\. Menick, S\. Borgeaud, A\. Brock, A\. Nematzadeh, S\. Sharifzadeh, M\. Binkowski, R\. Barreira, O\. Vinyals, A\. Zisserman, and K\. Simonyan \(2022\)Flamingo: a visual language model for few\-shot learning\.External Links:2204\.14198,[Link](https://arxiv.org/abs/2204.14198)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p2.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2606.02802#S3.SS3.p3.2)\.
- E\. Asgari, N\. Montaña\-Brown, M\. Dubois,et al\.\(2025\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.npj Digital Medicine8\(1\),pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7),[Link](https://www.nature.com/articles/s41746-025-01670-7)Cited by:[§4\.4](https://arxiv.org/html/2606.02802#S4.SS4.SSS0.Px1.p1.1)\.
- E\. Choi, M\. T\. Bahadori, J\. Sun, J\. Kulas, A\. Schuetz, and W\. Stewart \(2016\)RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism\.InAdvances in Neural Information Processing Systems,Vol\.29\.External Links:[Link](https://proceedings.neurips.cc/paper/2016/hash/231141b34c82aa95e48810a9d1b33a79-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p1.1)\.
- E\. Croxfordet al\.\(2025\)Automating evaluation of ai text generation in healthcare with a large language model\.medRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2025.04.22.25326219),[Link](https://www.medrxiv.org/content/10.1101/2025.04.22.25326219v1)Cited by:[§4\.4](https://arxiv.org/html/2606.02802#S4.SS4.p1.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8003–8017\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.507),[Link](https://aclanthology.org/2023.findings-acl.507)Cited by:[§3\.4](https://arxiv.org/html/2606.02802#S3.SS4.p2.1)\.
- A\. Jaegle, F\. Gimeno, A\. Brock, A\. Zisserman, O\. Vinyals, and J\. Carreira \(2021\)Perceiver: general perception with iterative attention\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 4651–4664\.External Links:[Link](https://proceedings.mlr.press/v139/jaegle21a.html)Cited by:[§3\.3](https://arxiv.org/html/2606.02802#S3.SS3.p3.2)\.
- S\. M\. Lauritsen, M\. Kristensen, M\. V\. Olsen, M\. S\. Larsen, K\. M\. Lauritsen, M\. J\. Jørgensen, J\. Lange, and B\. Thiesson \(2019\)Explainable artificial intelligence model to predict acute critical illness from electronic health records\.arXiv preprint arXiv:1912\.01266\.Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§3\.2](https://arxiv.org/html/2606.02802#S3.SS2.p2.1)\.
- Y\. Li, S\. Rao, J\. R\. A\. Solares, A\. Hassaine, R\. Ramakrishnan, D\. Canoy, Y\. Zhu, K\. Rahimi, and G\. Salimi\-Khorshidi \(2020\)BEHRT: transformer for electronic health records\.Scientific Reports10\(1\),pp\. 7155\.External Links:[Document](https://dx.doi.org/10.1038/s41598-020-62922-y)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p2.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-eval: NLG evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153)Cited by:[§4\.4](https://arxiv.org/html/2606.02802#S4.SS4.p1.1)\.
- C\. Pang, X\. Jiang, K\. S\. Kalluri, M\. Spotnitz, R\. Chen, A\. Perotte, and K\. Natarajan \(2021\)CEHR\-bert: incorporating temporal information from structured ehr data to improve prediction tasks\.InProceedings of Machine Learning for Health,pp\. 239–260\.Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p1.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Rasmy, Y\. Xiang, Z\. Xie, C\. Tao, and D\. Zhi \(2021\)Med\-bert: pretrained contextualized embeddings on large\-scale structured electronic health records for disease prediction\.npj Digital Medicine4\(1\),pp\. 86\.External Links:[Document](https://dx.doi.org/10.1038/s41746-021-00455-y)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p2.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Richard, B\. P\. de Almeida, H\. Dalla\-Torre, C\. Blum, L\. Hexemer, P\. Pandey, S\. Laurent, M\. Lopez, A\. Laterre, M\. Lang, U\. Şahin, K\. Beguir, and T\. Pierrot \(2024\)ChatNT: a multimodal conversational agent for dna, rna and protein tasks\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2024.04.30.591835),[Link](https://www.biorxiv.org/content/early/2024/05/02/2024.04.30.591835),https://www\.biorxiv\.org/content/early/2024/05/02/2024\.04\.30\.591835\.full\.pdfCited by:[§1](https://arxiv.org/html/2606.02802#S1.p2.1),[§1](https://arxiv.org/html/2606.02802#S1.p3.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2606.02802#S3.SS3.p3.2)\.
- K\. Singhal, S\. Azizi, T\. Tu,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620,pp\. 172–180\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06291-2)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p1.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)Cited by:[§3\.3](https://arxiv.org/html/2606.02802#S3.SS3.p3.2)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§3\.4](https://arxiv.org/html/2606.02802#S3.SS4.p2.1)\.
- M\. Wornow, R\. Thapa, E\. Steinberg, J\. Fries, and N\. Shah \(2023\)EHRSHOT: an ehr benchmark for few\-shot evaluation of foundation models\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 67125–67137\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/d42db1f74df54cb992b3956eb7f15a6f-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p2.1),[§1](https://arxiv.org/html/2606.02802#S1.p3.1),[§2](https://arxiv.org/html/2606.02802#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.02802#S4.SS1.p1.1)\.
- Z\. Wu, A\. Dadu, M\. Nalls, F\. Faghri, and J\. Sun \(2024\)Instruction tuning large language models to understand electronic health records\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.02802#S1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§4\.4](https://arxiv.org/html/2606.02802#S4.SS4.p1.1)\.

## Appendix AAppendix

### A\.1Prompt Templates

#### A\.1\.1Clinical Event Refinement Prompt

Clinical Event Refinement PromptYou are selecting clinically important events for downstream reasoning\.Input:Candidate temporal EHR chunks retrieved by retrieval\-augmented retrieval\.Task:From the input chunks, select a compact but balanced set of clinically important events that best represents the patient’s clinical course around prediction time\.Strict constraints•Output at most 30 events total\.•Preserve a balanced view of the case\.•Include events suggesting:–high complexity, severe illness, invasive care, or major intervention–routine care, stable course, uncomplicated recovery, or lower clinical intensitySelection rules•Prefer events that characterize the overall clinical course\.•Prefer diversity over redundancy\.•Preserve diagnoses, procedures, medications, care setting, monitoring, imaging, and clinically meaningful notes\.•Keep routine or low\-intensity events if they help distinguish routine versus complex clinical courses\.•Remove only exact duplicates or clearly redundant repetitions\.•Do not bias toward severe or critical events only\.Event rules•Only use event names already appearing in the input\.•Do not paraphrase event names\.•Do not merge events\.•Preserve original timestamps\.•Group events under their original timestamps\.

#### A\.1\.2Reasoning Supervision Generation Prompt

Reasoning Supervision Generation PromptYou are generating reasoning supervision for a hospital length\-of\-stay task\.Task:•The question is whether this hospital visit will have a total length of stay of at least 7 days\.•The label is already known and must be followed\.•The goal is to generate structured reasoning supervision consistent with the known label\.Output formatThe generated answer must follow the structure:Evidence: 1\) … 2\) …Reasoning:Step 1: Using Evidence …Step 2: Using Evidence …Conclusion: Yes/No, …Generation rules•Use only information supported by the provided clinical events\.•Do not hallucinate diagnoses, procedures, medications, or complications\.•Each reasoning step must explicitly reference Evidence items\.•Temporal progression, persistence, escalation, or lack of improvement must be incorporated into the reasoning process\.•Avoid unsupported severity claims\.•Use cautious clinical language such as “suggests”, “consistent with”, or “likely” when appropriate\.•Conclusions must be based on treatment intensity, monitoring burden, procedures, and overall clinical complexity rather than explicit discharge timing\.

#### A\.1\.3LLM Judge Prompt

LLM Judge PromptYou are evaluating a structured clinical reasoning explanation generated by an AI model\.You will be given:1\.Task instruction\.2\.Patient clinical events\.3\.Ground\-truth label\.4\.Model prediction\.5\.Model\-generated explanation\.Expected structure:1\.Evidence: a numbered list of clinical evidence items\.2\.Reasoning: step\-by\-step reasoning that refers back to the evidence\.3\.Conclusion: a final Yes/No prediction with a short explanation\.Evaluate all dimensions on a 1–5 Likert scale\.General scoring guide•1 = Poor:mostly incorrect, unsupported, incoherent, or misleading\.•2 = Weak:contains major problems, but a small portion is acceptable\.•3 = Fair:partially correct and usable, but with clear limitations\.•4 = Good:mostly correct, grounded, and useful, with minor issues\.•5 = Excellent:fully grounded, clinically coherent, complete, and useful\.Dimensions1\. evidence\_groundingDoes each Evidence item and major reasoning claim come from the provided clinical events or patient context?•1 = mostly hallucinated or unsupported\.•2 = several important unsupported claims\.•3 = partially grounded, but some claims are vague or weakly supported\.•4 = mostly grounded, with only minor unsupported or vague claims\.•5 = all major claims are clearly grounded in the provided events\.2\. clinical\_relevanceIs the selected evidence relevant to the task instruction?•1 = mostly irrelevant evidence\.•2 = limited relevance; many selected events do not help answer the task\.•3 = mixed relevant and irrelevant evidence\.•4 = mostly relevant evidence, with minor irrelevant details\.•5 = highly relevant evidence that directly supports the clinical prediction\.3\. temporal\_reasoningDoes the explanation correctly reason over the event timeline?•1 = ignores or misuses temporal order\.•2 = mentions time but does not use it meaningfully\.•3 = partially uses temporal sequence\.•4 = mostly uses temporal progression correctly\.•5 = clearly reasons over progression, escalation, de\-escalation, or stability over time\.4\. clinical\_coherenceIs the reasoning medically plausible and internally consistent?•1 = clinically incoherent or contradictory\.•2 = weak clinical logic with major gaps\.•3 = partially coherent but unclear or incomplete\.•4 = mostly coherent and medically plausible\.•5 = clinically coherent, plausible, and internally consistent\.5\. completenessDoes the explanation provide enough evidence to justify the conclusion?•1 = insufficient explanation\.•2 = missing several important pieces of evidence\.•3 = partially sufficient but incomplete\.•4 = mostly complete with minor omissions\.•5 = complete and well\-supported\.6\. safety\_overclaimingDoes the explanation avoid unsupported severity claims, diagnoses, causal claims, risk claims, or misleading wording?•1 = frequent overclaiming or potentially unsafe claims\.•2 = several unsupported severity or causal claims\.•3 = some overclaiming or unsupported wording\.•4 = mostly cautious, with minor wording issues\.•5 = cautious, appropriately qualified, and not misleading\.7\. outcome\_alignmentDoes the final prediction and explanation align with the ground\-truth label?•1 = prediction/conclusion is wrong or explanation supports the wrong outcome\.•2 = mostly misaligned with the correct outcome\.•3 = partially aligned but ambiguous or weak\.•4 = mostly aligned with the correct outcome\.•5 = clearly aligned with the correct outcome\.8\. clinical\_usefulnessWould this explanation help a clinician understand the prediction?•1 = misleading, unsafe, or clinically unhelpful\.•2 = weak usefulness; may confuse the reader\.•3 = somewhat useful but incomplete or partly misleading\.•4 = clinically useful with minor limitations\.•5 = highly useful, well grounded, and decision\-relevant\.Task\-specific rules•Use the task instruction to judge whether evidence is relevant\.•For length\-of\-stay prediction, relevant evidence should reflect care intensity, clinical complexity, escalation or de\-escalation, procedures, monitoring burden, medication burden, complications, or stability\.•Do not over\-reward routine vitals, routine labs, isolated medications, or copied event lists unless the explanation clearly connects them to the task\.•Penalize unsupported claims such as “severe”, “high\-acuity”, “routine”, “low\-risk”, “major complication”, “no complication”, “prolonged QT”, or “organ failure” unless the provided events support that interpretation\.•Absence\-based claims such as “no escalation”, “no organ failure”, or “no major complication” are allowed only when stated cautiously and based on the observed event timeline\.•Patient age, prediction time, or demographic information may be used only if explicitly provided in the input\.•Penalize reasoning steps that cite Evidence numbers but draw conclusions not supported by those evidence items\.•Penalize the explanation if the Conclusion line is missing, inconsistent with the model prediction, or inconsistent with the preceding reasoning\.Important rules•Use only the provided clinical events and patient context as evidence\.•Do not reward fluent writing if it is not grounded\.•Do not reward a correct prediction if the explanation is unsupported\.•If the model prediction is wrong, outcome\_alignment should be low\.•If the explanation supports the wrong clinical conclusion, clinical\_usefulness should be low\.•Return only valid JSON\.•Do not wrap JSON in markdown\.

### A\.2Artifact and Data Usage

EHRSHOT is used under its corresponding data use agreement and access restrictions for research purposes only\. ChatHealthAI is intended solely for research and experimental evaluation, and this work does not publicly release processed patient data or unrestricted clinical artifacts\.

### A\.3Computational Resources

ChatHealthAI uses CLMBR\-T\-Base together with DeepSeek\-R1\-Distill\-Qwen\-14B as the frozen LLM backbone\. Clinical event refinement and teacher\-generated reasoning supervision are produced using GPT\-oss\-120B\. Experiments were conducted on NVIDIA RTX 6000 Pro Blackwell 96GB GPUs\.

### A\.4Implementation Details

Our implementation is based on PyTorch and HuggingFace Transformers\. CLMBR\-T\-Base is loaded through the FEMR framework\. Evaluation metrics are computed using scikit\-learn\. Training uses the AdamW optimizer with learning rate1×10−51\\times 10^\{\-5\}and weight decay0\.010\.01\. Mixed\-precision bfloat16 training is used\.

Similar Articles

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

arXiv cs.CL

This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.

Enhanced and Efficient Reasoning in Large Learning Models

arXiv cs.AI

This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.