TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

arXiv cs.LG Papers

Summary

Introduces TreeText-CTS, a method that converts irregular EHR trajectories into compact, source-traceable tree-path evidence units without patient-level summarization. Achieves state-of-the-art AUROC and AUPRC among text-based EHR time-series interfaces on three clinical benchmarks.

arXiv:2605.20292v1 Announce Type: new Abstract: Numerical time-series models can effectively process irregular electronic health record (EHR) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence. Existing text-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient-level free-form summaries, which are difficult to trace to source measurements and time windows. To bridge this gap, we introduce TreeText-CTS (Clinical Time-Series), which converts irregular EHR trajectories into human-readable, compact, source-traceable tree-path evidence units without patient-level summarization or inference-time autoregressive decoding. TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, TreeText-CTS achieves the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while remaining competitive with numerical time-series models. Ablations show that tree-path evidence construction, evidence selection, and language-model composition each contribute to performance. Because every span passed to the language-model encoder is constructed from activated tree-path threshold conditions, TreeText-CTS makes the evidence supplied to the final predictor inspectable and source-traceable.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:21 AM

# TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction
Source: [https://arxiv.org/html/2605.20292](https://arxiv.org/html/2605.20292)
Kwanhyung Lee1,2,Juhwan Choi2,Jongheon Kim1,Joohyung Lee2, Hyeongwon Jang1,Eunho Yang1 1Kim Jaechul Graduate School of AI, KAIST2AITRICS kwanlee9209@kaist\.ac\.kr

###### Abstract

Numerical time\-series models can effectively process irregular electronic health record \(EHR\) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence\. Existing text\-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient\-level free\-form summaries, which are difficult to trace to source measurements and time windows\. To bridge this gap, we introduceTreeText\-CTS\(ClinicalTime\-Series\), which converts irregular EHR trajectories into human\-readable, compact, source\-traceable tree\-path evidence units without patient\-level summarization or inference\-time autoregressive decoding\. TreeText\-CTS routes multi\-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source\-traceable evidence units composed of threshold conditions\. An evidence selector assembles an informative subset of these units, which a language\-model encoder then integrates for prediction\. Across PhysioNet 2012 mortality, MIMIC\-III mortality, and PhysioNet 2019 sepsis\-onset forecasting, TreeText\-CTS achieves the best AUROC and AUPRC among evaluated text\-based EHR time\-series interfaces, improving AUPRC by 6\.0 to 9\.7 absolute percentage points over the strongest prior text\-based interface while remaining competitive with numerical time\-series models\. Ablations show that tree\-path evidence construction, evidence selection, and language\-model composition each contribute to performance\. Because every span passed to the language\-model encoder is constructed from activated tree\-path threshold conditions, TreeText\-CTS makes the evidence supplied to the final predictor inspectable and source\-traceable\.

## 1Introduction

Large language models \(LLMs\) have emerged as broadly useful foundation models across domains\. Trained on large\-scale corpora, they encode broad prior knowledge that can be especially valuable in healthcare, where labeled data are limited, costly, and difficult to share\. Moreover, their language outputs also provide a natural medium for reasoning, evidence presentation, and human inspection\. These properties motivate LLM use in healthcare prediction, where models should support both accuracy and interpretable decision\-making\. Recent EHR modeling efforts reflect this trend through instruction\-tuned EHR interfaces\[[26](https://arxiv.org/html/2605.20292#bib.bib36)\], long\-context clinical modeling\[[25](https://arxiv.org/html/2605.20292#bib.bib37)\], knowledge\-augmented prediction and reasoning\[[8](https://arxiv.org/html/2605.20292#bib.bib38),[9](https://arxiv.org/html/2605.20292#bib.bib39)\], and LLM\-derived representations for irregular ICU time series\[[7](https://arxiv.org/html/2605.20292#bib.bib23)\]\.

Irregular EHR time\-series prediction, however, remains dominated by numerical models that directly ingest values, timestamps, masks, and missingness patterns\[[2](https://arxiv.org/html/2605.20292#bib.bib4),[19](https://arxiv.org/html/2605.20292#bib.bib5),[6](https://arxiv.org/html/2605.20292#bib.bib6),[22](https://arxiv.org/html/2605.20292#bib.bib8),[28](https://arxiv.org/html/2605.20292#bib.bib10),[14](https://arxiv.org/html/2605.20292#bib.bib11),[12](https://arxiv.org/html/2605.20292#bib.bib12)\]\. Existing LLM\-based EHR interfaces often underperform these predictors, suggesting that language\-based methods must preserve the structured signals numerical models exploit\. A central challenge is representational: raw EHR time series are not naturally linguistic, but consist of irregular values, timestamps, labs, vitals, interventions, missingness patterns, and static covariates\. Applying LLMs to EHR time series, therefore, requires an interface that converts structured observations into language while preserving information for accurate and inspectable prediction\.

Existing EHR\-to\-language interfaces fall into two broad groups\. Rule\-based approaches serialize observations with fixed templates, preserving source values and avoiding patient\-level generation\[[4](https://arxiv.org/html/2605.20292#bib.bib20),[3](https://arxiv.org/html/2605.20292#bib.bib21)\]\. They are deterministic, but often heuristic and exhaustive: they verbalize observations because they are present, not because they are compact evidence\. This yields long, redundant inputs that are difficult to verify\. Model\-based approaches instead use LLMs to summarize, contextualize, or embed patient trajectories\[[7](https://arxiv.org/html/2605.20292#bib.bib23),[13](https://arxiv.org/html/2605.20292#bib.bib22)\]\. They can improve readability, but may require costly patient\-level generation, raise deployment and privacy concerns, and require faithfulness checks because generated text can omit or distort source measurements\[[11](https://arxiv.org/html/2605.20292#bib.bib45),[1](https://arxiv.org/html/2605.20292#bib.bib30),[24](https://arxiv.org/html/2605.20292#bib.bib46)\]\. These limitations point to a missing interface: compact like a summary, deterministic like a rule\-based representation, and traceable to source EHR windows\.

We propose TreeText\-CTS, a tree\-grounded language interface for irregular EHR time\-series prediction\. TreeText\-CTS uses trees to discover prediction\-relevant thresholds and language to expose activated tree paths as readable evidence\. It summarizes trajectories over multiple look\-back windows, routes the summaries through fixed window\-specific XGBoost models, and renders activated root\-to\-leaf paths as deterministic threshold predicates, e\.g\., “minimum MAP is at most 65” or “last heart rate is higher than 110\.” Each predicate is linked to its source window, tree, and leaf\. A Compact Evidence Selector \(CES\) assembles a compact subset of tree\-path evidence units, and a clinical language model classifier composes them for final prediction\.

This design combines strengths of prior interfaces\. Like rule\-based serialization, the classifier input is deterministic, reproducible, and source\-traceable\. Like model\-based summarization, it is selective and readable rather than a raw observation list\. Unlike patient\-level free\-form summarization, TreeText\-CTS requires no inference\-time autoregressive decoding and keeps deterministic predicate text as the auditable evidence anchor; optional clinical glosses are cached offline to provide clinically meaningful context, but never replace tree\-derived predicates\. Unlike recent LLM\-tree methods that use LLMs to generate features or refine tree rules\[[15](https://arxiv.org/html/2605.20292#bib.bib16),[27](https://arxiv.org/html/2605.20292#bib.bib17)\], TreeText\-CTS keeps the tree inventory fixed and uses a language encoder only after deterministic tree\-path evidence construction\.

##### Contributions\.

Our contributions are threefold\.

- •Tree\-path evidence construction for irregular EHR time series\.We convert multi\-scale EHR window summaries into activated paths in fixed XGBoost models and render each path as deterministic predicate text, yielding readable evidence units linked to source windows, clinical variables, trees, and leaves\.
- •Compact classifier\-input construction\.We introduce a selector that assembles a compact subset of tree\-path evidence units, enabling the LM classifier to compose informative conditions rather than redundant raw serializations or free\-form patient summaries\.
- •Accuracy–traceability evaluation on ICU benchmarks\.Across three ICU prediction benchmarks, TreeText\-CTS achieves the strongest results among text\-based EHR time\-series interfaces and remains competitive with numerical irregular time\-series models, while preserving source traceability for every selected classifier\-input span\.

## 2Related work

##### Numerical models for irregular clinical time series\.

Conventional models for EHR time\-series data handle missingness and uneven observation times through various strategies, such as recurrent decay, set functions, time\-aware attention, sparse\-event transformers, or variable graphs\[[2](https://arxiv.org/html/2605.20292#bib.bib4),[19](https://arxiv.org/html/2605.20292#bib.bib5),[6](https://arxiv.org/html/2605.20292#bib.bib6),[22](https://arxiv.org/html/2605.20292#bib.bib8),[28](https://arxiv.org/html/2605.20292#bib.bib10),[14](https://arxiv.org/html/2605.20292#bib.bib11),[12](https://arxiv.org/html/2605.20292#bib.bib12)\]\. These numerical time\-series models directly optimize prediction from clinical values, timestamps, masks, and missingness patterns and are strong baselines for clinical forecasting, but the evidence behind each prediction is not naturally exposed in a human\-readable form\.

##### Text\-based interfaces for EHR time\-series prediction\.

Recent text\-based EHR time\-series interfaces use LLM embeddings, verbalized observations, and generated patient representations\. WRDP evaluates LLM embeddings of numerical EHR features and finds that raw numerical features often remain competitive or stronger\[[4](https://arxiv.org/html/2605.20292#bib.bib20)\]; Decode Like a Clinician verbalizes temporal EHR observations for LLM fine\-tuning\[[3](https://arxiv.org/html/2605.20292#bib.bib21)\]; TimeCP uses LLM\-based contextualization for time\-series event prediction\[[13](https://arxiv.org/html/2605.20292#bib.bib22)\]; and Record2Vec embeds LLM\-generated clinical summaries of irregular ICU time series\[[7](https://arxiv.org/html/2605.20292#bib.bib23)\]\. These methods motivate text or language\-model interfaces for structured clinical data, with the potential to leverage pretrained language\-model priors\. However, they generally build patient\-level LM inputs or representations through serialization, verbalization, embedding, or generation\. In contrast, TreeText\-CTS separates cohort\-level structure learning from patient\-specific evidence construction\. Fixed tree models learn reusable split thresholds on window summaries; patient\-specific activated paths instantiate these thresholds as traceable textual evidence units, from which a selector retains a compact subset for the LM classifier\.

##### Language models and tree\-based structured prediction\.

Recent LLM\-tree hybrids use decision trees for LLM\-guided feature generation or rule refinement in tabular prediction\[[15](https://arxiv.org/html/2605.20292#bib.bib16),[27](https://arxiv.org/html/2605.20292#bib.bib17)\]\. TreeText\-CTS instead keeps trees fixed and treats activated paths as an intermediate evidence representation for irregular clinical time series\. This separates local evidence extraction from language\-based evidence composition, rather than using an LLM to modify the tabular predictor\. Prior tree\-based representation methods also use leaves or rules as derived features, but typically consume them as numerical indicators instead of rendering them as source\-traceable text for LM classification\.

## 3Method

##### Overview\.

TreeText\-CTS builds a compact, human\-readable, source\-traceable text interface between irregular EHR trajectories and a LM classifier\. As shown in Figure[1](https://arxiv.org/html/2605.20292#S3.F1), multi\-scale window summaries are routed through fixed XGBoost models, activated paths are rendered as deterministic condition text by the Tree\-to\-Evidence Mapper \(TEM\), and the Compact Evidence Selector \(CES\) selects a compact subset of these evidence units\. The LM classifier predicts from a static\-covariate prefix followed by the selected evidence, avoiding raw trajectory serialization, patient\-level free\-form summarization, and inference\-time autoregressive decoding\. We trainCESwith a self\-critical objective to handle compact selection\.

![Refer to caption](https://arxiv.org/html/2605.20292v1/images/figure1.jpg)Figure 1:Overview of TreeText\-CTS\. Irregular EHR trajectories are summarized over multiple windows, routed through fixed per\-window XGBoost models, and mapped by the Tree\-to\-Evidence Mapper \(TEM\) into deterministic, source\-traceable tree\-path evidence units\. The Compact Evidence Selector \(CES\) selects at mostKKevidence units, which are concatenated after a static\-covariate prefix and read by a clinical language encoder without patient\-level free\-form generation\.
##### Problem setup\.

For patientii, let𝒪i=\{\(vj,tj,xj\)\}j=1ni\\mathcal\{O\}\_\{i\}=\\\{\(v\_\{j\},t\_\{j\},x\_\{j\}\)\\\}\_\{j=1\}^\{n\_\{i\}\}denote thenin\_\{i\}irregular EHR observations, wherevjv\_\{j\}is the measured variable,tjt\_\{j\}is the observation time, andxjx\_\{j\}is the observed value\. Letsis\_\{i\}denote static covariates and letyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}be the binary outcome\. We rendersis\_\{i\}as a deterministic text prefixcic\_\{i\}, such asAge is \.\.\., prepended to the selected evidence sequence read by the LM classifier; only tree\-path evidence units count toward the evidence budgetKK\.

From𝒪i\\mathcal\{O\}\_\{i\}, TreeText\-CTS constructs a candidate tree\-path evidence setℰi=\{ei​1,…,ei​Ni\}\\mathcal\{E\}\_\{i\}=\\\{e\_\{i1\},\\ldots,e\_\{iN\_\{i\}\}\\\}, from whichCESselectsSi⊆ℰiS\_\{i\}\\subseteq\\mathcal\{E\}\_\{i\}\. The final predictor is

p^i=σ​\(w⊤​fθ​\(ci∥concate∈sort⁡\(Si\)⁡Text​\(e\)\)\+b\),\\hat\{p\}\_\{i\}=\\sigma\\\!\\left\(w^\{\\top\}f\_\{\\theta\}\\\!\\left\(c\_\{i\}\\;\\\|\\;\\operatorname\{concat\}\_\{e\\in\\operatorname\{sort\}\(S\_\{i\}\)\}\\mathrm\{Text\}\(e\)\\right\)\+b\\right\),\(1\)wherep^i\\hat\{p\}\_\{i\}is the predicted probability ofyi=1y\_\{i\}=1,σ\\sigmais the sigmoid function,∥\\\|denotes text concatenation,sort\\operatorname\{sort\}orders evidence by source time and window size,fθf\_\{\\theta\}denotes the LM encoder representation, andw,bw,bare the linear classification head\.

##### Multi\-scale tree evidence\.

Clinical risk signals can appear at different temporal scales, from abrupt changes in short windows to sustained abnormalities over longer histories\. We therefore summarize each patient trajectory over multiple look\-back windows before extracting tree\-path evidence\. For each patient, candidate time pointtt, and look\-back window sizeWW, we form a local summary over the source interval\[t−W,t\]\[t\-W,t\], wherettis the right endpoint of the window andWWis the temporal scale\. For a fixed window sizeWW, candidate windows are placed on a non\-overlapping grid\. For each variable and window sizeWW, we compute nine fixed statistics over the corresponding interval\[t−W,t\]\[t\-W,t\]: last value, mean, standard deviation, minimum, maximum, count, net change, time since last observation, and missingness indicator\. For example, att=24t=24h, theW=1W=1h andW=8W=8h summaries describe the intervals\[23,24\]\[23,24\]h and\[16,24\]\[16,24\]h, respectively\. The set of candidate prediction times follows the dataset\-specific evaluation protocol in Section[4](https://arxiv.org/html/2605.20292#S4)\.

For each window sizeWW, we train a separate XGBoost ensemble on dynamic summary rows\{\(ϕW​\(i,t\),yi\):i∈𝒟train,t∈𝒢i​\(W\)\}\\\{\(\\phi\_\{W\}\(i,t\),y\_\{i\}\):i\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\},\\,t\\in\\mathcal\{G\}\_\{i\}\(W\)\\\}, whereϕW​\(i,t\)\\phi\_\{W\}\(i,t\)denotes the nine\-statistic summary at source\-window endpointtt\. After training, we freeze the trees, yielding a fixed inventory of threshold rules that later patients can only activate, not modify\.

For a patient\-endpoint\-window tuple\(i,t,W\)\(i,t,W\), each treebbin the window\-specific XGBoost ensemble activates one leafℓ\\ell\. The corresponding root\-to\-leaf path defines one tree\-path evidence unit indexed by the source tuple\(t,W,b,ℓ\)\(t,W,b,\\ell\)\. The source tuple records provenance, while the path records the satisfied threshold conditions\. We define a simple*leaf score*for first\-stage candidate pruning,

sleaf​\(e\)=log⁡\(1\+nleaf​\(W,b,ℓ\)\)​\|pleaf​\(W,b,ℓ\)−p0\|,s\_\{\\mathrm\{leaf\}\}\(e\)=\\log\\\!\\left\(1\+n\_\{\\mathrm\{leaf\}\}\(W,b,\\ell\)\\right\)\\left\|p\_\{\\mathrm\{leaf\}\}\(W,b,\\ell\)\-p\_\{0\}\\right\|,\(2\)wherep0p\_\{0\}is the training\-set base rate, andpleafp\_\{\\mathrm\{leaf\}\}andnleafn\_\{\\mathrm\{leaf\}\}are the empirical positive rate and support of the leaf on the training split\. Within each\(i,t,W\)\(i,t,W\), we retain the topMMactivated leaves bysleafs\_\{\\mathrm\{leaf\}\}\.

##### Tree\-to\-Evidence Mapper\.

TEMmaps each activated root\-to\-leaf path to a source\-traceable text unit for the LM classifier\. Each unit stores its source tuple\(t,W,b,ℓ\)\(t,W,b,\\ell\), identifying the source\-window endpoint, window size, tree, and leaf, so the original path is recoverable\. We refer to each rendered XGBoost split along an activated path as a tree\-path predicate, i\.e\., a deterministic threshold condition on a window\-summary feature\. LetPred​\(e\)\\mathrm\{Pred\}\(e\)denote the canonical deterministic rendering of the resulting path predicates\. Fixed templates convert raw split inequalities into natural\-language predicate text, e\.g\.,x\>40x\>40becomes “xxis higher than 40\.” Path canonicalization merges redundant bounds and removes feature\-range restatements before caching; see Appendix[C\.3\.1](https://arxiv.org/html/2605.20292#A3.SS3.SSS1)\.

Because not every path warrants a task\-relevant clinical gloss, we use a local LLM offline at the leaf level to annotate each predicate usingPred​\(e\)\\mathrm\{Pred\}\(e\)and the task description\. The annotator outputsg​\(e\)∈\{0,1\}g\(e\)\\in\\\{0,1\\\}and producesGloss​\(e\)\\mathrm\{Gloss\}\(e\)only wheng​\(e\)=1g\(e\)=1; in that case, the final text appendscg:Gloss​\(e\)\\mathrm\{Gloss\}\(e\)toPred​\(e\)\\mathrm\{Pred\}\(e\), and otherwise usesPred​\(e\)\\mathrm\{Pred\}\(e\)alone\. Glosses are cached, reused across patients, and never replace or modify the deterministic predicate text, which remains the source\-traceable evidence anchor\. Table[1](https://arxiv.org/html/2605.20292#S3.T1)shows one example; Appendix[C\.4](https://arxiv.org/html/2605.20292#A3.SS4)provides the annotation prompt\.

Table 1:ExampleTEMrendering for one activated PhysioNet 2019 leaf withg​\(e\)=1g\(e\)=1\. The source tuple identifies the prediction time, window, tree, and leaf; the final evidence text unit isPred​\(e\)\\mathrm\{Pred\}\(e\)followed bycg:Gloss​\(e\)\\mathrm\{Gloss\}\(e\)\.FieldExampleSource tuple\(t=24​h,W=8​h,b=7,ℓ=26\)\(t=24\\mathrm\{h\},\\,W=8\\mathrm\{h\},\\,b=7,\\,\\ell=26\), interval\[16,24\]​h\[16,24\]\\mathrm\{h\}Path predicatesMAP\_min <= 65; HR\_last \> 110; Lactate\_max \> 2\.0Pred​\(e\)\\mathrm\{Pred\}\(e\)From 16:00 to 24:00: minimum MAP is at most 65; last HR is higher than 110; maximum lactate is higher than 2\.0\.Gloss​\(e\)\\mathrm\{Gloss\}\(e\)compatible with hypotension, tachycardia, and elevated lactate\.
##### Compact Evidence Selector\.

Multi\-scale routing generates many activated paths per patient across prediction times, look\-back windows, and XGBoost trees\. TheCESmodule selects a compact subset of these evidence units before they are assembled as the LM classifier input\. For each evidence unitei​je\_\{ij\}, we construct a 72\-dimensional selector tokenui​j=\[P​vi​j;mi​j\]u\_\{ij\}=\[\\,Pv\_\{ij\};m\_\{ij\}\\,\], wherevi​jv\_\{ij\}is a fixed cached embedding of the corresponding tree\-path evidence text,PPis a trainable linear projection to 64 dimensions, andmi​j∈ℝ8m\_\{ij\}\\in\\mathbb\{R\}^\{8\}contains scalar metadata\. The metadata covers temporal/window information, XGBoost leaf statistics, text length, and a binary gloss\-availability indicator; the exact fields are listed in Appendix[D\.1](https://arxiv.org/html/2605.20292#A4.SS1.SSS0.Px2)\. A lightweight Transformer contextualizes the candidate tokens and a linear gate produces margins and selection probabilities:

hi​j=Tϕ​\(ui​1,…,ui​Ni\)j,ri​j=wr⊤​hi​j\+br,qi​j=σ​\(ri​j\)\.h\_\{ij\}=T\_\{\\phi\}\(u\_\{i1\},\\ldots,u\_\{iN\_\{i\}\}\)\_\{j\},\\qquad r\_\{ij\}=w\_\{r\}^\{\\top\}h\_\{ij\}\+b\_\{r\},\\qquad q\_\{ij\}=\\sigma\(r\_\{ij\}\)\.For greedy assembly,CESkeeps positive\-margin evidence units, supplements them with the highest\-margin remaining units when fewer thanKminK\_\{\\min\}are selected to avoid all\-reject inputs, and enforces the final budget by retaining the top\-KKunits by margin\.

##### Self\-critical selector learning\.

Evidence selection is discrete, and the quality of a selected set is observed only after the LM classifier reads the assembled input and incurs prediction loss\. We therefore trainCESwith a self\-critical policy\-gradient objective\[[17](https://arxiv.org/html/2605.20292#bib.bib9)\]\. For each patient, we sample selector actionszi​js∼Bernoulli​\(qi​j\)z^\{s\}\_\{ij\}\\sim\\mathrm\{Bernoulli\}\(q\_\{ij\}\)and use the greedy baselinezi​jg=𝟏​\[ri​j\>0\]z^\{g\}\_\{ij\}=\\mathbf\{1\}\[r\_\{ij\}\>0\]\. Both action sets are passed through the sameKminK\_\{\\min\}floor and final top\-KKcap, yieldingSisS\_\{i\}^\{s\}andSigS\_\{i\}^\{g\}\.

The LM classifier reads two assemblies that share the static prefixcic\_\{i\}and differ only in the selected tree\-path evidence suffix\. Letℓis\\ell\_\{i\}^\{s\}andℓig\\ell\_\{i\}^\{g\}denote the binary cross\-entropy task losses for the sampled and greedy assemblies\. We train the LM classifier and classification head on both:

ℒtask=12​\|ℬ\|​∑i∈ℬ\(ℓis\+ℓig\)\.\\mathcal\{L\}\_\{\\mathrm\{task\}\}=\\frac\{1\}\{2\|\\mathcal\{B\}\|\}\\sum\_\{i\\in\\mathcal\{B\}\}\\left\(\\ell\_\{i\}^\{s\}\+\\ell\_\{i\}^\{g\}\\right\)\.\(3\)The selector receives a normalized self\-critical advantage,

A^i=stopgrad⁡\(\(ℓig−ℓis\)−μℬσℬ\+ϵ\),\\hat\{A\}\_\{i\}=\\operatorname\{stopgrad\}\\\!\\left\(\\frac\{\(\\ell\_\{i\}^\{g\}\-\\ell\_\{i\}^\{s\}\)\-\\mu\_\{\\mathcal\{B\}\}\}\{\\sigma\_\{\\mathcal\{B\}\}\+\\epsilon\}\\right\),\(4\)whereμℬ\\mu\_\{\\mathcal\{B\}\}andσℬ\\sigma\_\{\\mathcal\{B\}\}are mini\-batch statistics\. Thus, sampled evidence receives a positive advantage when it improves prediction loss over the greedy assembly for the same patient\. The selector loss is

ℒselSC=−1\|ℬ\|​∑i∈ℬA^i​∑j=1Ni\[zi​js​log⁡qi​j\+\(1−zi​js\)​log⁡\(1−qi​j\)\]\.\\mathcal\{L\}\_\{\\mathrm\{sel\}\}^\{\\mathrm\{SC\}\}=\-\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{i\\in\\mathcal\{B\}\}\\hat\{A\}\_\{i\}\\sum\_\{j=1\}^\{N\_\{i\}\}\\left\[z^\{s\}\_\{ij\}\\log q\_\{ij\}\+\(1\-z^\{s\}\_\{ij\}\)\\log\(1\-q\_\{ij\}\)\\right\]\.\(5\)The full objective is

ℒ=ℒtask\+λsel​ℒselSC−λent​ℋ​\(πϕ\)\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\lambda\_\{\\mathrm\{sel\}\}\\mathcal\{L\}\_\{\\mathrm\{sel\}\}^\{\\mathrm\{SC\}\}\-\\lambda\_\{\\mathrm\{ent\}\}\\mathcal\{H\}\(\\pi\_\{\\phi\}\)\.\(6\)whereℋ​\(πϕ\)\\mathcal\{H\}\(\\pi\_\{\\phi\}\)is the entropy of the Bernoulli selection policy\. The XGBoost models, predicate verbalizations, optional glosses, and evidence\-text embedding cache are fixed, while the selector\-token projection,CES, and the LM classifier are trained jointly\.

## 4Experiments

##### Datasets and tasks\.

We evaluate on three public ICU benchmarks: PhysioNet 2012 \(P12\) in\-hospital mortality\[[20](https://arxiv.org/html/2605.20292#bib.bib24)\], MIMIC\-III in\-hospital mortality\[[10](https://arxiv.org/html/2605.20292#bib.bib25)\], and PhysioNet 2019 \(P19\) 6\-hour\-ahead sepsis onset\[[18](https://arxiv.org/html/2605.20292#bib.bib26)\]\. All methods use identical train/validation/test cohorts\. Appendix[A\.1](https://arxiv.org/html/2605.20292#A1.SS1)reports source preprocessing, cohort filters, excluded\-visit counts, split sizes, and positive rates\.

##### Baselines\.

We compare against two groups of baselines\. First, we evaluate text\-based EHR time\-series interfaces: WRDP\[[4](https://arxiv.org/html/2605.20292#bib.bib20)\], Record2Vec\[[7](https://arxiv.org/html/2605.20292#bib.bib23)\], Decode Like a Clinician\[[3](https://arxiv.org/html/2605.20292#bib.bib21)\], and TimeCP\[[13](https://arxiv.org/html/2605.20292#bib.bib22)\]\. For each interface family, we report afaithfulvariant that preserves the original\-style generator, embedder, and predictor, and anadaptedvariant that keeps the same patient text but replaces the final predictor with the BioClinical ModernBERT\[[21](https://arxiv.org/html/2605.20292#bib.bib19)\]used by TreeText\-CTS\. This adapted setting isolates representation quality from the downstream encoder\. Second, we compare against numerical irregular\-time\-series models: Last Observation Carried Forward \(LOCF\) LSTM\[[5](https://arxiv.org/html/2605.20292#bib.bib47)\], LOCF Transformer\[[23](https://arxiv.org/html/2605.20292#bib.bib48)\], GRU\-D\[[2](https://arxiv.org/html/2605.20292#bib.bib4)\], mTAND Transformer\[[19](https://arxiv.org/html/2605.20292#bib.bib5)\], SeFT\[[6](https://arxiv.org/html/2605.20292#bib.bib6)\], STraTS\[[22](https://arxiv.org/html/2605.20292#bib.bib8)\], KEDGN\[[14](https://arxiv.org/html/2605.20292#bib.bib11)\], and DuETT\[[12](https://arxiv.org/html/2605.20292#bib.bib12)\]\. Full text\-baseline interface instantiations are summarized in Table[A\.5](https://arxiv.org/html/2605.20292#A1.T5), and numerical\-baseline training protocols are given in Appendix[A\.2](https://arxiv.org/html/2605.20292#A1.SS2)\.

##### TreeText\-CTS configuration\.

We use the tree\-to\-evidence pipeline from Section[3](https://arxiv.org/html/2605.20292#S3)\. XGBoost ensembles are trained separately for seven look\-back windows,W∈\{1,2,4,8,16,32,48\}W\\in\\\{1,2,4,8,16,32,48\\\}hours, using the window\-summary bank in Section[3](https://arxiv.org/html/2605.20292#S3.SS0.SSS0.Px3)\. Each activated leaf is converted into deterministic predicate text, optionally augmented with a cached clinical gloss when the leaf is judged clinically meaningful\. For TreeText\-CTS, leaf\-level clinical\-meaningfulness judgments and optional glosses are generated offline with Qwen3\.5\-27B\[[16](https://arxiv.org/html/2605.20292#bib.bib49)\]and cached once per unique tree leaf\. These calls are not patient\-level generation and are not performed during validation/test inference\. ForCES, we compute cached evidence\-text embeddings using Qwen3\-Embedding\-8B\[[29](https://arxiv.org/html/2605.20292#bib.bib50)\]and project them to 64 dimensions through the trainable selector projection described in Section[3](https://arxiv.org/html/2605.20292#S3)\.

For Model\-based text baselines, Qwen3\.5\-27B is used to generate the corresponding patient\-level summaries or contextualized patient text according to each baseline interface; these patient\-level generation calls are counted in the online latency protocol when required at test time\.

The tree inventory, leaf verbalization cache, and evidence\-text embedding cache are built from the training split and then fixed for validation and test patients\. Unless stated otherwise, TreeText\-CTS uses top\-M=5M=5candidate leaves per\(i,t,W\)\(i,t,W\), minimum\-selection floorKmin=5K\_\{\\min\}=5, final evidence budgetK=30K=30, and BioClinical ModernBERT as the LM classifier\. Implementation details, the gloss annotation prompt, and sensitivity analyses are reported in Appendices[B\.3](https://arxiv.org/html/2605.20292#A2.SS3),[B\.4](https://arxiv.org/html/2605.20292#A2.SS4),[C\.4](https://arxiv.org/html/2605.20292#A3.SS4), and[D\.1](https://arxiv.org/html/2605.20292#A4.SS1)\.

##### Evaluation protocol\.

We report AUROC and AUPRC as the primary metrics\. Results are mean±\\pmstandard deviation over three independent random seeds\. For each seed, checkpoints are selected by validation AUROC and evaluated once on the held\-out test set\. Validation labels are used only for learning\-rate and best epoch model selection; test labels are used to estimate final perfomance\. Online latency is measured as end\-to\-end wall\-clock time per test patient after model and cache loading, with the full timing protocol in Appendix[D\.2](https://arxiv.org/html/2605.20292#A4.SS2)\.

## 5Results and discussion

Table 2:Text\-based EHR time\-series interface comparison\.Faithfulvariants preserve the original\-style predictor, whereasAdaptedvariants keep the same text representation but use BioClinical ModernBERT\. Metrics are test mean±\\pmstd\.No patient gen\.,No AR decode, andTraceable evid\.indicate whether the method avoids patient\-level free\-form generation, avoids inference\-time autoregressive decoding, and exposes a readable, source\-traceable evidence sequence, respectively\. Online latency follows the end\-to\-end protocol in Appendix[D\.2](https://arxiv.org/html/2605.20292#A4.SS2)\.MethodNo patientgen\.No ARdecodeTraceableevid\.Online lat\.\(s/pt\)P2012MIMIC\-IIIP2019AUROCAUPRCAUROCAUPRCAUROCAUPRCWRDP\-Faithful✓✓✗0\.2990\.7715±\\pm0\.00190\.3528±\\pm0\.00010\.7543±\\pm0\.00280\.2958±\\pm0\.00350\.8509±\\pm0\.00120\.3038±\\pm0\.0096WRDP\-Adapted✓✓✗0\.0710\.8236±\\pm0\.00080\.4636±\\pm0\.01810\.7843±\\pm0\.00170\.3580±\\pm0\.00890\.8853±\\pm0\.00420\.3792±\\pm0\.0025Record2Vec\-Faithful✗✗✗272\.3080\.7720±\\pm0\.00060\.4033±\\pm0\.00200\.7839±\\pm0\.00280\.3738±\\pm0\.00090\.7450±\\pm0\.00240\.1151±\\pm0\.0029Record2Vec\-Adapted✗✗✗272\.2280\.7675±\\pm0\.01090\.3788±\\pm0\.02730\.8040±\\pm0\.00050\.3692±\\pm0\.00140\.7894±\\pm0\.00250\.1636±\\pm0\.0065Decode\-Faithful✓✓✗0\.2130\.7821±\\pm0\.02170\.4376±\\pm0\.03100\.8394±\\pm0\.01260\.4292±\\pm0\.00410\.7913±\\pm0\.02360\.3266±\\pm0\.0072Decode\-Adapted✓✓✗0\.0740\.7854±\\pm0\.00310\.3598±\\pm0\.00010\.7933±\\pm0\.00260\.3629±\\pm0\.00600\.8118±\\pm0\.00410\.2448±\\pm0\.0011TimeCP\-Faithful✗✗✗273\.4730\.5663±\\pm0\.00370\.1579±\\pm0\.01490\.6120±\\pm0\.00200\.1453±\\pm0\.00510\.5266±\\pm0\.01170\.0457±\\pm0\.0276TimeCP\-Adapted✗✗✗272\.2280\.7999±\\pm0\.02130\.4156±\\pm0\.01420\.8065±\\pm0\.00860\.4110±\\pm0\.00900\.7949±\\pm0\.00060\.1520±\\pm0\.0040TreeText\-CTS \(Ours\)✓✓✓0\.0940\.8571±\\pm0\.00380\.5239±\\pm0\.00040\.8579±\\pm0\.00150\.5011±\\pm0\.00900\.9066±\\pm0\.00450\.4757±\\pm0\.0248

##### Overview\.

We evaluate TreeText\-CTS along three axes: \(1\) whether tree\-path evidence units improve LM inputs over existing text interfaces for irregular EHR time series, \(2\) whether the gains come from Tree\-to\-Evidence Mapper, Compact Evidence Selection, optional clinical glosses, and language\-based composition, and \(3\) how its prediction accuracy, readability, and traceability compare with numerical time\-series models\.

##### Tree\-path evidence yields a stronger text\-based interface for irregular EHR trajectories\.

Table[2](https://arxiv.org/html/2605.20292#S5.T2)compares TreeText\-CTS with existing text\-based EHR time\-series interfaces under faithful and adapted variants\. TreeText\-CTS obtains the highest AUROC and AUPRC point estimates across all three benchmarks among the evaluated text\-interface baselines\. Relative to the strongest text\-interface baseline on each dataset \(WRDP\-Adapted on PhysioNet 2012, Decode\-Faithful on MIMIC\-III, and WRDP\-Adapted on PhysioNet 2019\), TreeText\-CTS improves absolute AUROC by 0\.034, 0\.019, and 0\.021, and absolute AUPRC by 0\.060, 0\.072, and 0\.097, respectively\. The adapted variants further control for the final LM classifier by replacing each baseline’s original predictor with the same BioClinical ModernBERT architecture used by TreeText\-CTS; under this controlled LM\-classifier setting, prior patient\-level text interfaces still remain below the selected tree\-path evidence interface\.

Table[2](https://arxiv.org/html/2605.20292#S5.T2)also shows a favorable operational profile\. TreeText\-CTS is the only evaluated interface that exposes a compact selected sequence of readable source\-traceable evidence units while requiring neither patient\-level free\-form generation nor inference\-time autoregressive decoding\. Raw\-serialization interfaces such as WRDP and Decode retain source traceability, but they do so by passing redundant observation lists rather than compactly readable evidence units\. TreeText\-CTS’s measured online latency remains comparable to the fastest adapted baselines, while delivering consistently higher AUROC and AUPRC; in contrast to patient\-level generative interfaces, it avoids large generation\-time overhead through cached leaf\-level evidence and a forward\-only encoder pass\.

##### What the component ablations establish\.

Table[3](https://arxiv.org/html/2605.20292#S5.T3)separates the contributions of tree\-path evidence units, selection byCES, optional clinical glosses, and LM\-classifier composition\. First, the XGB\-only aggregation controls show that the gain is not simply inherited from the frozen window\-specific XGBoost models\. Relative to the stronger XGB\-only aggregation in each dataset, TreeText\-CTS improves AUROC by 0\.022–0\.048 and AUPRC by 0\.043–0\.224\. Second, learned selection byCESmatters: replacingCESwith leaf\-score top\-KKevidence consistently lowers performance by 0\.017–0\.034 AUROC and 0\.032–0\.126 AUPRC\. Third, clinical glosses provide useful auxiliary hints as removing glosses while retaining deterministic predicate text causes smaller drops than removing learned selection\. Fourth, the leaf\-ID controls show that selected tree identities already carry strong predictive signal, but converting selected paths into readable evidence units improves AUPRC over a non\-readable leaf\-ID MLP on all three datasets\.

Together, these ablations support a local\-to\-global evidence composition view in which tree paths expose local evidence,CESassembles a budgeted subset from evidence\-text embeddings and scalar metadata, and the LM classifier composes the selected source\-traceable units\. Tables[A\.14](https://arxiv.org/html/2605.20292#A2.T14)and[A\.15](https://arxiv.org/html/2605.20292#A2.T15)show that selector\-side evidence representations and domain\-aligned LM pretraining contribute\.

Table 3:Component ablation and diagnostic controls\. Cells report test AUROC/AUPRC means over independent seeds\.CGdenotes optional clinical glosses,CESthe Compact Evidence Selector,LMthe BioClinical ModernBERT classifier, andTreethe XGBoost tree\-evidence source\. All rows use the same final evidence budget unless noted\. Bold marks the best point estimate in each metric column\.MethodCGCESLMTreeP2012MIMIC\-IIIP2019AUROCAUPRCAUROCAUPRCAUROCAUPRCText\-evidence pipeline ablationsTreeText\-CTS w/o clinical gloss–✓✓✓0\.85310\.48340\.84490\.46200\.90390\.4664Leaf\-score top\-KKevidence \+ gloss✓–✓✓0\.84040\.49150\.82550\.43980\.87280\.3499Leaf\-score top\-KK\+ predicate\-only, noCG/CES––✓✓0\.82600\.47980\.82710\.43500\.87660\.3661Leaf\-ID controls without an LM classifierHeuristic top\-KKleaf\-ID MLP–––✓0\.82240\.48050\.82150\.44440\.89310\.4301CES\-no\-LM leaf\-ID MLP–✓–✓0\.84780\.50610\.85260\.48220\.88150\.4194Reused\-CES leaf\-ID MLP†–✓†–✓0\.85280\.50700\.85970\.49030\.88190\.4080XGB\-only aggregation controlsXGB multi\-window mean–––✓0\.83550\.44750\.83460\.40510\.85830\.2388XGB multi\-window max–––✓0\.83160\.48100\.79250\.40660\.84350\.2516TreeText\-CTS \(Ours\)✓✓✓✓0\.85710\.52390\.85790\.50110\.90660\.4757

A dash indicates that the component is absent or not applicable\. For text\-evidence rows withoutCES, evidence units are selected by fixed leaf\-score top\-KKunder the same budget as TreeText\-CTS\. Leaf\-ID MLP controls remove readable text evidence and the LM classifier: the selected leaves are encoded as a sparse multi\-hot vector over source\-window endpoints, window scales, tree indices, and leaf IDs, then passed to a 3\-layer MLP\. Heuristic top\-KKuses fixed leaf\-score selection\. CES\-no\-LM trains the same compact selector architecture using only the leaf\-ID MLP prediction objective\. XGB multi\-window mean and max are non\-text controls that aggregate XGBoost probabilities over candidate endpoints and window sizes by mean or max, respectively, without a selector or LM reader\.†Reused\-CES freezes the selector policy learned by full TreeText\-CTS and trains only the diagnostic MLP\.

##### CES margins rank tree\-path evidence units beyond score, recency, or gloss shortcuts\.

Figure[2](https://arxiv.org/html/2605.20292#S5.F2)analyzesCESfrom two complementary views: a fixed\-count evidence\-budget sweep and the enrichment profile of the evidence units selected by the trained selector\. In the left panel, every selector exposes exactlyKKtree\-path evidence units to the same LM classifier, thereby isolating evidence\-ranking quality from input\-length effects\. On PhysioNet 2012, CES top\-KKis the strongest selector throughout the practical rangeK≥5K\\geq 5, outperforming leaf\-score, recency, and random top\-KKselection in AUPRC\. The same qualitative AUROC pattern is reported in Appendix[B](https://arxiv.org/html/2605.20292#A2)\. Since the defaultK=30K=30operating point is already evaluated in Table[3](https://arxiv.org/html/2605.20292#S5.T3), this sweep is used mainly to interpret the selector, showing that the learned CES define a useful utility ordering over candidate evidence units rather than merely increasing the amount of text read by the classifier\. The bottom\-KKdiagnostic further supports this interpretation, as selecting the lowest\-margin evidence units from CES leads to weaker and more variable performance than selecting the highest\-margin units\.

The right panel profiles the final evidence units selected by the trainedCESrelative to the candidate pool, using the enrichment statistic defined in Appendix[B\.3](https://arxiv.org/html/2605.20292#A2.SS3.SSS0.Px4)\.CEStends to select more recent evidence while suppressing old evidence, although this behavior is not reducible to a recency\-only shortcut, since the explicit recency top\-KKselector underperforms CES in the fixed\-count sweep\. The selected evidence is also concentrated in longer look\-back windows, especially 16–48 hours, indicating thatCEStends to select recent prediction times with summaries that capture sustained abnormalities rather than isolated short\-window fluctuations\. Gloss availability is not the dominant selection signal either\. Glossed evidence is mildly enriched on PhysioNet 2012 and MIMIC\-III, but depleted on PhysioNet 2019, where TreeText\-CTS still achieves the strongest text\-based AUPRC\. Together, these results suggest thatCESis neither a gloss detector nor a simple recency rule\. Instead, it learns a task\-dependent ranking over tree\-path evidence units that combines temporal position, window scale, and path\-level predictive utility\.

![Refer to caption](https://arxiv.org/html/2605.20292v1/x1.png)Figure 2:Selector behavior and evidence efficiency\. Left: P2012 AUPRC at matched evidence budgets, comparing CES\-margin top\-KKagainst leaf\-score, recency, random, and bottom\-KKselectors with the same LM classifier\. Right: enrichment of selected evidence relative to the candidate pool\.
##### Positioning against numerical time\-series models\.

Table[4](https://arxiv.org/html/2605.20292#S5.T4)positions TreeText\-CTS against numerical time\-series models\. These models remain strong, with STraTS leading on PhysioNet 2012 and STraTS/GRU\-D leading on PhysioNet 2019 AUROC\. TreeText\-CTS nevertheless obtains the best point estimates on MIMIC\-III AUROC/AUPRC and PhysioNet 2019 AUPRC, while exposing a sequence of human\-readable, source\-traceable tree\-path evidence units as the classifier input\. We therefore do not claim uniform statistical superiority over numerical models, but rather a competitive operating point combining accuracy with readable, traceable evidence\.

Table 4:Comparison with numerical baselines\. Metrics are test mean±\\pmstd over three seeds\.MethodP2012MIMIC\-IIIP2019AUROCAUPRCAUROCAUPRCAUROCAUPRCLOCF LSTM0\.8192±\\pm0\.00440\.4548±\\pm0\.01490\.8240±\\pm0\.00160\.4811±\\pm0\.01130\.8514±\\pm0\.00430\.2392±\\pm0\.0064LOCF Transformer0\.8197±\\pm0\.00360\.4327±\\pm0\.00370\.8230±\\pm0\.00040\.4465±\\pm0\.00810\.8741±\\pm0\.00430\.3296±\\pm0\.0100GRU\-D0\.8595±\\pm0\.00280\.5383±\\pm0\.00820\.8541±\\pm0\.00150\.5008±\\pm0\.00980\.9134±\\pm0\.00250\.4637±\\pm0\.0160STraTS0\.8702±\\pm0\.00410\.5543±\\pm0\.00880\.8459±\\pm0\.00810\.4705±\\pm0\.02180\.9174±\\pm0\.00040\.4579±\\pm0\.0031mTAND0\.8397±\\pm0\.00310\.5100±\\pm0\.00360\.8399±\\pm0\.00210\.4603±\\pm0\.00670\.8381±\\pm0\.00690\.2629±\\pm0\.0097SEFT0\.8278±\\pm0\.00690\.4646±\\pm0\.00920\.8146±\\pm0\.00750\.4034±\\pm0\.01160\.9041±\\pm0\.00240\.3969±\\pm0\.0095KEDGN0\.8661±\\pm0\.00370\.5374±\\pm0\.00710\.8425±\\pm0\.00850\.4734±\\pm0\.01160\.9084±\\pm0\.00470\.3906±\\pm0\.0039DuETT0\.8672±\\pm0\.00210\.5420±\\pm0\.00570\.8460±\\pm0\.00420\.4576±\\pm0\.00730\.9033±\\pm0\.00330\.4368±\\pm0\.0296TreeText\-CTS \(Ours\)0\.8571±\\pm0\.00380\.5239±\\pm0\.00040\.8579±\\pm0\.00150\.5011±\\pm0\.00900\.9066±\\pm0\.00450\.4757±\\pm0\.0248

##### Construction\-level auditability\.

TreeText\-CTS provides construction\-level auditability by making the selected evidence units the LM input itself\. Each unit indexes its source time, look\-back window, tree and leaf identifiers, deterministic path predicates, and any cached clinical gloss, allowing predictions to be inspected through human\-readable, source\-traceable evidence\. Appendix[C\.1](https://arxiv.org/html/2605.20292#A3.SS1)details these guarantees and provides a held\-out evidence\-card case study\.

##### Takeaway\.

The results support three claims\. First, tree\-path evidence units provide a stronger text\-based interface than evaluated raw\-serialization, generated, or per\-event patient\-text interfaces for the irregular EHR tasks\. Second, the final operating point depends on the combination of tree\-grounded local evidence, learned selection, optional gloss hints, and LM classifier composition\. Third, TreeText\-CTS narrows the accuracy gap to numerical time\-series models while exposing compact human\-readable, source\-traceable evidence\-unit classifier inputs\. The central empirical message is that language models can support clinical time\-series prediction by composing compact, deterministic tree evidence, rather than parsing raw irregular trajectories or generating patient\-level narratives\.

## 6Limitations

TreeText\-CTS exposes compact, human\-readable, source\-traceable classifier inputs, but such construction\-level auditability does not guarantee clinical correctness, causality, fairness, or calibration\. Clinical glosses are auxiliary offline hints; they may be incomplete or unsupported, and the canonical deterministic predicates remain the recoverable evidence anchors\. The method is limited by its tree ensemble, summary bank, and candidate retrieval, so patterns absent from these components cannot be selected byCES\. Our retrospective evaluation on three public ICU benchmarks supports the proposed text interface, not clinical deployment\. The exposed evidence is for inspection, not a replacement for clinical judgment\.

## 7Conclusion

We introduced TreeText\-CTS, a framework that converts irregular clinical time series into human\-readable, source\-traceable tree\-path evidence, usesCESto select a compact subset, and applies a clinical LM classifier for prediction\. Across three ICU benchmarks, TreeText\-CTS achieves the strongest point estimates among evaluated text\-based EHR time\-series interfaces while remaining competitive with numerical time\-series models\. Ablations show that tree\-path evidence, learned budgeted selection, optional clinical glosses, and LM composition contribute to performance\. Overall, TreeText\-CTS supports LM\-based clinical time\-series prediction through a deterministic, source\-traceable evidence interface while maintaining competitive performance\.

## References

- \[1\]\(2025\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.npj Digital Medicine8,pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1)\.
- \[2\]Z\. Che, S\. Purushotham, K\. Cho, D\. Sontag, and Y\. Liu\(2018\)Recurrent neural networks for multivariate time series with missing values\.Scientific Reports8\(1\),pp\. 6085\.Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[3\]D\. Fadlon, D\. Dov, A\. Bennett, D\. Heller\-Miron, G\. Levy, K\. Bar, and A\. Weiss\-Meilik\(2025\)Decode like a clinician: enhancing llm fine\-tuning with temporal structured data representation\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 1906–1922\.Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[4\]Y\. Gao, S\. Myers, S\. Chen, D\. Dligach, T\. A\. Miller, D\. Bitterman, M\. Churpek, and M\. Afshar\(2024\)When raw data prevails: are large language model embeddings effective in numerical data representation for medical machine learning applications?\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 5414–5428\.Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[5\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[6\]M\. Horn, M\. Moor, C\. Bock, B\. Rieck, and K\. Borgwardt\(2020\)Set functions for time series\.InInternational Conference on Machine Learning,pp\. 4353–4363\.Cited by:[§A\.1](https://arxiv.org/html/2605.20292#A1.SS1.SSS0.Px1.p1.1),[Table A\.1](https://arxiv.org/html/2605.20292#A1.T1.1.2.2),[Table A\.1](https://arxiv.org/html/2605.20292#A1.T1.1.3.2),[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[7\]Z\. Ji, Y\. Sun, A\. C\. K\. Amaral, A\. Goldenberg, and R\. G\. Krishnan\(2026\)Can we generate portable representations for clinical time series data using LLMs?\.InThe Fourteenth International Conference on Learning Representations,Note:PosterExternal Links:[Link](https://openreview.net/forum?id=pXw0uRTSKT)Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p1.1),[§1](https://arxiv.org/html/2605.20292#S1.p3.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[8\]P\. Jiang, C\. Xiao, A\. R\. Cross, and J\. Sun\(2024\)GraphCare: enhancing healthcare predictions with personalized knowledge graphs\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p1.1)\.
- \[9\]P\. Jiang, C\. Xiao, M\. Jiang, P\. Bhatia, T\. Kass\-Hout, J\. Sun, and J\. Han\(2025\)Reasoning\-enhanced healthcare predictions with knowledge graph community retrieval\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p1.1)\.
- \[10\]A\. E\. W\. Johnson, T\. J\. Pollard, L\. Shen, L\. H\. Lehman, M\. Feng, M\. Ghassemi, B\. Moody, P\. Szolovits, L\. A\. Celi, and R\. G\. Mark\(2016\)MIMIC\-iii, a freely accessible critical care database\.Scientific Data3,pp\. 160035\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px1.p1.1)\.
- \[11\]J\. Jonnagaddala and Z\. S\. Wong\(2025\)Privacy preserving strategies for electronic health records in the era of large language models\.npj Digital Medicine8\(34\)\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01429-0)Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1)\.
- \[12\]A\. Labach, A\. Pokhrel, X\. S\. Huang, S\. Zuberi, S\. E\. Yi, M\. Volkovs, T\. Poutanen, and R\. G\. Krishnan\(2023\)DuETT: dual event time transformer for electronic health records\.InProceedings of the 8th Machine Learning for Healthcare Conference,Proceedings of Machine Learning Research, Vol\.219,pp\. 403–422\.Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[13\]G\. Lee, W\. Yu, K\. Shin, W\. Cheng, and H\. Chen\(2025\)TimeCAP: learning to contextualize, augment, and predict time series events with large language model agents\.InProceedings of the Thirty\-Ninth AAAI Conference on Artificial Intelligence,pp\. 18082–18090\.External Links:[Document](https://dx.doi.org/10.1609/AAAI.V39I17.33989),[Link](https://doi.org/10.1609/aaai.v39i17.33989)Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[14\]Y\. Luo, Z\. Liu, L\. Wang, J\. Zheng, B\. Wu, and Q\. Ma\(2024\)Knowledge\-empowered dynamic graph network for irregularly sampled medical time series\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[15\]J\. Nam, K\. Kim, S\. Oh, J\. Tack, J\. Kim, and J\. Shin\(2024\)Optimized feature generation for tabular data via llms with decision tree reasoning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p5.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px3.p1.1)\.
- \[16\]Qwen Team\(2026\-02\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px3.p1.1)\.
- \[17\]S\. J\. Rennie, E\. Marcheret, Y\. Mroueh, J\. Ross, and V\. Goel\(2017\)Self\-critical sequence training for image captioning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§3](https://arxiv.org/html/2605.20292#S3.SS0.SSS0.Px6.p1.6)\.
- \[18\]M\. A\. Reyna, C\. Josef, R\. Jeter, S\. P\. Shashikumar, M\. B\. Westover, S\. Nemati, G\. D\. Clifford, and A\. Sharma\(2020\)Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019\.Critical Care Medicine48\(2\),pp\. 210–217\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px1.p1.1)\.
- \[19\]S\. N\. Shukla and B\. M\. Marlin\(2021\)Multi\-time attention networks for irregularly sampled time series\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[20\]I\. Silva, G\. Moody, D\. J\. Scott, L\. A\. Celi, and R\. G\. Mark\(2012\)Predicting in\-hospital mortality of icu patients: the physionet/computing in cardiology challenge 2012\.InComputing in Cardiology,pp\. 245–248\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px1.p1.1)\.
- \[21\]T\. Sounack, J\. Davis, B\. Durieux, A\. Chaffin, T\. J\. Pollard, E\. Lehman, A\. E\. W\. Johnson, M\. McDermott, T\. Naumann, and C\. Lindvall\(2025\)BioClinical modernbert: a state\-of\-the\-art long\-context encoder for biomedical and clinical nlp\.arXiv preprint arXiv:2506\.10896\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[22\]S\. Tipirneni and C\. K\. Reddy\(2022\)Self\-supervised transformer for sparse and irregularly sampled multivariate clinical time\-series\.ACM Transactions on Knowledge Discovery from Data16\(6\),pp\. 1–17\.Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[23\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px2.p1.1)\.
- \[24\]P\. R\. Vishwanath, S\. Tiwari, T\. G\. Naik, S\. Gupta, D\. N\. Thai, W\. Zhao, S\. Kwon, V\. Ardulov, K\. Tarabishy, A\. McCallum, and W\. Salloum\(2024\)Faithfulness hallucination detection in healthcare AI\.InKDD\-AIDSH Workshop,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p3.1)\.
- \[25\]M\. Wornow, S\. Bedi, M\. A\. Fuentes Hernandez, E\. Steinberg, J\. A\. Fries, C\. Ré, S\. Koyejo, and N\. Shah\(2025\)Context clues: evaluating long context models for clinical prediction tasks on ehr data\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p1.1)\.
- \[26\]Z\. Wu, A\. Dadu, M\. Nalls, F\. Faghri, and J\. Sun\(2024\)Instruction tuning large language models to understand electronic health records\.InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track,Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p1.1)\.
- \[27\]H\. Ye, J\. Li, H\. Zhao, D\. Guo, and Y\. Chang\(2025\)LLM meeting decision trees on tabular data\.InAdvances in Neural Information Processing Systems,Note:SpotlightExternal Links:[Link](https://openreview.net/forum?id=SRDF3RV0KP)Cited by:[§1](https://arxiv.org/html/2605.20292#S1.p5.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px3.p1.1)\.
- \[28\]X\. Zhang, M\. Zeman, T\. Tsiligkaridis, and M\. Zitnik\(2022\)Graph\-guided network for irregularly sampled multivariate time series\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2605.20292#A1.SS1.SSS0.Px1.p1.1),[Table A\.1](https://arxiv.org/html/2605.20292#A1.T1.1.4.2),[§1](https://arxiv.org/html/2605.20292#S1.p2.1),[§2](https://arxiv.org/html/2605.20292#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§4](https://arxiv.org/html/2605.20292#S4.SS0.SSS0.Px3.p1.1)\.

##### Appendix organization\.

The appendix is organized into four sections\. Appendix[A](https://arxiv.org/html/2605.20292#A1)gives dataset, baseline, and full benchmark details\. Appendix[B](https://arxiv.org/html/2605.20292#A2)collects component ablations, budget sweeps, selector analyses, LM classifier ablations, and retrieval sensitivity\. Appendix[C](https://arxiv.org/html/2605.20292#A3)states the construction\-level auditability guarantees and documents verbalization, predicate canonicalization, and clinical glossing\. Appendix[D](https://arxiv.org/html/2605.20292#A4)reports implementation details, latency measurement, reproducibility, code and data access, compute resources, and licenses\. Unless otherwise stated, reported values are test\-set mean±\\pmstandard deviation over three independent random seeds\. AUROC and AUPRC are the primary metrics; F1 is included for completeness\.

## Appendix AAdditional experimental setup and full benchmark results

### A\.1Datasets and uniform cohort filters

##### Source preprocessing\.

We use established sparse clinical time\-series preprocessing conventions before applying our uniform cohort filters\. PhysioNet 2012 and MIMIC\-III mortality follow the SeFT preprocessing convention\[[6](https://arxiv.org/html/2605.20292#bib.bib6)\], and PhysioNet 2019 follows the Raindrop preprocessing convention\[[28](https://arxiv.org/html/2605.20292#bib.bib10)\]\. After this source preprocessing step, all raw\-text baselines, numerical ISMTS baselines, and tree\-evidence methods are evaluated on the same post\-filter patient cohorts\.

Table A\.1:Source preprocessing conventions used before applying uniform cohort filters\.DatasetSource preprocessing conventionPhysioNet 2012SeFT preprocessing pipeline\[[6](https://arxiv.org/html/2605.20292#bib.bib6)\]MIMIC\-III mortalitySeFT preprocessing pipeline\[[6](https://arxiv.org/html/2605.20292#bib.bib6)\]PhysioNet 2019Raindrop preprocessing pipeline\[[28](https://arxiv.org/html/2605.20292#bib.bib10)\]
##### Task protocol\.

Table[A\.2](https://arxiv.org/html/2605.20292#A1.T2)summarizes the prediction protocol and leakage\-prevention convention for each benchmark\. For mortality prediction on PhysioNet 2012 and MIMIC\-III, inputs are restricted to observations within the first 48 hours\. For PhysioNet 2019 sepsis forecasting, we follow the Raindrop preprocessing convention, which constructs one patient\-level prediction example per encounter using observations up toTmax=60T\_\{\\max\}=60h\. We apply the same cohort\-level exclusions and sparsity filter across all representation pipelines\.

Table A\.2:Task\-specific prediction protocol\.DatasetInput horizonTargetLeakage preventionPhysioNet 2012first 48hin\-hospital mortalityno observations after 48hMIMIC\-III mortalityfirst 48hin\-hospital mortalityno observations after 48hPhysioNet 2019first 60hpatient\-level sepsis predictionRaindrop preprocessing convention
##### Uniform exclusions\.

After source preprocessing, we apply the same exclusion rules to every representation pipeline before constructing raw\-text, numerical, or tree\-evidence inputs\. These filters remove samples for which at least one pipeline cannot construct a valid input, such as samples with too few non\-missing measurements for stable window summaries or no measurements in the first 60h PhysioNet 2019 input window\. Table[A\.3](https://arxiv.org/html/2605.20292#A1.T3)reports the resulting exclusions, aggregated over all splits and affected datasets\.

Table A\.3:Uniform exclusion rules applied after source preprocessing and before constructing raw\-text, numerical, and tree\-evidence representations\. Counts are aggregated over all train/validation/test splits and affected datasets\.CriterionRemoved samplesPositives removedAffected datasetsFewer than five measurements in the 48h input window524PhysioNet 2012, MIMIC\-IIINo measurements in the first 60h input window31PhysioNet 2019
##### Post\-filter split statistics\.

Table[A\.4](https://arxiv.org/html/2605.20292#A1.T4)reports the final patient\-level splits after source preprocessing and uniform exclusions\. Positive rates are computed after all cohort filters\.

Table A\.4:Post\-filter split statistics\. One sample is one patient\-level prediction example\. Positive rate is computed after all cohort filters\.Dataset / splitTotalPositiveNegativePositive ratePhysioNet 2012 train7,6701,0936,57714\.25%PhysioNet 2012 val1,9162731,64314\.25%PhysioNet 2012 test2,4003412,05914\.21%MIMIC\-III train14,6321,97812,65413\.52%MIMIC\-III val3,2084352,77313\.56%MIMIC\-III test3,2173742,84311\.63%PhysioNet 2019 train24,8171,01423,8034\.09%PhysioNet 2019 val6,2172545,9634\.09%PhysioNet 2019 test7,7663257,4414\.18%
##### Why identical cohorts matter\.

Raw text, numerical ISMTS inputs, and tree evidence are constructed by different preprocessing pipelines\. Using identical post\-filter patient cohorts ensures that reported differences are attributable to representation and model interface rather than to different underlying patient splits\.

##### Faithful versus adapted interfaces\.

Table[A\.5](https://arxiv.org/html/2605.20292#A1.T5)summarizes how each text baseline is instantiated\. Faithful variants preserve the baseline’s original\-style prediction interface\. Adapted variants keep the same patient text representation but replace the final predictor with BioClinical ModernBERT, matching TreeText\-CTS’s reader\. This separates representation quality from downstream classifier choice\.

Table A\.5:Text\-representation baseline interfaces\. Faithful variants preserve the original\-style prediction interface; adapted variants keep the same patient text representation but use the same language model as TreeText\-CTS\.Baseline familyText representationFaithful predictorAdapted predictorWRDPWRDP\-Style narrativeLlama\-3\-8B frozen mean pool \+ XGBoostBioClinical ModernBERT fine\-tuneRecord2VecQwen3\.5\-27B patient summaryQwen3\-Embedding \+ PatchTSMixerBioClinical ModernBERT fine\-tuneDecodeExact per\-event text with 6h binsLlama\-3\-8B LoRA first\-tokenP​\(Yes\)P\(\\text\{Yes\}\)BioClinical ModernBERT fine\-tuneTimeCPQwen3\.5\-27B contextualized patient textQwen3\.5\-27B answer\-prediction log\-probabilityBioClinical ModernBERT fine\-tune

##### Shared static\-prefix protocol\.

All text\-based methods are evaluated on the same filtered cohorts and receive the same deterministic static\-covariate prefix, e\.g\.,Age is \.\.\., Gender is \.\.\.\. For adapted encoder\-based variants and TreeText\-CTS, this prefix is prepended to the string read by BioClinical ModernBERT\. For embedding\-based variants, it is included in the text passed to Qwen3\-Embedding\-8B\. For faithful decoder, scoring, or summarization variants, it is included immediately before the Llama\-3\-8B or Qwen3\.5\-27B call\. Thus, differences among text approaches are not attributable to differential access to age or gender\.

##### Generation and embedding backbones\.

Generation and embedding models follow the corresponding representation family rather than a single universal generator\. We use Qwen3\.5\-27B for Qwen\-based patient\-summary or contextualization generation and for TreeText\-CTS’s offline leaf\-level clinical glosses\. We use Qwen3\-Embedding\-8B for Qwen embedding components, including Record2Vec\-style embedding and TreeText\-CTS’s leaf embeddings\. Llama\-based faithful interfaces retain their Llama\-3\-8B reader, embedding, or label\-scoring model\.

##### TimeCP versus TimeCAP naming\.

The TimeCAP paper introduces bothTimeCP, a text\-only Contextualize–Predict interface, and fullTimeCAP, which additionally uses a time\-series encoder, retrieval, and prediction fusion\. We therefore reportTimeCPin our text\-representation benchmark to avoid mixing patient\-text baselines with extra numerical modeling components, while citing the source paper as TimeCAP\.

##### Full metrics\.

Table[A\.6](https://arxiv.org/html/2605.20292#A1.T6)gives the full AUROC/AUPRC/F1 companion to the main text\-baseline table\.

Table A\.6:Full text\-baseline results\. Each metric is mean±\\pmstandard deviation\. TreeText\-CTS \(Ours\) is placed below the horizontal rule\.MethodPhysioNet 2012MIMIC\-IIIPhysioNet 2019AUROCAUPRCF1AUROCAUPRCF1AUROCAUPRCF1WRDP\-Faithful0\.7715±\\pm0\.00190\.3528±\\pm0\.00010\.4102±\\pm0\.00420\.7543±\\pm0\.00280\.2958±\\pm0\.00350\.3579±\\pm0\.00780\.8509±\\pm0\.00120\.3038±\\pm0\.00960\.3615±\\pm0\.0077WRDP\-Adapted0\.8236±\\pm0\.00080\.4636±\\pm0\.01810\.4884±\\pm0\.00050\.7843±\\pm0\.00170\.3580±\\pm0\.00890\.4046±\\pm0\.00280\.8853±\\pm0\.00420\.3792±\\pm0\.00250\.4138±\\pm0\.0009Record2Vec\-Faithful0\.7720±\\pm0\.00060\.4033±\\pm0\.00200\.4288±\\pm0\.00460\.7839±\\pm0\.00280\.3738±\\pm0\.00090\.4040±\\pm0\.00260\.7450±\\pm0\.00240\.1151±\\pm0\.00290\.1912±\\pm0\.0039Record2Vec\-Adapted0\.7675±\\pm0\.01090\.3788±\\pm0\.02730\.4207±\\pm0\.01910\.8040±\\pm0\.00050\.3692±\\pm0\.00140\.4167±\\pm0\.00930\.7894±\\pm0\.00250\.1636±\\pm0\.00650\.2387±\\pm0\.0000Decode\-Faithful0\.7821±\\pm0\.02170\.4376±\\pm0\.03100\.3701±\\pm0\.12130\.8394±\\pm0\.01260\.4292±\\pm0\.00410\.2077±\\pm0\.00000\.7913±\\pm0\.02360\.3266±\\pm0\.00720\.3839±\\pm0\.0075Decode\-Adapted0\.7854±\\pm0\.00310\.3598±\\pm0\.00010\.4194±\\pm0\.00020\.7933±\\pm0\.00260\.3629±\\pm0\.00600\.4223±\\pm0\.00040\.8118±\\pm0\.00410\.2448±\\pm0\.00110\.2810±\\pm0\.0054TimeCP\-Faithful0\.5663±\\pm0\.00370\.1579±\\pm0\.01490\.2895±\\pm0\.00980\.6120±\\pm0\.00200\.1453±\\pm0\.00510\.2641±\\pm0\.00350\.5266±\\pm0\.01170\.0457±\\pm0\.02760\.0891±\\pm0\.0252TimeCP\-Adapted0\.7999±\\pm0\.02130\.4156±\\pm0\.01420\.4623±\\pm0\.00000\.8065±\\pm0\.00860\.4110±\\pm0\.00900\.4442±\\pm0\.00000\.7949±\\pm0\.00060\.1520±\\pm0\.00400\.2304±\\pm0\.0027TreeText\-CTS \(Ours\)0\.8571±\\pm0\.00380\.5239±\\pm0\.00040\.5264±\\pm0\.00230\.8579±\\pm0\.00150\.5011±\\pm0\.00900\.4958±\\pm0\.00290\.9066±\\pm0\.00450\.4757±\\pm0\.02480\.4994±\\pm0\.0383

##### Interpretation\.

The adapted rows show that simply replacing a predictor with BioClinical ModernBERT is not sufficient\. Adapted WRDP and adapted TimeCP improve over some faithful variants, but they remain below TreeText\-CTS because their representations are not source\-traceable tree\-path evidence\. TimeCP\-Faithful is a failure case on the 6\-hour PhysioNet 2019 sepsis forecast: its AUPRC is close to the test positive rate, indicating near\-random precision ranking under this highly imbalanced short\-horizon prompting setup\. We include both the faithful and adapted TimeCP rows to separate this answer\-scoring failure from the quality of the contextualized text representation\.

### A\.2Numerical irregular\-time\-series baselines and full results

##### Full numerical comparison\.

Table[A\.7](https://arxiv.org/html/2605.20292#A1.T7)reports AUROC/AUPRC/F1 against numerical irregular\-time\-series models\. This is a positioning comparison: TreeText\-CTS is not intended to dominate numerical models on every metric, but to provide competitive accuracy together with source\-traceable textual evidence\.

Table A\.7:Full numerical ISMTS comparison\. Each metric is mean±\\pmstandard deviation\. TreeText\-CTS \(Ours\) is placed below the horizontal rule\.MethodPhysioNet 2012MIMIC\-IIIPhysioNet 2019AUROCAUPRCF1AUROCAUPRCF1AUROCAUPRCF1LOCF LSTM0\.8192±\\pm0\.00440\.4548±\\pm0\.01490\.4751±\\pm0\.00830\.8240±\\pm0\.00160\.4811±\\pm0\.01130\.4706±\\pm0\.00810\.8514±\\pm0\.00430\.2392±\\pm0\.00640\.3220±\\pm0\.0043LOCF Transformer0\.8197±\\pm0\.00360\.4327±\\pm0\.00370\.4688±\\pm0\.00780\.8230±\\pm0\.00040\.4465±\\pm0\.00810\.4641±\\pm0\.00660\.8741±\\pm0\.00430\.3296±\\pm0\.01000\.3727±\\pm0\.0014GRU\-D0\.8595±\\pm0\.00280\.5383±\\pm0\.00820\.5402±\\pm0\.01070\.8541±\\pm0\.00150\.5008±\\pm0\.00980\.4884±\\pm0\.00220\.9134±\\pm0\.00250\.4637±\\pm0\.01600\.4734±\\pm0\.0138STraTS0\.8702±\\pm0\.00410\.5543±\\pm0\.00880\.5489±\\pm0\.00890\.8459±\\pm0\.00810\.4705±\\pm0\.02180\.4716±\\pm0\.01370\.9174±\\pm0\.00040\.4579±\\pm0\.00310\.4645±\\pm0\.0050mTAND Transformer0\.8397±\\pm0\.00310\.5100±\\pm0\.00360\.5130±\\pm0\.00510\.8399±\\pm0\.00210\.4603±\\pm0\.00670\.4753±\\pm0\.00140\.8381±\\pm0\.00690\.2629±\\pm0\.00970\.3401±\\pm0\.0150SEFT0\.8278±\\pm0\.00690\.4646±\\pm0\.00920\.4733±\\pm0\.01030\.8146±\\pm0\.00750\.4034±\\pm0\.01160\.4384±\\pm0\.00900\.9041±\\pm0\.00240\.3969±\\pm0\.00950\.4451±\\pm0\.0113KEDGN0\.8661±\\pm0\.00370\.5374±\\pm0\.00710\.5372±\\pm0\.00660\.8425±\\pm0\.00850\.4734±\\pm0\.01160\.4819±\\pm0\.01630\.9084±\\pm0\.00470\.3906±\\pm0\.00390\.4206±\\pm0\.0143DuETT0\.8672±\\pm0\.00210\.5420±\\pm0\.00570\.5455±\\pm0\.00390\.8460±\\pm0\.00420\.4576±\\pm0\.00730\.4820±\\pm0\.00780\.9033±\\pm0\.00330\.4368±\\pm0\.02960\.4648±\\pm0\.0237TreeText\-CTS \(Ours\)0\.8571±\\pm0\.00380\.5239±\\pm0\.00040\.5264±\\pm0\.00230\.8579±\\pm0\.00150\.5011±\\pm0\.00900\.4958±\\pm0\.00290\.9066±\\pm0\.00450\.4757±\\pm0\.02480\.4994±\\pm0\.0383

##### Selected learning rates\.

Table[A\.8](https://arxiv.org/html/2605.20292#A1.T8)reports the validation\-selected learning rate for each numerical model and dataset\. The sweep grid range is10−310^\{\-3\}to10−510^\{\-5\}\.

Table A\.8:Validation\-selected learning rates for numerical ISMTS baselines\.MethodPhysioNet 2012MIMIC\-IIIPhysioNet 2019LOCF LSTM10−410^\{\-4\}10−410^\{\-4\}10−510^\{\-5\}LOCF Transformer10−410^\{\-4\}10−410^\{\-4\}10−510^\{\-5\}GRU\-D10−410^\{\-4\}10−410^\{\-4\}10−410^\{\-4\}STraTS10−510^\{\-5\}10−410^\{\-4\}10−410^\{\-4\}mTAND Transformer10−310^\{\-3\}10−410^\{\-4\}10−410^\{\-4\}SEFT10−310^\{\-3\}10−310^\{\-3\}10−310^\{\-3\}KEDGN10−310^\{\-3\}10−310^\{\-3\}10−310^\{\-3\}DuETT10−310^\{\-3\}10−410^\{\-4\}10−410^\{\-4\}
##### Unified numerical training protocol\.

For all LOCF LSTM, LOCF Transformer, GRU\-D, mTAND, SeFT, KEDGN, DuETT, and STraTS, we use the same patient cohorts and splits as TreeText\-CTS\. All models are trained with AdamW, weight decay10−410^\{\-4\}, early stopping on validation AUROC, and three random seeds\. Inputs are normalized using training\-split statistics only for each variable\.

Table A\.9:Shared numerical\-baseline training protocol unless a model\-specific paper\-faithful implementation requires otherwise\.ItemSettingOptimizerAdamWLearning\-rate sweep10−310^\{\-3\}to10−510^\{\-5\}Epochs30 maximumEarly stoppingvalidation AUROC, patience 10Batch size64 train / 128 validation and testSeeds3 independent seedsLossbinary cross\-entropy with logitsEvaluationbest\-validation\-AUROC checkpoint, single held\-out test passNormalizationper\-variable z\-transform from training split only

## Appendix BAdditional ablations and selector analyses

### B\.1Full component ablations

##### Purpose of the ablations\.

The ablations separate four questions: whether gloss\-gated clinical glosses help beyond deterministic predicates, whether learned hard evidence selection improves over fixed leaf\-score top\-KKselection, whether selected leaf identities alone are sufficient without a language reader, and whether XGB\-only aggregation can replace the local\-to\-global tree\-evidence pipeline\. These ablations evaluate the contribution of learned budgeted selection; they do not compare all possible optimization methods for training a selector\.

##### Text versus leaf identity\.

The leaf\-ID controls separate predictive selection from language composition\. A leaf\-ID MLP receives only sparse identifiers of selected leaves and therefore tests whether selected tree identities are sufficient for prediction\. These rows are strong, confirming that selected leaves are predictive\. However, they do not provide readable classifier inputs and underperform the full model in AUPRC on all three datasets\. The largest gap appears on PhysioNet 2019, where TreeText\-CTS improves AUPRC by 0\.0677 over the reused\-CES leaf\-ID MLP\.

### B\.2Budget sweeps and final selection enrichment

##### Budget sweep\.

Tables[A\.10](https://arxiv.org/html/2605.20292#A2.T10)and[A\.11](https://arxiv.org/html/2605.20292#A2.T11)report the PhysioNet 2012 budget sweep used for Figure[2](https://arxiv.org/html/2605.20292#S5.F2)\. This is a matched fixed\-budget analysis: every selector exposes exactlyKKevidence units to the same LM classifier under maximum input length 3072\. The CES top\-KKrow ranks candidates by the trained gate marginri​jr\_\{ij\}, while heuristic rows rank candidates by leaf score, recency, or random order\. The CES bottom\-KKrow selects the lowest\-margin evidence and is included only as a diagnostic negative control\. This analysis isolates evidence\-ranking quality from the adaptive accept/reject behavior of the learned gate\. The main paper plots AUPRC because the clinical tasks are imbalanced; Figure[A\.1](https://arxiv.org/html/2605.20292#A2.F1)provides the AUROC companion\.

##### Interpretation\.

CES top\-KKis the strongest selector across the practical budget range\. AtK=15K=15, it already exceeds every heuristic selector evaluated up toK=35K=35in both AUROC and AUPRC\. In contrast, CES bottom\-KKis both weaker and higher\-variance, especially at small and middle budgets\. This indicates that low learned margins correspond to unstable and weak classifier inputs, while high learned margins define a useful evidence\-utility ordering\.

Table A\.10:PhysioNet 2012 AUROC budget sweep\. Every row exposes exactlyKKevidence units to the same LM classifier\. CES bottom\-KKselects the lowest learned\-margin evidence and is included only as a diagnostic negative control\.SelectorK=2K=2K=5K=5K=10K=10K=15K=15K=20K=20K=25K=25K=30K=30K=35K=35Leaf\-score top\-KK0\.7963±0\.00060\.7963\{\\pm\}0\.00060\.8030±0\.00700\.8030\{\\pm\}0\.00700\.8118±0\.00640\.8118\{\\pm\}0\.00640\.8187±0\.00090\.8187\{\\pm\}0\.00090\.8246±0\.00640\.8246\{\\pm\}0\.00640\.8356±0\.00220\.8356\{\\pm\}0\.00220\.8404±0\.00740\.8404\{\\pm\}0\.00740\.8415±0\.00320\.8415\{\\pm\}0\.0032Random top\-KK0\.7409±0\.00760\.7409\{\\pm\}0\.00760\.7559±0\.00220\.7559\{\\pm\}0\.00220\.7980±0\.00180\.7980\{\\pm\}0\.00180\.8097±0\.00110\.8097\{\\pm\}0\.00110\.8156±0\.00110\.8156\{\\pm\}0\.00110\.8169±0\.00540\.8169\{\\pm\}0\.00540\.8162±0\.00410\.8162\{\\pm\}0\.00410\.8150±0\.00400\.8150\{\\pm\}0\.0040Recency top\-KK0\.8063±0\.00560\.8063\{\\pm\}0\.00560\.8160±0\.00020\.8160\{\\pm\}0\.00020\.8127±0\.00210\.8127\{\\pm\}0\.00210\.8249±0\.00110\.8249\{\\pm\}0\.00110\.8237±0\.00270\.8237\{\\pm\}0\.00270\.8207±0\.00280\.8207\{\\pm\}0\.00280\.8239±0\.00170\.8239\{\\pm\}0\.00170\.8253±0\.00070\.8253\{\\pm\}0\.0007CES top\-KK\(ours\)0\.8167±0\.0035\\mathbf\{0\.8167\}\{\\pm\}0\.00350\.8388±0\.0036\\mathbf\{0\.8388\}\{\\pm\}0\.00360\.8446±0\.0007\\mathbf\{0\.8446\}\{\\pm\}0\.00070\.8488±0\.0030\\mathbf\{0\.8488\}\{\\pm\}0\.00300\.8502±0\.0029\\mathbf\{0\.8502\}\{\\pm\}0\.00290\.8519±0\.0014\\mathbf\{0\.8519\}\{\\pm\}0\.00140\.8571±0\.0038\\mathbf\{0\.8571\}\{\\pm\}0\.00380\.8561±0\.0010\\mathbf\{0\.8561\}\{\\pm\}0\.0010CES bottom\-KKdiagnostic0\.7745±0\.00150\.7745\{\\pm\}0\.00150\.7884±0\.00210\.7884\{\\pm\}0\.00210\.7921±0\.00100\.7921\{\\pm\}0\.00100\.8017±0\.01380\.8017\{\\pm\}0\.01380\.7758±0\.03510\.7758\{\\pm\}0\.03510\.8110±0\.02150\.8110\{\\pm\}0\.02150\.8232±0\.02610\.8232\{\\pm\}0\.02610\.8167±0\.00050\.8167\{\\pm\}0\.0005

Table A\.11:PhysioNet 2012 AUPRC budget sweep\. These values are plotted in Figure[2](https://arxiv.org/html/2605.20292#S5.F2)\. Every row exposes exactlyKKevidence units to the same LM classifier\. CES bottom\-KKis a low\-margin diagnostic negative control\.SelectorK=2K=2K=5K=5K=10K=10K=15K=15K=20K=20K=25K=25K=30K=30K=35K=35Leaf\-score top\-KK0\.4190±0\.0014\\mathbf\{0\.4190\}\{\\pm\}0\.00140\.4323±0\.01450\.4323\{\\pm\}0\.01450\.4506±0\.00310\.4506\{\\pm\}0\.00310\.4662±0\.01140\.4662\{\\pm\}0\.01140\.4657±0\.02700\.4657\{\\pm\}0\.02700\.4648±0\.00370\.4648\{\\pm\}0\.00370\.4915±0\.01120\.4915\{\\pm\}0\.01120\.4907±0\.00530\.4907\{\\pm\}0\.0053Random top\-KK0\.3263±0\.01690\.3263\{\\pm\}0\.01690\.3469±0\.00520\.3469\{\\pm\}0\.00520\.3898±0\.01700\.3898\{\\pm\}0\.01700\.4082±0\.00210\.4082\{\\pm\}0\.00210\.4255±0\.00310\.4255\{\\pm\}0\.00310\.4311±0\.01660\.4311\{\\pm\}0\.01660\.4202±0\.00310\.4202\{\\pm\}0\.00310\.4133±0\.01220\.4133\{\\pm\}0\.0122Recency top\-KK0\.4143±0\.00570\.4143\{\\pm\}0\.00570\.4278±0\.01080\.4278\{\\pm\}0\.01080\.4319±0\.00280\.4319\{\\pm\}0\.00280\.4568±0\.00020\.4568\{\\pm\}0\.00020\.4573±0\.00100\.4573\{\\pm\}0\.00100\.4479±0\.00900\.4479\{\\pm\}0\.00900\.4517±0\.01010\.4517\{\\pm\}0\.01010\.4601±0\.00180\.4601\{\\pm\}0\.0018CES top\-KK\(ours\)0\.4099±0\.00700\.4099\{\\pm\}0\.00700\.4693±0\.0030\\mathbf\{0\.4693\}\{\\pm\}0\.00300\.4889±0\.0039\\mathbf\{0\.4889\}\{\\pm\}0\.00390\.4980±0\.0151\\mathbf\{0\.4980\}\{\\pm\}0\.01510\.4974±0\.0039\\mathbf\{0\.4974\}\{\\pm\}0\.00390\.5076±0\.0020\\mathbf\{0\.5076\}\{\\pm\}0\.00200\.5239±0\.0004\\mathbf\{0\.5239\}\{\\pm\}0\.00040\.5094±0\.0013\\mathbf\{0\.5094\}\{\\pm\}0\.0013CES bottom\-KKdiagnostic0\.3427±0\.04410\.3427\{\\pm\}0\.04410\.3746±0\.05270\.3746\{\\pm\}0\.05270\.3902±0\.02530\.3902\{\\pm\}0\.02530\.3563±0\.07410\.3563\{\\pm\}0\.07410\.3866±0\.03090\.3866\{\\pm\}0\.03090\.4362±0\.06840\.4362\{\\pm\}0\.06840\.4419±0\.00760\.4419\{\\pm\}0\.00760\.4384±0\.01330\.4384\{\\pm\}0\.0133

![Refer to caption](https://arxiv.org/html/2605.20292v1/x2.png)Figure A\.1:PhysioNet 2012 AUROC budget sweep\. The learnedCESremains above fixed leaf\-score, recency, and random top\-KKselectors across the practical budget range\. We keep AUPRC as the main\-paper Figure[2](https://arxiv.org/html/2605.20292#S5.F2)metric because the clinical tasks are imbalanced\.
##### Final selection enrichment\.

Table[A\.12](https://arxiv.org/html/2605.20292#A2.T12)reports selected\-to\-candidate enrichment ratios\. Values above 1 indicate selection preference relative to the candidate pool\.

Table A\.12:Final selection enrichment ratio\. Values above 1 indicate preference relative to the candidate pool\.CategoryBucketPhysioNet 2012MIMIC\-IIIPhysioNet 2019GlossabilityGlossable,g​\(e\)=1g\(e\)=11\.271\.110\.52GlossabilityNon\-glossable,g​\(e\)=0g\(e\)=00\.780\.921\.14Recencyrecent≥0\.9\\geq 0\.92\.813\.242\.31Recencymid 0\.6–0\.90\.840\.720\.71Recencyold<0\.6<0\.60\.340\.450\.22Window1h0\.250\.160\.28Window2h0\.500\.370\.39Window4h0\.760\.820\.97Window8h1\.592\.292\.12Window16h4\.126\.783\.12Window32h7\.0211\.233\.75Window48h7\.1811\.253\.99
##### Interpretation\.

The selector consistently enriches recent and long\-window evidence\. Glossability behavior is task\-dependent: PhysioNet 2012 and MIMIC\-III mildly enrich glossable evidence, whereas PhysioNet 2019 strongly depletes glossable evidence\. This indicates thatCESis not mechanically selecting every clinically glossed leaf\.

### B\.3Minimum\-selection floor and selector\-side input ablations

##### Minimum\-selection floor\.

The Compact Evidence Selector uses a minimum\-selection floorKmin=5K\_\{\\min\}=5before the final budget capK=30K=30\. This floor is not an additional evidence budget; it prevents all\-reject collapse, where the selector rejects nearly all candidates and the reader receives an empty or uninformative text\. Table[A\.13](https://arxiv.org/html/2605.20292#A2.T13)ablates this floor on PhysioNet 2012\.

Table A\.13:Minimum\-selection floor ablation on PhysioNet 2012\. All rows use Predicate\+gloss evidence, BioClinical ModernBERT, top\-M=5M=5candidate retrieval, final selected\-evidence budgetK=30K=30\. Without the floor, one of three seeds collapses; withKmin=5K\_\{\\min\}=5, the main setting remains stable\.SettingSeedsCollapsed seedsTest AUROCTest AUPRCTest F1No minimum\-selection floor,Kmin=0K\_\{\\min\}=031/30\.7322±0\.16400\.7322\\pm 0\.16400\.3852±0\.16910\.3852\\pm 0\.16910\.4236±0\.12380\.4236\\pm 0\.1238Collapsed seed underKmin=0K\_\{\\min\}=011/10\.50040\.50040\.14660\.14660\.24880\.2488With minimum\-selection floor,Kmin=5K\_\{\\min\}=530/30\.8571±0\.0038\\mathbf\{0\.8571\\pm 0\.0038\}0\.5239±0\.0004\\mathbf\{0\.5239\\pm 0\.0004\}0\.5264±0\.0023\\mathbf\{0\.5264\\pm 0\.0023\}

##### Selector\-side inputs\.

The selector receives a projected leaf embedding and eight scalar metadata features\. Table[A\.14](https://arxiv.org/html/2605.20292#A2.T14)masks one selector\-side input group at a time\. The text fed to the reader remains unchanged\.

Table A\.14:Selector\-side input masking\. Text fed to the reader is unchanged; onlyCESinputs are masked\. Each dataset cell is AUROC/AUPRC\.Removed selector inputPhysioNet 2012MIMIC\-IIIPhysioNet 2019AUROCAUPRCAUROCAUPRCAUROCAUPRCNone0\.8571±\\pm0\.00380\.5239±\\pm0\.00040\.8579±\\pm0\.00150\.5011±\\pm0\.00900\.9066±\\pm0\.00450\.4757±\\pm0\.0248Glossable scalar0\.8534±\\pm0\.00470\.5086±\\pm0\.00170\.8461±\\pm0\.00390\.4416±\\pm0\.02020\.9116±\\pm0\.00260\.4517±\\pm0\.0060XGB statistics bundle0\.8312±\\pm0\.01190\.4548±\\pm0\.01600\.8268±\\pm0\.00880\.4362±\\pm0\.00170\.9040±\\pm0\.00180\.4464±\\pm0\.0356Leaf embedding0\.8550±\\pm0\.00100\.5173±\\pm0\.00510\.8479±\\pm0\.00160\.4798±\\pm0\.01370\.9094±\\pm0\.00280\.4467±\\pm0\.0193

##### Interpretation\.

Removing the minimum\-selection floor makes selector training brittle: one seed collapses to near\-random AUROC and very low AUPRC\. The floor prevents this failure mode while preserving the final selected\-evidence budgetK=30K=30\. For selector\-side inputs, masking XGB statistics causes the largest drop on PhysioNet 2012 and MIMIC\-III, showing that tree\-side predictive metadata is important\. Masking the leaf embedding reduces AUPRC on all three datasets, with the largest AUPRC drop on PhysioNet 2019, although P2019 AUROC slightly increases when the embedding is removed\. We therefore interpret the leaf embedding as improving selected\-evidence ranking for precision\-recall behavior rather than uniformly improving every discrimination metric\. The glossability scalar is useful but task dependent, consistent with the enrichment analysis above\.

##### Selection enrichment statistic\.

For the selector\-profile analysis in Figure[2](https://arxiv.org/html/2605.20292#S5.F2), we measure how often a category appears among the final selected evidence units relative to how often it appears in the candidate pool available toCES\. LetCiC\_\{i\}denote the candidate tree\-path evidence pool for patientii, and letSi⊆CiS\_\{i\}\\subseteq C\_\{i\}denote the final evidence units selected byCESafter the minimum\-count floor and top\-KKbudget cap\. For any categorical attributea​\(e\)a\(e\), such as recency bin, look\-back window, or gloss availability, the candidate\-pool frequency and selected\-evidence frequency of categoryccare computed as

ppool​\(c\)=∑i∑e∈Ci𝟏​\[a​\(e\)=c\]∑i\|Ci\|,psel​\(c\)=∑i∑e∈Si𝟏​\[a​\(e\)=c\]∑i\|Si\|\.p\_\{\\mathrm\{pool\}\}\(c\)=\\frac\{\\sum\_\{i\}\\sum\_\{e\\in C\_\{i\}\}\\mathbf\{1\}\[a\(e\)=c\]\}\{\\sum\_\{i\}\|C\_\{i\}\|\},\\qquad p\_\{\\mathrm\{sel\}\}\(c\)=\\frac\{\\sum\_\{i\}\\sum\_\{e\\in S\_\{i\}\}\\mathbf\{1\}\[a\(e\)=c\]\}\{\\sum\_\{i\}\|S\_\{i\}\|\}\.We define enrichment as the ratio

Enrich​\(c\)=psel​\(c\)ppool​\(c\)\.\\mathrm\{Enrich\}\(c\)=\\frac\{p\_\{\\mathrm\{sel\}\}\(c\)\}\{p\_\{\\mathrm\{pool\}\}\(c\)\}\.Thus,Enrich​\(c\)\>1\\mathrm\{Enrich\}\(c\)\>1means that categoryccis over\-represented in the selected evidence relative to its availability in the candidate pool, whereasEnrich​\(c\)<1\\mathrm\{Enrich\}\(c\)<1means that it is under\-represented\. This normalization is important because some categories, such as particular look\-back windows or glossed evidence units, may already be more common in the candidate pool before selection\.

### B\.4LM classifier and candidate\-retrieval sensitivity

##### LM classifier ablation\.

This ablation swaps the final reader while keeping Predicate\+gloss evidence, top\-M=5M=5candidate retrieval,CES, and budgetK=30K=30fixed\. It tests whether the gain comes from the selected evidence representation alone, from generic language pretraining, or from domain\-aligned clinical pretraining\.

Table A\.15:Language\-reader ablation\. BioClinical ModernBERT is the default reader in TreeText\-CTS\. Random\-init ModernBERT uses the same architecture and tokenizer but initializes the encoder from configuration rather than pretrained weights\.DatasetLM backboneAUROCAUPRCF1PhysioNet 2012BioClinical ModernBERT0\.8571±0\.0038\\mathbf\{0\.8571\\pm 0\.0038\}0\.5239±0\.0004\\mathbf\{0\.5239\\pm 0\.0004\}0\.5264±0\.0023\\mathbf\{0\.5264\\pm 0\.0023\}PhysioNet 2012random\-init ModernBERT0\.8497±0\.00290\.8497\\pm 0\.00290\.5054±0\.02090\.5054\\pm 0\.02090\.5144±0\.00700\.5144\\pm 0\.0070PhysioNet 2012ModernBERT0\.8486±0\.00260\.8486\\pm 0\.00260\.4857±0\.00210\.4857\\pm 0\.00210\.5083±0\.00640\.5083\\pm 0\.0064MIMIC\-IIIBioClinical ModernBERT0\.8579±0\.0015\\mathbf\{0\.8579\\pm 0\.0015\}0\.5011±0\.0090\\mathbf\{0\.5011\\pm 0\.0090\}0\.4958±0\.0029\\mathbf\{0\.4958\\pm 0\.0029\}MIMIC\-IIIrandom\-init ModernBERT0\.8470±0\.00090\.8470\\pm 0\.00090\.4879±0\.00010\.4879\\pm 0\.00010\.4827±0\.00090\.4827\\pm 0\.0009MIMIC\-IIIModernBERT0\.8483±0\.00140\.8483\\pm 0\.00140\.4486±0\.00110\.4486\\pm 0\.00110\.4827±0\.00530\.4827\\pm 0\.0053PhysioNet 2019BioClinical ModernBERT0\.9066±0\.0045\\mathbf\{0\.9066\\pm 0\.0045\}0\.4757±0\.0248\\mathbf\{0\.4757\\pm 0\.0248\}0\.4994±0\.0383\\mathbf\{0\.4994\\pm 0\.0383\}PhysioNet 2019random\-init ModernBERT0\.8983±0\.00510\.8983\\pm 0\.00510\.4088±0\.03230\.4088\\pm 0\.03230\.4572±0\.00530\.4572\\pm 0\.0053PhysioNet 2019ModernBERT0\.8936±0\.00540\.8936\\pm 0\.00540\.4007±0\.00420\.4007\\pm 0\.00420\.4571±0\.00390\.4571\\pm 0\.0039
##### Candidate retrieval sensitivity\.

BeforeCESapplies the final selected\-evidence budgetK=30K=30, we retrieve a finite candidate pool of activated leaves for each patient\-time\-window tuple\(i,t,W\)\(i,t,W\)\. The main experiments use top\-M=5M=5retrieval\. IncreasingMMgives the selector more input tokens, which increases peak GPU memory, but it does not improve AUPRC on PhysioNet 2012\.

Table A\.16:Candidate pre\-pruning sensitivity on PhysioNet 2012\. All rows use Predicate\+gloss evidence, BioClinical ModernBERT, final selected\-evidence budgetK=30K=30, maximum input length 3072, 20 training epochs, and 3 random seeds\. Main experiments use top\-M=5M=5\. Peak memory is measured withnvidia\-smi memory\.usedduring one training epoch and therefore includes PyTorch caching allocator reserved blocks\.Candidate retrievalP2012 AUROCP2012 AUPRCLeaves per\(i,t,W\)\(i,t,W\)Train time / epochPeak GPU memorytop\-M=5M=50\.8571±0\.00380\.8571\\pm 0\.00380\.5239±0\.0004\\mathbf\{0\.5239\\pm 0\.0004\}519\.1 min35,787 MiB \(≈\\approx34\.9 GB\)top\-M=10M=100\.8584±0\.0003\\mathbf\{0\.8584\\pm 0\.0003\}0\.5069±0\.00470\.5069\\pm 0\.00471020\.2 min39,197 MiB \(≈\\approx38\.3 GB\)top\-M=20M=200\.8551±0\.00220\.8551\\pm 0\.00220\.4999±0\.00450\.4999\\pm 0\.00452020\.4 min40,015 MiB \(≈\\approx39\.1 GB\)top\-M=30M=300\.8483±0\.00300\.8483\\pm 0\.00300\.5002±0\.02540\.5002\\pm 0\.02543020\.5 min43,875 MiB \(≈\\approx42\.9 GB\)

##### Interpretation\.

BioClinical ModernBERT is the strongest reader on all three datasets\. Random\-init ModernBERT outperforms vanilla ModernBERT in AUPRC, suggesting that generic language pretraining is not automatically aligned with the structured evidence language produced by tree\-path predicates\. However, random\-init does not match BioClinical ModernBERT, especially on PhysioNet 2019\. The random\-init reader is close to the reused\-CES leaf\-ID MLP control in AUPRC, which indicates that selected leaf identities already carry strong predictive information\. The full BioClinical reader improves beyond both controls most strongly on PhysioNet 2019\. For retrieval, top\-M=5M=5achieves the best AUPRC while using the smallest selector input and the lowest peak GPU memory\. Larger candidate pools expose more leaves per patient\-time\-window context, but the additional candidates do not improve AUPRC on PhysioNet 2012\.

### B\.5Representation\-level case example

This section shows one held\-out PhysioNet 2012 test case to illustrate how different input interfaces expose the same ICU trajectory\. The example is intended as a qualitative representation comparison rather than an additional performance claim\. We replace the original record identifier with a case label and show shortened derived representations rather than the complete raw trajectory\.

##### Case overview\.

The case, denoted P12\-B, is a mortality\-negative held\-out PhysioNet 2012 example\. The displayed trajectory contains early tachycardia, intermittent low urine output, declining oxygenation measurements, thrombocytopenia, and later stabilizing neurological and renal markers\. The panels below compare: \(i\) selected tree\-path evidence used by TreeText\-CTS, \(ii\) a TimeCP\-style patient\-level contextual summary from our text\-based baselines, \(iii\) raw textual serialization from Decode\- and WRDP\-style baselines, and \(iv\) a forward\-filled numerical matrix used by common imputation\-based numerical baselines\.

Table A\.17:Representation statistics for the P12\-B held\-out case\. Lengths are reported in the native units logged by each representation pipeline and are not directly comparable tokenizer counts\.InterfaceBaseline familyLogged sizeSource granularityTreeText\-CTSTreeText\-CTS30 / 204 selected units; 1668 subwordssource window \+ tree pathPatient\-level summaryTimeCP contextual patient text992 characterspatient\-level narrativeRaw text serializationDecode per\-event serialization2689 characters6h event/time\-bin entriesRaw text serializationWRDP hourly narrative serialization13057 characters48\-hour feature rowsNumerical matrixforward\-fill imputation input48×3748\\times 37valuesfeature\-by\-time matrix
##### TreeText\-CTS selected tree\-path evidence\.

TableLABEL:tab:p12b\_ours\_evidenceshows representative selected evidence units from TreeText\-CTS\. Each unit is tied to a source time window and consists of deterministic threshold conditions\. We include both glossed and predicate\-only units; roughly four of the displayed units include a clinical gloss and six are predicate\-only\. The gloss is auxiliary and is never used to replace the deterministic conditions\.

Table A\.18:Representative TreeText\-CTS selected evidence units for P12\-B\. This table shows 10 selected units ordered by source\-window end time\.Source\_windowSelected tree\-path evidence unitGloss?05:00–07:00Urine mean is at most 46\.6667, Urine last is higher than 0\.0000, Urine mean is higher than 17\.5000, and HR mean is higher than 105\.5000\. cg: Low but present urine output with reduced mean output accompanies tachycardia, suggesting a pattern of compromised perfusion and circulatory stress\.yes04:00–08:00HR mean is higher than 110\.0000, Urine max is at most 55\.0000, Weight ts\_gap is higher than 0\.6500, RespRate min is at most 15\.0000, and Weight last is at most 131\.8000\. cg: Tachycardia combined with markedly restricted urine output and a low respiratory rate minimum suggests haemodynamic stress with reduced renal output in a lower\-weight patient; weight data is temporally sparse\.yes07:00–15:00FiO2 last is at most 0\.5500, HR mean is higher than 110\.0909, and NIMAP mean is at most 100\.0000\. cg: Sustained tachycardia with non\-elevated non\-invasive mean arterial pressure in patients receiving lower supplemental oxygen\.yes09:00–15:00RespRate min is higher than 10\.0000, NIMAP max is at most 75\.0000, and Urine mean is higher than 61\.6667\.no11:00–27:00Creatinine mean is at most 1\.4750, Urine mean is higher than 66\.3636, and Temp min is higher than 34\.4000\.no20:00–27:00Urine mean is higher than 45\.0000, RespRate min is higher than 8\.0000, and RespRate mean is at most 25\.0000\.no29:00–31:00RespRate mean is higher than 9\.0000, Urine mean is higher than 82\.5000, RespRate mean is at most 21\.0000, and Urine max is at most 340\.0000\.no00:00–43:00GCS mean is higher than 8\.0526, Urine mean is higher than 57\.7500, BUN mean is at most 17\.5000, Bilirubin delta is at most 1\.7000, and Lactate last is at most 4\.9000\. cg: Preserved consciousness, adequate urine output, low BUN suggesting absence of azotemia, stable bilirubin, and lactate not reaching severe hyperlactatemia together indicate relatively stable metabolic and organ function status\.yes11:00–43:00BUN last is at most 28\.0000, and Platelets min is at most 366\.0000\.no11:00–43:00GCS last is higher than 14\.0000, Temp mean is higher than 36\.8429, Urine max is higher than 155\.0000, and TroponinI min is at most 4\.6000\.no
##### Patient\-level summary: TimeCP contextual patient text\.

The following excerpt comes from the TimeCP\-style contextual patient\-text baseline used in our text\-based baseline group\. This representation is compact and fluent, but the clinical narrative is patient\-level and not explicitly tied to recoverable source windows\. Boldface marks statements that are either not directly supported by the displayed source excerpts or appear clinically stronger than the underlying measurements shown here\.

> This 42\-year\-old patient presents with significant early hemodynamic instability characterized bysevere hypertension \(systolic peaks\>200\>200mmHg\)and tachycardia, which gradually trended toward normotension but remained labile with persistent tachycardia throughout the 48\-hour window\. Respiratory status shows a concerning trajectory of worsening hypoxemia \(PaO2 declining from 198 to 72 mmHg\) accompanied byprogressive tachypnea, suggesting deteriorating gas exchange despite maintained oxygen saturation\. Concurrently, the patient exhibits signs ofevolving multi\-organ stress, including a sharp decline in hematocrit and platelets, persistent hyperbilirubinemia, and fluctuating oliguria that intermittently dropped below 40 mL/hr\. Although acid\-base status remained relatively compensated with only mild respiratory acidosis, the combination of sustained hemodynamic volatility, worsening respiratory mechanics, and declining hematologic parameters indicates ahigh\-risk clinical course\.

##### Raw textual serialization: Decode and WRDP baselines\.

The raw\-text panel excerpts two text\-based baselines: Decode\-style per\-event serialization and WRDP\-style hourly narrative serialization\. These inputs retain source observations, but they are long and redundant\. Repeated or slowly varying values are restated many times, and related clinical signals are distributed across many lines\.

> WRDP\-style hourly narrative serialization: Clinical prediction task: in\-hospital mortality\. Observation window: first 48 hours of ICU admission\. Patient demographics: Age: 42\.00, Gender: Male, ICUType: SICU\. Weight: 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, 83\.48, \.\.\. FiO2: 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, 0\.54, \.\.\. MechVent: 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, 1\.00, \.\.\. Lactate: 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, 2\.00, \.\.\. TroponinI: 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, 8\.29, \.\.\. Decode\-style 6h per\-event serialization: at 0h: ALP = 100; ALT = 17; AST = 43; BUN = 11; Bilirubin = 2\.20; Creatinine = 0\.60; GCS = 14; HR = 109\.25; PaO2 = 169; Platelets = 102; at 6h: ALP = 102; ALT = 18; AST = 47; BUN = 12; Bilirubin = 2\.70; Creatinine = 0\.60; HR = 117; PaO2 = 119; Platelets = 106; at 12h: DiasABP = 81\.17; GCS = 14; HR = 107; MAP = 99\.17; RespRate = 18\.33; SysABP = 137\.33; Temp = 38\.05; Urine = 103; at 18h: DiasABP = 79\.83; GCS = 14; HCT = 27\.20; HR = 101\.50; MAP = 98\.17; RespRate = 20\.17; SysABP = 134\.67; Urine = 98; at 24h: BUN = 11; Creatinine = 0\.60; GCS = 15; HCT = 26\.50; PaO2 = 72; Platelets = 80; RespRate = 19\.67; Urine = 60; at 36h: Albumin = 2\.80; Bilirubin = 2\.50; GCS = 15; HCT = 26\.90; HR = 89\.67; RespRate = 22\.17; Urine = 102\.50\.

##### Numerical matrix representation: forward\-fill imputation input\.

The numerical panel shows a forward\-filled hourly matrix, a common input format for imputation\-based LSTM or Transformer baselines\. This representation is compact for numerical computation, but it is not a readable classifier input\. Columns are time points and rows are variables\.

Table A\.19:Forward\-filled numerical matrix excerpt for P12\-B\. The full numerical input has 48 time steps and 37 variables; this table shows a subset of variables and 6\-hour time columns\.Feature0h6h12h18h24h30h36h42hWeight83\.4883\.4883\.4883\.4883\.4883\.4883\.4883\.48ALP119\.06100\.00102\.00102\.00102\.00102\.00102\.00102\.00ALT360\.3817\.0018\.0018\.0018\.0018\.0018\.0018\.00AST500\.8943\.0047\.0047\.0047\.0047\.0047\.0047\.00BUN27\.0311\.0012\.0012\.0012\.0011\.0011\.0011\.00Bilirubin2\.872\.202\.702\.702\.702\.702\.702\.50Creatinine1\.460\.600\.600\.600\.600\.600\.600\.60DiasABP59\.5478\.0080\.0083\.0080\.0075\.0080\.0081\.00FiO20\.540\.540\.540\.540\.540\.540\.540\.54GCS11\.3814\.0014\.0014\.0015\.0015\.0015\.0015\.00Glucose141\.17131\.00116\.00116\.00116\.00108\.00108\.00108\.00HCT30\.7031\.4033\.7033\.7027\.2026\.5026\.5027\.30HR86\.51122\.00107\.00106\.0099\.0096\.0089\.0088\.00Lactate2\.002\.002\.002\.002\.002\.002\.002\.00MAP80\.2497\.0099\.00100\.0097\.0094\.00111\.0099\.00MechVent1\.001\.001\.001\.001\.001\.001\.001\.00NIMAP77\.50128\.30128\.30128\.30128\.30128\.30128\.3086\.00NISysABP119\.97175\.00175\.00175\.00175\.00175\.00175\.00110\.00PaO2198\.00140\.00119\.00119\.00119\.0072\.0072\.0072\.00Platelets190\.57102\.00106\.00106\.00106\.0080\.0080\.0080\.00RespRate19\.5815\.0016\.0021\.0021\.0020\.0021\.0020\.00Temp37\.0936\.8037\.1038\.4038\.0037\.7038\.4038\.70Urine115\.6440\.0050\.00100\.0080\.0060\.0050\.00160\.00pH7\.477\.377\.377\.377\.377\.447\.447\.44

##### Takeaway\.

This case illustrates the interface trade\-off\. The numerical matrix is compact for numerical models but difficult to inspect as evidence\. Raw serialization preserves source observations but produces long, repetitive inputs in which the relevant trajectory patterns are scattered across many lines\. Patient\-level contextual summaries are concise and readable, but can introduce over\-specific clinical claims that are not directly tied to recoverable source windows\. In contrast, TreeText\-CTS exposes a selected set of source\-window\-grounded tree\-path evidence units\. The LM classifier reads these selected units directly, and each unit remains traceable through its source window and deterministic tree\-derived conditions\.

## Appendix CTraceability, verbalization, and clinical glossing

### C\.1Construction\-level auditability

##### Source traceability\.

TreeText\-CTS auditability is an input\-construction property rather than a post\-hoc explanation\. Each selected evidence unit is assembled from a cached leaf\-level evidence record that stores the source tuple\(t,W,b,ℓ\)\(t,W,b,\\ell\)and the canonical deterministic predicate textPred​\(e\)\\mathrm\{Pred\}\(e\)\. Therefore, selected evidence can be traced back to the source\-window endpoint, window size, tree, leaf, and path conditions\. This guarantee is not a claim that raw\-serialization baselines lack source provenance; rather, TreeText\-CTS combines provenance with compact selected evidence units\. It also does not imply a causal explanation or a clinically validated rationale\.

Table A\.20:Auditability properties guaranteed by the evidence\-construction pipeline\.PropertyGuaranteeSource tuple retainedEvery selected evidence unit stores\(t,W,b,ℓ\)\(t,W,b,\\ell\)Predicate anchor retainedEvery selected evidence unit contains canonical deterministic predicate textPred​\(e\)\\mathrm\{Pred\}\(e\)Full path recoverableThe stored source tuple indexes the original XGBoost tree inventoryGloss gated by glossable flagClinical gloss is appended only wheng​\(e\)=1g\(e\)=1No patient\-level free\-form generationLeaf verbalization and embedding are cached offline at the leaf level
##### Interpretation\.

The clinical gloss is not the evidence anchor\. Even for glossable leaves, the canonical deterministic predicate text remains in the selected evidence unit, and the original tree path remains recoverable through the source tuple and tree inventory\. This is why the method exposes auditable classifier inputs rather than post\-hoc saliency maps\.

### C\.2Source\-grounded held\-out case study

Figure[A\.2](https://arxiv.org/html/2605.20292#A3.F2)shows an anonymized held\-out PhysioNet 2012 mortality case\. TreeText\-CTS predicts the positive outcome withp^=0\.83\\hat\{p\}=0\.83, while Decode\-Adapted, a controlled adapted text baseline using the same reader family, predictsp^=0\.21\\hat\{p\}=0\.21\. The selected evidence cards expose the source time, window, tree, leaf, canonical deterministic predicates, and gloss\-gated auxiliary clinical text used as classifier input\. This example illustrates the construction\-level auditability benefit of TreeText\-CTS: the model exposes a compact set of source\-grounded findings that can be inspected before the prediction is interpreted\.

Held\-out PhysioNet 2012 caseLabel: mortality positive TreeText\-CTS:p^=0\.83\\hat\{p\}=0\.83✓Decode\-Adapted:p^=0\.21\\hat\{p\}=0\.21✗Case summary\.For anonymized test caseP12\-test, TreeText\-CTS selects 30 evidence units; 22 haveg​\(e\)=1g\(e\)=1, and the selected set spans all seven window sizesW∈\{1,2,4,8,16,32,48\}W\\in\\\{1,2,4,8,16,32,48\\\}h\. Below are the three highest\-logit selected evidence units, sorted by source time; they are shown to illustrate source traceability rather than to summarize the full multi\-scale selection distribution\.Card 1\.Source\(t=33​h,W=2​h,b=25,ℓ=57\)\(t\{=\}33\\mathrm\{h\},W\{=\}2\\mathrm\{h\},b\{=\}25,\\ell\{=\}57\);g​\(e\)=1g\(e\)\{=\}1; leaf score2\.222\.22Pred\.non\-invasive systolic BP minimum is at most 103; heart\-rate mean is higher than 108\.5; weight last value is at most 143\.4\.Glosspatient with tachycardia and low non\-invasive systolic pressure, suggesting possible hemodynamic instability\.Card 2\.Source\(t=40​h,W=2​h,b=1,ℓ=33\)\(t\{=\}40\\mathrm\{h\},W\{=\}2\\mathrm\{h\},b\{=\}1,\\ell\{=\}33\);g​\(e\)=1g\(e\)\{=\}1; leaf score2\.372\.37Pred\.Respiratory\-rate mean is at most 10; GCS maximum is at most 10; GCS last value is higher than 3; GCS last value is at most 7\.GlossSeverely depressed level of consciousness across both peak and recent GCS values, combined with markedly reduced respiratory rate, consistent with a profound neurological impairment pattern\.Card 3\.Source\(t=44​h,W=2​h,b=17,ℓ=44\)\(t\{=\}44\\mathrm\{h\},W\{=\}2\\mathrm\{h\},b\{=\}17,\\ell\{=\}44\);g​\(e\)=1g\(e\)\{=\}1; leaf score3\.643\.64Pred\.Urine maximum is at most 80; FiO2minimum is higher than 0\.502; lactate minimum is higher than 6\.1; PaCO2time\-gap is higher than 0\.1\.GlossMarkedly elevated minimum lactate indicates hyperlactatemia; combined with reduced urine output and high minimum inspired oxygen fraction, this pattern suggests concurrent tissue hypoperfusion and impaired oxygenation\.Note\.The gloss is auxiliary\. The deterministic predicate text and source tuple are retained for every selected evidence unit\. Decode\-Adapted reads the same patient’s serialized event stream but does not expose source\-indexed selected evidence\.Figure A\.2:A source\-grounded prediction case from the PhysioNet 2012 test set\. TreeText\-CTS correctly predicts in\-hospital mortality, while Decode\-Adapted assigns a low probability\. Each selected evidence card is tied to a source time, window, tree, and leaf, and preserves thresholded predicate text\. The case illustrates input\-level auditability rather than a post\-hoc explanation\.
### C\.3Tree\-to\-evidence inventory and predicate canonicalization

##### Leaf inventory\.

Table[A\.21](https://arxiv.org/html/2605.20292#A3.T21)reports the number of XGBoost leaves available for verbalization\. Each leaf has canonical deterministic predicate textPred​\(e\)\\mathrm\{Pred\}\(e\)\. The glossable flag indicates whether the offline LLM judged that the path supports a named clinical\-state gloss\.

Table A\.21:Total XGBoost leaves per dataset and window\. Each leaf can be mapped to a reusable evidence unit\.Dataset1h2h4h8h16h32h48hTotalPhysioNet 20128928478418077035215625,173MIMIC\-III9169269038868226466485,747PhysioNet 20199068838508157386705905,452All datasets2,7142,6562,5942,5082,2631,8371,80016,372Table A\.22:Glossable and non\-glossable leaf counts\.g​\(e\)=1g\(e\)=1means the leaf path supports a clinical gloss;g​\(e\)=0g\(e\)=0means only canonical deterministic predicate text is used in Predicate\+gloss\.DatasetTotal leavesg​\(e\)=1g\(e\)=1g​\(e\)=0g\(e\)=0g​\(e\)=1g\(e\)=1ratePhysioNet 20125,1732,3492,82445\.4%MIMIC\-III5,7473,1182,62954\.3%PhysioNet 20195,4521,7863,66632\.8%All datasets16,3727,2539,11944\.3%
#### C\.3\.1Predicate rendering and canonicalization

##### Deterministic predicate rendering\.

For each activated root\-to\-leaf path, we recover the ordered split conditions from the fixed XGBoost tree inventory\. Each raw split inequality is rendered by a deterministic template using the feature name and summary statistic\. For example,x\>40x\>40is rendered as “xxis higher than 40,” andx≤40x\\leq 40is rendered as “xxis at most 40\.” This rendering step uses no LLM: the same tree path always maps to the same predicate text after canonicalization\.

##### Path canonicalization\.

Before cachingPred​\(e\)\\mathrm\{Pred\}\(e\), we simplify the raw conjunction into a canonical predicate set\. For each feature\-summary key, multiple lower\-bound constraints are replaced by the tightest lower bound, and multiple upper\-bound constraints are replaced by the tightest upper bound\. Thus, a path containing bothSBP \> 40andSBP \> 50is rendered using onlySBP \> 50\. We also remove predicates that only restate known feasible feature ranges or glossary\-defined boundary constraints, such as an upper\-bound constraint at the maximum possible score or a lower\-bound constraint at the minimum possible score\.

##### Scope and verbalization rule\.

Canonicalization only removes redundant or range\-bound predicate clauses; it does not add clinical interpretation\. The finalPred​\(e\)\\mathrm\{Pred\}\(e\)remains a deterministic rendering of tree\-derived conditions\. Clinical glossing is applied afterward and stored separately throughg​\(e\)g\(e\)andGloss​\(e\)\\mathrm\{Gloss\}\(e\)\. The original uncanonicalized path remains recoverable from the source tuple\(t,W,b,ℓ\)\(t,W,b,\\ell\)and the fixed tree inventory\. The Predicate\+gloss evidence text always contains the canonical deterministic predicate textPred​\(e\)\\mathrm\{Pred\}\(e\)\. For glossable leaves, we append a clinical gloss:

text​\(e\)=\{Pred​\(e\)​cg:​Gloss​\(e\),g​\(e\)=1,Pred​\(e\),g​\(e\)=0\.\\mathrm\{text\}\(e\)=\\begin\{cases\}\\mathrm\{Pred\}\(e\)\\;\\texttt\{ cg: \}\\mathrm\{Gloss\}\(e\),&g\(e\)=1,\\\\ \\mathrm\{Pred\}\(e\),&g\(e\)=0\.\\end\{cases\}Thus, the gloss is an auxiliary semantic hint, while the canonical predicate text remains the auditable anchor\.

##### Example\.

A glossable path such asGCS\_\_last≤\\leq9,Urine\_\_mean≤\\leq70\.1667, andBUN\_\_mean≤\\leq21\.6667is rendered as canonical deterministic predicate text and may receive a clinical gloss describing impaired consciousness with reduced urine output\. A non\-glossable path is rendered as predicate text only, avoiding forced clinical labels where the tree path is predictive but not semantically specific\. If a raw path contains redundant clauses such asSBP \> 40andSBP \> 50, only the tighter condition is kept inPred​\(e\)\\mathrm\{Pred\}\(e\)\.

### C\.4Clinical gloss annotation prompt

##### Clinical gloss annotation prompt\.

For each unique leaf predicate, we query a local LLM with the following prompt and cache the returned annotation\.

SYSTEM You are a clinical gloss annotator\. You are given \(i\) a TASK description and \(ii\) a single decision\-tree path rendered as deterministic predicate textPred​\(e\)\\mathrm\{Pred\}\(e\)\. Decide whether the predicate conjunction has a clinically meaningful interpretation for the given TASK\. If so, setg​\(e\)=1g\(e\)=1and write a short clinical gloss; otherwise, setg​\(e\)=0g\(e\)=0and leave the gloss empty\.Setg​\(e\)=1g\(e\)=1only when the predicate conjunction has a recognizable clinical interpretation that is informative for the TASK, such as a known physiologic pattern, organ\-system derangement, treatment response, or risk signal\. Setg​\(e\)=0g\(e\)=0when the path is only a generic threshold combination, weakly related to the TASK, or clinically speculative\.Rules forGloss​\(e\)\\mathrm\{Gloss\}\(e\)wheng​\(e\)=1g\(e\)=1:•Use one sentence, at most 30 words\.•Describe the physiologic or clinical state; do not restate numeric thresholds\.•Do not mention the tree, leaf, window index, model, or prediction label\.•Do not add clinical facts unsupported by the predicates or feature glossary\.•If unsure, setg​\(e\)=0g\(e\)=0rather than writing a vague gloss\.Output strict JSON with exactly these keys and nothing else:\{"g": 0 or 1, "gloss": "<string, empty if g=0\>"\}USER TASK: \{task\_description\}FEATURE GLOSSARY: \{feature\_glossary\}PATH PREDICATES \(Pred​\(e\)\\mathrm\{Pred\}\(e\)\): \{pred\_text\}Return the JSON object now\.

### C\.5Representation\-level case examples

We provide qualitative examples from three held\-out PhysioNet 2012 cases to illustrate how different EHR time\-series interfaces present the same underlying patient trajectory\. These examples are intended to compare input representations rather than to establish additional quantitative claims\. We replace record identifiers with case labels, coarsen static demographics in displayed excerpts, and do not reproduce complete raw patient trajectories\. Outcome labels are shown only to orient the reader and are not part of the model input\.

Table A\.23:Held\-out PhysioNet 2012 representation examples\. TreeText\-CTS exposes a fixed\-budget set of selected tree\-path evidence units\. Summary methods produce compact patient\-level narratives, raw\-serialization methods enumerate observations as text, and numerical methods consume a forward\-filled hourly matrix\. Lengths are reported in the native units logged by each pipeline and are not directly comparable tokenizer counts\.CaseDisplayed demographicsOutcomeOurs selected unitslengthSummary lengthRaw\-text lengthP12\-Aolder female, CSRUmortality positive30 / 2221617 subwords906 chars2198 / 12932 charsP12\-Bmiddle\-aged male, SICUmortality negative30 / 2041668 subwords992 chars2689 / 13057 charsP12\-Cyoung female, SICUmortality positive30 / 2002479 subwords947 chars4592 / 12970 chars
Raw\-text length reports Decode\-style per\-event serialization / WRDP\-style hourly narrative serialization\. Numerical baselines consume a48×3748\\times 37forward\-filled hourly matrix for these PhysioNet 2012 cases; only short matrix excerpts are displayed below\.

## Appendix DImplementation, latency, reproducibility, and resources

### D\.1Implementation details

##### Main hyperparameters\.

Table[A\.24](https://arxiv.org/html/2605.20292#A4.T24)summarizes the main implementation settings\. We use Predicate\+gloss as shorthand for the main evidence style: canonical deterministic path predicates plus an optional cached clinical gloss\.

Table A\.24:Main implementation settings for TreeText\-CTS\.ItemSettingEvidence stylePredicate\+gloss: canonical predicate text \+ optional clinical glossWindow sizes\{1,2,4,8,16,32,48\}\\\{1,2,4,8,16,32,48\\\}hoursWindow summary banklast value, mean, std, min, max, count, net change, time since last observation, missingnessTree modelseparate XGBoost model per window sizeLeaf embedding modelQwen3\-Embedding\-8B, projected to 64 dimensionsSelector token64\-d leaf embedding \+ 8 scalar metadata featuresSelector architecture2\-layer Transformer,dmodel=128d\_\{\\mathrm\{model\}\}=128, 4 attention headsEvidence assemblylearned gate margins \+ minimum\-selection floor \+ budget capLM ClassifierBioClinical ModernBERT\-baseMax input length3072 subword tokensCandidate retrievaltop\-M=5M=5leaves per\(i,t,W\)\(i,t,W\)Final evidence budgetK=30K=30Minimum selected evidenceKmin=5K\_\{\\min\}=5evidence unitsLM Classifier learning rate10−510^\{\-5\}Selector/projection learning rate10−410^\{\-4\}OptimizerAdamWBatch size8Early stoppingvalidation metric with patience 10Trainable modulesleaf projection,CES, reader, classification headAdvantage normalizationmini\-batch mean/std ofΔi\\Delta\_\{i\}
##### Selector metadata\.

Each candidate evidence unit is represented toCESby a cached leaf\-text embedding and eight scalar metadata features: recency score, leaf score, absolute base\-rate deviation\|pleaf−p0\|\|p\_\{\\mathrm\{leaf\}\}\-p\_\{0\}\|, signed leaf\-risk directionpleaf−p0p\_\{\\mathrm\{leaf\}\}\-p\_\{0\}, log leaf support, normalized window size, log subword count, and the glossability flagg​\(e\)g\(e\)\.

Table A\.25:Scalar metadata used by the Compact Evidence Selector\. Each tree\-path evidence unit is represented by a 64\-dimensional projected evidence\-unit embedding concatenated with the following eight scalar metadata features\.IndexFeatureDescription0recencyTemporal recency of the evidence unit relative to the prediction time1leaf\_scoreXGBoost leaf score associated with the activated path2abs\_leaf\_scoreAbsolute magnitude of the leaf score3leaf\_directionSigned direction of the leaf score4log\_leaf\_supportLog\-transformed number of training samples assigned to the leaf5window\_normNormalized look\-back window size6log\_predicate\_countText\-length proxy based on the number of rendered path\-predicate tokens7has\_clinical\_glossBinary indicator of whether a cached clinical gloss is available
##### Optional subword budget\.

The implementation supports an optional one\-sided subword\-budget penalty when\-\-subword\_budgetis specified\. This option is disabled in the main experiments, where the selected\-evidence count is controlled by the hard evidence budgetK=30K=30\.

##### XGBoost hyperparameters\.

Table[A\.26](https://arxiv.org/html/2605.20292#A4.T26)reports the XGBoost hyperparameters used to train the fixed tree ensembles before TreeText\-CTS\. We fit each ensemble on the training split and use the validation split only for early stopping\. XGBoost consumes the per\-variable nine\-dimensional window summaries\. If a variable is unobserved within a look\-back window, we setcount=0, the missingness indicator to 1,delta=0, andts\_last\_gap=W, while fillinglast,mean,std,min, andmaxwith the fixed value−1\-1\.

Table A\.26:XGBoost hyperparameters configuration\.HyperparameterValuemax\_depth5n\_estimators30learning\_rate0\.1subsample0\.8colsample\_bytree0\.8min\_child\_weight5gamma0scale\_pos\_weightNneg/NposN\_\{\\text\{neg\}\}/N\_\{\\text\{pos\}\}\(per dataset/window\)eval\_metricloglossearly\_stopping\_rounds10 \(on validation\)random\_state42Multi\-scale routingWindow sizesWW\(h\)\{1,2,4,8,16,32,48\}\\\{1,2,4,8,16,32,48\\\}
##### Offline and online LLM use\.

Leaf clinical glosses are generated offline at the tree\-leaf level, not at the patient level\. At inference, the model retrieves cached evidence, appliesCES, assembles selected evidence text, and runs the encoder\. The final prediction\-time model does not perform generative decoding\. The exact gloss annotation prompt is reported in Appendix[C\.4](https://arxiv.org/html/2605.20292#A3.SS4)\.

### D\.2Latency measurement protocol

##### Measurement setup\.

Latency is measured as end\-to\-end online inference time per patient on the first 32 patients of the filtered PhysioNet 2019 test split, using batch size 1, fp16 weights and activations, one discarded warm\-up sample, andtorch\.cuda\.synchronize\(\)barriers around GPU calls\. We report mean±\\pmstd over 32 measured samples on NVIDIA RTX A6000 GPUs\. All methods use one A6000 by default; Qwen3\.5\-27B\-based baselines use 2×\\timesA6000 sharding because their fp16 weights do not fit on one 48GB card\. Timing includes all online preprocessing and model calls required for prediction, while offline caches such as TreeText\-CTS leaf verbalizations and leaf embeddings are excluded\.

### D\.3Reproducibility, code release, and data access

##### Reproducibility scope\.

All main results use the same filtered patient splits reported in Appendix[A\.1](https://arxiv.org/html/2605.20292#A1.SS1)\. The XGBoost models, leaf verbalization cache, and leaf embedding cache are created before patient\-level training\. During patient\-level training, the leaf projection,CES, LM classifier, and classification head are optimized jointly from the first epoch\.

##### Cached artifacts\.

Leaf verbalization and leaf embeddings are computed once per tree leaf, not through patient\-level free\-form generation\. At inference, the model retrieves candidate evidence, appliesCESunder budgetK=30K=30, assembles selected evidence text, and runs the encoder classifier\.

##### Planned training command after code release\.

The main TreeText\-CTS training configuration is summarized by the following command\-line arguments, which will be supported by the released implementation:

```
python train_treetext_cts.py \
  --dataset {physionet2012,mimic3_mortality,physionet2019} \
  --train_mode rl_selection \
  --model_name bioclinicalmodernbert \
  --max_length 3072 \
  --rl_top_k 5 \
  --vb_budget 30 \
  --rl_min_sel_count 5 \
  --rl_lr 1e-4 \
  --lr 1e-5 \
  --batch_size 8 \
  --seed_num 3
```

### D\.4Compute resources and licenses

##### Hardware\.

All TreeText\-CTS training and evaluation runs use a single NVIDIA RTX A6000 GPU with 48 GB memory on a host with an Intel Xeon Silver 4310 CPU and 503 GB system RAM\. Latency measurements use the same A6000\. The Qwen3\.5\-27B generative baselines require two A6000 GPUs sharded via HuggingFacedevice\_map="auto"because the fp16 model does not fit on a single 48 GB device\.

##### Per\-run wall\-clock cost\.

A single TreeText\-CTS training run with the canonical configuration \(BioClinical ModernBERT, maximum length 3072, batch size 8,K=30K=30,Kmin=5K\_\{\\min\}=5, and top\-M=5M=5candidate retrieval\) costs approximately 19–21 minutes per epoch on one A6000, dominated by the BioClinical ModernBERT forward/backward pass\. End\-to\-end runtime per seed is approximately 6\.4 hours on PhysioNet 2012, 8\.4 hours on MIMIC\-III, and 5\.0 hours on PhysioNet 2019\. The three\-seed main TreeText\-CTS results therefore consume approximately 60 GPU\-hours\. The XGBoost trees, leaf\-verbalization cache, and leaf\-embedding cache are computed once per dataset, requiring approximately 2 CPU\-hours for trees and 3 GPU\-hours for Qwen3\-Embedding\-8B leaf embeddings, and are reused across TreeText\-CTS runs\.

##### Total project budget\.

Including the main results, ablations, budget sweeps, candidate\-retrieval sensitivity, reader\-backbone swaps, selector\-input masking, and exploratory configurations that did not enter the paper, the full project consumed approximately 2,800 A6000 GPU\-hours\. Numerical\-baseline reimplementations across eight ISMTS models, three datasets, three seeds, and three learning rates account for an additional approximately 900 GPU\-hours\. Latency measurements add approximately 10 GPU\-hours for the Qwen\-summary baselines\.

##### Dataset licenses and data\-use terms\.

PhysioNet 2012 v1\.0\.0 is distributed through PhysioNet under the Open Data Commons Attribution License v1\.0\. MIMIC\-III v1\.4 is a PhysioNet restricted\-access resource under the PhysioNet Credentialed Health Data License v1\.5\.0 and the PhysioNet Credentialed Health Data Use Agreement v1\.5\.0; access requires credentialing and CITI “Data or Specimens Only Research” training\. PhysioNet 2019 v1\.0\.0 is distributed through the official PhysioNet Challenge repository with its included license file\. We do not redistribute raw data from any benchmark\.

##### Compliance\.

All clinical datasets are used only for the prediction tasks studied in the paper, with no attempt at patient re\-identification, no linkage to external databases, and no redistribution of raw records\. Leaf verbalizations and clinical glosses are cached at the tree\-leaf level and are tied to anonymized tree/leaf indices rather than patient identifiers\. Pretrained\-model usage follows the corresponding model licenses and access terms\.

Similar Articles