Hierarchical Modeling of ICD Codes in EHR Foundation Models

arXiv cs.AI 06/16/26, 04:00 AM Papers
Summary
This paper investigates explicit encoding of ICD-10-CM hierarchy in EHR foundation models, using hierarchical token augmentation and graph-based code representations. Experiments on MIMIC-IV and eICU show improvements over flat code representations for in-domain and cross-dataset prediction tasks.
arXiv:2606.15447v1 Announce Type: new Abstract: Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:45 AM
# Hierarchical Modeling of ICD Codes in EHR Foundation Models
Source: [https://arxiv.org/html/2606.15447](https://arxiv.org/html/2606.15447)
\\theorembodyfont\\theoremheaderfont\\theorempostheader

:\\theoremsep

\\NameDong Gyun Kang\\nametag1\\Emaildkang335@gatech\.edu\\NameRudra Pratap Singh\\nametag1\\Emailrudra\.singh@gatech\.edu\\NameShruthi Kashinath Hiremath\\nametag2\\Emailshruthi\.hiremath@optum\.com\\NameKatrin Hänsel\\nametag2\\Emailkatrin\_haensel@optum\.com\\NameThomas Plötz\\nametag1\\Emailthomas\.ploetz@gatech\.edu\\addr1School of Interactive ComputingGeorgia Institute of TechnologyAtlantaUnited States 2Optum AIUnited States

###### Abstract

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine\-grained diagnostic detail\. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system\. In this work, we study ICD\-10\-CM hierarchy as a*general inductive bias*for clinical representation learning\. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT\-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph\-based code representations through hierarchy\-aware edges combined with diagnosis co\-occurrence structure\. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure\. We conduct experiments on two large\-scale real\-world clinical datasets: MIMIC\-IV, used for pretraining and in\-domain evaluation, and eICU, used to assess cross\-dataset transfer via frozen encoder probing\. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in\-domain and cross\-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach\. More broadly, we focus on hierarchy\-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels\. Our code is available[here](https://meghathukral.github.io/HICD_EHR_FMs/)\.

## 1Introduction

Electronic health records \(EHRs\) provide a longitudinal view of patients’ clinical history through diagnoses, procedures, medications, and encounters\. Advances in transformer\-based EHR models, such as BEHRT and Med\-BERT, have shown that large\-scale pretraining can learn useful patient representations for downstream clinical prediction tasks\(li2020behrt;rasmy2021medbert\)\. However, much of this progress still treats structured clinical codes as flat symbols, even when those codes are drawn from ontologies that were explicitly designed to encode clinical hierarchy\. This creates a mismatch between the structure of the data and the structure presented to the model\. For instance, ICD codes are not arbitrary identifiers: they organize diseases into clinically meaningful families, subgroups, and increasingly specific categories\(who\_icd10\)\(see Figure[1](https://arxiv.org/html/2606.15447#S3.F1)\)\. A code, therefore, carries both a fine\-grained diagnosis and a path through a broader clinical taxonomy\. In practice, however, most EHR models represent ICD codes as independent tokens, ignoring this hierarchical structure\. As a result, related diagnoses that should share information may instead be learned as isolated symbols\. This is especially problematic in clinical data, where many diagnoses are sparse, long\-tailed, and institution\-specific\(choi2016doctor\)\.

We hypothesize that explicitly modelling hierarchy could benefit EHR representation learning in several ways\. First, the hierarchy provides a natural inductive bias for relationships between diseases, so rare diagnosis codes can benefit from data associated with clinically similar codes\(perotte2014diagnosis\)\. Second, broader and intermediate groupings may capture care pathways or anatomical systems that are more predictive than the most specific code alone\(song2019medical\)\. Third, hierarchy can improve robustness under*domain shift*, since higher\-level disease families are often more stable across institutions\(ostrominski2021coding\)\. More broadly, encoding hierarchy can help learned embeddings better reflect clinical semantics rather than only empirical co\-occurrence\.

Prior work has explored incorporating ICD code hierarchy into healthcare representation learning, such as GRAM\(choi2017gram\), primarily through graph\-based approaches that embed ontological structure into code representations\. However, these methods predate the current generation of transformer\-based EHR foundation models, and it remains underexplored how the ICD hierarchy should be systematically incorporated into such models, and which granularity of hierarchy proves most beneficial\.

To address this gap, we propose and evaluate two complementary strategies for injecting ICD hierarchy into contemporary EHR representation learning\. The first,HICD\-BERT\(HierarchicalICD\-BERT\), injects hierarchy directly into token representations in a BERT\-style encoder, embedding each level of the ICD hierarchy as an additive token embedding alongside the diagnosis code\. The second,HICD\-Graph\(HierarchicalICD\-Graph\), encodes hierarchy relationally via a diagnosis co\-occurrence graph augmented with ontology\-derived edges, learning hierarchy\-aware code embeddings that initialise a patient\-level Transformer\. Together, these approaches allow us to compare whether hierarchical information is best consumed as a token\-level signal or as a relational structure over the code vocabulary\.

Our results show that hierarchy is arobustly effective inductive biasfor EHR representation learning, as across both architectures and both prediction tasks, the large majority of hierarchy\-augmented configurations outperform their no\-hierarchy baselines\. Both models benefit from hierarchy, with graph\-based encoding leveraging all hierarchy levels and token\-based encoding benefiting most from the finest granularity level\. Crucially, hierarchy also improves cross\-dataset transfer, specifically for graph\-based hierarchy encoding, which transfers robustly from MIMIC\-IV to eICU\.

- •We investigate ICD code hierarchy as an inductive bias for EHR representation learning across two architectures, two clinical prediction tasks, and a systematic ablation over three granularity levels, showing that hierarchy significantly improves performance in 26 of 28 comparisons\.
- •We examine the benefit of hierarchy under distributional shift by evaluating cross\-dataset transfer from MIMIC\-IV to eICU under a frozen\-encoder probe protocol, showing that graph\-based hierarchy encoding transfers robustly to new datasets and task settings\.
- •Through embedding analysis of the learned code representations, we show that hierarchy produces more coherent code clusters, and that the configurations with the tightest clusters also achieve the strongest downstream performance\.

### Generalizable Insights about Machine Learning in the Context of Healthcare

Our findings suggest two broader lessons for machine learning in health\. First, structured clinical ontologies such as ICD provide useful inductive biases for EHR foundation models, and even lightweight ways of encoding that structure can yield consistent gains over flat code representations\. Second, incorporating clinically meaningful structure can also help under distribution shift, and the mechanism of incorporation can impact: graph\-based hierarchy encoding transfers more robustly across datasets than token\-level hierarchy injection\. These insights are relevant beyond ICD\-10\-CM and can be extended to other structured clinical vocabularies, such as ATC and procedure ontologies, and to other representation learning methods\.

## 2Related Work

### 2\.1Sequential and Transformer\-based EHR representation learning

Early work on longitudinal EHR modeling established the importance of learning from temporally ordered patient histories\. Doctor AI\(choi2016doctor\)used recurrent neural networks to predict future clinical events from visit sequences, while RETAIN\(choi2016retain\)introduced a reverse\-time attention mechanism to improve interpretability in healthcare prediction\. Dipole\(ma2017dipole\)further demonstrated that attention\-based sequence models can improve diagnosis prediction by capturing dependencies across visits\. These methods showed that sequential structure in EHRs is highly informative, but they generally treated diagnosis codes as flat symbols\.

More recently, Transformer\-based models have become a central approach for structured EHR representation learning\. BEHRT\(li2020behrt\)was an early and influential framework that represents patient histories as sequences of medical codes and applies self\-attention to capture dependencies across visits\. Med\-BERT\(rasmy2021medbert\)further showed that masked pretraining on large EHR corpora can produce reusable representations for downstream prediction tasks\. Subsequent models such as CEHR\-BERT\(pang2021cehr\), ExBEHRT\(rupp2023exbehrt\), and larger\-scale foundation models such as Foresight\(kraljevic2024foresight\)and ETHOS\(renc2024zero\)have further scaled this paradigm with richer input representations and larger training corpora\. Despite these advances, Transformer\-based EHR models have generally treated diagnosis codes as atomic identifiers\.

### 2\.2Ontology\-aware and Hierarchy\-aware Medical Representation Learning

A parallel line of work has explored how medical ontologies can improve representation learning\. Many of these methods were developed in the pre\-foundation\-model era and are trained end\-to\-end for a single downstream task rather than as reusable pretrained encoders\. GRAM\(choi2017gram\)incorporates ontology ancestors via attention\-weighted aggregation within a recurrent patient encoder, and introduced an intuition central to our study: clinically related diagnoses should share information through their position in a hierarchy\. KAME\(ma2018kame\)extended this with a knowledge\-based attention mechanism for diagnosis prediction\. MiME\(choi2018mime\)explored related ideas by structuring medical codes around clinically meaningful groupings\. Among ontology\-aware methods that use pretraining, G\-BERT\(shang2019gbert\)applies graph\-augmented BERT\-style pretraining for medication recommendation, though it does not study how different ontology levels contribute\.

A complementary line of work has explored purely data\-driven approaches to code relationships, learning structure from co\-occurrence rather than from an external taxonomy\. Med2Vec\(choi2016med2vec\)demonstrated that skip\-gram\-style co\-occurrence embeddings can recover clinically meaningful code and visit representations without any ontology supervision\. Subsequent graph\-based methods have modeled empirical dependencies among diagnoses for prediction:\(poulain2024graph\)builds patient\-level heterogeneous graphs that jointly encode diagnoses, procedures, and demographics, learning representations through message passing over clinically grounded relations, while\(xi2025breaking\)constructs disease co\-occurrence graphs from patient records and applies graph neural networks to capture higher\-order relational signal beyond pairwise embeddings\. These approaches recover relational structure directly from data, but unlike ontology\-aware methods, they do not leverage existing domain knowledge\-based hierarchy\.

Across these lines of work, prior methods typically rely on a single source of hierarchical signal\. Even when pretrained, they generally do not ablate across hierarchy levels or explicitly examine how hierarchy itself shapes the learned representation\. This leaves open whether hierarchy helps consistently across tasks and architectures, which levels of the ICD hierarchy contribute most, and how hierarchy\-aware representations behave under cross\-dataset transfer\. In contrast, we evaluate two paradigms head\-to\-head within modern Transformer\-based EHR foundation models:HICD\-BERTinjects hierarchy at the token\-embedding level via string\-prefix truncation, requiring no ontology access at training time, whileHICD\-Graphcombines ontology\-derived hierarchy edges with empirical PMI\-weighted co\-occurrence edges in a hybrid diagnosis graph, learning hierarchy\-aware code embeddings via a graph convolutional network\. Across both paradigms, we systematically ablate three hierarchy granularity levels, evaluate on two in\-domain tasks and one cross\-dataset transfer task, and analyze how hierarchy reshapes the learned embedding geometry, isolating not only*whether*hierarchy helps, but*which*encoding paradigm makes it most useful,*which*levels contribute, and*whether*ontology information is strictly necessary\.

## 3Methods

We represent each patient as a temporally ordered sequence of visits,𝒱=\(v1,…,vT\)\\mathcal\{V\}=\(v\_\{1\},\\dots,v\_\{T\}\), where each visitvtv\_\{t\}contains one or more ICD diagnosis codes\. Our goal is to learn patient representations that preserve both temporal context and the hierarchical structure of diagnosis ontologies, and to evaluate whether such structure improves downstream prediction and generalization\. We study two complementary mechanisms for injecting ICD hierarchy into EHR models, which we collectively refer to asHierarchicalICDmodels \(HICD\)\. The first is a*token\-level*approach,HICD\-BERT, which extends BEHRT\(li2020behrt\)by adding hierarchy\-specific embedding tokens for each diagnosis code \(Figure[2](https://arxiv.org/html/2606.15447#S3.F2)\(a\)\)\. The second is a*graph\-level*approach,HICD\-Graph, which constructs a hierarchy\-augmented diagnosis co\-occurrence graph, learns code embeddings with a graph convolutional network \(GCN\)\(kipf2017semi\), and uses those embeddings to initialise a Transformer\-based patient encoder \(Figure[2](https://arxiv.org/html/2606.15447#S3.F2)\(b\)\)\. Full implementation details and hyperparameters are provided in Appendix[B](https://arxiv.org/html/2606.15447#A2)\.

### 3\.1ICD Hierarchy Levels

ICD\-10\-CM codes are organised hierarchically\(who\_icd10\)\. Every code belongs to a*chapter*\(a broad body\-system or etiology grouping, e\.g\. Chapter XIX,*Injury, Poisoning and External Causes*\), partitioned into officially defined*blocks*of related categories \(e\.g\.S70\-\-S79,*Injuries to the hip and thigh*\), which contain three\-character*categories*\(e\.g\.S72,*Fracture of femur*\)\. Characters after the decimal may encode etiology, anatomic site, severity, and encounter details \(Fig\.[1](https://arxiv.org/html/2606.15447#S3.F1)\)\. Each leaf code is nested inside progressively coarser clinical groupings, and this nesting is the inductive bias we inject\.

![Refer to caption](https://arxiv.org/html/2606.15447v1/sections/images/icd_10_code.png)Figure 1:Structure of an ICD\-10\-CM diagnosis code\.For each diagnosis code, we define three nested hierarchy levelsG0,G1,G2, ordered from coarsest to finest\. These levels are relative\-granularity indices, each architecture operationalises them through its own mechanism \(details below\)\. Each ablation setting enables or disables a subset of these levels, yielding all eight combinations\(G0,G1,G2\)∈\{0,1\}3\(\\texttt\{G0\},\\texttt\{G1\},\\texttt\{G2\}\)\\in\\\{0,1\\\}^\{3\}, including the non\-hierarchical baseline\(0,0,0\)\(0,0,0\), equivalent to BEHRT\(li2020behrt\)forHICD\-BERTand to a standard GCN\-Transformer without hierarchy edges forHICD\-Graph\. This setup compares two paradigms: data\-driven token encoding, where hierarchy is derived directly from diagnosis code strings, and hybrid graph\-based encoding, where hierarchy is injected through ontology\-derived graph nodes alongside empirical co\-occurrence structure\.

##### HICD\-BERT\(data\-driven\)\.

Hierarchy is injected as auxiliary token embeddings using string\-prefix truncation \(Fig\.[2](https://arxiv.org/html/2606.15447#S3.F2)\(a\)\), requiring no external ontology at training time\.G0is the 1\-character alphabetic prefix,G1the 2\-character prefix, andG2the 3\-character ICD category\. These act as data\-derived surrogates for increasing hierarchy depth rather than official chapter or block identifiers\. For example,S72\.001Amaps to\(G0,G1,G2\)=\(S,S7,S72\)\(\\texttt\{G0\},\\texttt\{G1\},\\texttt\{G2\}\)=\(\\texttt\{S\},\\ \\texttt\{S7\},\\ \\texttt\{S72\}\)\.

##### HICD\-Graph\(hybrid ontology\-based\)\.

Hierarchy is represented as graph nodes connected to leaf diagnosis codes by weighted edges \(Fig\.[2](https://arxiv.org/html/2606.15447#S3.F2)\(b\)\)\. These nodes are derived from the ICD\-10\-CM ontology and are incorporated into a graph that also encodes diagnosis co\-occurrence structure\.G0denotes the ICD\-10 chapter,G1the official ICD\-10\-CM block, andG2the 2\-character prefix group\. For example,S72\.001Amaps to\(G0,G1,G2\)=\(Chapter XIX,S70–S79,S7x\)\(\\texttt\{G0\},\\texttt\{G1\},\\texttt\{G2\}\)=\(\\text\{Chapter~XIX\},\\ \\text\{S70\-\-S79\},\\ \\text\{S7\}x\)\. AlthoughG1andG2are closely related, they are not identical:G1follows official ICD\-10\-CM block boundaries, whileG2groups codes by shared first two characters\. A worked example of how leaf codes share hierarchy nodes inHICD\-Graphis provided in Appendix[A](https://arxiv.org/html/2606.15447#A1)\.

All analyses and comparisons are made*within*each model, and we interpret the levels by relative granularity \(coarse to fine\) rather than assuming the same label corresponds to identical clinical content across architectures\.

### 3\.2HICD\-BERT: Hierarchy\-aware BEHRT

HICD\-BERTextends BEHRT\(li2020behrt\), a Transformer encoder for longitudinal EHR sequences\. In BEHRT, each diagnosis token is represented by the sum of the token, age, positional, and segment embeddings\. We augment this representation with up to three hierarchy\-specific embeddings:

𝐡i\(0\)=𝐞code\(ci\)\+𝐞age\(ai\)\+𝐞pos\(pi\)\+𝐞seg\(si\)\+𝕀\[G0\]𝐞G0\(gi\(0\)\)\+𝕀\[G1\]𝐞G1\(gi\(1\)\)\+𝕀\[G2\]𝐞G2\(gi\(2\)\),\\mathbf\{h\}^\{\(0\)\}\_\{i\}=\\mathbf\{e\}\_\{\\text\{code\}\}\(c\_\{i\}\)\+\\mathbf\{e\}\_\{\\text\{age\}\}\(a\_\{i\}\)\+\\mathbf\{e\}\_\{\\text\{pos\}\}\(p\_\{i\}\)\+\\mathbf\{e\}\_\{\\text\{seg\}\}\(s\_\{i\}\)\+\\mathbb\{I\}\[G0\]\\mathbf\{e\}\_\{G0\}\(g\_\{i\}^\{\(0\)\}\)\+\\mathbb\{I\}\[G1\]\\mathbf\{e\}\_\{G1\}\(g\_\{i\}^\{\(1\)\}\)\+\\mathbb\{I\}\[G2\]\\mathbf\{e\}\_\{G2\}\(g\_\{i\}^\{\(2\)\}\),\(1\)wherecic\_\{i\}is the diagnosis token,aia\_\{i\}is age,pip\_\{i\}is position,sis\_\{i\}is the segment identifier, andgi\(0\),gi\(1\),gi\(2\)g\_\{i\}^\{\(0\)\},g\_\{i\}^\{\(1\)\},g\_\{i\}^\{\(2\)\}are the group identifiers at hierarchy levelsG0G0,G1G1, andG2G2\. This design preserves the original BEHRT tokenization while directly injecting the coarse\-to\-fine ICD structure into the embedding layer\. The resulting sequence is processed by a stack of Transformer encoder blocks\(vaswani2017attention\)\. Each block contains multi\-head self\-attention, a position\-wise feed\-forward network, residual connections, and layer normalization\. The hierarchy\-aware embedding layer is the only change to the BEHRT backbone\. Additional architectural and training details are provided in Appendix[B\.1](https://arxiv.org/html/2606.15447#A2.SS1)\.

We pretrainHICD\-BERTusing a masked language modeling \(MLM\) objective over the diagnosis sequence\. Given a masked sequence𝐱~\\tilde\{\\mathbf\{x\}\}, the model predicts the original diagnosis token at masked positions:

ℒMLM=−∑i∈ℳlog⁡p\(ci∣𝐱~\),\\mathcal\{L\}\_\{\\text\{MLM\}\}=\-\\sum\_\{i\\in\\mathcal\{M\}\}\\log p\(c\_\{i\}\\mid\\tilde\{\\mathbf\{x\}\}\),\(2\)whereℳ\\mathcal\{M\}denotes the set of masked positions\.

For downstream prediction, we initialize the encoder from the pretrained MLM checkpoint and attach a task\-specific multilayer perceptron to the final\[CLS\]representation\. Binary tasks are optimized with binary cross\-entropy with logits, while multiclass tasks are optimized with cross\-entropy\.

\\subfigure

\[HICD\-BERT\]\[b\]![Refer to caption](https://arxiv.org/html/2606.15447v1/x1.png)\\subfigure\[HICD\-Graph\]\[b\]![Refer to caption](https://arxiv.org/html/2606.15447v1/sections/images/HICD-Graph.png)

Figure 2:Comparison of HICD\-BERT and HICD\-Graph\.
### 3\.3HICD\-Graph: Graph\-based Hierarchy Modeling

Our second model injects ICD hierarchy at the code\-graph level rather than the token\-embedding level\. We first construct a diagnosis co\-occurrence graph over ICD codes, where horizontal edges reflect empirical co\-occurrence within patient visits and are weighted by pointwise mutual information \(PMI\):

PMI\(a,b\)=log⁡P\(a,b\)P\(a\)P\(b\)\+ϵ,\\text\{PMI\}\(a,b\)=\\log\\frac\{P\(a,b\)\}\{P\(a\)P\(b\)\}\+\\epsilon,\(3\)whereP\(a\)P\(a\)andP\(b\)P\(b\)are marginal code frequencies andP\(a,b\)P\(a,b\)is their joint co\-occurrence probability\. We retain only positive PMI edges above a percentile threshold to suppress weak associations\. We augment this empirical graph with uniform\-weight ICD\-10 hierarchy edges linking each diagnosis node to up to three enabled ancestor levels,G2,G1, andG0\(Section[3\.1](https://arxiv.org/html/2606.15447#S3.SS1), Fig\.[4](https://arxiv.org/html/2606.15447#A1.F4)\), yielding a hybrid graph construction that combines data\-driven PMI edges with ontological hierarchy edges\. Given the resulting graph, we learn code embeddings using a two\-layer GCN\(kipf2017semi\)trained with a link\-prediction objective:

ℒgraph=−∑\(u,v\)∈ℰlog⁡σ\(𝐳u⊤𝐳v\)−∑\(u,v−\)∉ℰlog⁡\(1−σ\(𝐳u⊤𝐳v−\)\),\\mathcal\{L\}\_\{\\text\{graph\}\}=\-\\sum\_\{\(u,v\)\\in\\mathcal\{E\}\}\\log\\sigma\(\\mathbf\{z\}\_\{u\}^\{\\top\}\\mathbf\{z\}\_\{v\}\)\-\\sum\_\{\(u,v^\{\-\}\)\\notin\\mathcal\{E\}\}\\log\\left\(1\-\\sigma\(\\mathbf\{z\}\_\{u\}^\{\\top\}\\mathbf\{z\}\_\{v^\{\-\}\}\)\\right\),\(4\)where𝐳u\\mathbf\{z\}\_\{u\}and𝐳v\\mathbf\{z\}\_\{v\}denote node embeddings andℰ\\mathcal\{E\}is the set of graph edges\.

The learned code embeddings are then used to initialise the embedding table of a patient\-level Transformer encoder\. For each visit, we compute a visit\-level representation by applying masked mean pooling over the embeddings of the diagnosis codes \(and, when enabled, their hierarchy ancestor codes\) observed in that visit\. The temporally ordered sequence of visit representations is then processed by a Transformer encoder, followed by attention\-based pooling across visits to produce a patient\-level representation, which is passed to a task\-specific classification head\. Additional implementation details are provided in Appendix[B\.2](https://arxiv.org/html/2606.15447#A2.SS2)\.

### 3\.4Experimental Setup

#### 3\.4\.1Datasets

##### MIMIC\-IV

We use data from the Medical Information Mart for Intensive Care IV \(MIMIC\-IV\), a large, de\-identified clinical database developed at Beth Israel Deaconess Medical Center in Boston, Massachusetts\(johnson\_mimic\-iv\_2023\)\. MIMIC\-IV contains electronic health records for hospital admissions between 2008 and 2019\. For this work, we restrict our use of MIMIC\-IV to*diagnosis information*from the hospital\-wide \(hosp\) module\. This module covers all inpatient admissions, not only ICU stays, and stores structured diagnosis codes assigned to each hospital encounter\. Diagnoses are encoded using International Classification of Diseases, Ninth Revision, Clinical Modification \(ICD\-9\-CM\)\(mapped to ICD\-10 in our work\) and Tenth Revision, Clinical Modification \(ICD\-10\-CM\)\. Each diagnosis code is linked to dictionary tables providing a textual description and hierarchical groupings, which we leverage to define clinical conditions and construct label spaces\.

##### eICU Collaborative Research Database

The eICU Collaborative Research Database \(eICU\-CRD\) is a multi\-center critical care database containing de\-identified health data for over 200,000 ICU admissions from more than 200 hospitals across the United States\(Pollard2018\)\. Data were collected through the Philips eICU telehealth program and span a wide range of ICU types, including medical, surgical, cardiac, and mixed units\. eICU\-CRD provides diagnosis codes recorded using ICD\-9\-CM and ICD\-10\-CM, as well as proprietary diagnosis fields used within the telehealth platform\.

Both MIMIC\-IV and eICU\-CRD are de\-identified, hosted on PhysioNet\(goldberger2000physiobank\), and available to credentialed researchers who complete the required data use training and agreements\.

#### 3\.4\.2Task Definitions

We evaluate all models on two binary classification tasks defined over patient visit sequences from MIMIC\-IV, and one binary classification task on eICU for transfer evaluation\. For all MIMIC\-IV tasks, the final visit of each patient is excluded from the prediction set, as no subsequent outcome label can be observed\.

##### 30\-Day Readmission \(MIMIC\-IV\)\.

Given a patient’s visit history up to and including visitvtv\_\{t\}, the model predicts whether the patient will be readmitted within 30 days of discharge\. The binary label is defined as

ytreadmit=𝕀\[admittime\(vt\+1\)−dischtime\(vt\)<30days\]\.y\_\{t\}^\{\\text\{readmit\}\}=\\mathbb\{I\}\\\!\\left\[\\mathrm\{admittime\}\(v\_\{t\+1\}\)\-\\mathrm\{dischtime\}\(v\_\{t\}\)<30\\text\{ days\}\\right\]\.\(5\)This task serves as a standard clinical benchmark for assessing whether hierarchy\-aware representations capture patient risk signals that are predictive of near\-term acute care utilization\.

##### Emergency\-Admission Prediction \(MIMIC\-IV\)\.

Given a patient’s visit history up to and including visitvtv\_\{t\}, the model predicts whether the immediately following visitvt\+1v\_\{t\+1\}is an emergency admission\. MIMIC\-IV records nine admission types; we define emergency admissions as the union ofURGENT,EW EMER\.,EU OBSERVATION,OBSERVATION ADMIT,AMBULATORY OBSERVATION,DIRECT EMER\., andDIRECT OBSERVATION, based on the visit type information provided by MIMIC\-IV\. The binary label is

yttype=𝕀\[admission\_type\(vt\+1\)∈𝒯EM\],y\_\{t\}^\{\\text\{type\}\}=\\mathbb\{I\}\\\!\\left\[\\mathrm\{admission\\\_type\}\(v\_\{t\+1\}\)\\in\\mathcal\{T\}\_\{\\text\{EM\}\}\\right\],\(6\)where𝒯EM\\mathcal\{T\}\_\{\\text\{EM\}\}denotes the set of emergency admission types defined above\. This task evaluates whether the learned representations encode temporal diagnostic patterns predictive of the urgency of future care\.

##### ICU Readmission Prediction \(eICU\)\.

The eICU Collaborative Research Database organizes data at the level of individual ICU unit stays, with multiple unit stays possible within a single hospital admission, linked by a commonpatientHealthSystemStayID\. Given the diagnosis codes observed during ICU unit stayuku\_\{k\}for a patient, the model predicts whether the patient will be readmitted to the ICU within the same hospital admission\. The binary label is

ykreadmit=𝕀\[ukis not the final ICU unit stay within the hospital admission\]\.y\_\{k\}^\{\\text\{readmit\}\}=\\mathbb\{I\}\\\!\\left\[u\_\{k\}\\text\{ is not the final ICU unit stay within the hospital admission\}\\right\]\.\(7\)This task is used exclusively for transfer evaluation: models are pretrained on MIMIC\-IV and only a classifier head is fine\-tuned on eICU, with the encoder kept frozen\.

We split the data into an 80:10:10 ratio to create a train/validation/test split shared across all hierarchy ablations for a model\. ForHICD\-BERT, we first pretrain the model with masked language modeling \(ref\. Sec\.[3\.2](https://arxiv.org/html/2606.15447#S3.SS2)\) and then fine\-tune it for each downstream task\. ForHICD\-Graph, we first learn hierarchy\-aware code embeddings from the diagnosis graph and then use these embeddings to initialize the downstream Transformer\-based patient encoder\. InHICD\-BERT, we evaluate all subsets of the three hierarchy levelsG0,G1, andG2, which correspond to enabling or disabling hierarchy embeddings; inHICD\-Graph, they correspond to enabling or disabling the associated hierarchy edges and ancestor nodes\.

We evaluate performance using F1, and AUROC scores\. For transfer experiments, models trained on MIMIC\-IV are evaluated on eICU without modifying the hierarchy construction pipeline\. Exact optimization settings and hyperparameters are provided in Appendix[B](https://arxiv.org/html/2606.15447#A2)\.

## 4Results

We report the in\-domain downstream performance of hierarchy\-aware modeling, cross\-dataset transfer under a frozen\-encoder probe setting, and embedding\-level analyses\. We evaluate all eight combinations of the three ICD\-10 hierarchy levels \(G0,G1,G2\) across bothHICD\-BERTandHICD\-Graph, resulting in28pairwise comparisons using*2 models×\\times2 tasks*against the no\-hierarchy baseline within each paradigm: BEHRT\(li2020behrt\)forHICD\-BERTand a GCN \+ Transformer encoder forHICD\-Graph\. We report statistical significance using the paired DeLong test\(delong1988comparing\)for all pairwise comparisons of binary AUROC on the same test set\. For F1\-macro, we use a paired bootstrap over 1,000 test\-set resamples and a two\-sided p\-value, following prior work\(berg2012empirical\)\.

Table 1:30\-day readmission results forHICD\-BERTandHICD\-Graphacross hierarchy ablations\.\*p<0\.05p<0\.05from pairwise DeLong tests on AUROC and paired bootstrap tests on F1\-macro, both against the no\-hierarchy baseline\.### 4\.130\-day Readmission Task

Table[1](https://arxiv.org/html/2606.15447#S4.T1)shows that adding hierarchical structure improves 30\-day readmission prediction for bothHICD\-BERTandHICD\-Graphrelative to the no\-hierarchy baseline \(DeLongp<0\.05p<0\.05for 6 of 7HICD\-BERTvariants and all 7HICD\-Graphvariants\)\.

ForHICD\-BERT, multi\-level hierarchy outperforms single\-level, with the finest level driving the gains\.The strongest F1\-macro is achieved by the two\-levelG0\+G2configuration \(0\.5953\), while the best AUROC is obtained when all hierarchy levels are enabled \(0\.6923\)\. Among single\-level configurations, the finest levelG2yields the strongest improvement \(AUROC 0\.6921\), while the mid\-levelG1is the only configuration that does not significantly improve over baseline, though it contributes meaningfully when combined with other levels\. This suggests that for token\-level hierarchy injection, fine\-grained groupings are the primary driver of performance gains, and that mid\-level groupings serve as connective structure between coarser and finer hierarchy rather than as standalone signal\.

ForHICD\-Graph, more hierarchy levels yield progressively better readmission prediction\.All seven hierarchy configurations yield statistically significant improvements over the no\-hierarchy baseline, and a clear monotonic pattern emerges: single\-level configurations improve AUROC by1\.3%1\.3\\%on average over the baseline, two\-level configurations by1\.6%1\.6\\%, and the all\-levelsG0\+G1\+G2configuration by1\.7%1\.7\\%\. This progressive improvement suggests that each additional hierarchy level contributes complementary structure to the diagnosis graph, and that the graph\-based architecture is able to effectively integrate information from multiple granularity levels simultaneously\.

Taken together, the consistency of hierarchy benefits across two different pretraining objectives, a Transformer\-based encoder pretrained with masked language modeling \(HICD\-BERT\) and a GCN\-initialized Transformer pretrained with link prediction on a diagnosis co\-occurrence graph \(HICD\-Graph\), supports that ICD hierarchy is a useful inductive bias for readmission prediction\. While the absolute AUROC improvements are modest, they are statistically robust and architecturally consistent; we discuss the practical significance of these effect sizes in Section[5](https://arxiv.org/html/2606.15447#S5)\.

Table 2:Emergency admission prediction results forHICD\-BERTandHICD\-Graphacross hierarchy ablations\.\*p<0\.05p<0\.05from pairwise DeLong tests on AUROC and paired bootstrap tests on F1\-macro, both against the no\-hierarchy baseline\.
### 4\.2Emergency Admission Prediction

We report the results for emergency admission prediction in Table[2](https://arxiv.org/html/2606.15447#S4.T2)\. Here again, we note that hierarchy improves over the no\-hierarchy baseline, but the pattern of which configurations perform best differs from the readmission task\.

ForHICD\-BERT, a single hierarchy level appears to be as effective as the full hierarchy\.Unlike the readmission setting, where the all\-levels configuration achieves the best AUROC, the single\-levelG2model performs as strongly as the all\-levels variant on emergency\-visit prediction, with both achieving comparable AUROC\. As in readmission, the mid\-levelG1alone does not significantly improve over baseline, though it does yield a significant F1\-macro gain\. This task\-dependent shift suggests that emergency\-visit prediction can drive performance from a specific level of abstraction, and that adding coarser hierarchy levels on top of the finest grouping provides no additional benefit for this task\.

ForHICD\-Graph, the monotonic trend from readmission holds\.All seven variants significantly outperform the no\-hierarchy baseline \(allp<0\.001p<0\.001\), and as in readmission, AUROC improves progressively from single\-level to two\-level to all\-levels configurations\. The best individual configuration is the two\-levelG1\+G2, which achieves comparable or slightly better AUROC than the all\-levels variant\. This suggests that for emergency\-visit prediction, the combination of intermediate and fine\-grained hierarchy captures the most relevant diagnostic structure, and that adding the coarsest level \(ICD chapter\) on top provides no further benefit for this task\.

Across both in\-domain tasks, hierarchy\-aware models outperform the no\-hierarchy baseline in 26 of 28 pairwise AUROC comparisons, establishing ICD hierarchy as arobustly effective inductive biasregardless of task or architecture\. Our systematic ablation further reveals that while hierarchy consistently improves over baseline, the optimal configuration varies by task, reinforcing the value of treating hierarchy levels as controllable design choices\.

Table 3:Cross Dataset Transfer:Probe results on eICU readmission task forHICD\-BERTandHICD\-Graphacross hierarchy ablations\.\*p<0\.05p<0\.05from pairwise DeLong tests on AUROC and paired bootstrap tests on F1\-macro, both against the no\-hierarchy baseline\.
### 4\.3Cross\-Dataset Transfer Results

Table[3](https://arxiv.org/html/2606.15447#S4.T3)presents cross\-dataset transfer results on eICU under a frozen\-encoder probe protocol\. In this setting, the encoder pretrained on MIMIC\-IV is kept frozen, and only a task\-specific classifier head is trained using eICU ICU\-readmission labels\. This evaluation measures whether the hierarchy\-aware representations learned on one institution’s data capture structure that generalises to a different patient population\. We note that bothHICD\-BERTandHICD\-Graphretain meaningful predictive signal under this setting, with all configurations outperforming chance\-level prediction \(AUROC\>0\.55\>0\.55\)\. However, the pattern of which hierarchy configurations benefit transfer differs between the two architectures and from the in\-domain results\.

Hierarchy benefits transfer more inHICD\-Graphthan inHICD\-BERT\.ForHICD\-Graph, 6 of 7 hierarchy variants significantly improve over the no\-hierarchy baseline, with the all\-levelsG0\+G1\+G2configuration achieving the strongest AUROC \(0\.587\)\. This demonstrates that the graph\-based hierarchy encoding produces representations whose structure transfers robustly across datasets\. In contrast, forHICD\-BERT, only the coarsest single\-level configurationG0improves AUROC over baseline \(0\.571\); the remaining 6 variants are statistically indistinguishable from the no\-hierarchy baseline, and F1\-macro results are mixed, with only some variants improving F1 \(G2,G12,G0\+G1\+G2\)\. \. This is notable becauseHICD\-BERTis the stronger beneficiary of hierarchy in the in\-domain setting, where 6 of 7 variants achieve improvements, yet these gains do not survive the distributional shift to eICU\. The asymmetry may reflect a difference in what the two encoders learn: the co\-occurrence graph underlyingHICD\-Graphcaptures inter\-code relationships that can be more institution\-invariant, whereas the sequential token patterns learned byHICD\-BERTare more tightly coupled to MIMIC\-IV visit structures\. Consistent with this interpretation, only the coarsest levelG0transfers forHICD\-BERTas coarse groupings can be less susceptible to dataset\-specific coding variation than finer\-grained levels\.

Hierarchy aware pretrained representations capture transferable diagnosis structure\.Despite the modest absolute AUROC values \(0\.56–0\.59\), these results demonstrate that hierarchy\-augmented pretraining produces more robust representations that can transfer well across datasets\. The frozen encoder has not seen eICU patients or diagnoses during pretraining, yet hierarchy\-aware configurations, particularly inHICD\-Graph, achieve statistically significant improvements over the no\-hierarchy baseline\. This supports that the ICD hierarchy provides an inductive bias that is not merely corpus\-specific but reflects underlying relationships that can transfer well across datasets\.

### 4\.4Embedding Analysis

\\subfigure

\[Pairwise similarity\.\]\[t\]![Refer to caption](https://arxiv.org/html/2606.15447v1/sections/images/figA_pairwise_sim.png)\\subfigure\[Within\-group similarity\.\]\[t\]![Refer to caption](https://arxiv.org/html/2606.15447v1/sections/images/figB_intra_g2_sim.png)

Figure 3:Embedding analysis forHICD\-Graph\. All hierarchy\-aware variants increase semantic clustering over the baseline, withG1\+G2achieving the strongest local clustering\.We analyze the learned GCN node embeddings inHICD\-Graphfor leaf ICD\-10 diagnosis codes to assess how hierarchical supervision changes the structure of the diagnosis\-code representation space\. We study pairwise cosine similarity under two ICD relationship types, defined by the shared ancestor in the ICD\-10 ontology: codes sharing the same 2\-character prefix \(e\.g\. all codes beginning withS7\), which corresponds to the model’sG2grouping level, and codes from the same ICD\-10 chapter but with different 2\-character prefixes\.

Hierarchy strengthens clinically meaningful similarity structure\.Figure[3](https://arxiv.org/html/2606.15447#S4.F3)\(a\) compares the mean cosine similarity across these two relationship types for the no\-hierarchy baseline and theG0\+G1\+G2configuration\. Relative to the baseline, theG0\+G1\+G2configuration increases similarity in both groups, from 0\.286 to 0\.357 among codes sharing the same 2\-character prefix, and from 0\.267 to 0\.344 among codes within the same chapter but with different prefixes\. The resulting embedding space preserves a meaningful similarity gradient: codes sharing the finest ancestor \(G2\) are more similar than broader within\-chapter codes, supporting that hierarchical supervision strengthens the clinically relevant organization of the embedding space\.

All hierarchy configurations improve local code similarity\.Figure[3](https://arxiv.org/html/2606.15447#S4.F3)\(b\) examines how each hierarchy configuration affects pairwise similarity among diagnosis codes\. All hierarchy\-aware variants improve over the no\-hierarchy baseline \(0\.286\), confirming that explicit hierarchical information consistently pulls related diagnoses closer together in embedding space\. The strongest effect is obtained when the mid\-level and fine\-level hierarchies are combined \(G1\+G2, 0\.384\), indicating that these two levels together capture the most clinically relevant grouping structure for organising related diagnosis codes in embedding space\.

Embedding structure predicts downstream performance\.Notably, these embedding\-level patterns are also predictive of downstream task performance\. On readmission prediction \(Table[1](https://arxiv.org/html/2606.15447#S4.T1)\), theG1\+G2configuration, which produces the tightest local code clusters in embedding space, also achieves among the strongest AUROC forHICD\-Graph, while the coarsest levelG0, which yields the smallest embedding similarity gains, delivers the weakest single\-level task performance\. This alignment between representation geometry and predictive accuracy suggests that hierarchy improvesHICD\-Graphperformance by shaping the embedding space in clinically meaningful ways that benefit downstream discrimination\.

Hierarchy effects are heterogeneous across chapters in our pretraining dataFigure[5](https://arxiv.org/html/2606.15447#A1.F5)\(Appendix\) shows the intra\-chapter cosine similarity for each ICD\-10 chapter across all eight hierarchy configurations on our pretraining data, MIMIC\-IV\. Across the majority of chapters in our pretraining data, cosine similarity increases with the addition of hierarchy, confirming the global trend observed in Figure[3](https://arxiv.org/html/2606.15447#S4.F3)\.

However, the magnitude of improvement is heterogeneous: chapters with higher baseline similarity, such as Chapter XVIII \(0\.48→\\rightarrow0\.61, \+27%\) and Chapter XXII \(0\.70→\\rightarrow0\.90, \+29%\), exhibit comparatively modest relative gains, while lower\-baseline chapters such as Chapter XVI \(0\.09→\\rightarrow0\.26, \+189%\) and Chapter XIX \(0\.05→\\rightarrow0\.10, \+100%\) show substantially larger proportional improvements under hierarchy\. This suggests that hierarchical supervision is slightly more beneficial in chapters with greater code diversity, where the representation space is less coherent at baseline, and the model can be optimized more effectively by incorporating ontological structure\.

## 5Discussion

This work investigates ICD hierarchy as an inductive bias for EHR representation learning, evaluating two complementary mechanisms, token\-level injection \(HICD\-BERT\) and graph\-level encoding \(HICD\-Graph\)\. Our results support the view that the hierarchical structure already present in ICD coding systems is anunderutilised signalin EHR foundation models, and that modeling it explicitly benefits downstream performance\.

##### Key Findings\.

Across both in\-domain tasks, the finest hierarchy levelG2is consistently the strongest single\-level configuration forHICD\-BERT, whileHICD\-Graphbenefits progressively from adding more hierarchy levels\. ForHICD\-BERT, the mid\-levelG1provides no significant improvement in isolation on either task, though it contributes when combined with other levels, suggesting that mid\-level groupings serve as connective structure rather than a standalone signal\. The optimal hierarchy depth is also task\-dependent: readmission benefits from combining all levels, while emergency\-visit prediction achieves comparable performance from a single fine\-grained level alone\. On cross\-dataset transfer, hierarchy benefitsHICD\-Graphrobustly \(6 of 7 variants significant\) but largely vanish forHICD\-BERT, where only the coarsest levelG0transfers\. This suggests that graph\-based relational structure, anchored in the ICD ontology, is more dataset\-invariant than sequential token patterns\. The embedding analysis provides mechanistic support: hierarchy constrains the embedding space to respect clinical relationships, and the configurations producing the tightest local code clusters \(G1\+G2\) also achieve among the strongest downstream AUROC, confirming that the gains arise from meaningful representational structure\.

##### Practical significance of effect sizes\.

Readmission and emergency\-visit prediction are inherently difficult, multifactorial outcomes that depend on many factors beyonddiagnosis codes alone\. Against this backdrop, the AUROC improvements from hierarchy are statistically robust \(26 of 28 comparisons significant\) and consistent across two models and two tasks, supporting hierarchy as a general inductive bias\. Importantly, in our setup, we incorporated hierarchy using only minimal modifications to existing architectures: additive token embeddings forHICD\-BERTand additional graph edges forHICD\-Graphwith no changes to the core encoder, pretraining objective, or training pipeline\. Thisintentional low cost integration, combined with the consistent improvements observed, suggests that the ICD hierarchy should be included as a strong prior in EHR representation learning\.

##### Limitations\.

Our evaluation covers two binary clinical tasks on MIMIC\-IV and one transfer task on eICU; the generalizability of our findings to other clinical outcomes, institutions, and EHR systems remains to be validated\. Our analysis is restricted to ICD diagnosis codes; whether the observed benefits extend to other coded data in EHR foundational models remains an open question\. The hierarchy levels \(G0,G1,G2\) are intentionally defined differently acrossHICD\-GraphandHICD\-BERTto evaluate two hierarchy construction strategies: ontology\-derived grouping versus lightweight data\-driven prefix grouping\. All comparisons are therefore made within each model, while future work could analyze other encodings across architectures to isolate the effect of the hierarchy definition itself\. Our hierarchy experiments are limited to a fixed set of manually defined levels, so we do not test whether the optimal hierarchy depth could be learned automatically for a given task or model\.

##### Future Work\.

A natural next step is to extend this framework to other structured clinical vocabularies with analogous ontological structure, such as ATC codes for medications, which organize drugs into therapeutic, pharmacological, and chemical subgroups across multiple levels\. We also plan to explore deeper integration mechanisms that go beyond the additive approaches studied here, including hierarchy\-aware pretraining objectives that explicitly optimize for ontological coherence, adaptive mechanisms that learn to weight or select different hierarchy levels during training, and multi\-ontology fusion that jointly encodes diagnosis, medication, and procedure hierarchies within a single model\. We believe such approaches could further amplify the practical gains observed in this work\.

## References

## Appendix AHICD\-Graph Hierarchy: Example

Figure[4](https://arxiv.org/html/2606.15447#A1.F4)illustrates how two leaf diagnosis codes share hierarchy nodes inHICD\-Graph\. ConsiderI50\.21\(*Acute systolic \(congestive\) heart failure*\) andI50\.32\(*Chronic diastolic \(congestive\) heart failure*\)\. Both leaves connect, via weighted edges, to the sameG2node \(I5xxprefix group\), the sameG1node \(official blockI30\-\-I52,*Other forms of heart disease*\), and the sameG0node \(Chapter IX,*Diseases of the circulatory system*\)\. Shared ancestors propagate information between clinically related leaves during message passing, while unrelated leaves remain separated in the hierarchy subgraph\.

![Refer to caption](https://arxiv.org/html/2606.15447v1/x2.png)Figure 4:Example of theHICD\-Graphhierarchy subgraph\.![Refer to caption](https://arxiv.org/html/2606.15447v1/sections/images/fig4_intra_cosine.png)Figure 5:Mean intra\-chapter cosine similarity among ICD\-10 leaf codes in our pretraining data of MIMIC\-IV, computed across all eight hierarchy configurations \(G0/G1/G2ablations\)\. Each row corresponds to one ICD\-10 chapter, and each column to a hierarchy configuration ordered as in the main results tables\. Higher values \(darker\) indicate that codes within the same chapter are embedded more tightly together\. Hierarchy increases similarity across most chapters, with the largest gains in diverse chapters such as Chapter XIX \(*Injury and poisoning*, 5,937 codes\)\.
## Appendix BImplementation Details

### B\.1HICD\-BERTArchitecture and Training

HICD\-BERTextends a BEHRT\-style encoder by summing diagnosis\-token, age, position, and segment embeddings, with optional additive embeddings for the three hierarchy levelsG0,G1, andG2\. The contextualized sequence is processed by a stack of Transformer blocks, and the pretraining objective is masked language modeling \(MLM\)\. For downstream prediction, the encoder is initialized from the pretrained checkpoint and a task\-specific multilayer perceptron is applied to the final\[CLS\]representation\.

#### B\.1\.1Architecture

HICD\-BERTbuilds on a BEHRT\-style Transformer encoder and augments the standard token, age, position, and segment embeddings with up to three hierarchy\-aware embedding channels corresponding toG0,G1, andG2\. These hierarchy embeddings are added to the input representation and are learned jointly with the rest of the model\. For downstream prediction, the contextualized\[CLS\]representation is passed through a task\-specific multilayer perceptron classifier\.

Table 4:HICD\-BERTarchitecture details\.
#### B\.1\.2MLM Pretraining

We pretrainHICD\-BERTwith a masked language modeling objective over diagnosis sequences\. The model is optimized to recover masked diagnosis tokens from their clinical context while jointly conditioning on age, position, segment, and the enabled hierarchy\-level embeddings\. Table[5](https://arxiv.org/html/2606.15447#A2.T5)summarizes the pretraining configuration\.

Table 5:HICD\-BERTMLM pretraining hyperparameters\.
#### B\.1\.3Task Fine\-Tuning

For downstream tasks, we initialize the encoder from the best MLM checkpoint and optimize a task\-specific classification objective\. Binary tasks useBCEWithLogitsLosswith a positive class weightw\+=Nnegtrain/Npostrainw^\{\+\}=N\_\{\\text\{neg\}\}^\{\\text\{train\}\}/N\_\{\\text\{pos\}\}^\{\\text\{train\}\}, computed from the training split, to address label imbalance\. The best checkpoint is selected by validation AUROC\. For binary tasks, a decision threshold is swept over\[0\.01,0\.99\]\[0\.01,0\.99\]\(200 steps\) on the validation set and stored in the checkpoint for test\-set evaluation\. Table[6](https://arxiv.org/html/2606.15447#A2.T6)lists the full fine\-tuning hyperparameters\.

Table 6:HICD\-BERTdownstream fine\-tuning hyperparameters\.

### B\.2HICD\-Graph: Graph\-Augmented Hierarchy Model

The graph\-based model has two stages\. First, we construct a diagnosis co\-occurrence graph over ICD roots using PMI\-weighted edges, then augment the graph with hierarchy edges corresponding to the enabledG0,G1, andG2levels\. Second, we train a two\-layer GCN for node embedding learning and use the resulting embeddings to initialize a patient\-level Transformer\.

#### B\.2\.1Graph Construction

We construct an undirected diagnosis graph whose nodes correspond to ICD root codes\. Horizontal edges are weighted by PMI computed from within\-admission co\-occurrence, and vertical edges are added to encode enabled hierarchy levels\. Table[7](https://arxiv.org/html/2606.15447#A2.T7)summarizes the graph construction settings\.

Table 7:Diagnosis graph construction hyperparameters\.
#### B\.2\.2GCN Embedding Training

We learn node embeddings with a two\-layer GCN trained using a link\-prediction objective over observed and sampled non\-edge pairs\. The resulting embeddings are used to initialize the downstream patient encoder\. The training configuration is given in Table[8](https://arxiv.org/html/2606.15447#A2.T8)\.

Table 8:GCN embedding training hyperparameters\.
#### B\.2\.3Transformer Fine\-Tuning on Graph Embeddings

The learned graph embeddings initialize the diagnosis embedding table of a patient\-level Transformer operating over temporally ordered visits\. Visit representations are formed by masked mean pooling over code embeddings, and patient representations are obtained through attention pooling across visits\. Table[9](https://arxiv.org/html/2606.15447#A2.T9)provides the fine\-tuning details\.

Table 9:Patient\-level Transformer hyperparameters for the graph\-based model\.

### B\.3Statistical Testing

For all pairwise comparisons of AUROC between a hierarchy\-augmented variant and the no\-hierarchy baseline evaluated on the same test set, we use the paired DeLong test\(delong1988comparing\), which provides a closed\-form test for the difference between two correlated AUROC estimates\. Our implementation was verified against the reference implementation of Sun and Xu\(sun2014fast\)111[https://github\.com/yandexdataschool/roc\_comparison](https://github.com/yandexdataschool/roc_comparison), with all p\-values agreeing to machine precision \(<10−15<10^\{\-15\}\) across all comparisons\. We report significance at thep<0\.05p<0\.05threshold\.

### B\.4Hierarchy Ablations

For bothHICD\-BERTand the graph\-augmented model, we evaluate all eight hierarchy settings:

\(0,0,0\),\(0,0,1\),\(0,1,0\),\(0,1,1\),\(1,0,0\),\(1,0,1\),\(1,1,0\),\(1,1,1\)\.\(0,0,0\),\\ \(0,0,1\),\\ \(0,1,0\),\\ \(0,1,1\),\\ \(1,0,0\),\\ \(1,0,1\),\\ \(1,1,0\),\\ \(1,1,1\)\.Here,\(0,0,0\)\(0,0,0\)corresponds to the non\-hierarchical baseline, while\(1,1,1\)\(1,1,1\)uses the full hierarchy\. InHICD\-BERTthese settings determine which hierarchy embeddings are added to the token representation\. In the graph\-based model, they determine which hierarchy edges are added to the diagnosis graph and which ancestor nodes are included during sequence tokenization\.
Hierarchical Modeling of ICD Codes in EHR Foundation Models

Similar Articles

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

Submit Feedback

Similar Articles

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning
Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models