Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

arXiv cs.AI 06/16/26, 04:00 AM Papers
Summary
Introduces a foundation model–driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data for time-to-event prediction, evaluating fusion strategies on pulmonary embolism and cardiovascular disease cohorts.
arXiv:2606.15038v1 Announce Type: new Abstract: Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:43 AM
# Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling
Source: [https://arxiv.org/html/2606.15038](https://arxiv.org/html/2606.15038)
Weijie ChenArizona State University Mayo ClinicDavid LeArizona State University Mayo ClinicAmara TariqArizona State University Mayo ClinicAlex WallaceArizona State University Mayo ClinicMatthew StibArizona State University Mayo ClinicJuan Maria FarinaArizona State University Mayo ClinicChadi AyoubArizona State University Mayo ClinicReza ArsanjaniArizona State University Mayo ClinicImon BanerjeeArizona State University Mayo Clinic

###### Abstract

Accurate time\-to\-event \(TTE\) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift\. We introduce a foundation model–driven framework for cross\-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions\. CT and EHR modalities are encoded independently using domain\-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross\-attention, and co\-attention\. We evaluate two clinically distinct TTE tasks—pulmonary embolism \(PE\) mortality and cardiovascular disease \(CVD\) outcomes—on large\-scale multi\-institutional cohorts \(PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external\)\. Fusion consistently improves concordance index by 1\.5–5\.4% over unimodal baselines when modalities contribute comparably\. Overall, contrastive multimodal fusion—particularly with CLMBR representations—provided the most consistent and statistically robust improvements, especially for PE mortality prediction\. For MACE, cross\-attention \(one\-hot\) achieved the highest internal performance and image\-guided co\-attention achieved the best external performance\. We therefore introduce a generalizable foundation model–based cross\-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction\. Our results establish task\-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment\.

## 1Introduction

Prognosis in healthcare requires estimating both the likelihood and timing of adverse outcomes\. For high\-risk conditions such as pulmonary embolism \(PE\) and cardiovascular disease \(CVD\), patient trajectories are heterogeneous, and accurate temporal risk stratification is essential for guiding monitoring, intervention, and resource allocation\. Time\-to\-event \(TTE\) modeling provides a natural framework for estimating individualized risk over time, unlike binary classification approaches that predict only event occurrence\[[20](https://arxiv.org/html/2606.15038#bib.bib9),[15](https://arxiv.org/html/2606.15038#bib.bib10)\]\. Traditional risk scores rely on static tabular variables and often omit high\-dimensional imaging data and longitudinal context\[[16](https://arxiv.org/html/2606.15038#bib.bib12),[9](https://arxiv.org/html/2606.15038#bib.bib14)\]\. In PE, scores such as Pulmonary Embolism Severity Index \(sPESI\) show variable calibration, while emerging evidence suggests that combining CT Pulmonary Angiography \(CTPA\) with clinical data can improve outcome prediction\[[23](https://arxiv.org/html/2606.15038#bib.bib23),[4](https://arxiv.org/html/2606.15038#bib.bib25)\]\. Combining traditional risk score with CT derived biomarkers can improve prognostic performance\[[5](https://arxiv.org/html/2606.15038#bib.bib57)\]\.

Multimodal deep learning offers a principled approach to integrate complementary prognostic information: imaging captures structural and spatial markers of disease severity, while longitudinal EHR data encode comorbidity, treatment history, and temporal dynamics\. Existing multimodal approaches often select fusion strategies heuristically, and the interaction between fusion mechanism and temporal objective remains unexplored\[[15](https://arxiv.org/html/2606.15038#bib.bib10),[7](https://arxiv.org/html/2606.15038#bib.bib21)\]\. More recently, joint and attention\-based fusion methods\[[11](https://arxiv.org/html/2606.15038#bib.bib38),[17](https://arxiv.org/html/2606.15038#bib.bib51)\]have emerged as powerful strategies for integrating heterogeneous modalities\. These approaches employ mechanisms such as cross\-attention\[[19](https://arxiv.org/html/2606.15038#bib.bib50)\], co\-attention\[[10](https://arxiv.org/html/2606.15038#bib.bib40),[8](https://arxiv.org/html/2606.15038#bib.bib41)\], and contrastive alignment\[[21](https://arxiv.org/html/2606.15038#bib.bib42),[6](https://arxiv.org/html/2606.15038#bib.bib43)\]to dynamically re\-weight contributions from each modality based on contextual relevance\. However, foundation models for 3D imaging and clinical sequences are typically pretrained on generic objectives and are not optimized for TTE prediction, leaving latent representations unstructured for temporal risk modeling\. Cross\-modal alignment can reshape these embeddings toward temporally predictive signals, but existing approaches often rely on heuristic fusion strategies, and the interaction between alignment mechanism and survival objectives remains underexplored\[[15](https://arxiv.org/html/2606.15038#bib.bib10),[7](https://arxiv.org/html/2606.15038#bib.bib21)\]\.

We propose a multimodal framework for time\-to\-event \(TTE\) prediction that systematically integrates supervised fusion strategies—contrastive alignment, cross\-attention and bi\-directional co\-attention, and evaluate against traditional concatenation\. Using mortality following pulmonary embolism \(PE\) and long\-term cardiovascular outcomes as benchmark tasks, we demonstrate how the choice of fusion strategy impacts temporal risk modeling under distribution shift\. These findings provide practical guidance for designing effective multimodal survival models that leverage complementary imaging and EHR information\.

## 2Method

Figure[1](https://arxiv.org/html/2606.15038#S2.F1)illustrates the multimodal fusion framework for time\-to\-event prediction\. CT scans and EHR data are first encoded using domain\-specific foundation models\[[13](https://arxiv.org/html/2606.15038#bib.bib2)\], producing image and clinical embeddings that are fused in latent space and optimized using a survival objective\[[12](https://arxiv.org/html/2606.15038#bib.bib1)\]\. Gradients are propagated only through the fusion and task\-specific layers, while the pretrained foundation encoders remain frozen in all experiments\.

### 2\.1Foundation models \- Image and EHR representation

*2D Medical Image Foundation Model – MedImageInsight:*We used MedImageInsight \(MII\)\[[1](https://arxiv.org/html/2606.15038#bib.bib5)\], a pretrained medical imaging foundation model that encodes 2D slices into1×10241\\times 1024embeddings\. For 3D CT volumes, axial slices are first selected with cardiac structures\. Soft\-tissue windowing is applied by clipping voxel intensities to\[−1350,150\]\[\-1350,150\]HU for PE, and\[−125,225\]\[\-125,225\]HU for MACE\. Each slice is independently encoded by MII, producingN×1024N\\times 1024embeddings, which are averaged across the z\-dimension to obtain a single1×10241\\times 1024volume\-level representation\.*EHR Foundation Model – CLMBR Features:*We use a pre\-trained CLMBR\-T\-base model\[[2](https://arxiv.org/html/2606.15038#bib.bib19)\], an autoregressive Transformer trained via self\-supervised next\-code prediction on longitudinal structured EHR data\. The model encodes time\-ordered clinical codes and produces a 768\-dimensional patient\-level embedding\. We utilize the publicly released Stanford\-trained checkpoint and extract fixed embeddings without additional fine\-tuning\. As an alternative EHR representation, we also explore a manually curated 1\-hot encoding, where task\-specific features—including demographics, laboratory results, medications, diagnoses, and procedure codes—are selected and binarized into 1\-hot vectors for downstream modeling\.

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/Pipeline.png)

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/model1_concat.png)

\(a\) Traditional Concat

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/model2_cl.png)

\(b\) Proposed Contrastive

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/model3_coattn.png)

\(c\) Proposed Co\-Attention

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/model4_cross.png)

\(d\) Proposed Cross\-Attention

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/legen.png)

Figure 1:Overview of the multimodal survival framework and fusion strategies\. Cross\-modal alignment of chest CT and longitudinal EHR using domain\-specific foundation encoders to produce shared embeddings for time\-to\-event prediction, with image\-only and EHR\-only baselines derived from the respective encoders\. Fusion variants in latent space— \(a\) Traditional concatenation of embedding, Proposed \- \(b\) contrastive learning, \(c\) co\-attention, and \(d\) symmetric cross\-attention\.
### 2\.2Survival Model: Time\-to\-Event \(TTE\) with cross modal alignment

For TTE modeling, we apply a multilayer perceptron \(MLP\) prediction head to either unimodal or fused embeddings and optimize a negative log\-likelihood \(NLL\) survival objective that accounts for right\-censored observations\. Gradients are propagated through the fusion modules, enabling end\-to\-end optimization of the joint representation under the survival objective\. Hidden\-layer widths are scaled proportionally to the input embedding dimensionality, ensuring balanced model capacity across heterogeneous feature types\. This formulation allows probabilistic estimation of event times by modeling both the hazard function and the censoring mechanism\. Backpropagation of the NLL loss through the network, including the fusion modules, enables the shared representation to be shaped by temporally and clinically relevant features, improving predictive performance under both observed and censored conditions\.

*Contrastive:*Inspired by CLIP style training\[[14](https://arxiv.org/html/2606.15038#bib.bib53),[22](https://arxiv.org/html/2606.15038#bib.bib54)\], we designed a 2\-step training where we first align image and EHR embeddings using a symmetric contrastive objective\. Given a batch of sizeNN, theii\-th image andii\-th EHR form a positive pair, while all other combinations act as negatives\. Similarity scores are defined as:Sij=sim\(hiimg,hjehr\)τ,ℒ=−1N∑i=1N\[logeSii∑j=1NeSij\+log⁡eSii∑j=1NeSji\]S\_\{ij\}=\\frac\{\\mathrm\{sim\}\\\!\\left\(h\_\{i\}^\{\\mathrm\{img\}\},h\_\{j\}^\{\\mathrm\{ehr\}\}\\right\)\}\{\\tau\},\\quad\\mathcal\{L\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\[log\\frac\{e^\{S\_\{ii\}\}\}\{\\sum\_\{j=1\}^\{N\}e^\{S\_\{ij\}\}\}\+\\log\\frac\{e^\{S\_\{ii\}\}\}\{\\sum\_\{j=1\}^\{N\}e^\{S\_\{ji\}\}\}\\right\]\. After contrastive pre\-alignment, the fused embeddings are used to train the time\-to\-event \(TTE\) survival model\.

*Cross\-attention:*In an end\-to\-end learning, we fuse multimodal embeddings \(imaging \+ EHR\) using cross‑attention so that one modality can attend to important signals in the other, and then train this fused representation under a TTE objective\. This lets the model simultaneously learn which features from each modality embedding are predictive of when an event is likely to occur in a*supervised way*\. Once we derived the foundation model encodings from image and ehr as:himg=fenc,img\(ximg\)h\_\{\\text\{img\}\}=f\_\{\\text\{enc,img\}\}\(x\_\{\\text\{img\}\}\)andhehr=fenc,ehr\(xehr\)h\_\{\\text\{ehr\}\}=f\_\{\\text\{enc,ehr\}\}\(x\_\{\\text\{ehr\}\}\), we calculated as fused featurehfusedh\_\{\\text\{fused\}\}with cross attention as \-hfused=concat\(hehr,Aehr←img,himg,Aimg←ehr\)h\_\{\\mathrm\{fused\}\}=\\mathrm\{concat\}\\\!\\bigl\(h\_\{\\mathrm\{ehr\}\},\\,A\_\{\\mathrm\{ehr\}\\leftarrow\\mathrm\{img\}\},\\,h\_\{\\mathrm\{img\}\},\\,A\_\{\\mathrm\{img\}\\leftarrow\\mathrm\{ehr\}\}\\bigr\), where

Qehr=WQ\(e\)hehr,Kimg=WK\(e\)himg,Vimg=WV\(e\)himgQ\_\{\\mathrm\{ehr\}\}=W\_\{Q\}^\{\(e\)\}\\,h\_\{\\mathrm\{ehr\}\},\\,K\_\{\\mathrm\{img\}\}=W\_\{K\}^\{\(e\)\}\\,h\_\{\\mathrm\{img\}\},\\,V\_\{\\mathrm\{img\}\}=W\_\{V\}^\{\(e\)\}\\,h\_\{\\mathrm\{img\}\}Aehr←img=softmax\(QehrKimg⊤dk\)VimgA\_\{\\mathrm\{ehr\}\\leftarrow\\mathrm\{img\}\}=\\mathrm\{softmax\}\\\!\\Bigl\(\\frac\{Q\_\{\\mathrm\{ehr\}\}\\,K\_\{\\mathrm\{img\}\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\Bigr\)\\;V\_\{\\mathrm\{img\}\}Qimg=WQ\(i\)himg,Kehr=WK\(i\)hehr,Vehr=WV\(i\)hehrQ\_\{\\mathrm\{img\}\}=W\_\{Q\}^\{\(i\)\}\\,h\_\{\\mathrm\{img\}\},\\,K\_\{\\mathrm\{ehr\}\}=W\_\{K\}^\{\(i\)\}\\,h\_\{\\mathrm\{ehr\}\},\\,V\_\{\\mathrm\{ehr\}\}=W\_\{V\}^\{\(i\)\}\\,h\_\{\\mathrm\{ehr\}\}Aimg←ehr=softmax\(QimgKehr⊤dk\)VehrA\_\{\\mathrm\{img\}\\leftarrow\\mathrm\{ehr\}\}=\\mathrm\{softmax\}\\\!\\Bigl\(\\frac\{Q\_\{\\mathrm\{img\}\}\\,K\_\{\\mathrm\{ehr\}\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\Bigr\)\\;V\_\{\\mathrm\{ehr\}\}andWQ\(e\),WK\(e\),WV\(e\)W\_\{Q\}^\{\(e\)\},W\_\{K\}^\{\(e\)\},W\_\{V\}^\{\(e\)\}are projection matrices for the EHR\-to\-Image cross\-attention mechanism, where queries are derived from EHR embeddings, and keys and values are from image embeddings\. Specifically,Qehr=WQ\(e\)hehrQ\_\{\\mathrm\{ehr\}\}=W\_\{Q\}^\{\(e\)\}h\_\{\\mathrm\{ehr\}\},Kimg=WK\(e\)himgK\_\{\\mathrm\{img\}\}=W\_\{K\}^\{\(e\)\}h\_\{\\mathrm\{img\}\}, andVimg=WV\(e\)himgV\_\{\\mathrm\{img\}\}=W\_\{V\}^\{\(e\)\}h\_\{\\mathrm\{img\}\}\. The reverse, Image\-to\-EHR cross\-attention, follows the same notation but with roles of modalities swapped\. The outputs of these attention mechanisms,Aehr←imgA\_\{\\mathrm\{ehr\}\\leftarrow\\mathrm\{img\}\}andAimg←ehrA\_\{\\mathrm\{img\}\\leftarrow\\mathrm\{ehr\}\}, are concatenated with their respective modality embeddings to form the fused feature vectorhfusedh\_\{\\text\{fused\}\}\. This fused representation is then utilized for time\-to\-event \(TTE\) prediction, formulated as:ℒ𝒯𝒯ℰ=−∑i=1Nlog⁡p\(Ti,δi∣hfused,i\)\+λ‖θ‖22\\mathcal\{L\_\{TTE\}\}=\-\\sum\_\{i=1\}^\{N\}\\log p\\left\(T\_\{i\},\\delta\_\{i\}\\mid h\_\{\\text\{fused\},i\}\\right\)\+\\lambda\\\|\\theta\\\|\_\{2\}^\{2\}, whereTiT\_\{i\}denotes the observed or censored time for subjectii,δi\\delta\_\{i\}is the event indicator,λ\\lambdais the regularization hyperparameter, andθ\\thetarepresents the model’s trainable parameters across projection layers and attention mechanisms\.

*Co\-attention:*To preserve the primary knowledge from one modality and use other as a guiding modality, we adapted the co\-attention mechanism\. In this model, one modality \(e\.g\.hehrh\_\{\\mathrm\{ehr\}\}\) serves as the query, while the otherhimgh\_\{\\mathrm\{img\}\}functions as both the key and value in the co‑attention computation:

CoAttnehr→img\(hehr,himg\)=softmax\(QehrKimg⊤dk\)Vimg\\text\{CoAttn\}\_\{\\mathrm\{ehr\}\\to\\mathrm\{img\}\}\(h\_\{\\mathrm\{ehr\}\},\\,h\_\{\\mathrm\{img\}\}\)=\\mathrm\{softmax\}\\\!\\left\(\\frac\{Q\_\{\\mathrm\{ehr\}\}\\,K\_\{\\mathrm\{img\}\}^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\;V\_\{\\mathrm\{img\}\}
whereQehr=WQ\(e\)hehrQ\_\{\\mathrm\{ehr\}\}=W\_\{Q\}^\{\(e\)\}\\,h\_\{\\mathrm\{ehr\}\},Kimg=WK\(e\)himgK\_\{\\mathrm\{img\}\}=W\_\{K\}^\{\(e\)\}\\,h\_\{\\mathrm\{img\}\},Vimg=WV\(e\)himgV\_\{\\mathrm\{img\}\}=W\_\{V\}^\{\(e\)\}\\,h\_\{\\mathrm\{img\}\}\. The attention scores are computed by comparing the projected queryQ=WqehrhehrQ=W\_\{q\}^\{\\text\{ehr\}\}\\,h\_\{\\text\{ehr\}\}with the projected keyK=WkimghimgK=W\_\{k\}^\{\\text\{img\}\}\\,h\_\{\\text\{img\}\}, scaled bydk\\sqrt\{d\_\{k\}\}for numerical stability\. After that, we apply the softmax operation to obtain a co‑attention matrixAcoattn=softmax\(QK⊤dk\)A\_\{\\mathrm\{coattn\}\}=\\mathrm\{softmax\}\\left\(\\frac\{Q\\,K^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\), which assigns importance weights over different regions of image modality\. These weights are then used to compute a weighted sum of the image features, producing a refined representationh^img\\hat\{h\}\_\{\\text\{img\}\}as:h^img=Acoattn\(Wvimghimg\)=softmax\(Wqehrhehr\(Wkimghimg\)⊤dk\)Wvimghimg\\hat\{h\}\_\{\\text\{img\}\}=A\_\{\\mathrm\{coattn\}\}\\,\(W\_\{v\}^\{\\text\{img\}\}\\,h\_\{\\text\{img\}\}\)\\quad=\\quad\\mathrm\{softmax\}\\left\(\\frac\{W\_\{q\}^\{\\text\{ehr\}\}\\,h\_\{\\text\{ehr\}\}\\,\(W\_\{k\}^\{\\text\{img\}\}\\,h\_\{\\text\{img\}\}\)^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\,W\_\{v\}^\{\\text\{img\}\}\\,h\_\{\\text\{img\}\}, whereWqehr,Wkimg,Wvimg∈ℝdk×dkW\_\{q\}^\{\\text\{ehr\}\},W\_\{k\}^\{\\text\{img\}\},W\_\{v\}^\{\\text\{img\}\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\_\{k\}\}are trainable weight matrices\. The value term is derived from image asV=WvimghimgV=W\_\{v\}^\{\\text\{img\}\}\\,h\_\{\\text\{img\}\}, which ensures that the model focuses on the most informative regions of the imaging modality given the EHR features\. However, the image and EHR modalities can ideally swapped since the attention is being learn in embedding space\. Similar to cross attention, a fused embedding is generated after the incorporating the coattention \(hfused=concat\(himg,Acoattn\)h\_\{\\mathrm\{fused\}\}=concat\(h\_\{img\},A\_\{\\mathrm\{coattn\}\}\)\) which is finally fed into the TTE to optimize the overall loss:LTTEL\_\{TTE\}\.

## 3Results

*Clinical use\-case 1: Major adverse cardiac event \(MACE\) prediction*— Traditional MACE risk models rely on predefined tabular variables \(e\.g\., PCE\) or imaging alone \(e\.g\., CAC\) to predict risk of myocardial infarction, ischemic stroke, heart failure and cardiac mortality, limiting personalization and underutilizing longitudinal EHR data\. While chest CT captures detailed anatomy, EHR provides temporal clinical context, though often incomplete in asymptomatic patients\. Integrating both modalities via foundation models enables complementary, personalized risk stratification\. We developed the model using retrospective multi\-site data \(3,947 patients; 4,133 thoracic CTs; 46% MACE rate\) and evaluated generalization on ED patients undergoing routine non cardiac chest CTs\(665 patients; 682 CTs; 28% MACE rate\)\. Temporal EHR features were extracted per AHA guidelines, including demographics, vitals, labs, comorbidities, and medications\.*Clinical use\-case 2: Mortality prediction in pulmonary embolism \(PE\)*— PE carries substantial mortality risk \(1–35% depending on severity\), underscoring the need for accurate stratification\. Existing tools \(e\.g\., PESI\) emphasize clinical variables but rarely integrate high dimensional imaging data\. We assembled a private dataset of all positive acute PE diagnosis confirmed CTA pulmonary angiographic studies \(3,764 patients; 4,542 studies; 7% mortality for 1 year\) with linked temporal EHR data mapped to OMOP CDM for standardized learning\. External validation was performed on randomly sampled all positive acute PE INSPECT dataset\[[3](https://arxiv.org/html/2606.15038#bib.bib56)\]\(396 patients; 435 studies; 19% mortality for 1 year\), enabling robust multimodal mortality prediction across sites\.

*Quantitative Performance*\- As baseline multimodal integration strategies, we evaluated simple feature concatenation \(Fig\.[1](https://arxiv.org/html/2606.15038#S2.F1)\) where concatenation directly combines independently pre\-trained image and EHR embeddings into a joint representation\. This establish baseline multimodal integration strategies against which 2\-step contrastive and attention\-based fusion methods are compared\. The frozen image\-only encoder with supervised training of survival layers is used as the unimodal baseline, given the heterogeneity in EHR encoding strategies \(e\.g\., one\-hot representations versus CLMBR\-based embeddings\), which could otherwise introduce confounding variability in baseline comparisons\.

Table 1:Prediction performance for PE and MACE reported as mean \(95% CI\) for Internal \(Institution A\) and External \(Inspect\) cohorts\. Unimodal baselines \(Image\-only and EHR\-only\) are shown separately from multimodal fusion models\.∗indicatesp<0\.05p<0\.05versus Image\-only, and\+indicatesp<0\.05p<0\.05versus linear concatenation fusion as ref\. baseline\. Best performance within each cohort and outcome is shown inbold\.InternalExternalModelPEMACEPEMACEImage\-Only BaselineImage only\(unimodal ref\.\)0\.837 \(0\.828–0\.847\)0\.742 \(0\.737–0\.748\)0\.710 \(0\.697–0\.722\)0\.725 \(0\.718–0\.732\)EHR\-Only \(Unimodal Structured Data\)One\-hot RepresentationEHR only0\.748∗\(0\.735–0\.760\)0\.732∗\(0\.725–0\.739\)0\.620∗\(0\.610–0\.630\)0\.691∗\(0\.685–0\.698\)CLMBR RepresentationEHR only0\.775∗\(0\.763–0\.787\)0\.502∗\(0\.498–0\.505\)0\.641∗\(0\.628–0\.654\)0\.497∗\(0\.497–0\.498\)Multimodal Fusion ModelsOne\-hot RepresentationConcatenation\(cross\-modal ref\.\)0\.819∗\(0\.810–0\.828\)0\.790∗\(0\.786–0\.794\)0\.717 \(0\.708–0\.726\)0\.733 \(0\.725–0\.740\)Contrastive Learning0\.847\+\(0\.840–0\.854\)0\.794 \(0\.790–0\.798\)0\.738∗\+\(0\.724–0\.752\)0\.722 \(0\.714–0\.729\)Cross\-attention0\.846\+\(0\.839–0\.854\)0\.796∗\(0\.791–0\.801\)0\.719 \(0\.709–0\.729\)0\.731 \(0\.724–0\.738\)Co\-attention \(image guide\)0\.839\+\(0\.830–0\.848\)0\.791 \(0\.786–0\.796\)0\.693∗\+\(0\.685–0\.700\)0\.740∗\(0\.735–0\.746\)Co\-attention \(EHR guide\)0\.784∗\+\(0\.774–0\.795\)0\.791 \(0\.787–0\.795\)0\.653∗\+\(0\.641–0\.664\)0\.736∗\(0\.730–0\.741\)CLMBR RepresentationConcatenation\(cross\-modal ref\.\)0\.846 \(0\.837–0\.855\)0\.757∗\(0\.751–0\.763\)0\.734 \(0\.724–0\.744\)0\.721 \(0\.712–0\.730\)Contrastive Learning0\.862∗\+\(0\.856–0\.868\)0\.759∗\(0\.754–0\.764\)0\.743∗\+\(0\.731–0\.754\)0\.727 \(0\.721–0\.734\)Cross\-attention0\.846 \(0\.838–0\.854\)0\.762∗\(0\.757–0\.767\)0\.723 \(0\.707–0\.739\)0\.725 \(0\.718–0\.731\)Co\-attention \(image guide\)0\.842 \(0\.836–0\.848\)0\.741 \(0\.733–0\.749\)0\.712 \(0\.701–0\.723\)0\.724 \(0\.715–0\.732\)Co\-attention \(EHR guide\)0\.824\+\(0\.814–0\.834\)0\.754 \(0\.749–0\.759\)0\.724∗\(0\.715–0\.733\)0\.726 \(0\.718–0\.734\)

Across both PE and MACE tasks, multimodal fusion models generally improved prediction performance over unimodal baselines, with statistically significant gains observed across several settings\. EHR\-only models performed significantly worse than the image\-only reference for both PE and MACE \(p<0\.05\)\. For PE prediction, 2\-step contrastive learning consistently achieved the strongest results, with the CLMBR\-based contrastive model demonstrating the best overall performance internally \(AUC 0\.862\) and externally \(AUC 0\.743\), significantly outperforming both image\-only and linear concatenation \(p<0\.05\)\. For MACE, improvements were more modest, though cross\-attention \(one\-hot\) achieved the highest internal performance \(AUC 0\.796, p<0\.05 vs image\-only\) and image\-guided co\-attention achieved the best external performance \(AUC 0\.740, p<0\.05 vs image\-only\)\. One\-hot representations provided more stable MACE prediction, likely due to the use of hand\-curated PREVENT features employed in current clinical risk scoring\[[18](https://arxiv.org/html/2606.15038#bib.bib55)\]\. Overall, contrastive multimodal fusion—particularly with CLMBR representations—provided the most consistent and statistically robust improvements, especially for PE and external generalization\. Statistically significant gains over traditional linear concatenation of the foundational feature space underscore the value of task\-driven cross\-modal alignment, and performance trends suggest that optimal fusion strategies are task\-dependent and performs better when trained with supervised objective\. We analyzed gradient\-based saliency maps to assess whether multimodal alignment altered spatial attention\. Compared to image\-only models, cross\-attention fusion shifted attention toward clinically relevant regions \(Fig\.[2](https://arxiv.org/html/2606.15038#S3.F2)\)\. For PE mortality prediction, the image\-only model primarily highlighted structure of pulmonary arteries, whereas the fusion model localized embolic burden\. For MACE risk estimation, the fusion model emphasized pulmonary artery dilation in addition to cardiac structures, yielding more confident predictions\. Although saliency maps do not establish causality, these findings suggest that integrating EHR information guides imaging features toward task\-relevant anatomy\.

![Refer to caption](https://arxiv.org/html/2606.15038v1/MICCAI2026-Latex-Template/figures/Silency_maps.png)Figure 2:Comparative saliency maps for MACE and pulmonary embolism \(PE\) outcome prediction using image\-only and multimodal fusion models\.*P*denotes the predicted risk score for patients who subsequently developed adverse events\.
## 4Discussion

Vision–language pretraining aligns modalities using self\-supervised objectives \(e\.g\., contrastive losses\), but such alignment does not necessarily transfer optimally to complex time\-to\-event \(TTE\) prediction, where prognostic signals may differ from pretraining objectives\. While concatenation of pre\-aligned spaces provide useful baseline, they may fail to capture fine\-grained task\-relevant interactions\. In contrast, supervised attention\-based fusion enables frozen foundation embeddings to be selectively aligned for survival prediction without full fine\-tuning\.

Across MACE and PE mortality tasks, both modality importance and optimal fusion strategy were task\-dependent\. CT imaging was more discriminative for MACE, reflecting the prognostic value of cardiac morphology and coronary calcification, whereas longitudinal EHR features were more informative for PE mortality, which depends on systemic comorbidities and treatment context\. EHR representation choice further influenced performance: curated one\-hot cardiovascular variables provided stable MACE prediction, while CLMBR foundational embeddings, which capture broader comorbidity patterns, better predicted PE\-related mortality but underperformed for MACE\. Results underscore a limitation of foundation models—representations optimized for general sequence modeling may fail to retain task\-specific survival signals without targeted alignment\.

Attention\-based strategies surpassed simple concatenation by adaptively weighting modalities, enhancing discrimination and robustness under modality imbalance\. Bidirectional cross\-attention achieved the highest internal performance, while co\-attention improved external generalization when modality importance was asymmetric, highlighting that optimal fusion is task\- and cohort\-dependent\. CLMBR\-based 2\-step contrastive fusion provided the most consistent and statistically robust gains, particularly for PE and external validation\. Saliency analyses supported these findings: fusion shifted attention toward clinically relevant regions, localizing embolic burden for PE and emphasizing pulmonary artery dilation and cardiac morphology for MACE, demonstrating that EHR integration guides imaging representations to task\-relevant anatomy\.

External validation revealed persistent generalization gaps, likely driven by heterogeneity in imaging protocols, population characteristics, and EHR completeness, underscoring the need for domain adaptation and harmonization strategies for cross\-site deployment with foundational backbone\. Additional limitations include cohort size, event\-rate imbalance, potential residual confounding in retrospective data\. Prospective validation and calibration assessment will be essential before clinical translation\. Experimental results indicate that even with powerful self\-supervised foundation models, supervised, task\-aware cross\-modal learning is critical for robust and clinically meaningful survival modeling in multimodal settings\.

## References

- \[1\]N\. C\. F\. Codella, Y\. Jin, S\. Jain, Y\. Gu, H\. H\. Lee, A\. B\. Abacha, A\. Santamaria\-Pang, W\. Guyman, N\. Sangani, S\. Zhang, H\. Poon, S\. Hyland, S\. Bannur, J\. Alvarez\-Valle, X\. Li, J\. Garrett, A\. McMillan, G\. Rajguru, M\. Maddi, N\. Vijayrania, R\. Bhimai, N\. Mecklenburg, R\. Jain, D\. Holstein, N\. Gaur, V\. Aski, J\. Hwang, T\. Lin, I\. Tarapov, M\. Lungren, and M\. Wei\(2024\)MedImageInsight: an open\-source embedding model for general domain medical imaging\.External Links:2410\.06542,[Link](https://arxiv.org/abs/2410.06542)Cited by:[§2\.1](https://arxiv.org/html/2606.15038#S2.SS1.p1.5)\.
- \[2\]L\. L\. Guo, E\. Steinberg, S\. L\. Fleming, J\. Posada, J\. Lemmon, S\. R\. Pfohl, N\. Shah, J\. Fries, and L\. Sung\(2023\)EHR foundation models improve robustness in the presence of temporal distribution shift\.Scientific Reports13\(1\),pp\. 3767\.External Links:[Document](https://dx.doi.org/10.1038/s41598-023-30820-8),[Link](https://www.nature.com/articles/s41598-023-30820-8)Cited by:[§2\.1](https://arxiv.org/html/2606.15038#S2.SS1.p1.5)\.
- \[3\]S\. Huang, Z\. Huo, E\. Steinberg, C\. Chiang, M\. P\. Lungren, C\. P\. Langlotz, S\. Yeung, N\. H\. Shah, and J\. A\. Fries\(2023\)INSPECT: a multimodal dataset for pulmonary embolism diagnosis and prognosis\.arXiv preprint arXiv:2311\.10798\.Cited by:[§3](https://arxiv.org/html/2606.15038#S3.p1.1)\.
- \[4\]S\. Huanget al\.\(2023\)INSPECT: a multimodal dataset for patient outcome prediction of pulmonary embolisms\.Note:NeurIPS 2023 Datasets & Benchmarks \(poster\)External Links:[Link](https://neurips.cc/virtual/2023/poster/73704)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
- \[5\]S\. Huang, A\. Pareek, S\. Seyyedi, I\. Banerjee, and M\. P\. Lungren\(2020\)Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines\.NPJ Digital Medicine3\(1\),pp\. 136\.Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
- \[6\]W\. Huang, C\. Li, H\. Zhou, H\. Yang, J\. Liu, Y\. Liang, H\. Zheng, S\. Zhang, and S\. Wang\(2024\)Enhancing representation in radiography\-reports foundation model: a granular alignment algorithm using masked contrastive learning\.Nature Communications15\(1\),pp\. 7620\.External Links:[Document](https://dx.doi.org/10.1038/s41467-024-51749-0)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[7\]Z\. Huo, J\. A\. Fries, A\. Lozano, J\. M\. J\. Valanarasu, E\. Steinberg, L\. Blankemeier, A\. S\. Chaudhari, C\. P\. Langlotz, and N\. H\. Shah\(2024\)Time\-to\-Event pretraining for 3D medical imaging\.External Links:2411\.09361,[Link](https://arxiv.org/abs/2411.09361)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[8\]Z\. Ji, Y\. Ge, C\. Chukwudi, S\. M\. Zhang, Y\. Peng, J\. Zhu, H\. Zaki, X\. Zhang, S\. Yang, X\. Wang, Y\. Chen, and J\. Zhao\(2025\)Counterfactual bidirectional co\-attention transformer for integrative histology–genomic cancer risk stratification\.IEEE Journal of Biomedical and Health Informatics29\(8\),pp\. 5862–5874\.External Links:[Document](https://dx.doi.org/10.1109/JBHI.2025.3548048)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[9\]S\. S\. Khan, K\. Matsushita, Y\. Sang, S\. H\. Ballew, M\. E\. Grams, A\. Surapaneni, M\. J\. Blaha, A\. P\. Carson, A\. R\. Chang, E\. Ciemins, A\. S\. Go, O\. M\. Gutierrez, S\. Hwang, S\. K\. Jassal, C\. P\. Kovesdy, D\. M\. Lloyd\-Jones, M\. G\. Shlipak, L\. P\. Palaniappan, L\. Sperling, S\. S\. Virani, K\. Tuttle, I\. J\. Neeland, S\. L\. Chow, J\. Rangaswami, M\. J\. Pencina, C\. E\. Ndumele, J\. Coresh, Chronic Kidney Disease Prognosis Consortium, and American Heart Association Cardiovascular\-Kidney\-Metabolic Science Advisory Group\(2024\)Development and validation of the american heart association’s PREVENT equations\.Circulation149\(6\),pp\. 430–449\.External Links:[Document](https://dx.doi.org/10.1161/CIRCULATIONAHA.123.067626),[Link](https://www.ahajournals.org/doi/abs/10.1161/CIRCULATIONAHA.123.067626),https://www\.ahajournals\.org/doi/pdf/10\.1161/CIRCULATIONAHA\.123\.067626Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
- \[10\]Z\. Li, Y\. Jiang, M\. Lu, R\. Li, and Y\. Xia\(2023\)Survival prediction via hierarchical multimodal co\-attention transformer: a computational histology–radiology solution\.IEEE Transactions on Medical Imaging42\(9\),pp\. 2678–2689\.External Links:[Document](https://dx.doi.org/10.1109/TMI.2023.3263010)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[11\]P\. P\. Liang, A\. Zadeh, and L\. Morency\(2024\)Foundations & trends in multimodal machine learning: principles, challenges, and open questions\.ACM Computing Surveys56\(10\),pp\. 1–42\.External Links:[Document](https://dx.doi.org/10.1145/3656580)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[12\]M\. Monod, P\. Krusche, Q\. Cao, B\. Sahiner, N\. Petrick, D\. Ohlssen, and T\. Coroller\(2024\)TorchSurv: a lightweight package for deep survival analysis\.Journal of Open Source Software9\(104\),pp\. 7341\.External Links:[Document](https://dx.doi.org/10.21105/joss.07341)Cited by:[§2](https://arxiv.org/html/2606.15038#S2.p1.1)\.
- \[13\]S\. Neupane, S\. Mitra, S\. Mittal, M\. Gaur,et al\.\(2025\)MedInsight: a multi\-source context augmentation framework for generating patient\-centric medical responses using large language models\.ACM Transactions on Computing for Healthcare6\(6\)\.External Links:[Document](https://dx.doi.org/10.1145/3709365)Cited by:[§2](https://arxiv.org/html/2606.15038#S2.p1.1)\.
- \[14\]A\. Shah, S\. Sra, R\. Chellappa, and A\. Cherian\(2022\)Max\-margin contrastive learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 8220–8230\.Cited by:[§2\.2](https://arxiv.org/html/2606.15038#S2.SS2.p2.4)\.
- \[15\]E\. Steinberg, J\. A\. Fries, Y\. Xu, and N\. H\. Shah\(2024\)MOTOR: a time\-to\-event foundation model for structured medical records\.InICLR,External Links:[Link](https://openreview.net/forum?id=NialiwI2V6)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1),[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[16\]P\. Stoneet al\.\(2022\)The accuracy of clinician predictions of survival in the PiPS2 study\.PLOS ONE\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0267050),[Link](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0267050)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
- \[17\]Z\. Sun, M\. Lin, Q\. Zhu, Q\. Xie, F\. Wang, Z\. Lu, and Y\. Peng\(2023\)A scoping review on multimodal deep learning in biomedical images and texts\.Journal of Biomedical Informatics146,pp\. 104482\.External Links:ISSN 1532\-0464,[Document](https://dx.doi.org/10.1016/j.jbi.2023.104482),[Link](https://www.sciencedirect.com/science/article/pii/S1532046423002034)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[18\]T\. F\. Thomsen, M\. Davidsen, H\. Ibsen, T\. Jørgensen, G\. Jensen, and K\. Borch\-Johnsen\(2001\)A new method for chd prediction and prevention based on regional risk scores and randomized clinical trials; precard® and the copenhagen risk score\.European Journal of Cardiovascular Prevention & Rehabilitation8\(5\),pp\. 291–297\.Cited by:[§3](https://arxiv.org/html/2606.15038#S3.p3.1)\.
- \[19\]Q\. Wang, K\. Chen, W\. Dou, and Y\. Ma\(2023\)Cross\-attention based multi\-resolution feature fusion model for self\-supervised cervical OCT image classification\.IEEE/ACM Transactions on Computational Biology and Bioinformatics20\(4\),pp\. 2541–2554\.External Links:[Document](https://dx.doi.org/10.1109/TCBB.2023.3246979)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[20\]S\. Wiegrebe, P\. Kopper, R\. Sonabend, B\. Bischl, and A\. Bender\(2024\)Deep learning for survival analysis: a review\.Artificial Intelligence Review\.External Links:[Document](https://dx.doi.org/10.1007/s10462-023-10681-3),[Link](https://link.springer.com/article/10.1007/s10462-023-10681-3)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
- \[21\]P\. Yang, X\. Yin, H\. Lu, Z\. Hu, X\. Zhang, R\. Jiang, and H\. Lv\(2022\)CS\-CO: a hybrid self\-supervised visual representation learning method for H&E\-stained histopathological images\.Medical Image Analysis81,pp\. 102539\.External Links:[Document](https://dx.doi.org/10.1016/j.media.2022.102539)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p2.1)\.
- \[22\]J\. Zhang, X\. Lan, X\. Qu, Y\. Cheng, M\. Feng, and B\. Hooi\(2024\)Learning the unlearned: mitigating feature suppression in contrastive learning\.InEuropean Conference on Computer Vision,pp\. 35–52\.Cited by:[§2\.2](https://arxiv.org/html/2606.15038#S2.SS2.p2.4)\.
- \[23\]Y\. Zhang, Y\. Chen, H\. Chen, C\. Dong, X\. Hu, X\. Xu, L\. Zhu, Z\. Cheng, D\. Wang, Z\. Zhang,et al\.\(2024\)Performance of the simplified pulmonary embolism severity index in predicting 30\-day mortality after acute pulmonary embolism: validation from a large\-scale cohort\.European Journal of Internal Medicine124,pp\. 46–53\.External Links:[Document](https://dx.doi.org/10.1016/j.ejim.2024.01.037),[Link](https://pubmed.ncbi.nlm.nih.gov/38350784/)Cited by:[§1](https://arxiv.org/html/2606.15038#S1.p1.1)\.
Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Similar Articles

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

Submit Feedback

Similar Articles

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals
LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts
On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series