Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Summary
This paper systematically evaluates foundation model representations for multimodal cancer analysis, benchmarking unimodal and multimodal fusion strategies on real-world cohorts, and assessing trustworthiness via conformal prediction.
View Cached Full Text
Cached at: 06/17/26, 05:35 AM
# Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis
Source: [https://arxiv.org/html/2606.17115](https://arxiv.org/html/2606.17115)
11institutetext:The Alan Turing Institute, London, United Kingdom22institutetext:University of Bristol, Bristol, United Kingdom33institutetext:University of Manchester, Manchester, United Kingdom44institutetext:The Institute of Cancer Research, London, United Kingdom55institutetext:Genentech, United States
55email:tchakraborty@turing\.ac\.uk###### Abstract
Foundation models \(FMs\) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored\. This work systematically evaluates FM\-based representations on a suite of computational pathology tasks across two real\-world commercial cohorts,IH\-BCandIH\-NSCLC, drawn from the licensed in\-house \(IH\)oncologydataset\. The analysis focuses on two modalities, whole\-slide images and transcriptomic profiles, drawn from theIHmultimodal data\. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals\. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image\-omics fusion strategies built on paired representations\. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction\. Our results show that FM representations achieve competitive performance on out\-of\-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal\. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty\-aware inference for clinical support\.
## 1Introduction
Artificial intelligence has shown promising performance across a range of cancer diagnosis applications, from medical imaging\[[4](https://arxiv.org/html/2606.17115#bib.bib2),[45](https://arxiv.org/html/2606.17115#bib.bib1)\]and pathology\[[9](https://arxiv.org/html/2606.17115#bib.bib3)\]to molecular and genomic interpretation\[[10](https://arxiv.org/html/2606.17115#bib.bib4),[35](https://arxiv.org/html/2606.17115#bib.bib5)\]and clinical outcome prediction\[[48](https://arxiv.org/html/2606.17115#bib.bib6)\]\. A common workflow in these methods applies a modality\-specific encoder to extract data representations, which are then consumed by either a unimodal or multimodal learning module for downstream prediction\. The expressiveness of the encoded representation largely determines the quality of the downstream predictor, and prior work has extensively studied how encoder architecture and capacity match each data modality\[[27](https://arxiv.org/html/2606.17115#bib.bib32),[59](https://arxiv.org/html/2606.17115#bib.bib33)\]\.
As medical datasets continue to grow in number and scale\[[22](https://arxiv.org/html/2606.17115#bib.bib27)\], training a dedicated encoder from scratch for every new task becomes computationally expensive\. To address this bottleneck, recent work starts to explore medical Foundation Models \(FMs\), which are pretrained on large medical corpora and used as training\-free feature extractors across heterogeneous downstream tasks\. Both pathology FMs like CONCH\[[42](https://arxiv.org/html/2606.17115#bib.bib8)\]and transcriptomic FMs like UCE\[[49](https://arxiv.org/html/2606.17115#bib.bib28)\]have reported promising transferability, though evaluations are mainly performed on public benchmarks that can overlap with or are close to the pretraining corpora of FMs\.
In real\-world scenarios, however, there are also datasets collected from industrial and commercial sources that, due to different data collection pipelines, follow different distributions from the public ones\. Whether FMs can generalize to such unseen data remains an underexplored question\. This gap motivates our first question:*whether FM representations can transfer to OOD datasets and yield reliable representations under probing\.*We study this question using an in\-house \(IH\) real\-world dataset of multi\-cancer cases with paired whole\-slide H&E images and transcriptomic profiles\. We probe image representations from four image FMs and omics representations from three transcriptomic encoders and evaluate their performance on eight downstream classification tasks\.
The evaluation on unimodal probing shows that image and omics representations carry complementary signals\. This is consistent with the real\-world setting where prognosis and treatment response are jointly determined by multiple modalities such as morphology, molecular state, and clinical context\. Prior work has therefore started multimodal learning exploration that applies fusion strategies such as concatenation\[[17](https://arxiv.org/html/2606.17115#bib.bib7)\]and cross\-modal attention\[[18](https://arxiv.org/html/2606.17115#bib.bib35)\]to incorporate information across modalities\. We therefore ask*whether image\-omics fused representations can obtain additional predictive performance over the unimodal representations\.*By pairing image and omics representations with three fusion strategies, we find multimodal fusion delivers stronger performance on some tasks\.
The evaluations so far focus on predictive performance \(e\.g\., accuracy\)\. However, in high\-stakes domains like medical diagnosis, high utility alone does not guarantee trustworthiness: a model may still be miscalibrated or unfair across demographic subgroups\. This leads us to further wonder*how trustworthy are the unimodal and multimodal techniques studied above?*We apply conformal prediction\[[3](https://arxiv.org/html/2606.17115#bib.bib39)\]to assess predictive uncertainty and quantify the gap in predictive performance across subgroups\. The selected pipelines achieve meaningful coverage guarantees and broadly similar performance across groups, while still revealing task\-specific disparities for future work\.
Figure 1:Overview of the Experimental Setups\.Figure[1](https://arxiv.org/html/2606.17115#S1.F1)illustrates the overall workflow of our work, which proceeds in three parts: unimodal probing of image and omics representations across eight downstream tasks \(§[3](https://arxiv.org/html/2606.17115#S3)\), multimodal learning comparisons of three image\-omics fusion strategies built on paired image\-omics representations \(§[4](https://arxiv.org/html/2606.17115#S4)\), and trustworthiness evaluation of selected unimodal and multimodal techniques via conformal analysis \(§[5](https://arxiv.org/html/2606.17115#S5)\)\.
## 2Experimental Setup
Datasets Preprocessing Methods\.This study used the US\-based deidentified Flatiron Health\-Caris Life Sciences breast cancer and non\-small cell lung cancer Clinical\-Molecular Database \(CMDB\)\. Clinical data from the Flatiron Health Research Database\[[26](https://arxiv.org/html/2606.17115#bib.bib80),[44](https://arxiv.org/html/2606.17115#bib.bib81),[60](https://arxiv.org/html/2606.17115#bib.bib82)\]are linked to molecular data, derived from Caris Life Sciences’ MI Profile™ comprehensive profiling in the CMDB by probabilistic matching, providing a deidentified dataset\[[12](https://arxiv.org/html/2606.17115#bib.bib77),[11](https://arxiv.org/html/2606.17115#bib.bib78),[13](https://arxiv.org/html/2606.17115#bib.bib79)\]\.Based on cancer type, the collection is split into two in\-house cohorts, referred to as IH\-BC and IH\-NSCLC\. Each IH subset contains multimodal genomic data for the corresponding cancer\. This study mainly takes two modalities of H&E images and omics data as input, and information about cancer subtypes, biopsy sites, and biomarkers is used as the downstream prediction tasks\. Specifically,IH\-BCincludes LOH, Biomarker PR status, PIK3CAstatus, Biopsy Site, and Breast Cancer Subtype identification tasks, whileIH\-NSCLCincludes Biopsy Site, Tumor Site and TMB identification tasks\. The datasets are split into training, calibration, validation, and test sets at a ratio of 7:3:1:1\. The multimodal learning methods are trained on the training set and evaluated on the test set\. More detailed preprocessing steps are reported in Appendix[0\.B\.1](https://arxiv.org/html/2606.17115#Pt0.A2.SS1)\.
Modelling Methods\.The experiment consists of two stages: representation extraction and representation learning\. For representation extraction, we evaluated four tile\-level foundation models \(CONCH\[[42](https://arxiv.org/html/2606.17115#bib.bib8)\], UNI\[[16](https://arxiv.org/html/2606.17115#bib.bib30)\], Virchow\[[55](https://arxiv.org/html/2606.17115#bib.bib29)\]and MUSK\[[57](https://arxiv.org/html/2606.17115#bib.bib31)\]\) for H&E WSI data and one omics foundation model \(UCE\[[49](https://arxiv.org/html/2606.17115#bib.bib28)\]\) for omics data\. We also obtained omics representations using scVI\[[41](https://arxiv.org/html/2606.17115#bib.bib37)\]and Principal Component Analysis \(PCA\)\[[1](https://arxiv.org/html/2606.17115#bib.bib54)\]for comparison\. In the representation learning stage, we considered five methods, each combined with different representation backbones, including two unimodal methods: H&E image\-based Multiple Instance Learning \(HEMIL\) and Multilayer Perceptron for omics data \(GeneMLP\), and three multimodal methods: concatenation\-based fusion \(CONTACT\), Multimodal Co\-Attention Transformer\[[18](https://arxiv.org/html/2606.17115#bib.bib35)\]\(MCAT\), and Late Fusion Multiple Instance Learning\[[46](https://arxiv.org/html/2606.17115#bib.bib34)\]\(LateMIL\)\. All methods are designed for the same set of tasks, and further method details are provided in Appendix[0\.B\.2](https://arxiv.org/html/2606.17115#Pt0.A2.SS2)\.
Evaluation Metrics\.Accuracy \(ACC\) and AUC are used as the main evaluation metrics for model utility performance of downstream classification tasks\. The ROC curve is applied to illustrate the trade\-off between the true positive rate and false positive rate at different classification thresholds\. More detailed formulas can be found in Appendix[0\.B\.3](https://arxiv.org/html/2606.17115#Pt0.A2.SS3)\.
## 3Experiments on Unimodal Probing
Table 1:Unimodal Probing Performance onIH\-BCandIH\-NSCLCTasks\. We report bothACCandAUC\.Table[1](https://arxiv.org/html/2606.17115#S3.T1)summarizes the unimodal probing performance of four types of tiles representations and three omics representations across eight downstream tasks\.
Image foundation models achieve broadly comparable performance, though results vary considerably acrosscancer types: AUC exceeds0\.90\.9onIH\-BCBiopsy Site across all image FMs, whereasIH\-NSCLCBiopsy Site AUC ranges only from0\.63250\.6325to0\.64780\.6478\. Among tile representations, differences between image FMs are small relative to differences across tasks, indicating that downstream task difficulty is the key factor shaping unimodal probing performance\. Omics representations provide strong unimodal signal on several tasks, and in some cases outperform image\-based methods\. For example, the highest accuracy on the task of BC\-PIK3CA among image\-based approaches is 0\.7533, whereas the best omics\-based result reaches 0\.7933\. The classical PCA baseline generally outperforms learned transcriptomic encoders on most task and achieves highest AUC on LOH \(0\.77940\.7794\), PR \(0\.81590\.8159\), PIK3CA \(0\.79210\.7921\), Subtype \(0\.89550\.8955\), and TMB \(0\.72770\.7277\)\.
We also compared ROC curves \(Figure[2](https://arxiv.org/html/2606.17115#S3.F2)\) and observed a consistent pattern: image representations show more stable performance with relatively small variance across FMs\. All three omics representation methods \(UCE, PCA, and scVI\) outperform direct modeling on the raw full gene expression profile on BC\-LOH\. However, the foundation model UCE underperforms the non\-foundation\-model approaches scVI and PCA\. The pattern observed in the industrialIHdataset aligns with prior discussions on public datasets, suggesting that PCA can be better suited for capturing biological perturbations than existing omics foundation models\[[8](https://arxiv.org/html/2606.17115#bib.bib38)\]\. This indicates that building effective transcriptomic foundation models remains an open challenge\.
Figure 2:ROC Comparison on BC\-LOH Task\.
## 4Experiments on Multimodal Fusion
Figure 3:Performance Comparisons of Unimodal \(GeneMLP, HEMIL\) and Multimodal \(MCAT, CONTACT, LateMIL\) MethodsThe above results show that omics and image modalities each have strengths on different tasks, we next investigate whether fusing their representations yields additional benefit\. Figure[3](https://arxiv.org/html/2606.17115#S4.F3)compares unimodal and multimodal performance when CONCH is used as the image backbone and PCA/SCVI as the omics backbone\. The complete results can be found in Table[5](https://arxiv.org/html/2606.17115#Pt0.A3.T5)in the Appendix\.
Between different fusion strategies, LateMIL is more consistent across tasks than CONTACT and MCAT\. Their comparison with unimodal methods is mixed: multimodal fusion outperforms the best unimodal baseline on some tasks, but there are also cases where it shows no appreciable gain or even underperforms a unimodal model\. For instance, LateMIL achieve the highest AUC on BC\-LOH \(under CONCH\+PCA\) and LateMIL remains competitive on the NSCLC\-Biopsy Site task\. On BC\-Subtype task, methods reach broadly comparable accuracy and AUC, with fusion offering marginal gains over the strongest unimodal baseline\. In contrast, on the NSCLC\-TMB task, GeneMLP \(PCA\) attains the highest ACC, with MCAT and LateMIL falling notably below it\. This suggests that when a single modality carries the dominant predictive signal, fusion can weaken rather than strengthen the representation\. The same pattern holds under UNI\+PCA, where the relative ranking of unimodal and multimodal methods again varies across tasks\. These results indicate that multimodal fusion is not universally beneficial and its utility depends on the relative informativeness of each modality for the target task\.
## 5Experiments on Uncertainty Quantification
Table 2:Task\-level conformal performance atα=0\.10\\alpha=0\.10, averaged over all models\.
Table 3:Conformal performance by model atα=0\.10\\alpha=0\.10\. Each cell: Coverage / Avg\. Set Size\.
Point predictions are insufficient for high\-stakes oncology decision support systems, as models can still make high\-confidence errors\. Uncertainty quantification \(UQ\) addresses this by providing calibrated confidence estimates alongside each prediction\[[6](https://arxiv.org/html/2606.17115#bib.bib53)\]\. We analyse uncertainty using split conformal prediction \(CP\)\[[3](https://arxiv.org/html/2606.17115#bib.bib39)\], a model\-agnostic framework that wraps any trained classifier and produces a coverage\-guaranteed prediction set𝒞\(x\)\\mathcal\{C\}\(x\)rather than a single top\-1 class, applied to four tasks spanning diverse label structures: LOH, Biopsy Site, Subtype, and Tumor Site\. We report results atα=0\.10\\alpha=0\.10\(90% coverage target\); additionalα\\alphavalues are in Appendix[0\.C\.1](https://arxiv.org/html/2606.17115#Pt0.A3.SS1)\. We evaluate conformal prediction using three metrics: empirical coverage, average set size, and singleton rate\. Empirical coverage measures how often the true label lies in the prediction set, while average set size and singleton rate quantify how informative and specific those sets are\. Further details are provided in Appendix[0\.B\.4](https://arxiv.org/html/2606.17115#Pt0.A2.SS4)\.
As shown in Table[3](https://arxiv.org/html/2606.17115#S5.T3), all four multiclass tasks achieve mean coverage at or above the nominal0\.900\.90target, and none exhibits a negative aggregate coverage gap\. The two NSCLC endpoints deviate most strongly from the nominal target: NSCLC Biopsy Site and NSCLC Tumour Site have mean coverages of0\.9310\.931and0\.9150\.915, respectively\. This over\-coverage is directly linked to their near\-zero singleton rates; models are rarely confident enough to commit to a single class, so prediction sets routinely span two or three classes, driving coverage well above the nominal level at the cost of efficiency\. This behaviour is consistent with the intrinsic difficulty of discriminating between lung subregions in routine clinical samples, which often yields broader, more overlapping probability distributions and therefore larger, more conservative conformal sets\. The breast cancer \(BC\) tasks are substantially tighter\. BC\-LOH and BC\-Subtype lie within0\.0030\.003of the target coverage at the aggregate level, with average set sizes of approximately2\.02\.0out of three and four classes, respectively, indicating a more favourable reliability\-efficiency trade\-off\. This tighter behaviour aligns with the higher predictive accuracy on BC tasks and suggests that the available image and omics features provide a stronger signal for distinguishing LOH status and molecular subtypes than for separating lung subregions, enabling models to concentrate probability mass more sharply and yield more efficient prediction sets\. The rescue rate \(how often the true label is recovered when the top\-1 prediction is wrong\) reinforces these results\. NSCLC tasks benefit most from larger sets \(87\.3%87\.3\\%for Biopsy Site,84\.8%84\.8\\%for Tumour Site\), while BC tasks still recover the true label in over72%72\\%of failures, confirming that conformal prediction provides a meaningful safety net across all tasks\.
Table[3](https://arxiv.org/html/2606.17115#S5.T3)disaggregates these results by model family\. For any given task, the results are more stable, the spread in average prediction\-set size across the five families is less than0\.50\.5, indicating that efficiency is largely architecture\-agnostic\. On the BC tasks, where coverage is closest to the nominal target, CONTACT produces the most compact sets but falls marginally below0\.900\.90on both LOH and Subtype, with LateMIL also under\-covering slightly on Subtype\. HEMIL is the most conservative family: it attains the highest coverage on BC Subtype \(0\.9120\.912\) but at the cost of the largest average set size \(2\.342\.34\), consistent with a more dispersed softmax distribution than other models\. For the NSCLC tasks, all families substantially exceed the target coverage, in line with the task\-level analysis\. The main source of variation is LateMIL, which reaches0\.9420\.942on NSCLC Biopsy Site but drops to0\.9100\.910on NSCLC Tumor Site, while GeneMLP is the only family to approach the lower bound on NSCLC Tumor Site\. Taken together, these family\-level patterns suggest that conformal behaviour is largely robust to the choice of backbone: architectural differences induce only modest changes in coverage and set size, which are small compared with the much larger task\-specific effects\.
## 6Conclusion
We evaluated foundation model representations for multimodal cancer analysis on two real\-worldIHcohorts across three parts: unimodal probing, image\-omics representations fusion, and methods trustworthiness\. The results show that image FMs transfer reasonably to unseen data, while the classic PCA baseline still matches or even beats omics FMs in performance\. Fusion helps when both modalities contribute to the prediction, but it can underperform when the signal from one modality dominates\. CP provides an architecture\-agnostic uncertainty layer in which task difficulty, rather than model choice, governs prediction\-set size and coverage, with prediction sets consistently recovering the true label when the top\-1 prediction fails\.Exploring new representation alignment methods and unified multimodal FMs to improve the performance of multimodal methods would be helpful future directions\.
## Acknowledgments
Portions of this research were conducted with the advanced computing resources provided by the CSCoE Converge platform\. We also thank the CSCoE Scientific Computing staff for technical support\. We thank Cyrus Manuel and Evan Liu for generating and providing access to the image embeddings\.
## Impact Statement
By coupling predictive evaluation with conformal uncertainty quantification, the study supports a shift from single point predictions to prediction sets with marginal coverage guarantees that preserve the true diagnosis in the majority of failure cases, reducing the risk of silently discarding the correct label and supporting safer human–AI collaboration in clinician decision support\.
## References
- \[1\]H\. Abdi and L\. J\. Williams\(2010\)Principal component analysis\.Wiley interdisciplinary reviews: computational statistics2\(4\),pp\. 433–459\.Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[2\]A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster\(2022\)Conformal risk control\.arXiv preprint arXiv:2208\.02814\.Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p2.1)\.
- \[3\]A\. N\. Angelopoulos and S\. Bates\(2021\)A gentle introduction to conformal prediction and distribution\-free uncertainty quantification\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2107.07511),[Link](https://arxiv.org/abs/2107.07511)Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p2.1),[§1](https://arxiv.org/html/2606.17115#S1.p5.1),[§5](https://arxiv.org/html/2606.17115#S5.p1.3)\.
- \[4\]D\. Ardila, A\. P\. Kiraly, S\. Bharadwaj, B\. Choi, J\. J\. Reicher, L\. Peng, D\. Tse, M\. Etemadi, W\. Ye, G\. Corrado,et al\.\(2019\)End\-to\-end lung cancer screening with three\-dimensional deep learning on low\-dose chest computed tomography\.Nature medicine25\(6\),pp\. 954–961\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[5\]N\. A\. Bagegni, A\. A\. Davis, K\. K\. Clifton, and F\. O\. Ademuyiwa\(2022\-04\)Targeted treatment for high\-risk early\-stage triple\-negative breast cancer: spotlight on pembrolizumab\.Breast Cancer: Targets and TherapyVolume 14,pp\. 113–123\.External Links:ISSN 1179\-1314,[Link](http://dx.doi.org/10.2147/BCTT.S293597),[Document](https://dx.doi.org/10.2147/bctt.s293597)Cited by:[§0\.C\.1](https://arxiv.org/html/2606.17115#Pt0.A3.SS1.p8.4)\.
- \[6\]E\. Begoli, T\. Bhattacharya, and D\. Kusnezov\(2019\-01\)The need for uncertainty quantification in machine\-assisted medical decision making\.Nature Machine Intelligence1\(1\),pp\. 20–23\.External Links:ISSN 2522\-5839,[Link](http://dx.doi.org/10.1038/s42256-018-0004-1),[Document](https://dx.doi.org/10.1038/s42256-018-0004-1)Cited by:[§5](https://arxiv.org/html/2606.17115#S5.p1.3)\.
- \[7\]E\. Begoli, T\. Bhattacharya, and D\. Kusnezov\(2019\)The need for uncertainty quantification in machine\-learning for clinical applications\.Nature Machine Intelligence1\(1\),pp\. 20–23\.Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[8\]I\. Bendidi, S\. Whitfield, K\. Kenyon\-Dean, H\. B\. Yedder, Y\. E\. Mesbahi, E\. Noutahi, and A\. K\. Denton\(2024\)Benchmarking transcriptomics foundation models for perturbation analysis: one pca still rules them all\.arXiv preprint arXiv:2410\.13956\.Cited by:[§3](https://arxiv.org/html/2606.17115#S3.p3.1.1.1)\.
- \[9\]G\. Campanella, M\. G\. Hanna, L\. Geneslaw, A\. Miraflor, V\. Werneck Krauss Silva, K\. J\. Busam, E\. Brogi, V\. E\. Reuter, D\. S\. Klimstra, and T\. J\. Fuchs\(2019\)Clinical\-grade computational pathology using weakly supervised deep learning on whole slide images\.Nature medicine25\(8\),pp\. 1301–1309\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[10\]D\. Capper, D\. T\. Jones, M\. Sill, V\. Hovestadt, D\. Schrimpf, D\. Sturm, C\. Koelsche, F\. Sahm, L\. Chavez, D\. E\. Reuss,et al\.\(2018\)DNA methylation\-based classification of central nervous system tumours\.Nature555\(7697\),pp\. 469–474\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[11\]Caris Life SciencesMI cancer seek\.Note:carislifesciences\.comAccessed May 31, 2026External Links:[Link](https://www.carislifesciences.com/physicians/physician-tests/mi-cancer-seek/)Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
- \[12\]Caris Life SciencesMI profile\.Note:carislifesciences\.comAccessed May 31, 2026External Links:[Link](https://www.carislifesciences.com/physicians/physician-tests/mi-profile/)Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
- \[13\]Caris Life SciencesMI tumor seek hybrid\.Note:carislifesciences\.comAccessed May 31, 2026External Links:[Link](https://www.carislifesciences.com/physicians/physician-tests/mi-tumor-seek-hybrid/)Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
- \[14\]A\. Cheerla and O\. Gevaert\(2019\)Deep learning with multimodal representation for pancancer prognosis prediction\.Bioinformatics35\(14\),pp\. i446–i454\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[15\]H\. Chen, M\. S\. Venkatesh, J\. Gómez Ortega, S\. V\. Mahesh, T\. N\. Nandi, R\. K\. Madduri, K\. Pelka, and C\. V\. Theodoris\(2026\)Scaling and quantization of large\-scale foundation model enables resource\-efficient predictions in network biology\.Nature Computational Science,pp\. 1–14\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[16\]R\. J\. Chen, T\. Ding, M\. Y\. Lu, D\. F\. Williamson, G\. Jaume, A\. H\. Song, B\. Chen, A\. Zhang, D\. Shao, M\. Shaban,et al\.\(2024\)Towards a general\-purpose foundation model for computational pathology\.Nature medicine30\(3\),pp\. 850–862\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[17\]R\. J\. Chen, M\. Y\. Lu, J\. Wang, D\. F\. Williamson, S\. J\. Rodig, N\. I\. Lindeman, and F\. Mahmood\(2020\)Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis\.IEEE transactions on medical imaging41\(4\),pp\. 757–770\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.17115#S1.p4.1)\.
- \[18\]R\. J\. Chen, M\. Y\. Lu, W\. Weng, T\. Y\. Chen, D\. F\. Williamson, T\. Manz, M\. Shady, and F\. Mahmood\(2021\)Multimodal co\-attention transformer for survival prediction in gigapixel whole slide images\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4015–4025\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.17115#S1.p4.1),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[19\]R\. J\. Chen, M\. Y\. Lu, D\. F\. Williamson, T\. Y\. Chen, J\. Lipkova, Z\. Noor, M\. Shaban, M\. Shady, M\. Williams, B\. Joo,et al\.\(2022\)Pan\-cancer integrative histology\-genomic analysis via multimodal deep learning\.Cancer cell40\(8\),pp\. 865–878\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[20\]H\. Cui, C\. Wang, H\. Maan, K\. Pang, F\. Luo, N\. Duan, and B\. Wang\(2024\)ScGPT: toward building a foundation model for single\-cell multi\-omics using generative ai\.Nature methods21\(8\),pp\. 1470–1480\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[21\]T\. Dawood, C\. Chen, B\. S\. Sidhu, B\. Ruijsink, J\. Gould, B\. Porter, M\. K\. Elliott, V\. Mehta, C\. A\. Rinaldi, E\. Puyol\-Antón, R\. Razavi, and A\. P\. King\(2023\-08\)Uncertainty aware training to improve deep learning model calibration for classification of cardiac mr images\.Medical Image Analysis88,pp\. 102861\.External Links:ISSN 1361\-8415,[Link](http://dx.doi.org/10.1016/j.media.2023.102861),[Document](https://dx.doi.org/10.1016/j.media.2023.102861)Cited by:[§0\.B\.4](https://arxiv.org/html/2606.17115#Pt0.A2.SS4.p11.5),[§0\.C\.1](https://arxiv.org/html/2606.17115#Pt0.A3.SS1.p3.10)\.
- \[22\]Z\. Deng, C\. Tang, Z\. Huang, J\. Lin, Y\. Chen, J\. Ning, C\. Ma, J\. Liu, W\. Li, Y\. Zhu,et al\.\(2026\)Project imaging\-x: a survey of 1000\+ open\-access medical imaging datasets for foundation model development\.arXiv preprint arXiv:2603\.27460\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p2.1)\.
- \[23\]S\. Dey, C\. R\. S\. Banerji, P\. Basuchowdhuri, S\. K\. Saha, D\. Parashar, and T\. Chakraborti\(2025\-02\)Generating crossmodal gene expression from cancer histopathology improves multimodal ai predictions\.\(arXiv:2502\.00568\)\.Note:arXiv:2502\.00568 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.00568),[Document](https://dx.doi.org/10.48550/arXiv.2502.00568)Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p2.1)\.
- \[24\]T\. Ding, S\. J\. Wagner, A\. H\. Song, R\. J\. Chen, M\. Y\. Lu, A\. Zhang, A\. J\. Vaidya, G\. Jaume, M\. Shaban, A\. Kim,et al\.\(2025\)A multimodal whole\-slide foundation model for pathology\.Nature medicine,pp\. 1–13\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[25\]B\. Duvieusart, F\. Krones, G\. Parsons, L\. Tarassenko, B\. W\. Papież, and A\. Mahdi\(2022\)Multimodal cardiomegaly classification with image\-derived digital biomarkers\.InAnnual Conference on Medical Image Understanding and Analysis,pp\. 13–27\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[26\]Flatiron Health\(2025\-03\)Database characterization guide\.Note:Flatiron\.comPublished March 18, 2025\. Accessed May 31, 2026External Links:[Link](https://flatiron.com/database-characterization)Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
- \[27\]X\. Geng, H\. Liu, L\. Lee, D\. Schuurmans, S\. Levine, and P\. Abbeel\(2022\)Multimodal masked autoencoders learn transferable representations\.arXiv preprint arXiv:2205\.14204\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[28\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.CoRRabs/1706\.04599\.External Links:[Link](http://arxiv.org/abs/1706.04599),1706\.04599Cited by:[§0\.B\.4](https://arxiv.org/html/2606.17115#Pt0.A2.SS4.p10.4),[§0\.C\.1](https://arxiv.org/html/2606.17115#Pt0.A3.SS1.p3.10)\.
- \[29\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning,pp\. 1321–1330\.Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[30\]K\. Hemker, N\. Simidjievski, and M\. Jamnik\(2024\)HEALNet: multimodal fusion for heterogeneous biomedical data\.Advances in Neural Information Processing Systems37,pp\. 64479–64498\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[31\]H\. Hermessi, O\. Mourali, and E\. Zagrouba\(2021\)Multimodal medical image fusion review: theoretical background and recent advances\.Signal Processing183,pp\. 108036\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[32\]S\. Huang, A\. Pareek, S\. Seyyedi, I\. Banerjee, and M\. P\. Lungren\(2020\)Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines\.NPJ digital medicine3\(1\),pp\. 136\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[33\]S\. Huang, A\. Pareek, R\. Zamanian, I\. Banerjee, and M\. P\. Lungren\(2020\)Multimodal fusion with deep neural networks for leveraging ct imaging and electronic health record: a case\-study in pulmonary embolism detection\.Scientific reports10\(1\),pp\. 22147\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[34\]G\. Jaume, A\. Vaidya, R\. J\. Chen, D\. F\. Williamson, P\. P\. Liang, and F\. Mahmood\(2024\)Modeling dense multimodal interactions between biological pathways and histology for survival prediction\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11579–11590\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[35\]W\. Jiao, G\. Atwal, P\. Polak, R\. Karlic, E\. Cuppen, A\. Danyi, J\. de Ridder, C\. van Herpen, M\. P\. Lolkema,et al\.\(2020\)A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns\.Nature communications11\(1\),pp\. 728\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[36\]M\. Karasikov, J\. van Doorn, N\. Känzig, M\. Erdal Cesur, H\. M\. Horlings, R\. Berke, F\. Tang, and S\. Otálora\(2025\)Training state\-of\-the\-art pathology foundation models with orders of magnitude less data\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 573–583\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p2.1)\.
- \[37\]K\. Z\. Kedzierska, L\. Crawford, A\. P\. Amini, and A\. X\. Lu\(2025\)Zero\-shot evaluation reveals limitations of single\-cell foundation models\.Genome Biology26\(1\),pp\. 101\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p2.1)\.
- \[38\]F\. Krones, U\. Marikkar, G\. Parsons, A\. Szmul, and A\. Mahdi\(2025\)Review of multimodal machine learning approaches in healthcare\.Information Fusion114,pp\. 102690\.Cited by:[§0\.A\.2](https://arxiv.org/html/2606.17115#Pt0.A1.SS2.p1.1)\.
- \[39\]P\. Langley\(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[§0\.C\.1](https://arxiv.org/html/2606.17115#Pt0.A3.SS1.p9.1)\.
- \[40\]C\. Leibig, V\. Allken, M\. S\. Ayhan, P\. Berens, and S\. Wahl\(2017\)Leveraging uncertainty information from deep neural networks for disease detection\.Scientific Reports7\(1\),pp\. 17816\.Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[41\]R\. Lopez, J\. Regier, M\. B\. Cole, M\. I\. Jordan, and N\. Yosef\(2018\)Deep generative modeling for single\-cell transcriptomics\.Nature methods15\(12\),pp\. 1053–1058\.Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[42\]M\. Y\. Lu, B\. Chen, D\. F\. Williamson, R\. J\. Chen, I\. Liang, T\. Ding, G\. Jaume, I\. Odintsov, L\. P\. Le, G\. Gerber,et al\.\(2024\)A visual\-language foundation model for computational pathology\.Nature medicine30\(3\),pp\. 863–874\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.17115#S1.p2.1),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[43\]M\. Y\. Lu, B\. Chen, D\. F\. Williamson, R\. J\. Chen, M\. Zhao, A\. K\. Chow, K\. Ikemura, A\. Kim, D\. Pouli, A\. Patel,et al\.\(2024\)A multimodal generative ai copilot for human pathology\.Nature634\(8033\),pp\. 466–473\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p2.1)\.
- \[44\]X\. Ma, L\. Long, S\. Moon, B\. J\. Adamson, and S\. S\. Baxi\(2020\)Comparison of population characteristics in real\-world clinical oncology databases in the us: flatiron health, seer, and npcr\.MedRxiv,pp\. 2020–03\.Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
- \[45\]S\. M\. McKinney, M\. Sieniek, V\. Godbole, J\. Godwin, N\. Antropova, H\. Ashrafian, T\. Back, M\. Chesus, G\. S\. Corrado, A\. Darzi,et al\.\(2020\)International evaluation of an ai system for breast cancer screening\.Nature577\(7788\),pp\. 89–94\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[46\]R\. Naidoo, O\. Fourkioti, M\. D\. Vries, and C\. Bakal\(2024\-06 Oct\)SurvivMIL: a multimodal, multiple instance learning pipeline for survival outcome of neuroblastoma patients\.InProceedings of the MICCAI Workshop on Computational Pathology,F\. Ciompi, N\. Khalili, L\. Studer, M\. Poceviciute, A\. Khan, M\. Veta, Y\. Jiao, N\. Haj\-Hosseini, H\. Chen, S\. Raza, I\. Minhas, N\. Burlutskiy, V\. Vilaplana, B\. Brattoli, H\. Muller, M\. Atzori, S\. Raza, and F\. Minhas \(Eds\.\),Proceedings of Machine Learning Research, Vol\.254,pp\. 131–141\.External Links:[Link](https://proceedings.mlr.press/v254/naidoo24a.html)Cited by:[§0\.B\.2\.2](https://arxiv.org/html/2606.17115#Pt0.A2.SS2.SSS2.p6.6),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[47\]D\. Peeters, N\. Alves, K\. V\. Venkadesh, R\. Dinnessen, Z\. Saghir, E\. T\. Scholten, C\. Schaefer\-Prokop, R\. Vliegenthart, M\. Prokop, and C\. Jacobs\(2024\)Enhancing a deep learning model for pulmonary nodule malignancy risk estimation in chest ct with uncertainty estimation\.European Radiology34\(10\),pp\. 6639–6651\.External Links:ISSN 1432\-1084,[Document](https://dx.doi.org/10.1007/s00330-024-10714-7)Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[48\]D\. Placido, B\. Yuan, J\. X\. Hjaltelin, C\. Zheng, A\. D\. Haue, P\. J\. Chmura, C\. Yuan, J\. Kim, R\. Umeton, G\. Antell,et al\.\(2023\)A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories\.Nature medicine29\(5\),pp\. 1113–1122\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[49\]Y\. Rosen, Y\. Roohani, A\. Agarwal, L\. Samotorčan, T\. S\. Consortium, S\. R\. Quake, and J\. Leskovec\(2023\)Universal cell embeddings: a foundation model for cell biology\.BioRxiv,pp\. 2023–11\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.17115#S1.p2.1),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[50\]A\. G\. Roy, S\. Conjeti, N\. Navab, and C\. Wachinger\(2019\)Bayesian QuickNAT: model uncertainty in deep whole\-brain segmentation for structure\-wise quality control\.NeuroImage195,pp\. 11–22\.Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[51\]A\. Sellergren, C\. Gao, F\. Mahvar, T\. Kohlberger, F\. Jamil, M\. Traverse, A\. Tono, B\. Sadjad, L\. Yang, C\. Lau,et al\.\(2026\)Medgemma 1\.5 technical report\.arXiv preprint arXiv:2604\.05081\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p2.1)\.
- \[52\]A\. Sellergren, S\. Kazemzadeh, T\. Jaroensri, A\. Kiraly, M\. Traverse, T\. Kohlberger, S\. Xu, F\. Jamil, C\. Hughes, C\. Lau,et al\.\(2025\)Medgemma technical report\.arXiv preprint arXiv:2507\.05201\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p2.1)\.
- \[53\]G\. Shaikovski, A\. Casson, K\. Severson, E\. Zimmermann, Y\. K\. Wang, J\. D\. Kunz, J\. A\. Retamero, G\. Oakley, D\. Klimstra, C\. Kanan,et al\.\(2024\)Prism: a multi\-modal generative foundation model for slide\-level histopathology\.arXiv preprint arXiv:2405\.10254\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[54\]C\. V\. Theodoris, L\. Xiao, A\. Chopra, M\. D\. Chaffin, Z\. R\. Al Sayed, M\. C\. Hill, H\. Mantineo, E\. M\. Brydon, Z\. Zeng, X\. S\. Liu,et al\.\(2023\)Transfer learning enables predictions in network biology\.Nature618\(7965\),pp\. 616–624\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[55\]E\. Vorontsov, A\. Bozkurt, A\. Casson, G\. Shaikovski, M\. Zelechowski, S\. Liu, K\. Severson, E\. Zimmermann, J\. Hall, N\. Tenenholtz,et al\.\(2023\)Virchow: a million\-slide digital pathology foundation model\.arXiv preprint arXiv:2309\.07778\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[56\]X\. Wang, S\. Yang, J\. Zhang, M\. Wang, J\. Zhang, W\. Yang, J\. Huang, and X\. Han\(2022\)Transformer\-based unsupervised contrastive learning for histopathological image classification\.Medical image analysis81,pp\. 102559\.Cited by:[§0\.A\.1](https://arxiv.org/html/2606.17115#Pt0.A1.SS1.p1.1)\.
- \[57\]J\. Xiang, X\. Wang, X\. Zhang, Y\. Xi, F\. Eweje, Y\. Chen, Y\. Li, C\. Bergstrom, M\. Gopaulchan, T\. Kim,et al\.\(2025\)A vision–language foundation model for precision oncology\.Nature638\(8051\),pp\. 769–778\.Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p2.1)\.
- \[58\]R\. Zahari, J\. Cox, and B\. Obara\(2023\)Quantifying the uncertainty in 3d ct lung cancer images classification\.In2023 IEEE 13th International Conference on Pattern Recognition Systems \(ICPRS\),Vol\.,pp\. 1–7\.External Links:[Document](https://dx.doi.org/10.1109/ICPRS58416.2023.10179053)Cited by:[§0\.A\.3](https://arxiv.org/html/2606.17115#Pt0.A1.SS3.p1.1)\.
- \[59\]C\. Zhang, Z\. Yang, X\. He, and L\. Deng\(2020\)Multimodal intelligence: representation learning, information fusion, and applications\.IEEE Journal of Selected Topics in Signal Processing14\(3\),pp\. 478–493\.Cited by:[§1](https://arxiv.org/html/2606.17115#S1.p1.1)\.
- \[60\]Q\. Zhang, A\. Gossai, S\. Monroe, N\. C\. Nussbaum, and C\. M\. Parrinello\(2021\)Validation analysis of a composite real\-world mortality endpoint for patients with cancer in the united states\.Health services research56\(6\),pp\. 1281–1287\.Cited by:[§2](https://arxiv.org/html/2606.17115#S2.p1.1.2)\.
## Appendix 0\.ARelated Work
### 0\.A\.1Medical Foundation Models
Foundation models \(FMs\) for computational pathology can be broadly categorized into tile\-level and whole\-slide\-level approaches\. Tile\-level FMs learn patch representations from histology crops, with representative examples including contrastive learning based CTransPath\[[56](https://arxiv.org/html/2606.17115#bib.bib43)\], self\-supervised model UNI\[[16](https://arxiv.org/html/2606.17115#bib.bib30)\], CONtrastive learning from Captions for Histopathology \(CONCH\[[42](https://arxiv.org/html/2606.17115#bib.bib8)\]\), and Virchow\[[55](https://arxiv.org/html/2606.17115#bib.bib29)\]\. Slide\-level FMs aggregate features into holistic slide embeddings and the examples include TITAN\[[24](https://arxiv.org/html/2606.17115#bib.bib9)\]and PRISM\[[53](https://arxiv.org/html/2606.17115#bib.bib44)\]\. For transcriptional FMs, UCE\[[49](https://arxiv.org/html/2606.17115#bib.bib28)\]learns universal cell embeddings across species, scGPT\[[20](https://arxiv.org/html/2606.17115#bib.bib42)\]pretrains on over 33 million cells and Geneformer\[[54](https://arxiv.org/html/2606.17115#bib.bib45),[15](https://arxiv.org/html/2606.17115#bib.bib46)\]encodes rank\-value gene tokens for network\-level prediction\.
Beyond using FMs as feature extractors, a growing line of work explores FMs as interactive generators\. PathChat\[[43](https://arxiv.org/html/2606.17115#bib.bib47)\]pairs a pathology vision encoder with a large language model to enable conversational diagnosis, while MedGemma\[[52](https://arxiv.org/html/2606.17115#bib.bib51),[51](https://arxiv.org/html/2606.17115#bib.bib50)\]extends biomedical image\-text reasoning to broader medical domains\. Despite these advances, recent studies\[[37](https://arxiv.org/html/2606.17115#bib.bib49),[36](https://arxiv.org/html/2606.17115#bib.bib48)\]have also shown concerns that biomedical FMs can suffer from hallucinations, misdiagnosis, modality misalignment, and in some settings, representations that underperform simpler or classical baselines\. This motivates us a closer examination of how FM representations behave on downstream classification tasks\.
### 0\.A\.2Multimodal Learning for Downstream Tasks
Multi\-modal learning methods can be categorized into early fusion, intermediate fusion, late fusion, and hybrid ways by the stage of modalities combinations\[[38](https://arxiv.org/html/2606.17115#bib.bib76),[32](https://arxiv.org/html/2606.17115#bib.bib67),[31](https://arxiv.org/html/2606.17115#bib.bib68)\]\. Early fusion merges raw or minimally processed inputs via feature concatenation or stacking before they enter the main model\.\[[25](https://arxiv.org/html/2606.17115#bib.bib66)\]use XGBoost on image\-derived biomarkers together with ICU tabular data \(vital sign values, laboratory values and metadata\) for cardiomegaly classification\. Another practice is to use an unsupervised encoder to compress data from different modalities \(e\.g\., clinical records and gene expression\) into a single feature vector before feeding it to a survival prediction network\[[14](https://arxiv.org/html/2606.17115#bib.bib69)\]\. Intermediate fusion first extracts modality\-specific representations through separate encoders, then combines them in a shared feature space\. Common examples include Pathomic Fusion\[[17](https://arxiv.org/html/2606.17115#bib.bib7)\]and MCAT\[[18](https://arxiv.org/html/2606.17115#bib.bib35)\]\. The main idea of these methods involves first learning unimodal feature representations through individual encoders, which are then fused via mechanisms to train the final model\[[38](https://arxiv.org/html/2606.17115#bib.bib76)\]\. Late fusion operates at the decision level, combining independent per\-modality predictions through voting, averaging, or learned meta\-classifiers\[[33](https://arxiv.org/html/2606.17115#bib.bib72)\]\. Hybrid methods mix multiple stages, with examples including PORPOISE\[[19](https://arxiv.org/html/2606.17115#bib.bib73)\], HEALNet\[[30](https://arxiv.org/html/2606.17115#bib.bib75)\]and SurvPath\[[34](https://arxiv.org/html/2606.17115#bib.bib74)\]\.
### 0\.A\.3Uncertainty Quantification in Medical AI
Deploying machine learning in high\-stakes medical settings requires not only high predictive accuracy, but also reliable indicators of when a prediction can be trusted\[[7](https://arxiv.org/html/2606.17115#bib.bib55)\]\. Early work applied approximate Bayesian methods, notably Monte Carlo dropout and deep ensembles, to clinical imaging, demonstrating that uncertainty estimates can flag unreliable predictions across a range of tasks, including diabetic retinopathy screening\[[40](https://arxiv.org/html/2606.17115#bib.bib56)\], brain lesion segmentation\[[50](https://arxiv.org/html/2606.17115#bib.bib57)\], and lung cancer and skin lesion analysis\[[58](https://arxiv.org/html/2606.17115#bib.bib64),[47](https://arxiv.org/html/2606.17115#bib.bib65)\]\. Calibration has emerged as a complementary concern; models trained with cross\-entropy are systematically overconfident, making post\-hoc recalibration a standard practice\[[29](https://arxiv.org/html/2606.17115#bib.bib58)\]\.
Conformal prediction \(CP\)\[[3](https://arxiv.org/html/2606.17115#bib.bib39)\]provides finite\-sample, distribution\-free coverage guarantees without assumptions on the underlying model or data\-generating process, making it well suited to heterogeneous clinical datasets\. Recent work has extended CP from classification to risk\-bounded interval estimation;\[[2](https://arxiv.org/html/2606.17115#bib.bib62)\]shows how multimodal survival models can be wrapped to yield prediction intervals with a bounded false\-coverage rate, and\[[23](https://arxiv.org/html/2606.17115#bib.bib63)\]applies CP to PathGen\-based multimodal predictors to obtain calibrated grade sets and survival\-risk intervals in computational pathology and transcriptomics\.
## Appendix 0\.BImplementation Details
This appendix details the implementation of our benchmark of data preparation and exploratory analysis ofIHdata \(§[0\.B\.1](https://arxiv.org/html/2606.17115#Pt0.A2.SS1)\), representation extraction and learning methods used in experiments \(§[0\.B\.2](https://arxiv.org/html/2606.17115#Pt0.A2.SS2)\), evaluation metrics for utility performance \(§[0\.B\.3](https://arxiv.org/html/2606.17115#Pt0.A2.SS3)\), and conformal prediction for uncertainty quantification \(§[0\.B\.4](https://arxiv.org/html/2606.17115#Pt0.A2.SS4)\)\.
### 0\.B\.1In\-house \(IH\) Datasets
This subsection describes howIH\-BCandIH\-NSCLCcohorts are prepared for our experiments\. We first specify notations and preprocessing steps, and then report exploratory statistics that summarize the distribution of theIHdataset\.The data that support the findings of this study were originated by and are the property of Flatiron Health, Inc\. and Caris Life Sciences\. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to PublicationsDataAccess@flatiron\.com and cmdb\-caris@flatiron\.com\.
#### 0\.B\.1\.1Data Preprocessing
Both collectedIH\-BCandIH\-NSCLCdatasets follow the format𝒟=\{\(𝐱i,img,𝐱i,omics,yi\)\}i=1N\\mathcal\{D\}=\\\{\(\\mathbf\{x\}\_\{i,\\text\{img\}\},\\,\\mathbf\{x\}\_\{i,\\text\{omics\}\},\\,y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, describing multi\-modal information fromNNcases, each indexed by a uniquecase\_idi∈\{1,…,N\}i\\in\\\{1,\\ldots,N\\\}\. Hereyi∈𝒴y\_\{i\}\\in\\mathcal\{Y\}is the label for the downstream classification task, where𝒴\\mathcal\{Y\}denotes the task\-specific label space \(enumerated later in Table[4](https://arxiv.org/html/2606.17115#Pt0.A2.T4)\)\. Each caseiiis linked to corresponding raw image data𝐱i,img\\mathbf\{x\}\_\{i,\\text\{img\}\}and raw omics data𝐱i,omics\\mathbf\{x\}\_\{i,\\text\{omics\}\}\. This study selects H&E whole slide images as image data and TPM RNA gene expression as omics data\.
The dataset is split into train, calibration, validation, and test sets with an approximate ratio of 7:3:1:1\. Specifically,IH\-BCincludes 3,747 unique case IDs, with 2,147 training, 1,000 calibration, 300 validation, and 300 test samples\.IH\-NSCLCincludes 3,887 unique case IDs, with 2,287 training, 1000 calibration, 300 validation, and 300 test samples\. To keep the sampled subset consistent with the original cohorts, we apply a stratified split to preserve data distributions\.
Table[4](https://arxiv.org/html/2606.17115#Pt0.A2.T4)enumerated all eight downstream classification tasks considered in this study and their label spaces𝒴\\mathcal\{Y\}and class cardinalities\|𝒴\|\|\\mathcal\{Y\}\|\. For each downstream task, we recompute the class label distribution on the sampled subset and exclude any class whose samples account for less than 8% of the total, as such classes contain too few samples for the model to learn from effectively\.
Table 4:Statistics ofIH\-BCandIH\-NSCLCdownstream classification task classes\.\|𝒴\|\|\\mathcal\{Y\}\|denotes the cardinality of the task\-specific label space𝒴\\mathcal\{Y\}, i\.e\., the number of distinct classes for that downstream task\.DatasetTask\|𝒴\|\|\\mathcal\{Y\}\|Class NamesIHBCBiopsy Site2Metastatic / PrimarySubtypes4HR\-Positive HER2\-Low / HR\-Positive HER2\-Negative / Indeterminate / Triple NegativeBiomarker PR2False / TrueBiomarker PIK3CA2False / TrueLOH3Equivocal / High / LowIHNSCLCBiopsy Site3Lower lobe, lung / Lung, NOS / Upper lobe, lungTumor Site3Lower lobe, lung / Lung, NOS / Upper lobe, lungTMB2TMB\-High / TMB\-Low
### 0\.B\.2Models Implementation
This subsection details the models behind our benchmark\. We first describe the frozen image and omics foundation models used to produce case\-level representations \(§[0\.B\.2\.1](https://arxiv.org/html/2606.17115#Pt0.A2.SS2.SSS1)\), then specify the unimodal and multimodal learning methods trained on top of these representations \(§[0\.B\.2\.2](https://arxiv.org/html/2606.17115#Pt0.A2.SS2.SSS2)\), and finally list the computing resources used for all runs \(§[0\.B\.2\.3](https://arxiv.org/html/2606.17115#Pt0.A2.SS2.SSS3)\)\.
#### 0\.B\.2\.1Representation Generation Models
The experiments use frozen representations before downstream learning\. For each caseii, the raw image𝐱i,img\\mathbf\{x\}\_\{i,\\text\{img\}\}is encoded into an image representation𝐳i,img∈ℝBi×dimg\\mathbf\{z\}\_\{i,\\text\{img\}\}\\in\\mathbb\{R\}^\{B\_\{i\}\\times d\_\{\\text\{img\}\}\}, a bag ofBiB\_\{i\}tile embeddings, where each tile embedding is adimgd\_\{\\text\{img\}\}\-dimensional vector and the bag sizeBiB\_\{i\}varies across cases\. The raw omics data𝐱i,omics\\mathbf\{x\}\_\{i,\\text\{omics\}\}is encoded into a case\-level omics representation𝐳i,omics∈ℝdomics\\mathbf\{z\}\_\{i,\\text\{omics\}\}\\in\\mathbb\{R\}^\{d\_\{\\text\{omics\}\}\}\.
To obtain the image representation𝐳i,img\\mathbf\{z\}\_\{i,\\text\{img\}\}, each H&E whole slide image is divided into tiles, and a pretrained pathology encoder then maps each retained tile into a feature vector, forming a bag ofdimgd\_\{\\text\{img\}\}\-dimensional tile embeddings\. When multiple slides are available for the same case, all matched slide bags are used during training, while one slide bag is used during validation and testing\.
We extract representations from four different foundation models \(CONCH, UNI, Virchow, MUSK\)\. CONCH\-v1\.5 is a pathology vision language foundation model trained on histology image and text pairs\. Virchow is a large pathology foundation model trained at scale on cancer histopathology\. UNI2 is a general purpose pathology encoder for H&E tile representations\. MUSK is a multimodal pathology foundation model for image and text representation learning\.
We evaluate three omics representation generation methods \(UCE, PCA, scVI\)\. UCE is a cell foundation model and is generates 1280\-dimensional transcriptomic representations\. PCA is a dimensionality reduction method that maps the gene expression matrix to 2000 principal components\. scVI is a generative model based on a variational autoencoder\. It is fitted with 3000 highly variable genes and provides a 1000\-dimensional latent representation\.
#### 0\.B\.2\.2Representation Learning Models
Figure[4](https://arxiv.org/html/2606.17115#Pt0.A2.F4)presents a detailed workflow of this work\. We consider five representation learning methods including two unimodal methods, HEMIL and GeneMLP, and three multimodal methods, CONTACT, MCAT, and LateMIL\. All models are trained on the training split and final results are evaluated on the test split, with the random seed fixed to 42 throughout\.
Figure 4:The Detailed WorkflowHEMILis an H&E\-based unimodal method that applies gated\-attention multiple instance learning to the image representation bag𝐳i,img\\mathbf\{z\}\_\{i,\\text\{img\}\}\. Each tile vector is projected by a linear layer with widths⟨dimg,256⟩\\langle d\_\{\\text\{img\}\},256\\ranglefollowed by GELU and dropout\. The attention module has two parallel linear maps of widths⟨256,256⟩\\langle 256,256\\rangle, one with tanh and the other with sigmoid; their elementwise product is mapped to one attention logit per tile, and softmax normalizes the logits across all tiles\. The bag representation is the attention\-weighted sum of the projected tile vectors\. The classifier head has widths⟨256,128,\|𝒴\|⟩\\langle 256,128,\|\\mathcal\{Y\}\|\\rangle, with LayerNorm, GELU, and dropout after the hidden layer\. HEMIL is trained with weight decay=1×10−31\\times 10^\{\-3\}for 100 epochs and dropout 0\.4\. We apply early stopping with patience=20 on validation accuracy\.
GeneMLPis an omics\-based multilayer perceptron that receives the omics representation𝐳i,omics\\mathbf\{z\}\_\{i,\\text\{omics\}\}and has widths⟨domics,512,256,128,\|𝒴\|⟩\\langle d\_\{\\text\{omics\}\},512,256,128,\|\\mathcal\{Y\}\|\\rangle, with LayerNorm, ReLU, and dropout 0\.2 after each hidden linear layer\. GeneMLP is trained with weight decay=1×10−41\\times 10^\{\-4\}for 200 epochs\. We apply early stopping with patience=20 on validation set\.
CONTACTis a multimodal fusion strategy that concatenates omics information with every image tile embedding, then applies a fusion encoder with widths⟨dimg\+domics,512⟩\\langle d\_\{\\text\{img\}\}\+d\_\{\\text\{omics\}\},\\,512\\rangle, with LayerNorm, GELU, and dropout after each linear layer\. The fused tile bag is passed to the same MIL structure of HEMIL\. CONTACT is trained with weight decay=1×10−21\\times 10^\{\-2\}for 100 epochs and is applied early stopping with patience=8 on validation set\.
MCATis a co\-attention fusion model that exchanges information between image and omics tokens before MIL pooling\. The image input projection maps each tile with widths⟨dimg,512⟩\\langle d\_\{\\text\{img\}\},512\\rangleusing LayerNorm, GELU, and dropout; the omics input projection maps𝐳i,omics\\mathbf\{z\}\_\{i,\\text\{omics\}\}of width 512\. Learnable modality embeddings are added to both streams, and learnable positional embeddings are added to the omics tokens\. The encoder stacks 2 co\-attention blocks: in each block, image tokens attend to omics tokens via multi\-head attention, and omics tokens attend back via another multi\-head attention \(attention dropout 0\.1\); tile and omics self\-attention are disabled by default\. Each stream then passes through a FFN with hidden width 2048\. The final image tokens are normalized and mapped with widths⟨512,max\(dimg,domics\)⟩\\langle 512,\\,\\max\(d\_\{\\text\{img\}\},d\_\{\\text\{omics\}\}\)\\ranglewith LayerNorm, GELU, and dropout\. The fused tile bag is classified by the same architecture of HEMIL\. MCAT is trained with weight decay=1×10−21\\times 10^\{\-2\}for 100 epochs, dropout 0\.5, omics feature dropout 0\.2, label smoothing 0\.05, gradient clipping at max norm 1\.0\. The early stopping is applied with patience=8 on validation set\.
LateMILis a late\-fusion variant from\[[46](https://arxiv.org/html/2606.17115#bib.bib34)\]that considers both WSI logits and omics logits\. The WSI branch follows a dual\-stream MIL structure: the instance classifier is a linear map with widths⟨dimg,\|𝒴\|⟩\\langle d\_\{\\text\{img\}\},\|\\mathcal\{Y\}\|\\rangleproducing tile\-level class logits, and the bag classifier constructs a query for each tile via a two\-layer with widths⟨dimg,128,128⟩\\langle d\_\{\\text\{img\}\},128,128\\rangle, using ReLU after the first linear layer and tanh after the second\. For each class, the tile with the highest instance logit is selected as the important class tile; attention scores between all tile queries and the important class tile queries are scaled and normalized over tiles, with an identity value path by default\. The class\-specific bag representations are passed through a convolution with kernel sizedimgd\_\{\\text\{img\}\}to produce WSI bag logits\. The omics branch matches the architecture of GeneMLP \(with LayerNorm, ReLU, and dropout 0\.2\)\. The final prediction is a fixed weighted sum of the two branch logits, with default weights 0\.5\. The optimizer uses separate parameter groups with WSI learning rate2×10−42\\times 10^\{\-4\}and omics learning rate2×10−32\\times 10^\{\-3\}, weight decay=1×10−41\\times 10^\{\-4\}, for 100 epochs\.
#### 0\.B\.2\.3Computing Resources
All analyses and modelling computations are performed on the data licensor’s licensed compute platform\. The modality learning methods are run in parallel on an HPC platform with 8 A10G GPUs with 24 GB memory each, 196 CPU cores, and 2 TB system memory\.
### 0\.B\.3Evaluation Metrics
Utility is evaluated on the held\-out test set for each downstream classification task\. We report accuracy \(ACC\) and area under the ROC curve \(AUC\) as scalar metrics, and use ROC curves to show the threshold\-dependent trade\-off between true positive rate \(TPR\) and false positive rate \(FPR\)\. LetNNbe the number of test cases,yiy\_\{i\}the true label,y^i\\hat\{y\}\_\{i\}the predicted label, andpicp\_\{ic\}the predicted probability assigned to classcc\.
##### Accuracy \(ACC\)\.
Accuracy measures the fraction of test cases whose predicted label matches the ground\-truth label,ACC=1N∑i=1N𝟏\{y^i=yi\}\\operatorname\{ACC\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\{\\hat\{y\}\_\{i\}=y\_\{i\}\\\}\.
##### ROC curves\.
ROC curves evaluate probability scores over all possible classification thresholds\. For a given classc∈𝒴c\\in\\mathcal\{Y\}, cases withyi=cy\_\{i\}=care treated as positives and all other cases are treated as negatives\. At thresholdτ\\tau, the true positive rate \(TPR\) and false positive rate \(FPR\) are defined as
TPRc\(τ\)=∑i=1N𝟏\{yi=c,pic≥τ\}∑i=1N𝟏\{yi=c\},\\operatorname\{TPR\}\_\{c\}\(\\tau\)=\\frac\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\{y\_\{i\}=c,\\ p\_\{ic\}\\geq\\tau\\\}\}\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\{y\_\{i\}=c\\\}\},FPRc\(τ\)=∑i=1N𝟏\{yi≠c,pic≥τ\}∑i=1N𝟏\{yi≠c\}\.\\operatorname\{FPR\}\_\{c\}\(\\tau\)=\\frac\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\{y\_\{i\}\\neq c,\\ p\_\{ic\}\\geq\\tau\\\}\}\{\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\{y\_\{i\}\\neq c\\\}\}\.
The ROC curve is obtained by plotting TPR against FPR asτ\\tauvaries\. Binary tasks use the positive\-class ROC curve\. Multi\-class tasks use a macro one\-vs\-rest ROC curve,TPR¯\(u\)=1\|𝒴\|∑c∈𝒴TPRc\(u\)\\overline\{\\operatorname\{TPR\}\}\(u\)=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{c\\in\\mathcal\{Y\}\}\\operatorname\{TPR\}\_\{c\}\(u\), after each class\-specific ROC curve is reparameterised asTPRc\(u\)\\operatorname\{TPR\}\_\{c\}\(u\)by inverting the monotone non\-increasing relationu=FPRc\(τ\)u=\\operatorname\{FPR\}\_\{c\}\(\\tau\), eliminatingτ\\tau\. The diagonal lineTPR=FPR\\operatorname\{TPR\}=\\operatorname\{FPR\}is included as the random\-ranking reference\.
##### Area under the ROC curve \(AUC\)\.
AUC is the area under the ROC curve and summarizes how well the model ranks positive cases above negative cases using predicted probabilities rather than hard labels\. For classc∈𝒴c\\in\\mathcal\{Y\},AUCc=∫01TPRc\(u\)𝑑u\\operatorname\{AUC\}\_\{c\}=\\int\_\{0\}^\{1\}\\operatorname\{TPR\}\_\{c\}\(u\)\\,du, whereuudenotes the false positive rate\. For binary tasks, AUC is computed from the positive\-class probability, following the positive class specified per task in Table[4](https://arxiv.org/html/2606.17115#Pt0.A2.T4)\. For multi\-class tasks, we use a one\-vs\-rest macro average,AUCmacro=1\|𝒴\|∑c∈𝒴AUCc\\operatorname\{AUC\}\_\{\\mathrm\{macro\}\}=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{c\\in\\mathcal\{Y\}\}\\operatorname\{AUC\}\_\{c\}, where𝒴\\mathcal\{Y\}is the task label space\. This gives equal weight to each class and avoids letting larger classes dominate the reported AUC\.
### 0\.B\.4Conformal Quantification
We evaluate our classifiers in an uncertainty\-aware setting using split conformal prediction at miscoverage levelsα∈\{0\.05,0\.10,0\.20\}\\alpha\\in\\\{0\.05,0\.10,0\.20\\\}\. For eachα\\alpha, prediction sets are formed by including all classes whose nonconformity score falls below a calibrated threshold, guaranteeing marginal coveragePr\(y∈𝒞\(x\)\)≥1−α\\Pr\(y\\in\\mathcal\{C\}\(x\)\)\\geq 1\-\\alpha\.
The nonconformity score for a samplexxand classccis defined as
s\(x,c\)=1−pθ\(c∣x\),s\(x,c\)=1\-p\_\{\\theta\}\(c\\mid x\),\(1\)wherepθ\(c∣x\)p\_\{\\theta\}\(c\\mid x\)is the softmax probability assigned to classcc\. On the calibration set, each score is evaluated at the true label, i\.e\.si=s\(x,c\)s\_\{i\}=s\(x,c\)\. Given calibration scores\{si\}i=1ncal\\\{s\_\{i\}\\\}\_\{i=1\}^\{n\_\{\\mathrm\{cal\}\}\}, the conformal thresholdq^\\hat\{q\}is computed as theqq\-th quantile of the calibration scores, where
q=⌈\(ncal\+1\)\(1−α\)⌉ncalq=\\frac\{\\left\\lceil\(n\_\{\\mathrm\{cal\}\}\+1\)\(1\-\\alpha\)\\right\\rceil\}\{n\_\{\\mathrm\{cal\}\}\}\(2\)is the finite\-sample corrected quantile level\. A test classccis then included in the prediction set if and only if
s\(x,c\)≤q^\.s\(x,c\)\\leq\\hat\{q\}\.\(3\)
We evaluate conformal prediction results using the following metrics\.
Coveragemeasures the fraction of test samples for which the true label falls inside the prediction set, and is the primary metric for verifying that the nominal guarantee is satisfied,
Coverage=1n∑i=1n𝟏\{yi∈𝒞\(xi\)\}\.\\mathrm\{Coverage\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\{y\_\{i\}\\in\\mathcal\{C\}\(x\_\{i\}\)\\\}\.\(4\)
Target coverageis the nominal coverage level imposed by the choice of miscoverage rateα\\alpha, and it is used as the reference against which empirical coverage is comparedTarget=1−α\.\\mathrm\{Target\}=1\-\\alpha\.\.
Coverage gapquantifies the deviation of empirical coverage from the nominal target\. Positive values indicate over\-coverage; negative values indicate a violation of the guarantee:CoverageGap=Coverage−\(1−α\)\.\\mathrm\{CoverageGap\}=\\mathrm\{Coverage\}\-\(1\-\\alpha\)\.
Average prediction set sizemeasures prediction efficiency, how many classes must the model include, on average, to achieve the required coverage\. Smaller values indicate sharper, more informative predictions:
AvgSetSize=1n∑i=1n\|𝒞\(xi\)\|\.\\mathrm\{AvgSetSize\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\lvert\\mathcal\{C\}\(x\_\{i\}\)\\rvert\.\(5\)
Singleton ratemeasures the fraction of samples for which the model is confident enough to return exactly one class\. High singleton rates indicate decisive, well\-concentrated predictions:
SingletonRate=1n∑i=1n𝟏\{\|𝒞\(xi\)\|=1\}\.\\mathrm\{SingletonRate\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\{\\lvert\\mathcal\{C\}\(x\_\{i\}\)\\rvert=1\\\}\.\(6\)
We also report the exact achievable coverage levelδexact\\delta\_\{\\mathrm\{exact\}\}, which differs from the nominal targetδtarget=1−α\\delta\_\{\\mathrm\{target\}\}=1\-\\alphadue to the discreteness of the calibration quantile:
δexact=⌈\(ncal\+1\)δtarget⌉ncal\+1\.\\delta\_\{\\mathrm\{exact\}\}=\\frac\{\\left\\lceil\(n\_\{\\mathrm\{cal\}\}\+1\)\\,\\delta\_\{\\mathrm\{target\}\}\\right\\rceil\}\{n\_\{\\mathrm\{cal\}\}\+1\}\.\(7\)The slackδexact−δtarget\\delta\_\{\\mathrm\{exact\}\}\-\\delta\_\{\\mathrm\{target\}\}quantifies the unavoidable over\-coverage introduced by finite\-sample rounding, while the empirical gapCoverage−δtarget\\mathrm\{Coverage\}\-\\delta\_\{\\mathrm\{target\}\}measures the total deviation from the nominal target\. Together, these diagnostics allow us to distinguish genuine coverage violations from artefacts of finite calibration sets\.
Expected Calibration Error \(ECE\)\[[28](https://arxiv.org/html/2606.17115#bib.bib40)\]quantifies how well the model’s predicted probabilities agree with empirical accuracies\. The ECE is then defined as
ECE=∑m=1M\|βm\|n\|acc\(βm\)−conf\(βm\)\|,\\mathrm\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|\\beta\_\{m\}\|\}\{n\}\\left\|\\mathrm\{acc\}\(\\beta\_\{m\}\)\-\\mathrm\{conf\}\(\\beta\_\{m\}\)\\right\|,\(8\)where\|βm\|\|\\beta\_\{m\}\|is the number of samples in binβm\\beta\_\{m\}andnnis the total number of samples\. This measures the average, sample\-weighted discrepancy between predicted confidence and realised accuracy across bins\.
Brier score\[[21](https://arxiv.org/html/2606.17115#bib.bib41)\]provides a complementary measure of probabilistic accuracy and calibration\. It is defined as
Brier=1N∑i=1N\(pi−yi\)2,\\mathrm\{Brier\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(p\_\{i\}\-y\_\{i\}\)^\{2\},\(9\)whereNNis the number of samples,pip\_\{i\}is the predicted probability vector in the multi\-class case, for sampleii, andyiy\_\{i\}is the corresponding one\-hot encoded ground\-truth label\. Lower Brier scores indicate that the predicted probabilities are closer to the observed outcomes, reflecting better\-calibrated and more accurate probabilistic predictions\.
To quantify the practical advantage of conformal prediction over a single point estimate, we computed therescue rate: among all test samples with an incorrect top\-1 prediction, this is the fraction for which the true class still appears in the conformal prediction set \(α=0\.10\\alpha=0\.10\)\. A high rescue rate indicates that, even when the model is wrong, the correct answer is preserved among the plausible candidates, reducing the risk of silently discarding the true diagnosis in a clinical workflow\.
## Appendix 0\.CAppendix: Full Results
Table[5](https://arxiv.org/html/2606.17115#Pt0.A3.T5)shows detailed results for all multimodal techniques with different image and omics representations backbones\.
Overall, models on BC tasks show better predictive performance than NSCLC tasks\. This difference may reflect multiple factors such as task definition, label noise, and demographic composition\. A subgroup analysis would be needed to isolate demographic effects: the BC models are effectively built on a more demographically homogeneous population than the NSCLC models\. Due to the nature of the disease, BC dataset we extracted are all female patients, so the model is trained and evaluated on a single\-gender cohort\. In contrast, NSCLC dataset contains a more balanced distribution of male and female patients, but demographic information is excluded from modelling\. Similar clinical presentations can correspond to different classification labels across genders, making the prediction task inherently more challenging\. Incorporating demographic information into the modelling process and evaluating variations in predictive performance across subgroups would be beneficial for building more robust predictive models\.
Across different downstream tasks, we observe a pattern that task labels anchored in tissue context tend to favor image representations, whereas task labels anchored in molecular state tend to favor omics representations\. For instance, the image\-based HEMIL probe overall outperforms the omics\-based GeneMLP probe on the Biopsy Site identification tasks for both NSCLC and BC datasets\. This may be because biopsy site is primarily determined by surrounding tissue context and tumor morphology, both of which favor image\-based probing\. By contrast, biomarker\-status tasks, such as BC\-PR, target molecular states whose primary substrate is transcriptional, and therefore favor omics\-leaning probing\. As a result, GeneMLP outperforms HEMIL on BC\-PR status prediction\.
Table 5:Multimodal Fusion Performance onIH\-BCandIH\-NSCLCTasks\. Each cell reportsACC/AUC\.Figure[5](https://arxiv.org/html/2606.17115#Pt0.A3.F5)reports macro\-averaged ROC curves of the two unimodal baselines \(GeneMLP for omics, HEMIL for WSI\) across four representative tasks under different encoder backbones\. Overall, the patterns across different tasks are consistent with what we observed in the main text\. The various image\-based foundation models show stable predictive performance\. For omics representations, all three encoded versions outperform using raw gene expression directly\. That said, the omics foundation model UCE generally underperforms PCA and scVI\. This is likely because UCE is pre\-trained and generates representations in a training\-free manner, whereas PCA and scVI are further fitted on the training set before producing representations\. Fine\-tuning omics foundation models on domain\-specific data in future work can help them generate more informative embeddings\.
Figure 5:ROC curves of unimodal baselines across different tasks \(BC\-Subtype, BC\-LOH, NSCLC\-TMB, NSCLC\-Biopsy\) Here ‘RAW’ refers to directly feeding the full set of gene TPM values as omics features\.### 0\.C\.1Conformal Prediction
Figure[6](https://arxiv.org/html/2606.17115#Pt0.A3.F6)extends the single\-level analysis to the full range of target coverages\. The same task\-level structure observed atα=0\.10\\alpha=0\.10is evident across all three panels, the BC tasks track the nominal targets closely, whereas the NSCLC tasks consistently overshoot, because their harder multi‑site discrimination task leads to more variable, higher non‑conformity scores and thus larger, more conservative prediction sets\.
Atα=0\.05\\alpha=0\.05\(target0\.950\.95\), BC configurations cluster tightly between0\.950\.95and0\.970\.97\. The only clear under\-coverage is GeneMLP on NSCLC Tumour Site \(coverage0\.94710\.9471, gap−0\.003\-0\.003\), while LateMIL on NSCLC Biopsy Site shows the strongest over\-coverage \(0\.97760\.9776, gap\+0\.028\+0\.028; Table[6](https://arxiv.org/html/2606.17115#Pt0.A3.T6)\)\. Atα=0\.10\\alpha=0\.10\(target0\.900\.90\), BC families again form a narrow band around the target, whereas NSCLC families overshoot by roughly 2–4 percentage points; GeneMLP on NSCLC Tumour Site \(0\.89860\.8986, gap−0\.001\-0\.001\) is the only near\-miss\. Atα=0\.20\\alpha=0\.20\(target0\.800\.80\), NSCLC over\-coverage remains high, while CONTACT and MCAT on BC LOH dip slightly below target \(gaps−0\.010\-0\.010and−0\.012\-0\.012\), making this level the most prone to under\-coverage on BC tasks \(Table[6](https://arxiv.org/html/2606.17115#Pt0.A3.T6)\)\. The relative ordering of model families within each task remains stable and the spread of coverages does not grow systematically withα\\alpha, indicating that the qualitative conclusions from theα=0\.10\\alpha=0\.10setting generalise across the coverage sweep\.
Figure 6:Empirical coverage across models and target coverage levelsα∈\{0\.05,0\.10,0\.20\}\\alpha\\in\\\{0\.05,0\.10,0\.20\\\}\. Each point is the mean coverage averaged over encoder configurations for a given model and task\.Table[6](https://arxiv.org/html/2606.17115#Pt0.A3.T6)also reports probabilistic calibration via the expected calibration error \(ECE\)\[[28](https://arxiv.org/html/2606.17115#bib.bib40)\]and the multi\-class Brier score\[[21](https://arxiv.org/html/2606.17115#bib.bib41)\], both of which are independent ofα\\alphaand therefore reflect intrinsic model properties rather than the choice of coverage level\. CONTACT and MCAT are consistently well calibrated, maintaining ECE values in the range0\.040\.04–0\.070\.07across all four tasks, whereas LateMIL and GeneMLP are notably less reliable, with ECE reaching0\.1440\.144and0\.1600\.160on the BC tasks and peaking at0\.2040\.204for GeneMLP on NSCLC Tumour Site\. The Brier score follows the same path\. NSCLC tasks have substantially higher values \(0\.620\.62–0\.770\.77\) than BC tasks \(0\.460\.46–0\.590\.59\), consistent with their increased difficulty, and within each endpoint, the ordering of families matches the ECE ranking, with CONTACT and MCAT achieving the lowest Brier scores and GeneMLP the highest\.
These probabilistic calibration mirrors the conformal results\. Models with broad, poorly concentrated softmax distributions, as indicated by high ECE and Brier, require largerq^\\hat\{q\}thresholds to include the true class, increasing prediction\-set sizes and producing over\-coverage\. This is most evident for GeneMLP on NSCLC Tumour Site, where high ECE \(0\.2040\.204\) and Brier \(0\.7730\.773\) coincide with the largest prediction sets in that task \(average size≈2\.58\\approx 2\.58–2\.772\.77acrossα\\alphalevels; Table[6](https://arxiv.org/html/2606.17115#Pt0.A3.T6)\) and with the only near under coverage atα=0\.05\\alpha=0\.05\(coverage0\.9470\.947, gap−0\.003\-0\.003\)\. Conversely, MCAT on NSCLC Biopsy Site combines the lowest ECE in that task \(0\.0440\.044\) with the smallest prediction sets, illustrating how better probabilistic calibration translates directly into tighter and more informative prediction sets\.
Table 6:Conformal prediction metrics per task and model family across coverage levels \(α∈\{0\.05,0\.10,0\.20\}\\alpha\\in\\\{0\.05,0\.10,0\.20\\\}\)\. Eachα\\alpha\-block reports: empirical coverage \(Cov\.\), coverage gapΔ=cov−\(1−α\)\\Delta=\\text\{cov\}\-\(1\-\\alpha\)\(Gap\), average prediction\-set size \(Set\), and singleton rate \(Sing\.\)\. Fixed columns \(right\): expected calibration error \(ECE\), multi\-class Brier score, and top\-1 accuracy \(Acc\.\), which are independent ofα\\alpha\. Results averaged over encoder configurations\.Figure[7](https://arxiv.org/html/2606.17115#Pt0.A3.F7)illustrates the top\-1 error, which counts how often the single most likely class is wrong, and the conformal miss rate, which counts how often the true label falls outside the prediction set𝒞\(x\)\\mathcal\{C\}\(x\)\. The difference between these two reflects the uncertainty absorbed by the prediction set, where the classifier misclassifies the top label but the conformal wrapper still includes the true class\. Across the BC tasks \(top row\), top\-1 error ranges from approximately35%35\\%to43%43\\%depending on the model family, yet the conformal miss rate remains close to the nominal10%10\\%target for all five families\. The hatched bars are short and nearly identical, indicating that the choice of architecture has virtually no impact on coverage reliability at this level\. The NSCLC tasks \(bottom row\) show a different pattern\. Top\-1 error increases to5252–60%60\\%, reflecting the harder multi\-site discrimination, while conformal miss rates fall below the target to about77–8%8\\%, consistent with the systematic over\-coverage reported in Table[3](https://arxiv.org/html/2606.17115#S5.T3)\. The gap between solid and hatched bars is therefore largest on the NSCLC tasks, the conformal wrapper absorbs substantially more uncertainty, covering many samples that the point predictor misclassifies\. This gap is also highly consistent across model families, suggesting that the score distributions of all architectures are broad enough that the true label typically remains inside𝒞\(x\)\\mathcal\{C\}\(x\)regardless of backbone choice\.
Figure 7:Error decomposition by task and model family\. Each panel shows one of the four multiclass tasks; within each panel, bars are grouped by model family \(colour\)\. Solid bars represent the top\-1 classification error \(1−accuracy1\-\\text\{accuracy\}\); hatched bars represent the conformal miss rate \(1−coverage1\-\\text\{coverage\}\) atα=0\.10\\alpha=0\.10\. The gap between the two bars quantifies the uncertainty absorbed by the prediction set, samples misclassified by the point predictor that are nonetheless covered by𝒞\(x\)\\mathcal\{C\}\(x\)because the true class retains a softmax score above1−q^1\-\\hat\{q\}\.From the rescue rates in Table[6](https://arxiv.org/html/2606.17115#Pt0.A3.T6), we observe that across all four multiclass tasks and all model configurations,79\.7%79\.7\\%of misclassified samples were*rescued*, meaning that the conformal set still contained the correct label even when the single\-best prediction was wrong\. The rescue rate varied by clinical task:72\.6%72\.6\\%for BC Subtype,75\.7%75\.7\\%for BC LOH,87\.3%87\.3\\%for NSCLC Biopsy Site, and84\.8%84\.8\\%for NSCLC Tumour Site\. These numbers indicate that, in the large majority of failure cases, a hard classifier would commit to an incorrect label, whereas conformal prediction would still flag the true diagnosis as a plausible outcome and defer the final decision to a human reviewer\.
Two patient\-level examples from the BC Subtype task illustrate the clinical impact of this behaviour\. In the first example, a HEMIL model assigned a softmax probability of0\.3680\.368to HR\+/HER2\-Low and0\.3640\.364to the true class HR\+/HER2\-Negative, a very small margin ofΔp=0\.004\\Delta p=0\.004\. A standard classifier would pick HR\+/HER2\-Low and be wrong\.111See, for example, the FDA approval summary for fam\-trastuzumab deruxtecan\-nxki \(Enhertu\) in HER2\-low breast cancer:[https://www\.fda\.gov/drugs/resources\-information\-approved\-drugs/fda\-approves\-fam\-trastuzumab\-deruxtecan\-nxki\-her2\-low\-breast\-cancer](https://www.fda.gov/drugs/resources-information-approved-drugs/fda-approves-fam-trastuzumab-deruxtecan-nxki-her2-low-breast-cancer)\.A false HR\+/HER2\-Low label would therefore wrongly suggest that the patient is eligible for T\-DXd and could send them down an inappropriate treatment pathway\. By contrast, the conformal prediction set atq^=0\.860\\hat\{q\}=0\.860was𝒞\(x\)=\{HR\+/HER2\-Low,HR\+/HER2\-Neg,\\mathcal\{C\}\(x\)=\\\{\\text\{HR\+/HER2\-Low\},\\,\\text\{HR\+/HER2\-Neg\},\\,Indeterminate\}\\text\{Indeterminate\}\\\}, a three\-class output that clearly shows HER2 status is uncertain and that additional IHC or FISH testing is needed before deciding on treatment\.
In the second example, the same model assigned a probability of0\.3360\.336to Indeterminate and0\.3310\.331to the true class, Triple Negative, again a tiny margin \(Δp=0\.004\\Delta p=0\.004\)\. Triple Negative Breast Cancer \(TNBC\) is defined by the absence of some receptors and is managed very differently from an Indeterminate case\. Patients with TNBC may receive intensive systemic therapies such as chemotherapy\. An Indeterminate prediction would instead trigger further biomarker workup and delay the start of TNBC\-directed therapy, which is problematic given the aggressive course of the disease\[[5](https://arxiv.org/html/2606.17115#bib.bib52)\]\. The conformal prediction set𝒞\(x\)=\{HR\+/HER2 Low,Indeterminate,Triple Neg\.\}\\mathcal\{C\}\(x\)=\\\{\\text\{HR\+/HER2 Low\},\\,\\text\{Indeterminate\},\\,\\text\{Triple Neg\.\}\\\}keeps Triple Negative as an active diagnostic option alongside Indeterminate, encouraging clinicians to pursue both possibilities in parallel rather than committing to a single ambiguous label\. Taken together, these examples show that conformal prediction sets do more than improve statistical coverage: they translate model uncertainty into concrete clinical prompts for individual patients\.Similar Articles
Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling
Introduces a foundation model–driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data for time-to-event prediction, evaluating fusion strategies on pulmonary embolism and cardiovascular disease cohorts.
Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models
This paper introduces a Multi-Modal Agent framework for power distribution defect detection, evaluating foundation models on perception, reasoning, and tool usage capabilities, with a new domain-specific dataset and benchmark.
Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction
This paper examines the integration of multi-modal clinical data, including treatment records, pathology reports, and clinician notes, using rule-based extraction and machine learning to improve breast cancer recurrence prediction compared to single-modal approaches.
Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals
This paper introduces a plug-in calibration module that adjusts multimodal representations before fusion, using cross-modal context to suppress misleading signals and emphasize reliable ones, improving performance on multiple benchmarks.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.