Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

arXiv cs.AI Papers

Summary

This paper presents a data-efficient anatomy-aware benchmark for cardiac pathology prediction on the ACDC MRI dataset, showing that under limited labels, anatomical representation matters more than model complexity.

arXiv:2606.06509v1 Announce Type: cross Abstract: Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically meaningful anatomy. We study this question through a low-data anatomy-aware benchmark for 5-class cardiac pathology prediction on the public ACDC MRI dataset. Using segmentation-derived patient descriptors from the right ventricle, myocardium, and left ventricle, we compare anatomy-specific and multi-structure representations across linear, kernel, and tree-based classifiers. We find that under limited label settings, representation dominates complexity. These results suggest that in resource-constrained healthcare settings, identifying and representing the most informative anatomy may matter more than the increasing complexity of the model alone.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:15 AM

# Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction
Source: [https://arxiv.org/html/2606.06509](https://arxiv.org/html/2606.06509)
###### Abstract

Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically meaningful anatomy\. We study this question through a low\-data anatomy\-aware benchmark for 5\-class cardiac pathology prediction on the public ACDC MRI dataset\. Using segmentation\-derived patient descriptors from the right ventricle, myocardium, and left ventricle, we compare anatomy\-specific and multi\-structure representations across linear, kernel, and tree\-based classifiers\. We find that under limited label settings, representation dominates complexity\. These results suggest that in resource\-constrained healthcare settings, identifying and representing the most informative anatomy may matter more than the increasing complexity of the model alone\.

Machine Learning, ICML

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.06509v1/ACDC_MRI_data-4.png)

Figure 1:Representation before complexity\.Our benchmark asks whether, under limited labels, selecting the right anatomical representation matters more than increasing model complexity\. We illustrate this by decomposing a representative short\-axis cardiac MR image into RV\-only, MYO\-only, LV\-only, and ALL\-structures views, which form the basis of the anatomy ablation study\. Background cardiac MR image adapted fromAdomatet al\.\([2026](https://arxiv.org/html/2606.06509#bib.bib48)\)\.In the low\-label medical imaging regime\(Jinet al\.,[2026](https://arxiv.org/html/2606.06509#bib.bib15); Zhouet al\.,[2021](https://arxiv.org/html/2606.06509#bib.bib16)\), researchers often attempt to improve performance using more complex models\. However, it remains unclear whether the real bottleneck is the complexity of the model\(Varoquaux and Cheplygina,[2022](https://arxiv.org/html/2606.06509#bib.bib17)\)or how the clinical structure is represented\. In particular, in cardiac imaging\(Buja and Butany,[2022](https://arxiv.org/html/2606.06509#bib.bib18); Counseller and Aboelkassem,[2023](https://arxiv.org/html/2606.06509#bib.bib19); Flachskampfet al\.,[2015](https://arxiv.org/html/2606.06509#bib.bib20)\), this question is especially relevant because the pathology is expressed through anatomically meaningful structure rather than arbitrary input variation\.

On the other hand, in AI for medical imaging, practical bottlenecks often lie not only in model design but also in data preparation, annotation, and deployment infrastructure\(Willeminket al\.,[2020](https://arxiv.org/html/2606.06509#bib.bib36)\)\. These constraints are especially acute in resource\-constrained healthcare settings, where the limited radiology infrastructure can hinder the adoption of compute\-intensive AI pipelines\(Yousef and Schmollgruber,[2024](https://arxiv.org/html/2606.06509#bib.bib38)\)\.

In the present work, inspired by previous benchmark\-driven progress in medical imaging\(Bernardet al\.,[2018](https://arxiv.org/html/2606.06509#bib.bib9); Blagecet al\.,[2023](https://arxiv.org/html/2606.06509#bib.bib39)\), we address this research direction through a reproducible low\-data benchmark built on Automated Cardiac Diagnosis Challenge \(ACDC\) MRI dataset\(Bernardet al\.,[2018](https://arxiv.org/html/2606.06509#bib.bib9)\)\. The intuition behind our benchmark is illustrated in[Figure 1](https://arxiv.org/html/2606.06509#S1.F1), where a representative cardiac MR image is decomposed into structure\-specific views to address which anatomy carries the dominant predictive signal\.

This question is also well motivated by prior cardiac MRI diagnosis pipelines built on segmentation\-derived representations\. In particular, Isensee et al\.\(Isenseeet al\.,[2017](https://arxiv.org/html/2606.06509#bib.bib26)\)and Khened et al\.\(Khenedet al\.,[2019](https://arxiv.org/html/2606.06509#bib.bib27)\)combined segmentation outputs with clinically inspired handcrafted features for automatic disease assessment on ACDC, while Zheng et al\.\(Zhenget al\.,[2019](https://arxiv.org/html/2606.06509#bib.bib30)\)showed that explainable cardiac pathology classification can be achieved by combining shape\-related features with motion characterization\. These works demonstrate that anatomically grounded descriptors can support interpretable pathology prediction; our focus is to understand, under limited labels, which anatomical structures carry the dominant predictive signal and how much that matters relative to classifier complexity\.

#### Our Contributions\.

We construct an anatomy\-aware benchmark for low\-data cardiac pathology prediction using ACDC segmentation masks\. We then show that the myocardial morphology is the strongest single\-structure source of predictive signal, while multi\-structure anatomical representation yields the best overall performance\. Finally, we show that simple handcrafted inter\-phase delta features do not improve over static multi\-structure descriptors, and we validate the benchmark using label\-shuffle controls, confusion analysis, and patient\-level visualization\.

## 2Anatomy\-Aware Benchmark Setup & Models

### 2\.1Dataset and Task

To begin with, we consider the balanced 5\-class pathology setting on ACDC, which are dilated cardiomyopathy \(DCM\), hypertrophic cardiomyopathy \(HCM\), myocardial infarction \(MINF\), normal subjects \(NOR\), and abnormal right ventricle \(RV\)\. In this dataset, each class contains 20 patients, giving a total of 100 subjects\. For brevity, we use the abbreviations DCM, HCM, MINF, NOR, and RV throughout the remainder of the paper\. For each patient, we use the annotated segmentation masks provided at the labeled cardiac phases and build patient\-level features from the three principal anatomical structures: RV, MYO, LV\.

#### Task\.

Let\{\(xi,yi\)\}i=1N\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote the patient\-level dataset, whereyi∈\{1,…,K\}y\_\{i\}\\in\\\{1,\\dots,K\\\}is the pathology label, andxix\_\{i\}denotes the cardiac MRI study for patientii;NNis the number of patients andKKdenotes the number of classes\. From each study, we construct anatomy\-aware feature representations

ϕr​\(xi\)∈ℝdr,r∈ℛ,\\phi\_\{r\}\(x\_\{i\}\)\\in\\mathbb\{R\}^\{d\_\{r\}\},\\qquad r\\in\\mathcal\{R\},whereℛ≔\{RV\-only,MYO\-only,LV\-only,ALL\}\\mathcal\{R\}\\coloneqq\\\{\\text\{RV\-only\},~\\text\{MYO\-only\},~\\text\{LV\-only\},~\\text\{ALL\}\\\}\. For each representationrr, we train classifiersfθ\(r\)∈ℱf\_\{\\theta\}^\{\(r\)\}\\in\\mathcal\{F\}, whereℱ\\mathcal\{F\}includes linear, kernel, and tree\-based models, to predict

y^i=fθ\(r\)​\(ϕr​\(xi\)\)\.\\hat\{y\}\_\{i\}=f\_\{\\theta\}^\{\(r\)\}\(\\phi\_\{r\}\(x\_\{i\}\)\)\.Our goal is not only to maximize predictive performance, but also to identify which anatomical representationrrcontributes the strongest signal under limited labels and whether the variation across representations is larger than the variation across classifier families\.

### 2\.2Patient\-Level Anatomical Features

In order to isolate the structure\-specific signal, we define four primary feature configurations: RV\-only, MYO\-only, LV\-only, and ALL\-structures, where the latter concatenates descriptors from all three anatomical compartments; refer[Figure 2](https://arxiv.org/html/2606.06509#S2.F2)for a representative visual\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/fig_feature_configurations.png)

Figure 2:A representative visual for the primary anatomical feature configurations used in the benchmark\.[Figure 2](https://arxiv.org/html/2606.06509#S2.F2)is constructed from a representative ACDC subject by selecting a labeled cardiac phase and displaying the slice with the largest total annotated area\. The same underlying MRI slice is then shown in four structure\-selection settings: RV\-only, MYO\-only, LV\-only, and ALL\-structures\. For each labeled frame and each anatomical structure, we extract simple shape descriptors from the binary segmentation mask, including area, area fraction, aspect ratio, principal\-axis statistics, elongation, compactness, circularity, extent, and radial distance summaries\. Slice\-wise features are aggregated to the patient level by mean and standard deviation across the slices, together with the number of slices containing the structure\.

### 2\.3Evaluation Protocol

We evaluated models using accuracy, balanced accuracy and macro\-F1 under 5\-fold stratified cross\-validation following standard multiclass classification metrics\(Sokolova and Lapalme,[2009](https://arxiv.org/html/2606.06509#bib.bib32); Powers,[2020](https://arxiv.org/html/2606.06509#bib.bib33)\); refer[Appendix A](https://arxiv.org/html/2606.06509#A1)for their expression\. All preprocessing steps, including median imputation and feature standardization when required, are fit within each training fold and then applied to the corresponding validation fold\. For label\-fraction experiments, we additionally repeat random subsampling at each fraction and report mean and standard deviation across repeated runs\.

### 2\.4Models

We compare three lightweight model families: multinomial logistic regression as a linear baseline\(Hosmer Jret al\.,[2013](https://arxiv.org/html/2606.06509#bib.bib22); Hastie,[2009](https://arxiv.org/html/2606.06509#bib.bib31)\), RBF\-SVM as a nonlinear kernel baseline\(Cortes and Vapnik,[1995](https://arxiv.org/html/2606.06509#bib.bib10)\), and random forest as a tree\-based nonlinear baseline\(Breiman,[2001](https://arxiv.org/html/2606.06509#bib.bib11)\)\. Additional details of the model that we considered are given in[Table 1](https://arxiv.org/html/2606.06509#A1.T1),[Sub\-section A\.2](https://arxiv.org/html/2606.06509#A1.SS2)\. Beyond aggregate predictive performance, we evaluated label efficiency, anatomy ablation, dynamic\-feature enhancement, sanity checks, robustness, and feature\-level interpretability\.

## 3Results

Our results address four complementary questions: whether the benchmark remains informative under limited labels, which anatomical structure carries the strongest predictive signal, whether simple inter\-phase dynamic summaries add value beyond static anatomy\-aware features, and whether the observed gains survive basic sanity checks\.

### 3\.1Label Efficiency

We begin by evaluating whether segmentation\-derived anatomical descriptors carry a meaningful pathological signal in the low\-data regime\. We observe that across repeated label\-fraction sweeps, performance remains above chance \(1/5=0\.21/5=0\.2, shown by theblue dashed linein[Figure 3](https://arxiv.org/html/2606.06509#S3.F3)\) and improves gradually as additional labels are incorporated\. Hence, this shows that the benchmark is neither trivial nor noise\-dominated, and that simple patient\-level anatomical features already support nontrivial multiclass prediction\. A lightweight end\-to\-end ResNet\-18\(Heet al\.,[2016](https://arxiv.org/html/2606.06509#bib.bib49)\)baseline trained on representative raw MRI slices performed substantially worse than all three anatomy\-aware baselines \([Appendix E](https://arxiv.org/html/2606.06509#A5)\-[Table 2](https://arxiv.org/html/2606.06509#A5.T2)\), reinforcing the value of explicit anatomical representation in the low\-label regime\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x1.png)Figure 3:Balanced accuracy across label fractionsfor 5\-class ACDC pathology prediction using the all\-structure representation\.
### 3\.2Which Anatomy Matters?

Now, we analyze the anatomy ablation study where among single\-structure feature sets, MYO\-only performs best, substantially outperforming both LV\-only and RV\-only representations, while the full multi\-structure representation performs best overall\. This suggests that the pathology signal is not distributed uniformly throughout the anatomy\. Instead, myocardial morphology appears to concentrate the strongest single\-structure information in this benchmark\.

From[Figure 4](https://arxiv.org/html/2606.06509#S3.F4), we see that the myocardial descriptors are the strongest single\-structure feature set, while the combined multi\-structure representation performs best overall\. In particular, the gain from moving from RV\-only to MYO\-only is much larger than the gain from switching among linear, kernel, and tree\-based classifiers once the representation is fixed\. A finer\-grained decomposition of feature importance by anatomical structure and descriptor family is provided in[Figure 13](https://arxiv.org/html/2606.06509#A4.F13),[Appendix D](https://arxiv.org/html/2606.06509#A4), further supporting the dominant role of myocardial descriptors\.

We highlight that this result is important for exactly two reasons\. First, on scientific grounds, our result suggests that the benchmark is driven less by generic whole\-heart geometry and more by structure\-specific morphology, with the myocardium serving as the most informative individual component\. Second and lastly, from an empirical ground, our result provides an informative lesson for low\-resource medical ML: when building simplified or compute\-efficient pipelines under limited labels, the myocardium may be the most valuable single anatomical target to prioritize\. All\-structure representation still performs best, but the MYO result identifies where much of the predictive signal is already concentrated\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x2.png)Figure 4:Cross\-validated balanced accuracy across anatomical featuresets for 5\-class ACDC pathology prediction\. Myocardium is the strongest single\-structure representation, while combining RV, myocardium, and LV yields the best overall performance\.
### 3\.3Do Explicit Dynamic Features Help?

To assess whether simple cardiac phase dynamics add information beyond static anatomical representation, we augment the full feature set with explicit inter\-phase delta and ratio descriptors\. These additions do not materially improve over the static multi\-structure representation\. We address this negative result cautiously\.

One possibility is that ACDC pathology groups are already strongly reflected in static morphology, particularly in myocardial structure, so that simple phase\-difference summaries contribute little additional information\. Another possibility is that our handcrafted dynamic descriptors are too compressed to preserve the richer spatial deformation patterns present across phases\. Thus, our result should not be read as evidence that dynamics are uninformative in general; rather, it shows that simple low\-dimensional inter\-phase summaries do not outperform already\-strong anatomy\-aware static representations in this benchmark\.

For sanity check, under random label permutation, the balanced accuracy drops from0\.870±0\.0570\.870\\pm 0\.057to0\.230±0\.0570\.230\\pm 0\.057, which is close to chance for a balanced 5\-class task\. This supports the interpretation that the observed gains arise from a genuine anatomical signal rather than leakage or spurious shortcut cues in the dataset\.

### 3\.4Why Does MYO Matter?

We provide a plausible explanation for the MYO result as follows: several ACDC pathologies are expressed strongly through the morphology of the myocardial wall rather than the geometry of the chamber geometry\. Therefore, MYO descriptors capture clinically meaningful variation in shape, extent, circularity, elongation, and radial\-distance structure, making the myocardium the most informative single anatomical compartment in the benchmark\.[Figure 5](https://arxiv.org/html/2606.06509#S3.F5)quantitatively helps explain the result of anatomy ablation by aggregating feature importance at the structure level\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x3.png)Figure 5:Grouped feature importance by anatomical structure\.Summed absolute logistic\-regression coefficients are highest for myocardium, reinforcing the quantitative ablation result that MYO is the strongest single\-structure source of predictive signal\.

## 4Discussion and Limitations

Our key finding is that, in this low\-data cardiac pathology benchmark, the dominant factor is the anatomical representation rather than the complexity of the classifier\. Our reproducible benchmark experiments suggest that kernel and tree\-based models provide limited gains beyond a strong anatomy\-aware representation, while explicit handcrafted dynamic summaries add little further improvement\. We note that our study also has limitations as we consider a single public dataset, rely on handcrafted segmentation\-derived descriptors rather than raw\-image end\-to\-end learning, and study dynamics only through simple inter\-phase summaries\. Our future work can extend this framework to additional datasets, uncertainty\-aware analysis, richer temporal descriptors, and external validation across institutions\.

## 5Conclusion

We empirically show that across anatomy ablations, model comparisons, dynamic\-feature tests, and sanity checks, the representation choice matters more than classifier complexity\. These results suggest that in low\-label structured medical learning, identifying the right anatomical factor may be more important than choosing a more expressive classifier\.

## Acknowledgments

The author acknowledges the kind effort ofPayal Ghosh\(PG\) in making[Figure 1](https://arxiv.org/html/2606.06509#S1.F1)\. Additionally, we also acknowledge PG for helpful discussion and suggestions that were incorporated in the paper\.

## Impact Statement

Our work has a broader impact in the design of practical medical AI systems\.

Disease\-dependent anatomy\-aware modeling\.More generally, different cardiac conditions can shift the anatomical priority, suggesting that future low\-label benchmarks should identify and emphasize structures that carry the most clinically meaningful information for the task at hand\. This perspective may be relevant for the Global South settings, where data analysis capacity are limited and where anatomically focused representations may offer a more practical path toward deployable medical AI\.

Data\-efficient medical AI\.Our central message suggests that practical gains in medical AI may come not only from larger architectures but also from choosing the right anatomical representation\. Recent learning from radiology and medical imaging has emphasized that limited data, limited expert labeling, and limited annotation resources present challenges for real\-world model development\(Candemiret al\.,[2021](https://arxiv.org/html/2606.06509#bib.bib40); Willeminket al\.,[2020](https://arxiv.org/html/2606.06509#bib.bib36)\)\.

Resource\-constrained healthcare settings\.Our results support a complementary actionable principle in resource\-constrained healthcare settings: instead of assuming that better performance requires heavier end\-to\-end models, it may be more effective to identify the anatomical structures that carry the clinically meaningful signal and represent them explicitly\.

## References

- F\. Adomat, C\. Schaub, T\. Hoh, X\. Fischer, R\. Guggenberger, R\. Manka, M\. Eberhard, and L\. Weber \(2026\)Cardiac MR function analysis with DL\-based super resolution reconstruction: application in the clinical setting\.The International Journal of Cardiovascular Imaging,pp\. 1–11\.External Links:[Link](https://link.springer.com/article/10.1007/s10554-026-03642-8)Cited by:[Figure 1](https://arxiv.org/html/2606.06509#S1.F1),[Figure 1](https://arxiv.org/html/2606.06509#S1.F1.5.2.1)\.
- O\. Bernard, A\. Lalande, C\. Zotti, F\. Cervenansky, X\. Yang, P\. Heng, I\. Cetin, K\. Lekadir, O\. Camara, M\. A\. G\. Ballester,et al\.\(2018\)Deep Learning Techniques for Automatic MRI Cardiac Multi\-Structures Segmentation and Diagnosis: Is the Problem Solved?\.IEEE Transactions on Medical Imaging37\(11\),pp\. 2514–2525\.External Links:[Link](https://ieeexplore.ieee.org/document/8360453/)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p3.1)\.
- K\. Blagec, J\. Kraiger, W\. Frühwirt, and M\. Samwald \(2023\)Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals\.Journal of Biomedical Informatics137,pp\. 104274\.External Links:[Link](https://www.sciencedirect.com/science/article/pii/S1532046422002799)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p3.1)\.
- L\. Breiman \(2001\)Random forests\.Machine learning45\(1\),pp\. 5–32\.Cited by:[§2\.4](https://arxiv.org/html/2606.06509#S2.SS4.p1.1)\.
- L\. M\. Buja and J\. Butany \(2022\)Cardiovascular pathology\.Academic Press\.Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.
- S\. Candemir, X\. V\. Nguyen, L\. R\. Folio, and L\. M\. Prevedello \(2021\)Training Strategies for Radiology Deep Learning Models in Data\-limited Scenarios\.Radiology: Artificial Intelligence3\(6\),pp\. e210014\.External Links:[Link](https://pubs.rsna.org/doi/10.1148/ryai.2021210014)Cited by:[Impact Statement](https://arxiv.org/html/2606.06509#Sx2.p3.1)\.
- C\. Cortes and V\. Vapnik \(1995\)Support\-vector networks\.Machine learning20\(3\),pp\. 273–297\.External Links:[Link](https://link.springer.com/article/10.1023/A:1022627411411)Cited by:[§2\.4](https://arxiv.org/html/2606.06509#S2.SS4.p1.1)\.
- Q\. Counseller and Y\. Aboelkassem \(2023\)Recent technologies in cardiac imaging\.Frontiers in medical technology4,pp\. 984492\.External Links:[Link](https://www.frontiersin.org/journals/medical-technology/articles/10.3389/fmedt.2022.984492/full)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.
- F\. A\. Flachskampf, T\. Biering\-Sørensen, S\. D\. Solomon, O\. Duvernoy, T\. Bjerner, and O\. A\. Smiseth \(2015\)Cardiac Imaging to Evaluate Left Ventricular Diastolic Function\.JACC: Cardiovascular Imaging8\(9\),pp\. 1071–1093\.External Links:[Link](https://www.jacc.org/doi/epdf/10.1016/j.jcmg.2015.07.004)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.
- T\. Hastie \(2009\)The elements of statistical learning: data mining, inference, and prediction\.springer\.Cited by:[§2\.4](https://arxiv.org/html/2606.06509#S2.SS4.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§3\.1](https://arxiv.org/html/2606.06509#S3.SS1.p1.1)\.
- D\. W\. Hosmer Jr, S\. Lemeshow, and R\. X\. Sturdivant \(2013\)Applied logistic regression\.John Wiley & Sons\.Cited by:[§2\.4](https://arxiv.org/html/2606.06509#S2.SS4.p1.1)\.
- F\. Isensee, P\. F\. Jaeger, P\. M\. Full, I\. Wolf, S\. Engelhardt, and K\. H\. Maier\-Hein \(2017\)Automatic Cardiac Disease Assessment on cine\-MRI via Time\-Series Segmentation and Domain Specific Features\.InInternational workshop on statistical atlases and computational models of the heart,pp\. 120–129\.External Links:[Link](https://link.springer.com/chapter/10.1007/978-3-319-75541-0_13)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p4.1)\.
- C\. Jin, Z\. Guo, Y\. Lin, L\. Luo, and H\. Chen \(2026\)Learning with less supervision: A survey of label\-efficient learning for medical image analysis\.Medical Image Analysis,pp\. 104062\.External Links:[Link](https://www.sciencedirect.com/science/article/abs/pii/S1361841526001301?via%3Dihub)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.
- M\. Khened, V\. A\. Kollerathu, and G\. Krishnamurthi \(2019\)Fully convolutional multi\-scale residual DenseNets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers\.Medical Image Analysis51,pp\. 21–45\.External Links:[Link](https://www.sciencedirect.com/science/article/abs/pii/S136184151830848X?via%3Dihub)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p4.1)\.
- D\. M\. Powers \(2020\)EVALUATION: from precision, recall and f\-measure to roc, informedness, markedness & correlation\.arXiv preprint arXiv:2010\.16061\.External Links:[Link](https://arxiv.org/pdf/2010.16061)Cited by:[§2\.3](https://arxiv.org/html/2606.06509#S2.SS3.p1.1)\.
- M\. Sokolova and G\. Lapalme \(2009\)A systematic analysis of performance measures for classification tasks\.Information processing & management45\(4\),pp\. 427–437\.Cited by:[§2\.3](https://arxiv.org/html/2606.06509#S2.SS3.p1.1)\.
- G\. Varoquaux and V\. Cheplygina \(2022\)Machine learning for medical imaging: methodological failures and recommendations for the future\.NPJ digital medicine5\(1\),pp\. 48\.External Links:[Link](https://www.nature.com/articles/s41746-022-00592-y)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.
- M\. J\. Willemink, W\. A\. Koszek, C\. Hardell, J\. Wu, D\. Fleischmann, H\. Harvey, L\. R\. Folio, R\. M\. Summers, D\. L\. Rubin, and M\. P\. Lungren \(2020\)Preparing Medical Imaging Data for Machine Learning\.Radiology295\(1\),pp\. 4–15\.External Links:[Link](https://pubs.rsna.org/doi/10.1148/radiol.2020192224)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p2.1),[Impact Statement](https://arxiv.org/html/2606.06509#Sx2.p3.1)\.
- K\. Yousef and S\. Schmollgruber \(2024\)Artificial Intelligence in Low\- and Middle\-Income Countries: Reducing the Gaps in Health Care, Research, and Education\.International Journal of Critical Care18\(2\),pp\. 1–3\.External Links:[Link](https://wfccn-ijcc.com/index.php/ijcc/article/view/981)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p2.1)\.
- Q\. Zheng, H\. Delingette, and N\. Ayache \(2019\)Explainable cardiac pathology classification on cine mri with motion characterization by semi\-supervised learning of apparent flow\.Medical Image Analysis56,pp\. 80–95\.External Links:[Link](https://www.sciencedirect.com/science/article/abs/pii/S1361841519300519?via%3Dihub)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p4.1)\.
- S\. K\. Zhou, H\. Greenspan, C\. Davatzikos, J\. S\. Duncan, B\. Van Ginneken, A\. Madabhushi, J\. L\. Prince, D\. Rueckert, and R\. M\. Summers \(2021\)A Review of Deep Learning in Medical Imaging: Imaging Traits, Technology Trends, Case Studies With Progress Highlights, and Future Promises\.Proceedings of the IEEE109\(5\),pp\. 820–838\.External Links:[Link](https://ieeexplore.ieee.org/document/9363915)Cited by:[§1](https://arxiv.org/html/2606.06509#S1.p1.1)\.

## Appendix AEvaluation and Model Details

### A\.1Metric Definitions

LetCi​jC\_\{ij\}denote the confusion matrix entry counting examples whose true class isiiand predicted class isjj\. For classii, the recall is defined as follows:Recalli=Ci​i∑jCi​j\.\\mathrm\{Recall\}\_\{i\}=\\frac\{C\_\{ii\}\}\{\\sum\_\{j\}C\_\{ij\}\}\.Then,*balanced accuracy*BA\\mathrm\{BA\}is defined as the average recall across classes:BA=1K​∑i=1KRecalli,\\mathrm\{BA\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\mathrm\{Recall\}\_\{i\},whereKKis the number of classes\. This metric is appropriate here because the benchmark is multiclass and class\-conditional performance is of primary interest\. In the usual manner,*accuracy*is defined as follows:Acc=∑iCi​i∑i,jCi​j\.\\mathrm\{Acc\}=\\frac\{\\sum\_\{i\}C\_\{ii\}\}\{\\sum\_\{i,j\}C\_\{ij\}\}\.For macro\-F1, let the precision for classiibePrecisioni=Ci​i∑jCj​i,\\mathrm\{Precision\}\_\{i\}=\\frac\{C\_\{ii\}\}\{\\sum\_\{j\}C\_\{ji\}\},and define the classwise F1 score byF1i=2​Precisioni​RecalliPrecisioni\+Recalli\.\\mathrm\{F1\}\_\{i\}=\\frac\{2\\,\\mathrm\{Precision\}\_\{i\}\\,\\mathrm\{Recall\}\_\{i\}\}\{\\mathrm\{Precision\}\_\{i\}\+\\mathrm\{Recall\}\_\{i\}\}\.The macro\-F1 score is thenMacro​\-​F1=1K​∑i=1KF1i\.\\mathrm\{Macro\\text\{\-\}F1\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\mathrm\{F1\}\_\{i\}\.

### A\.2Model Hyperparameters

Other important details related to the models that we consider in the paper\.

Table 1:Model families and representative hyperparameter settings used in the main experiments\.

## Appendix BAdditional Qualitative Analysis

To visually complement the anatomical ablation study, we align and average structure\-specific masks within each pathology class and plot the resulting class prototypes; this is visually shown in[Figure 6](https://arxiv.org/html/2606.06509#A2.F6)\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x4.png)Figure 6:Aligned class prototypes for RV, myocardium, and LV across ACDC pathologies\. Masks were centered and size\-normalized before averaging\. Myocardial contours exhibit the clearest class\-dependent variation, consistent with the quantitative finding that myocardium is the strongest single\-structure feature set\.### B\.1Representative Failure Cases

To complement the confusion analysis, we visualize representative misclassified patients to illustrate the kinds of anatomical ambiguity that remain under the all\-structure representation; this is shown in[Figure 7](https://arxiv.org/html/2606.06509#A2.F7)\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x5.png)Figure 7:Representative misclassified patients under the all\-structure model\. RV, myocardium, and LV are shown in red, green, and blue, respectively\. The errors are concentrated in anatomically ambiguous cases rather than random failures, especially for MINF\-related confusions\.

## Appendix CAdditional Quantitative Analysis

### C\.1Normalized Confusion Matrix

To characterize residual class\-wise ambiguity, we compute the row\-normalized confusion matrix\. LetCi​jC\_\{ij\}denote the number of examples whose true class isiiand predicted class isjj\. We normalize each row by the total number of examples in the corresponding true class:

C~i​j=Ci​j∑kCi​k\.\\displaystyle\\widetilde\{C\}\_\{ij\}=\\frac\{C\_\{ij\}\}\{\\sum\_\{k\}C\_\{ik\}\}\.Thus,C~i​j\\widetilde\{C\}\_\{ij\}represents the fraction of class\-iiexamples assigned to classjj, and each row sums to one\.[Figure 8](https://arxiv.org/html/2606.06509#A3.F8)shows that the remaining errors are structured rather than random\. DCM and RV are especially recovered well, while MINF and NOR account for most of the residual ambiguity\. This pattern suggests that the benchmark is meaningful but not saturated, with failures concentrated in a small number of clinically plausible class pairs\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x6.png)Figure 8:Normalized cross\-validated confusion matrix for 5\-class ACDC pathology prediction using the all\-structure anatomical representation\. Most classes are well separated, while residual errors are concentrated in a small number of structured class pairs, particularly NOR–RV and MINF–DCM, indicating a meaningful but nontrivial benchmark\.
### C\.2Robustness to Imperfect Segmentations

To assess whether the anatomy\-aware pipeline remains reliable under realistic segmentation imperfections, we simulate mild boundary perturbations through mask erosion and dilation and re\-evaluate the benchmark under the same cross\-validation protocol\.[Figure 9](https://arxiv.org/html/2606.06509#A3.F9)tests whether the pipeline remains reliable under imperfect segmentations\. Performance remains relatively stable under mild erosion and dilation perturbations, suggesting that the anatomy\-aware representation is not unduly brittle to realistic contour variability\. This is particularly relevant for deployment in resource\-constrained settings, where segmentation quality may vary due to annotation differences or heterogeneous imaging conditions\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x7.png)Figure 9:Robustness to simulated mask perturbations\. We erode and dilate segmentation masks by small amounts to mimic annotation disagreement or lower\-quality imaging conditions\. Performance remains relatively stable across mild perturbations, suggesting that the anatomy\-aware pipeline is robust to realistic segmentation noise\.
### C\.3Feature\-Level and Representation Analysis

[Figure 10](https://arxiv.org/html/2606.06509#A3.F10)refines the MYO story by showing which individual descriptors drive the classification\. Many of the top\-ranked features are myocardial in origin, with only a smaller contribution from RV\-derived quantities\. This suggests that the predictive value of MYO is not a coarse artifact of grouping, but is expressed through specific shape statistics extracted from the myocardium itself\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x8.png)Figure 10:Top anatomical features driving classification under the logistic\-regression model, ranked by mean absolute coefficient magnitude\. Many of the most influential descriptors arise from myocardial morphology, with additional contribution from a smaller set of RV\-derived features\.
### C\.4Patient embedding

[Figure 11](https://arxiv.org/html/2606.06509#A3.F11)shows that the resulting patient\-level feature space has visible class structure, supporting the interpretation that the representation captures clinically meaningful anatomical organization rather than random variation\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x9.png)Figure 11:PCA embedding of patients in the all\-structure anatomical feature space\. The classes exhibit visible organization without becoming trivially separable, indicating that the anatomy\-aware representation captures meaningful pathology structure while preserving nontrivial class overlap\.
### C\.5Most important myocardial descriptor families

[Figure 12](https://arxiv.org/html/2606.06509#A3.F12)further decomposes the myocardial contribution into descriptor families\. Radial\-distance variability, extent, circularity, elongation, and compactness dominate, indicating that the MYO signal arises from geometry and morphological variation rather than from a single scalar quantity\. This deepens the interpretability of the benchmark and suggests which types of myocardial structure are most informative with limited labels\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x10.png)Figure 12:Most important myocardial descriptor families under the logistic\-regression model\. Radial\-distance variability, extent, circularity, elongation, and compactness emerge as the most influential myocardial descriptor groups, suggesting that the predictive value of MYO arises from geometry rather than a single scalar measurement alone\.

## Appendix DStructure\-by\-Descriptor\-Family Importance

In[Figure 13](https://arxiv.org/html/2606.06509#A4.F13), the heatmap provides a finer\-grained interpretability view of the anatomy\-aware benchmark and further highlights the dominant contribution of myocardial descriptors\.

![Refer to caption](https://arxiv.org/html/2606.06509v1/x11.png)Figure 13:Structure\-by\-descriptor\-family importance analysis using multinomial logistic regression coefficients\. Each cell reports the summed mean absolute coefficient importance for a descriptor family within a given anatomical structure, averaged across cross\-validation folds\.### D\.1Robustness

![Refer to caption](https://arxiv.org/html/2606.06509v1/x12.png)Figure 14:Robustness to simulated mask perturbations\. Each cell reports mean cross\-validation balanced accuracy±\\pmstandard deviation under mild erosion and dilation of the segmentation masks\. Performance remains stable across perturbation settings for logistic regression, RBF\-SVM, and random forest, suggesting that the anatomy\-aware pipeline is robust to modest contour variability\.

## Appendix EEnd\-to\-End Image Baseline

To contextualize anatomy\-aware baselines, we additionally compared with a lightweight end\-to\-end image classifier based on ResNet\-18 \(parameter counts 11,175,941\) trained on representative raw MRI slices under the same low\-label protocol\.

Table 2:Comparison of lightweight anatomy\-aware baselines and a small end\-to\-end image baseline \(ResNet\-18\)\. The anatomy\-aware models operate on segmentation\-derived patient descriptors, whereas ResNet\-18 operates on representative raw MRI slices\. This comparison contextualizes the tradeoff between predictive performance and computational cost in low\-resource settings\.

Similar Articles

LLMs for Cardiovascular Risk Prediction from Structured Clinical Data

arXiv cs.CL

This paper presents a hybrid framework that combines structured clinical data with LLM-generated narratives for coronary artery disease prediction, achieving high fidelity in variable extraction and comparing ML models with LLM-based zero-shot and few-shot classification.

Architecture, Not Scale: Circuit Localization in Large Language Models

arXiv cs.CL

This paper challenges the assumption that mechanistic interpretability becomes harder as models scale, showing that architecture (specifically Grouped Query Attention vs. Multi-Head Attention) matters more than parameter count for circuit localization and stability.