MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

arXiv cs.LG Papers

Summary

MuteBench is a benchmark for evaluating multimodal fusion models under modality missing and within-modality missing conditions across clinical datasets. It provides insights into architecture robustness and suggests that diffusion-based imputation can help.

arXiv:2605.15235v1 Announce Type: new Abstract: Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice. Two failure modes are common: modality missing, where an entire channel is absent, and within-modality missing, where a contiguous time segment is lost. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing-data modes over 125,000 samples. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but can be sensitive to within-modality missing, especially on short sequences. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat. Finally, a PTB-XL case study suggests that diffusion-based imputation can improve downstream classification under within-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:38 AM

# MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
Source: [https://arxiv.org/html/2605.15235](https://arxiv.org/html/2605.15235)
Wugeng Zheng University of Central Florida wgzheng@ucf\.edu &Ziwen Kan University of Central Florida zi605672@ucf\.edu &Tianlong Chen University of North Carolina at Chapel Hill tianlong@cs\.unc\.edu &Chen Chen University of Central Florida chen\.chen@crcv\.ucf\.edu &Song Wang University of Central Florida song\.wang@ucf\.edu

###### Abstract

Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice\. Two failure modes are common: modality missing, where an entire channel is absent, and within\-modality missing, where a contiguous time segment is lost\. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets\. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing\-data modes over 125,000 samples\. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count\. Channel\-independent models tolerate modality missing well but can be sensitive to within\-modality missing, especially on short sequences\. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training\. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat\. Finally, a PTB\-XL case study suggests that diffusion\-based imputation can improve downstream classification under within\-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction\. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods\.

## 1Introduction

Multimodal physiological data is central to modern clinical AI\[[1](https://arxiv.org/html/2605.15235#bib.bib9),[37](https://arxiv.org/html/2605.15235#bib.bib10)\], combining ECG, EEG, PPG, and structured records to support disease prediction, severity assessment, and health monitoring\[[47](https://arxiv.org/html/2605.15235#bib.bib11),[33](https://arxiv.org/html/2605.15235#bib.bib12),[9](https://arxiv.org/html/2605.15235#bib.bib13)\]\. Yet sensors routinely fail in practice\. Leads detach, devices lose contact, and transmission errors drop segments of data\[[7](https://arxiv.org/html/2605.15235#bib.bib15),[41](https://arxiv.org/html/2605.15235#bib.bib16)\], resulting in two distinct failure modes\.*Modality missing*occurs when an entire channel is absent\[[50](https://arxiv.org/html/2605.15235#bib.bib38)\], while*within\-modality missing*refers to the loss of a contiguous time segment within an active channel\[[5](https://arxiv.org/html/2605.15235#bib.bib39)\]\. Both failures arise during deployment and data collection alike, making incomplete multimodal data a common feature of clinical datasets\.

Despite this, most multimodal models are trained and evaluated assuming complete modality availability\[[45](https://arxiv.org/html/2605.15235#bib.bib2),[11](https://arxiv.org/html/2605.15235#bib.bib48)\], overlooking the incompleteness common in clinical practice\. Even when data are complete, training dynamics tend to favor certain modalities over others\[[44](https://arxiv.org/html/2605.15235#bib.bib1),[16](https://arxiv.org/html/2605.15235#bib.bib46)\], so many multimodal models in practice rely on one dominant modality while leaving others underused\. Missing modalities therefore worsen these existing imbalances rather than creating an entirely new challenge\. Recent evidence further shows that the two failure modes produce inconsistent method rankings\[[3](https://arxiv.org/html/2605.15235#bib.bib14)\], yet no existing work jointly evaluates multiple architectures under both modes at controlled severity levels across diverse clinical signals \(Table[1](https://arxiv.org/html/2605.15235#S1.T1)\), leaving practitioners without clear guidance for building robust multimodal fusion systems\.

To address this gap, we introduce the Modality Unavailability Tolerance Evaluation Benchmark \(MuteBench\), covering 9 datasets from 7 clinical domains, over 125,000 total samples across datasets ranging from 515 to 64,726 recordings, and 6 fusion models from three architecture families: channel\-independent models, Mixture\-of\-Experts fusion models, and shared\-specific decomposition models\. All models are evaluated under two missingness conditions at controlled severity levels, yielding 270 evaluation configurations replicated across three independent random seeds \(810 total runs\)\. The datasets span channel counts from 2 to 78, sequence lengths from 48 to 3,000 time steps, and sampling rates from once\-per\-hour clinical observations to 4,000 Hz physiological waveforms, covering binary, multi\-class, and multi\-label tasks across all three modality alignment types\. A framework\-agnostic missingness library injects identical deterministic patterns across all codebases, ensuring that performance differences reflect architectural choices rather than data\-pipeline artifacts\. Our main contributions are:

- •The first benchmark to jointly evaluate robustness under various missingness conditions across multiple fusion architectures and clinical datasets\.MuteBench is the first to cross multiple axes simultaneously: 9 datasets from 7 clinical domains, 6 fusion architectures from three model families, and both modality missing and within\-modality missing at controlled severity levels\. A framework\-agnostic missingness library and unified evaluation protocol ensure that observed performance differences reflect architecture, not pipeline artifacts\.
- •Empirical insights on what drives robustness\.Benchmarking across all nine datasets reveals that architecture family, not parameter count, is the primary driver of robustness\. Channel\-independent models are most stable under modality missing, while shared\-specific models best handle within\-modality missing on homogeneous signals\. Dataset channel count and sequence length jointly determine which failure mode is more damaging\. Curriculum modality dropout provides strong protection, but only up to the missing rate covered during training\.
- •Diffusion imputation as a potential remedy for within\-modality missing\.In a PTB\-XL case study with three models, diffusion\-based imputation improves downstream classification under within\-modality missing, with gains scaling with missing severity\. The largest gains appear for models whose expert routing is most sensitive to corrupted inputs, while architectures that already handle within\-modality gaps structurally benefit less\. The benefit is negligible under modality missing, where an entirely absent channel provides little conditioning signal for reconstruction\.

Table 1:Feature comparison of multimodal robustness benchmarks\. MuteBench is the first to evaluate robustness under various missingness conditions at controlled severity levels across multiple fusion architectures and clinical physiological datasets\.In light of these findings, our MuteBench provides a view with concrete, dataset\-aware guidance for selecting fusion architectures under real\-world missingness\. Beyond evaluation, our findings directly inform the design of future robust multimodal fusion models by identifying structural properties \(modality alignment type, channel count, and sequence length\) that govern sensitivity to each failure mode\. Specifically, our results suggest that dataset properties such as channel count, sequence length, and modality alignment type should be treated as first\-class design inputs when developing robust fusion architectures, rather than being handled post\-hoc through generic missing\-data strategies\. Together, MuteBench and these findings establish the first systematic foundation for benchmarking and designing robust multimodal clinical fusion systems under realistic sensor failure conditions\.

## 2Related Work

Clinical Multimodal Benchmarks\.Clinical multimodal benchmarks fall into two categories\. The first covers medical question answering\[[18](https://arxiv.org/html/2605.15235#bib.bib51),[36](https://arxiv.org/html/2605.15235#bib.bib50)\], LLM and LVLM reasoning on clinical images and guidelines\[[40](https://arxiv.org/html/2605.15235#bib.bib49),[15](https://arxiv.org/html/2605.15235#bib.bib52),[24](https://arxiv.org/html/2605.15235#bib.bib33),[10](https://arxiv.org/html/2605.15235#bib.bib34),[17](https://arxiv.org/html/2605.15235#bib.bib35)\], and multimodal QA over EHR tables paired with radiology images\[[2](https://arxiv.org/html/2605.15235#bib.bib53)\]\. These all focus on text\-based or knowledge\-driven reasoning without continuously sampled physiological signals\. The second targets task\-oriented clinical prediction by combining time series, waveforms, and structured features\[[6](https://arxiv.org/html/2605.15235#bib.bib36),[50](https://arxiv.org/html/2605.15235#bib.bib38),[46](https://arxiv.org/html/2605.15235#bib.bib19)\]\.Lianget al\.\[[25](https://arxiv.org/html/2605.15235#bib.bib37)\]introduce a general\-purpose benchmark with robustness evaluation but do not separate modality missing from within\-modality missing patterns and do not focus on clinical physiological signals\. No existing benchmark systematically studies how different missing patterns interact with signal structure across multiple fusion architectures, leaving an important and unaddressed gap in multimodal clinical evaluation\.

Missing Data in Clinical Time Series\.Most multimodal healthcare prediction models are developed and benchmarked assuming complete modality availability\[[45](https://arxiv.org/html/2605.15235#bib.bib2)\], yet this assumption rarely holds in real clinical practice\. Missing data in clinical physiological signals takes two forms: modality missing \(an entire sensor stream absent\) and within\-modality missing \(contiguous temporal gaps within an active channel\)\. Single\-modality methods address these challenges through missingness\-aware recurrent models\[[5](https://arxiv.org/html/2605.15235#bib.bib39)\], continuous\-time attention networks\[[34](https://arxiv.org/html/2605.15235#bib.bib40)\], and graph\-based sensor\-dropout encoders\[[51](https://arxiv.org/html/2605.15235#bib.bib41)\]\. For multimodal settings, existing methods include generative reconstruction of absent modalities\[[26](https://arxiv.org/html/2605.15235#bib.bib42)\], shared\-specific feature decomposition\[[43](https://arxiv.org/html/2605.15235#bib.bib23),[48](https://arxiv.org/html/2605.15235#bib.bib43)\], and prompt\-based adaptation\[[22](https://arxiv.org/html/2605.15235#bib.bib44)\]\. These methods are each proposed and evaluated in isolation on individual datasets, without a unified comparison across architectures or clinical signal structures\.

Benchmark for Robustness\.Two gaps remain\. First, no benchmark jointly evaluates multiple fusion architectures under both missing patterns at controlled severity levels on clinical physiological signals\. MultiBench\[[25](https://arxiv.org/html/2605.15235#bib.bib37)\]and MC\-BEC\[[6](https://arxiv.org/html/2605.15235#bib.bib36)\]each address only one of these dimensions\. Second, no benchmark accounts for how dataset structure \(modality alignment type, channel count, and sequence length\) interacts with missing pattern type to determine degradation\. Without this, it is unclear which architectural choices suit which deployment scenario\. We address both gaps with nine clinical datasets across three modality alignment structures, six fusion architectures, and two missing patterns at two severity levels\. Table[1](https://arxiv.org/html/2605.15235#S1.T1)details this comparison\. As Section 4 shows, architecture rankings can reverse between the two failure modes, making isolated evaluation misleading\.

## 3Dataset

![Refer to caption](https://arxiv.org/html/2605.15235v1/x1.png)Figure 1:Overview of MuteBench\. We evaluate 9 datasets spanning 7 clinical domains and three representative fusion architecture families: channel\-independent models, Mixture\-of\-Experts fusion models, and shared\-specific decomposition models, under two systematic missingness patterns at controlled severity levels\.This section describes the datasets used in MuteBench\. We select 9 datasets from 7 clinical domains \(Figure[1](https://arxiv.org/html/2605.15235#S3.F1), left\), covering a broad range of real\-world conditions, so that our robustness findings reflect general trends rather than dataset\-specific artifacts\. The following subsections describe our selection criteria and modality definitions\.

### 3\.1Dataset Selection and Modality Definitions

We curate 9 datasets spanning 7 domains \(e\.g\., ICU, cardiology, wearable activity tracking\) and multiple task formats \(binary, multi\-class, multi\-label\) to measure both spatial and temporal robustness\. This diversity ensures that the evaluated fusion strategies are tested across varied input structures and clinical objectives, enabling a thorough assessment of generalization across diverse clinical settings\.

We categorize the datasets into three modality alignment types to isolate how structural alignment affects a model’s ability to recover missing information\.Type 1 \(Homogeneous and Aligned\)datasets share the same format and time axis \(e\.g\., 12\-lead ECG\), testing pure spatial compensation\.Type 2 \(Heterogeneous and Aligned\)datasets have different physical domains but synchronized time axes \(e\.g\., EEG and respiration\), testing cross\-domain temporal reasoning\.Type 3 \(Heterogeneous and Unaligned\)datasets lack both spatial and temporal synchronization \(e\.g\., asynchronous clinical records combined with static metadata\), testing models under the most challenging alignment conditions\. Detailed selection criteria and formal modality definitions are in Appendix[C](https://arxiv.org/html/2605.15235#A3)\.

![Refer to caption](https://arxiv.org/html/2605.15235v1/x2.png)Figure 2:Evaluation of missing\-data conditions\.Left \(Complete\):Original, fully observed signals\.Middle \(Modality missing\):Entire modalities \(e\.g\., B and E\) are dropped with probabilitypp, simulating whole\-sensor failures\.Right \(Within\-modality missing\):Contiguous time segments are masked independently per channel, simulating transient interruptions like motion artifacts\.
### 3\.2Dataset Statistics

The benchmark spans 9 datasets from 7 clinical and physiological domains, totalling over 125,000 samples\. Dataset sizes differ by more than two orders of magnitude: HAR\-UP\[[27](https://arxiv.org/html/2605.15235#bib.bib24)\]is the smallest at 515 samples and PPG\-DaLiA\[[30](https://arxiv.org/html/2605.15235#bib.bib28)\]the largest at 64,726\. ICU datasets fall in the mid\-range, with MIMIC\-IV\[[19](https://arxiv.org/html/2605.15235#bib.bib30)\]contributing 5,100 patient episodes and Challenge\-2012\[[35](https://arxiv.org/html/2605.15235#bib.bib31)\]about 4,000 hospital stays\. The cardiac datasets are similarly sized: PTB\-XL\[[42](https://arxiv.org/html/2605.15235#bib.bib25)\]provides 21,837 recordings and Chapman\-Shaoxing\[[52](https://arxiv.org/html/2605.15235#bib.bib26)\]10,646\. Task formats are equally diverse\. Three datasets target binary classification: HAR\-UP, MIMIC\-IV, and Challenge\-2012\. Four require multi\-class prediction with 3 to 9 categories: Sleep\-EDF\[[21](https://arxiv.org/html/2605.15235#bib.bib27)\]distinguishes 5 sleep stages, PPG\-DaLiA covers 9 activity types, WESAD\[[31](https://arxiv.org/html/2605.15235#bib.bib29)\]recognizes 3 affective states, and CirCor\[[28](https://arxiv.org/html/2605.15235#bib.bib32)\]grades 3 murmur severities\. The remaining two, PTB\-XL and Chapman\-Shaoxing, use multi\-label prediction with 5 and 7 diagnostic label groups per recording\. All datasets are evaluated with Macro\-AUROC as the primary metric and Macro\-F1 reported alongside\.

The datasets also vary substantially in input structure\. Channel counts range from 2 in WESAD \(wrist EDA and BVP sensors\) to 78 in CirCor \(64 mel\-spectrogram bins plus 14 static demographic channels\)\. Sequence lengths span from 48 time steps in MIMIC\-IV and Challenge\-2012 \(hourly ICU bins\) to 3,000 steps in Sleep\-EDF \(30\-second windows at 100 Hz\)\. Other examples include PTB\-XL and Chapman\-Shaoxing with 12 ECG leads atT=250T\{=\}250andT=1,000T\{=\}1\{,\}000, PPG\-DaLiA with 9 wearable channels atT=256T\{=\}256, and HAR\-UP with 30 IMU channels from 5 body\-worn sensors atT=140T\{=\}140\. Sampling rates vary from once\-per\-hour clinical observations to 4,000 Hz phonocardiogram waveforms\. Challenge\-2012 has extreme temporal sparsity, with only≈13\.9%\{\\approx\}13\.9\\%of per\-cell measurements observed, making it the most irregularly sampled dataset in the benchmark\. This structural variation is deliberate: it tests models across low\- and high\-channel regimes, short\- and long\-horizon reasoning, and all three modality alignment types, from homogeneous synchronized signals like PTB\-XL and Chapman\-Shaoxing to heterogeneous unaligned mixtures of time series and static embeddings in MIMIC\-IV and CirCor\. Full per\-dataset statistics are in Appendix[C](https://arxiv.org/html/2605.15235#A3)\.

## 4Experiments

### 4\.1Experimental Setup

This section presents the experimental results and analysis\. We evaluate model robustness against incomplete data across diverse settings, guided by three questions:\(a\) Datasets: Do models show consistent robustness across dataset types and clinical domains?\(b\) Models: Can different fusion architectures maintain robustness under missing data?\(c\) Missingness: Do models respond differently to modality missing versus within\-modality missing? The datasets selected and model configurations used throughout this work are described as follows\.

##### Datasets:

We use 9 datasets covering the domains and modality types defined in Section[3](https://arxiv.org/html/2605.15235#S3)\. These include medical data \(MIMIC\-IV, Challenge\-2012\), heart signals \(PTB\-XL, Chapman\-Shaoxing\), cardiac auscultation \(CirCor\), brain signals \(Sleep\-EDF\), wearable data \(PPG\-DaLiA, WESAD\), and activity tracking \(HAR\-UP\)\. Categorized by modality structure, these datasets span three types: Type 1 \(homogeneous and aligned\), Type 2 \(heterogeneous and aligned\), and Type 3 \(heterogeneous and unaligned\), as detailed in Appendix[C](https://arxiv.org/html/2605.15235#A3)\. Table[2](https://arxiv.org/html/2605.15235#S4.T2)lists the specific details for all 9 datasets\.

Table 2:Overview of the specific datasets evaluated in the benchmark\. The table is sorted by modality type and task type\.Table 3:Summary of baseline model architectures and fusion strategies\.
##### Models:

We evaluate 6 representative multimodal fusion architectures spanning multiple design paradigms \(Figure[1](https://arxiv.org/html/2605.15235#S3.F1), right\): channel\-independent models \(CLIMB\[[8](https://arxiv.org/html/2605.15235#bib.bib18)\], MIRA\[[23](https://arxiv.org/html/2605.15235#bib.bib22)\]\), a shared\-specific decomposition model \(ShaSpec\[[43](https://arxiv.org/html/2605.15235#bib.bib23)\]\), and MoE\-fusion models \(Flex\-MoE\[[49](https://arxiv.org/html/2605.15235#bib.bib20)\], FuseMoE\[[13](https://arxiv.org/html/2605.15235#bib.bib21)\], Maestro\[[29](https://arxiv.org/html/2605.15235#bib.bib17)\]\)\. Table[3](https://arxiv.org/html/2605.15235#S4.T3)summarizes their architectures and key fusion strategies; complete implementation details and hyperparameter settings are in Appendix[B](https://arxiv.org/html/2605.15235#A2)\.

##### Missing Data and Evaluation Protocol:

We evaluate two missing\-data conditions \(Figure[2](https://arxiv.org/html/2605.15235#S3.F2)\)\. Undermodality missing, each channel is independently dropped with probabilityp∈\{0\.2,0\.5\}p\\in\\\{0\.2,0\.5\\\}, simulating whole\-sensor failures; at least one channel is always retained\. Underwithin\-modality missing, non\-overlapping contiguous blocks \(5–10% ofTTeach\) are masked per channel to a total fractionb∈\{0\.2,0\.5\}b\\in\\\{0\.2,0\.5\\\}, simulating transient signal interruptions; static\-feature channels such as CirCor demographics are excluded\. All patterns are generated deterministically from a\(seed, sample\_id\)pair and applied uniformly to train, validation, and test splits, so every model receives identical masks\. Maestro is the sole exception, using curriculum modality dropout during training while evaluating under the same protocol\. Note that modality missing operates at channel level rather than strict modality level; see Appendix[D\.4\.1](https://arxiv.org/html/2605.15235#A4.SS4.SSS1)\. Full details are in Appendix[D](https://arxiv.org/html/2605.15235#A4)\.

### 4\.2Main Results

We evaluate six models on nine datasets\. We report AUROC and Macro\-F1 under three conditions: clean data, modality missing, and within\-modality missing\. The tables report these as AUC and F1 columns for each model \(AUC: AUROC; F1: Macro\-F1\)\.

##### Clean Data Results\.

Table[4](https://arxiv.org/html/2605.15235#S4.T4)shows clean\-data baselines\. No single model dominates: ShaSpec leads on Sleep\-EDF, PPG\-DaLiA, and Chapman; Maestro on PTB\-XL, WESAD, HAR\-UP, Challenge\-2012, and CirCor; FuseMoE leads on MIMIC\-IV \(0\.811 AUROC\)\. These clean\-data scores serve as the no\-missing baseline shared by both modality\-missing and within\-modality\-missing evaluations\.

Table 4:Clean data results on nine datasets, averaged over three independent seeds\.AUC: AUROC;F1: Macro\-F1\. Bold denotes the best result per dataset\.Table 5:Modality missing results on nine datasets at 20% and 50% drop rates, averaged over three independent seeds\.AUC: AUROC;F1: Macro\-F1\. Bold denotes the best result per dataset per rate\.
##### Modality Missing Results\.

Table[5](https://arxiv.org/html/2605.15235#S4.T5)shows results at 20% and 50% drop rates\. Channel\-independent models show the strongest robustness: on Chapman at 50% drop, CLIMB retainsΔ​AUROC=−0\.014\\Delta\\text\{AUROC\}=\-0\.014and MIRA loses onlyΔ​AUROC=−0\.017\\Delta\\text\{AUROC\}=\-0\.017\. This reflects their design of encoding each channel independently so that dropped modalities do not corrupt remaining representations\. Maestro, trained with curriculum dropout, also maintains competitive performance across most datasets\. In contrast, fusion\-centric architectures suffer more: FuseMoE loses 0\.075 AUROC on PTB\-XL at 50% drop, and ShaSpec shows instability on Challenge\-2012 where Macro\-F1 collapses to 0\.000 at 20% drop\. This suggests that shared\-parameter fusion models are more sensitive to the complete absence of expected input streams\.

##### Within\-Modality Missing Results\.

Table[6](https://arxiv.org/html/2605.15235#S4.T6)shows within\-modality missing results\. ShaSpec is most robust on ECG datasets: PTB\-XL AUROC drops onlyΔ=−0\.007\\Delta=\-0\.007at the 50% block rate, likely because masked temporal segments in ECG signals preserve enough morphological context for classification\. Flex\-MoE maintains stability on MIMIC\-IV across both block rates \(Δ​AUROC≤0\.010\\Delta\\text\{AUROC\}\\leq 0\.010\), suggesting that its mixture\-of\-experts routing can compensate for localized temporal gaps\. However, Maestro shows large F1 degradation on Sleep\-EDF at block 50% \(F1: \.443 vs\. \.698 clean,Δ=−0\.255\\Delta=\-0\.255\), indicating that curriculum\-dropout\-based robustness does not fully transfer to within\-modality missing\.

Table 6:Within\-modality missing results on nine datasets at 20% and 50% block rates, averaged over three independent seeds\.AUC: AUROC;F1: Macro\-F1\. Bold denotes the best result per dataset per rate\.

### 4\.3Insights

From the benchmark results across nine datasets and six fusion architectures, we identify three consistent trends\. All trends are stable across the three independent seeds; per\-seed results are in Appendix[E\.1](https://arxiv.org/html/2605.15235#A5.SS1)\.

##### Architecture family predicts robustness better than parameter count\.

Channel\-independent models \(CLIMB, MIRA\) are the most stable under modality missing, as isolated channel encoding prevents cross\-channel interference\. Shared\-specific models handle within\-modality missing best on homogeneous correlated signals \(PTB\-XLΔ​Block50%=−0\.006\\Delta\\text\{Block50\\%\}=\-0\.006\)\. FuseMoE \(≈256\{\\approx\}256M parameters\) is consistently among the most vulnerable models under both conditions, confirming that scale does not substitute for structural robustness design\. Per\-family degradation is detailed in Appendix[E\.2](https://arxiv.org/html/2605.15235#A5.SS2)\.

##### Dataset structure determines which missing type causes more harm\.

Datasets with many channels and short sequences \(e\.g\., Challenge\-2012:C=42C=42,T=48T=48\) suffer more from modality missing\. Each channel carries dense information, while a missing block covers only 2 to 5 time steps and contributes little to overall loss\. In contrast, long\-sequence datasets \(e\.g\., Sleep\-EDF:T=3000T=3000\) are also vulnerable to within\-modality missing, as models relying on temporal continuity are more affected when a contiguous segment is lost\. Per\-dataset breakdowns are in Appendix[E](https://arxiv.org/html/2605.15235#A5)\.

##### Curriculum dropout protection is bounded by its training configuration\.

Models using curriculum modality dropout \(Appendix[D\.5](https://arxiv.org/html/2605.15235#A4.SS5.SSS0.Px3)\), such as Maestro, show substantially lower degradation under modality missing \(Challenge\-2012Δ​Mod50%=−0\.095\\Delta\\text\{Mod50\\%\}=\-0\.095vs\.−0\.172\-0\.172for FuseMoE\)\. However, this protection only holds within the maximum dropout rate seen during training\. On PPG\-DaLiA \(5 modalities\), a 50% modality missing rate removes 2\.5 modalities on average \(in this case, 3 modalities would be removed\), which exceeds Maestro’s training upper bound of≈40%\{\\approx\}40\\%\(at most 2 modalities\)\. As a result, Maestro’s degradation becomes the worst in the dataset \(Δ=−0\.099\\Delta=\-0\.099\)\. The curriculum upper bound must cover the worst expected missing rate at test time\.

### 4\.4Degradation Analysis

![Refer to caption](https://arxiv.org/html/2605.15235v1/x3.png)Figure 3:Degradation analysis\.left:Radar chart of AUROC drop \(Δ\\DeltaAUROC==clean−\-missing, averaged over three seeds\) across all 9 datasets under modality and within\-modality missing at 20% and 50% rates; larger area indicates greater overall sensitivity, and each axis corresponds to one dataset\.right:Detailed degradation trajectory of Flex\-MoE on PPG\-DaLiA: both AUROC and Macro\-F1 decline steeply as missing rate increases, with modality missing inducing sharper drops than within\-modality missing\.Figure[3](https://arxiv.org/html/2605.15235#S4.F3)\(left\) summarizes AUROC degradation across all datasets and models, showing that degradation varies by dataset structure and architecture family \(see Tables[19](https://arxiv.org/html/2605.15235#A5.T19)–[20](https://arxiv.org/html/2605.15235#A5.T20)\)\. Figure[3](https://arxiv.org/html/2605.15235#S4.F3)\(right\) examines Flex\-MoE on PPG\-DaLiA as a representative case study\. Macro\-F1 drops more sharply than AUROC under both conditions \(modality 50%:Δ​F1=−0\.152\\Delta\\text\{F1\}=\-0\.152vs\.Δ​AUROC=−0\.054\\Delta\\text\{AUROC\}=\-0\.054\), and modality missing causes a steeper decline than within\-modality missing \(Δ​F1=−0\.076\\Delta\\text\{F1\}=\-0\.076at within\-modality 50%\)\. The F1–AUROC gap arises because AUROC measures ranking ability and is relatively insensitive to shifts in predicted score magnitudes, whereas F1 directly reflects classification decision quality; when expert routing loses an entire input modality, the model tends toward majority\-class predictions and F1 collapses faster\. The sharper drop under modality missing reflects the complementary nature of PPG\-DaLiA’s modalities: removing an entire channel \(e\.g\., PPG, skin temperature, or accelerometer\) deprives the MoE gating mechanism of one expert’s input entirely, while within\-modality missing only corrupts a local temporal window and leaves the gating structure intact\. The degradation trajectory also steepens nonlinearly from 20% to 50% under modality missing, whereas within\-modality missing degrades more gradually, suggesting that MoE\-fusion architectures are disproportionately sensitive to complete channel loss as missing rate increases\.

### 4\.5Imputation Effect

We evaluate an impute\-then\-predict strategy on PTB\-XL\. A conditional score\-based diffusion model\[[14](https://arxiv.org/html/2605.15235#bib.bib3),[39](https://arxiv.org/html/2605.15235#bib.bib4)\]is trained at native ECG resolution \(T=1000T\{=\}1000, 50 diffusion steps\) and conditioned on observed channels to reconstruct missing segments offline; each downstream classifier then trains on the pre\-imputed signals \(Appendix[F\.1](https://arxiv.org/html/2605.15235#A6.SS1)\)\. Results are reported in Tables[7](https://arxiv.org/html/2605.15235#S4.T7)–[8](https://arxiv.org/html/2605.15235#S4.T8)\. Under modality missing, the strategy offers no benefit: AUROC drops by 0\.004 to 0\.028 across all settings, and Macro\-F1 degrades more sharply\. Without cross\-modal conditioning guidance, reconstructing a fully absent modality is out of distribution and yields unreliable estimates\. Models on zero\-filled inputs learn to isolate missingness, so low\-quality reconstructions instead corrupt the representation space\. Under within\-modality missing, imputation improves all three models on PTB\-XL\. Gains scale with the missing rate: Flex\-MoE recovers the most \(\+0\.038\+0\.038AUROC at 50%\), as its expert routing is most sensitive to corrupted inputs\. CLIMB improves moderately \(\+0\.014\+0\.014AUROC at 50%\)\. ShaSpec benefits least, gaining only\+0\.001\+0\.001AUROC at 50%, because its shared\-specific decomposition already handles within\-lead gaps\. These gains do not close the gap to clean performance, so imputation serves as a partial fix rather than a complete solution\. These divergent outcomes suggest applying imputation selectively\. Whether these patterns hold beyond PTB\-XL remains open\.

Table 7:Effect of diffusion\-based imputation on PTB\-XL under modality missing\.Raw: models trained on zero\-filled dropped\-modality input\.Diffusion: same model trained on diffusion\-reconstructed input\.Δ\\Deltadenotes the performance change from Raw to Diffusion Imputed\.AUC: AUROC;F1: Macro\-F1\. Three seeds; mean±\\pmstd\.Table 8:Effect of diffusion\-based imputation on PTB\-XL under within\-modality missing\.Raw: models trained on zero\-filled corrupted input\.Diffusion: models trained on diffusion\-reconstructed input\.Δ\\Deltadenotes the performance change from Raw to Diffusion Imputed\.AUC: AUROC;F1: Macro\-F1\. Three seeds; mean±\\pmstd\.

## 5Conclusion

We introduce MuteBench, a benchmark evaluating multimodal fusion architectures under two missing\-data conditions across 9 clinical datasets, 6 models, and 810 experimental runs\. Architecture family predicts robustness better than parameter count: channel\-independent models handle modality missing well, while shared\-specific models excel at within\-modality missing on homogeneous signals but degrade on short or imbalanced sequences\. A PTB\-XL case study suggests that diffusion imputation can improve within\-modality missing recovery but provides little benefit when an entire channel is absent\. Validating this finding broadly across more diverse datasets and settings remains important future work\. We recommend matching architecture choice to the dominant failure mode and setting curriculum dropout bounds to the worst expected missing rate\.

Our findings also suggest several directions for future work: designing architectures that balance shared and modality\-specific representations across heterogeneous signals, developing fusion mechanisms that degrade gracefully without explicit dropout curricula, and extending diffusion\-based imputation beyond a single dataset\. By releasing our complete benchmark suite, missingness library, and pretrained checkpoints, we hope MuteBench provides the practical guidance that real\-world practitioners need and establishes a reproducible and extensible foundation for developing the next generation of robust multimodal fusion models for clinical AI\.

## References

- \[1\]J\. N\. Acosta, G\. J\. Falcone, P\. Rajpurkar, and E\. J\. Topol\(2022\)Multimodal biomedical AI\.Nature Medicine28\(9\),pp\. 1773–1784\.External Links:[Document](https://dx.doi.org/10.1038/s41591-022-01981-2),[Link](https://www.nature.com/articles/s41591-022-01981-2)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[2\]S\. Bae, D\. Kyung, J\. Ryu, E\. Cho, G\. Lee, S\. Kweon, J\. Oh, L\. Ji, E\. Chang, T\. Kim, and E\. Choi\(2023\)EHRXQA: a multi\-modal question answering dataset for electronic health records with chest x\-ray images\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 3867–3880\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/0c007ebef1d11fd48da6ce4f54687db6-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[3\]\(2026\)Benchmarking imputation strategies for missing time\-series data in critical care using real\-world\-inspired scenarios\.Scientific Reports\.External Links:[Document](https://dx.doi.org/10.1038/s41598-026-39035-z)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p2.1)\.
- \[4\]Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston\(2009\)Curriculum learning\.InProceedings of the 26th Annual International Conference on Machine Learning,ICML ’09,New York, NY, USA,pp\. 41–48\.External Links:ISBN 9781605585161,[Document](https://dx.doi.org/10.1145/1553374.1553380)Cited by:[§B\.6](https://arxiv.org/html/2605.15235#A2.SS6.SSS0.Px2.p1.1)\.
- \[5\]Z\. Che, S\. Purushotham, K\. Cho, D\. Sontag, and Y\. Liu\(2018\)Recurrent neural networks for multivariate time series with missing values\.Scientific Reports8\(1\),pp\. 6085\.External Links:[Link](https://www.nature.com/articles/s41598-018-24271-9),[Document](https://dx.doi.org/10.1038/s41598-018-24271-9)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1),[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[6\]E\. Chen, A\. Kansal, J\. Chen, B\. T\. Jin, J\. R\. Reisler, D\. A\. Kim, and P\. Rajpurkar\(2023\)Multimodal clinical benchmark for emergency care \(mc\-bec\): a comprehensive benchmark for evaluating foundation models in emergency medicine\.External Links:2311\.04937,[Link](https://arxiv.org/abs/2311.04937)Cited by:[Table 1](https://arxiv.org/html/2605.15235#S1.T1.3.4.4.1),[§2](https://arxiv.org/html/2605.15235#S2.p1.1),[§2](https://arxiv.org/html/2605.15235#S2.p3.1)\.
- \[7\]J\. Chromik, S\. A\. I\. Klopfenstein, B\. Pfitzner, Z\. C\. Sinno, B\. Arnrich, F\. Balzer, and A\. Poncette\(2022\)Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review\.Frontiers in Digital Health4,pp\. 843747\.External Links:[Document](https://dx.doi.org/10.3389/fdgth.2022.843747),[Link](https://doi.org/10.3389/fdgth.2022.843747)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[8\]W\. Dai, P\. Chen, M\. Lu, D\. A\. Li, H\. Wei, H\. Cui, and P\. P\. Liang\(2025\)CLIMB: data foundations for large scale multimodal clinical foundation models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=TcvjOSePic)Cited by:[Table 9](https://arxiv.org/html/2605.15235#A2.T9.1.1.4),[Table 1](https://arxiv.org/html/2605.15235#S1.T1.3.5.5.1),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[9\]J\. Dunn, L\. Kidzinski, R\. Runge, D\. Witt, J\. L\. Hicks, S\. M\. Schüssler\-Fiorenza Rose, X\. Li, A\. Bahmani, S\. L\. Delp, T\. Hastie, and M\. P\. Snyder\(2021\)Wearable sensors enable personalized predictions of clinical laboratory measurements\.Nature Medicine27\(6\),pp\. 1105–1112\.External Links:[Document](https://dx.doi.org/10.1038/s41591-021-01339-0),[Link](https://doi.org/10.1038/s41591-021-01339-0)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[10\]D\. Fast, L\. C\. Adams, F\. Busch, C\. Fallon, M\. Huppertz, R\. Siepmann, P\. Prucker, N\. Bayerl, D\. Truhn, M\. Makowski,et al\.\(2024\)Autonomous medical evaluation for guideline adherence of large language models\.NPJ Digital Medicine7\(1\),pp\. 358\.Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[11\]S\. G\. Finlayson, A\. Subbaswamy, K\. Singh, J\. Bowers, A\. Kupke, J\. Zittrain, I\. S\. Kohane, and S\. Saria\(2021\)The clinician and dataset shift in artificial intelligence\.New England Journal of Medicine385\(3\),pp\. 283–286\.External Links:[Document](https://dx.doi.org/10.1056/NEJMc2104626)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p2.1)\.
- \[12\]A\. L\. Goldberger, L\. A\. N\. Amaral, L\. Glass, J\. M\. Hausdorff, P\. Ch\. Ivanov, R\. G\. Mark, J\. E\. Mietus, G\. B\. Moody, C\. Peng, and H\. E\. Stanley\(2000\)PhysioBank, physiotoolkit, and physionet\.Circulation101\(23\),pp\. e215–e220\.External Links:[Document](https://dx.doi.org/10.1161/01.CIR.101.23.e215)Cited by:[1st item](https://arxiv.org/html/2605.15235#A3.I3.i1.p1.1),[2nd item](https://arxiv.org/html/2605.15235#A3.I4.i2.p1.1)\.
- \[13\]X\. Han, H\. Nguyen, C\. Harris, N\. Ho, and S\. Saria\(2024\)FuseMoE: mixture\-of\-experts transformers for fleximodal fusion\.arXiv preprint arXiv:2402\.03226\.External Links:[Link](https://arxiv.org/abs/arXiv:2402.03226)Cited by:[§B\.5](https://arxiv.org/html/2605.15235#A2.SS5.p1.1),[Table 9](https://arxiv.org/html/2605.15235#A2.T9.5.5.4),[item 2](https://arxiv.org/html/2605.15235#A3.I4.i1.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[14\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.Cited by:[§F\.1](https://arxiv.org/html/2605.15235#A6.SS1.p1.2),[§4\.5](https://arxiv.org/html/2605.15235#S4.SS5.p1.4)\.
- \[15\]Y\. Hu, T\. Li, Q\. Lu, W\. Shao, J\. He, Y\. Qiao, and P\. Luo\(2024\)OmniMedVQA: a new large\-scale comprehensive evaluation benchmark for medical lvlm\.In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 22170–22183\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52733.2024.02093)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[16\]Y\. Huang, J\. Lin, C\. Zhou, H\. Yang, and L\. Huang\(2022\-17–23 Jul\)Modality competition: what makes joint training of multi\-modal network fail in deep learning? \(Provably\)\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 9226–9259\.External Links:[Link](https://proceedings.mlr.press/v162/huang22e.html)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p2.1)\.
- \[17\]J\. Jiang, J\. Zhang, Y\. Bi, J\. Bai, W\. Liu, W\. Jin, Z\. Xue, Y\. Liu, X\. Hu, and S\. Yan\(2026\)M3CoTBench: benchmark chain\-of\-thought of MLLMs in medical image understanding\.arXiv preprint arXiv:2601\.08758\.Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[18\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\)\.External Links:[Link](https://www.mdpi.com/2076-3417/11/14/6421),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app11146421)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[19\]A\. Johnson, L\. Bulgarelli, T\. Pollard, B\. Gow, B\. Moody, S\. Horng, L\. A\. Celi, and R\. Mark\(2024\-10\)MIMIC\-IV\.PhysioNet\.Note:Version 3\.1External Links:[Document](https://dx.doi.org/10.13026/kpb9-mt58),[Link](https://doi.org/10.13026/kpb9-mt58)Cited by:[1st item](https://arxiv.org/html/2605.15235#A3.I4.i1.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[20\]A\. Johnson, M\. Lungren, Y\. Peng, Z\. Lu, R\. Mark, S\. Berkowitz, and S\. Horng\(2019\)MIMIC\-cxr\-jpg: chest radiographs with structured labels\.PhysioNet\.External Links:[Document](https://dx.doi.org/10.13026/8360-t248)Cited by:[item 2](https://arxiv.org/html/2605.15235#A3.I4.i1.I1.i2.p1.1)\.
- \[21\]B\. Kemp, A\.H\. Zwinderman, B\. Tuk, H\.A\.C\. Kamphuisen, and J\.J\.L\. Oberye\(2000\)Analysis of a sleep\-dependent neuronal feedback loop: the slow\-wave microcontinuity of the eeg\.IEEE Transactions on Biomedical Engineering47\(9\),pp\. 1185–1194\.External Links:[Document](https://dx.doi.org/10.1109/10.867928),[Link](https://physionet.org/content/sleep-edfx/1.0.0/)Cited by:[1st item](https://arxiv.org/html/2605.15235#A3.I3.i1.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[22\]Y\. Lee, Y\. Tsai, W\. Chiu, and C\. Lee\(2023\)Multimodal prompting with missing modalities for visual recognition\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[23\]H\. Li, B\. Deng, C\. Xu, Z\. Feng, V\. Schlegel, Y\. Huang, Y\. Sun, J\. Sun, K\. Yang, Y\. Yu,et al\.\(2025\)MIRA: medical time series foundation model for real\-world health data\.arXiv preprint arXiv:2506\.07584\.External Links:[Link](https://arxiv.org/abs/2506.07584)Cited by:[§B\.3](https://arxiv.org/html/2605.15235#A2.SS3.p1.1),[Table 9](https://arxiv.org/html/2605.15235#A2.T9.2.2.4),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[24\]X\. Li, M\. Gao, Y\. Hao, T\. Li, G\. Wan, Z\. Wang, and Y\. Wang\(2025\)MedGUIDE: benchmarking clinical decision\-making in large language models\.arXiv preprint arXiv:2505\.11613\.Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[25\]P\. P\. Liang, Y\. Lyu, X\. Fan, A\. Agarwal, Y\. Cheng, L\. Morency, and R\. Salakhutdinov\(2023\)MULTIZOO & MULTIBENCH: a standardized toolkit for multimodal deep learning\.Journal of Machine Learning Research24,pp\. 1–7\.Cited by:[Table 1](https://arxiv.org/html/2605.15235#S1.T1.3.2.2.1),[§2](https://arxiv.org/html/2605.15235#S2.p1.1),[§2](https://arxiv.org/html/2605.15235#S2.p3.1)\.
- \[26\]M\. Ma, J\. Ren, L\. Zhao, S\. Tulyakov, C\. Wu, and X\. Peng\(2021\)SMIL: multimodal learning with severely missing modality\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 2302–2310\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/16330),[Document](https://dx.doi.org/10.1609/aaai.v35i3.16330)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[27\]L\. Martínez\-Villaseñor, H\. Ponce, J\. Brieva, E\. Moya\-Albor, J\. Núñez\-Martínez, and C\. Peñafort\-Asturiano\(2019\)UP\-fall detection dataset: a multimodal approach\.Sensors19\(9\)\.External Links:[Link](https://www.mdpi.com/1424-8220/19/9/1988),ISSN 1424\-8220,[Document](https://dx.doi.org/10.3390/s19091988)Cited by:[1st item](https://arxiv.org/html/2605.15235#A3.I2.i1.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[28\]J\. Oliveira, F\. Renna, P\. Costa, M\. Nogueira, A\. C\. Oliveira, A\. Elola, C\. Ferreira, A\. Jorge, A\. Bahrami Rad, M\. Reyna, R\. Sameni, G\. Clifford, and M\. Coimbra\(2022\-05\)The CirCor DigiScope Phonocardiogram Dataset\.PhysioNet\.Note:Version 1\.0\.3External Links:[Document](https://dx.doi.org/10.13026/tshs-mw03),[Link](https://doi.org/10.13026/tshs-mw03)Cited by:[3rd item](https://arxiv.org/html/2605.15235#A3.I4.i3.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[29\]A\. P\. Payal Mohapatra, S\. Xia, and Q\. Zhu\(2025\)MAESTRO: adaptive sparse attention and robust learning for multimodal dynamic time series\.InNeurIPS,Cited by:[§B\.6](https://arxiv.org/html/2605.15235#A2.SS6.p1.1),[Table 9](https://arxiv.org/html/2605.15235#A2.T9.6.6.4),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[30\]A\. Reiss, I\. Indlekofer, and P\. Schmidt\(2019\)PPG\-DaLiA\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C53890External Links:[Link](https://archive.ics.uci.edu/dataset/495/ppg+dalia)Cited by:[2nd item](https://arxiv.org/html/2605.15235#A3.I3.i2.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[31\]P\. Schmidt, A\. Reiss, R\. Duerichen, C\. Marberger, and K\. Van Laerhoven\(2018\)Introducing wesad, a multimodal dataset for wearable stress and affect detection\.InProceedings of the 20th ACM International Conference on Multimodal Interaction,ICMI ’18,New York, NY, USA,pp\. 400–408\.External Links:ISBN 9781450356923,[Link](https://doi.org/10.1145/3242969.3242985),[Document](https://dx.doi.org/10.1145/3242969.3242985)Cited by:[3rd item](https://arxiv.org/html/2605.15235#A3.I3.i3.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[32\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[1st item](https://arxiv.org/html/2605.15235#A2.I2.i1.p1.1),[3rd item](https://arxiv.org/html/2605.15235#A2.I5.i3.p1.2),[§B\.6](https://arxiv.org/html/2605.15235#A2.SS6.p2.1)\.
- \[33\]B\. Shickel, T\. J\. Loftus, L\. Adhikari, T\. Ozrazgat\-Baslanti, A\. Bihorac, and P\. Rashidi\(2019\)DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning\.Scientific Reports9\(1\),pp\. 1879\.External Links:[Document](https://dx.doi.org/10.1038/s41598-019-38491-0),[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC6372608/)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[34\]S\. N\. Shukla and B\. M\. Marlin\(2021\)Multi\-time attention networks for irregularly sampled time series\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=4c0J6lwQ4_)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[35\]I\. Silva, G\. Moody, D\. J\. Scott, L\. A\. Celi, and R\. G\. Mark\(2012\)Predicting in\-hospital mortality of icu patients: the physionet/computing in cardiology challenge 2012\.In2012 Computing in Cardiology,Vol\.,pp\. 245–248\.External Links:[Document](https://dx.doi.org/),[Link](https://physionet.org/content/challenge-2012/1.0.0/)Cited by:[2nd item](https://arxiv.org/html/2605.15235#A3.I4.i2.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[36\]K\. Singhal, S\. Azizi, T\. Tu, S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, P\. Payne, S\. Pfohl, M\. Seneviratne, P\. Gamble, C\. Kelly, A\. A\. H\. Babiker, N\. Schaerli, A\. Chowdhery, P\. Mansfield, D\. Demner\-Fushman, B\. Aguera\-Arcas, D\. Webster, G\. Corrado, Y\. Matias, K\. Chou, J\. Gottweis, N\. Tomašev, Y\. Liu, A\. Rajkomar, J\. Barral, C\. Semturs, A\. Karthikesalingam, and V\. Natarajan\(2023\)Large language models encode clinical knowledge\.Nature\.External Links:[Link](https://www.nature.com/articles/s41586-023-06291-2)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[37\]L\. R\. Soenksen, Y\. Ma, C\. Zeng, L\. Boussioux, K\. Villalobos Carballo, L\. Na, H\. M\. Wiberg, M\. L\. Li, I\. Fuentes, and D\. Bertsimas\(2022\)Integrated multimodal artificial intelligence framework for healthcare applications\.NPJ Digital Medicine5\(1\),pp\. 149\.External Links:[Document](https://dx.doi.org/10.1038/s41746-022-00689-4),[Link](https://www.nature.com/articles/s41746-022-00689-4)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[38\]N\. Strodthoff, P\. Wagner, T\. Schaeffter, and W\. Samek\(2021\)Deep learning for ecg analysis: benchmarks and insights from ptb\-xl\.IEEE Journal of Biomedical and Health Informatics25\(5\),pp\. 1519–1528\.External Links:[Document](https://dx.doi.org/10.1109/JBHI.2020.3022989)Cited by:[2nd item](https://arxiv.org/html/2605.15235#A3.I2.i2.p1.1)\.
- \[39\]Y\. Tashiro, J\. Song, Y\. Song, and S\. Ermon\(2021\)Csdi: conditional score\-based diffusion models for probabilistic time series imputation\.Advances in neural information processing systems34,pp\. 24804–24816\.Cited by:[§F\.1](https://arxiv.org/html/2605.15235#A6.SS1.p1.2),[§4\.5](https://arxiv.org/html/2605.15235#S4.SS5.p1.4)\.
- \[40\]T\. Tu, S\. Azizi, D\. Driess, M\. Schaekermann, M\. Amin, P\. Chang, A\. Carroll, C\. Lau, R\. Tanno, I\. Ktena, A\. Palepu, B\. Mustafa, A\. Chowdhery, Y\. Liu, S\. Kornblith, D\. Fleet, P\. Mansfield, S\. Prakash, R\. Wong, S\. Virmani, C\. Semturs, S\. S\. Mahdavi, B\. Green, E\. Dominowska, B\. A\. y Arcas, J\. Barral, D\. Webster, G\. S\. Corrado, Y\. Matias, K\. Singhal, P\. Florence, A\. Karthikesalingam, and V\. Natarajan\(2024\)Towards generalist biomedical ai\.NEJM AI1\(3\),pp\. AIoa2300138\.External Links:[Document](https://dx.doi.org/10.1056/AIoa2300138)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[41\]G\. Vila, C\. Godin, S\. Charbonnier, and A\. Campagne\(2021\)Real\-time quality index to control data loss in real\-life cardiac monitoring applications\.Sensors21\(16\),pp\. 5357\.External Links:[Document](https://dx.doi.org/10.3390/s21165357),[Link](https://doi.org/10.3390/s21165357)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[42\]P\. Wagner, N\. Strodthoff, R\. Bousseljot, W\. Samek, and T\. Schaeffter\(2020\-04\)PTB\-XL, a large publicly available electrocardiography dataset\.PhysioNet\.Note:Version 1\.0\.1External Links:[Document](https://dx.doi.org/10.13026/x4td-x982),[Link](https://doi.org/10.13026/x4td-x982)Cited by:[2nd item](https://arxiv.org/html/2605.15235#A3.I2.i2.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.
- \[43\]H\. Wang, Y\. Chen, C\. Ma, J\. Avery, L\. Hull, and G\. Carneiro\(2023\)Multi\-modal learning with missing modality via shared\-specific feature modelling\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15878–15887\.Cited by:[§B\.4](https://arxiv.org/html/2605.15235#A2.SS4.p1.1),[Table 9](https://arxiv.org/html/2605.15235#A2.T9.3.3.4),[§2](https://arxiv.org/html/2605.15235#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[44\]W\. Wang, D\. Tran, and M\. Feiszli\(2020\)What makes training multi\-modal classification networks hard?\.In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 12692–12702\.External Links:[Document](https://dx.doi.org/10.1109/CVPR42600.2020.01271)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p2.1)\.
- \[45\]Y\. Wang, C\. Yin, and P\. Zhang\(2024\)Multimodal risk prediction with physiological signals, medical images and clinical notes\.Heliyon10\(5\),pp\. e26772\.External Links:ISSN 2405\-8440,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.heliyon.2024.e26772)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p2.1),[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[46\]K\. Wantlin, C\. Wu, S\. Huang, O\. Banerjee, F\. Dadabhoy, V\. V\. Mehta, R\. W\. Han, F\. Cao, R\. R\. Narayan, E\. Colak, A\. Adamson, L\. Heacock, G\. H\. Tison, A\. Tamkin, and P\. Rajpurkar\(2023\)BenchMD: a benchmark for modality\-agnostic learning on medical images and sensors\.External Links:2304\.08486Cited by:[§B\.1](https://arxiv.org/html/2605.15235#A2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.15235#S1.T1.3.3.3.1),[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[47\]E\. Warner, J\. Lee, W\. Hsu, T\. Syeda\-Mahmood, C\. E\. Kahn Jr, O\. Gevaert, and A\. Rao\(2024\)Multimodal machine learning in image\-based and clinical biomedicine: survey and prospects\.International Journal of Computer Vision132\(9\),pp\. 3753–3769\.External Links:[Document](https://dx.doi.org/10.1007/s11263-024-02032-8),[Link](https://link.springer.com/article/10.1007/s11263-024-02032-8)Cited by:[§1](https://arxiv.org/html/2605.15235#S1.p1.1)\.
- \[48\]W\. Yao, K\. Yin, W\. K\. Cheung, J\. Liu, and J\. Qin\(2024\)DrFuse: learning disentangled representation for clinical multi\-modal fusion with missing modality and modal inconsistency\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 16416–16424\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29578),[Document](https://dx.doi.org/10.1609/aaai.v38i15.29578)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[49\]S\. Yun, I\. Choi, J\. Peng, Y\. Wu, J\. Bao, Q\. Zhang, J\. Xin, Q\. Long, and T\. Chen\(2024\)Flex\-moe: modeling arbitrary modality combination via the flexible mixture\-of\-experts\.External Links:2410\.08245,[Link](https://arxiv.org/abs/2410.08245)Cited by:[§B\.2](https://arxiv.org/html/2605.15235#A2.SS2.p1.1),[Table 9](https://arxiv.org/html/2605.15235#A2.T9.4.4.4),[§4\.1](https://arxiv.org/html/2605.15235#S4.SS1.SSS0.Px2.p1.1)\.
- \[50\]C\. Zhang, X\. Chu, L\. Ma, Y\. Zhu, Y\. Wang, J\. Wang, and J\. Zhao\(2022\)M3Care: learning with missing modalities in multimodal healthcare data\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,KDD ’22,pp\. 2418–2428\.External Links:[Document](https://dx.doi.org/10.1145/3534678.3539388)Cited by:[Table 1](https://arxiv.org/html/2605.15235#S1.T1.3.6.6.1),[§1](https://arxiv.org/html/2605.15235#S1.p1.1),[§2](https://arxiv.org/html/2605.15235#S2.p1.1)\.
- \[51\]X\. Zhang, M\. Zeman, T\. Tsiligkaridis, and M\. Zitnik\(2022\)Graph\-guided network for irregularly sampled multivariate time series\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Kwm8I7dU-l5)Cited by:[§2](https://arxiv.org/html/2605.15235#S2.p2.1)\.
- \[52\]J\. Zheng, H\. Guo, and H\. Chu\(2022\-08\)A large scale 12\-lead electrocardiogram database for arrhythmia study\.PhysioNet\.Note:Version 1\.0\.0External Links:[Document](https://dx.doi.org/10.13026/wgex-er52),[Link](https://doi.org/10.13026/wgex-er52)Cited by:[3rd item](https://arxiv.org/html/2605.15235#A3.I2.i3.p1.1),[§3\.2](https://arxiv.org/html/2605.15235#S3.SS2.p1.1)\.

## Appendix ABroader Impacts

MuteBench provides practitioners with concrete, dataset\-aware guidance for selecting multimodal fusion architectures that remain reliable under realistic sensor failures\. Making these failure modes explicit and quantifiable reduces the risk of deploying clinical AI systems that degrade silently, which could otherwise lead to missed diagnoses or delayed alerts in ICU monitoring, cardiac screening, and wearable health applications\. The released code and missingness library also lower the barrier for other researchers to conduct robustness evaluations, encouraging more systematic stress\-testing in clinical ML\.

The benchmark characterizes the conditions under which each architecture is most vulnerable, which in principle could be used to identify operating regimes where clinical AI systems are most susceptible to failure\. However, all evaluated datasets and model codebases are already publicly available, so the benchmark does not introduce new attack surfaces\. We do not foresee direct paths to harmful misuse\.

This section provides detailed technical descriptions of the six representative multimodal fusion architectures evaluated in our main experiments \(Section[4\.1](https://arxiv.org/html/2605.15235#S4.SS1)\)\. Each model is cloned from its official repository and adapted to a unified multi\-dataset evaluation protocol\. Below, we expand on the architectural backbone, fusion strategy, and specific mechanisms for handling missing modalities for each baseline\.

## Appendix BDetailed Baseline Model Architectures

All six models are adopted from publicly available or previously published work and evaluated purely as baselines; none is a contribution of this paper\. Each model is cloned from its official repository \(or internal codebase in the case of Maestro, which is cited as a prior work\) and adapted to our unified multi\-dataset evaluation protocol with minimal modification: we replace the original classification head with a task\-appropriate output layer and apply the shared missingness injection interface described in Appendix[D\.3](https://arxiv.org/html/2605.15235#A4.SS3)\. Hyperparameter selection follows the published recommendations for each model; no model receives preferential tuning\. Table[9](https://arxiv.org/html/2605.15235#A2.T9)summarises the origin, code availability, pretraining status, and approximate parameter count for each baseline\.

Table 9:Summary of baseline model provenance and adaptation\. All models are evaluated without pretraining on external data\. Parameter counts are approximate and dataset\-independent \(i\.e\., excluding the final classification head\)\. Adaptation level:*minimal*= head replacement only;*moderate*= head replacement plus task\-specific loss\.### B\.1CLIMB

CLIMB \(Clinical Large\-scale Integrative Multi\-modal Benchmark\) is built on the BenchMD framework\[[46](https://arxiv.org/html/2605.15235#bib.bib19)\]\. It is designed as a modality\-agnostic benchmark framework covering diverse medical data types under a single evaluation suite\.

##### Architecture\.

- •Backbone:A Domain Agnostic Transformer \(DAT\) with embedding dimension 256, depth of 12 Transformer layers, 8 attention heads, and MLP dimension 512\.
- •Channel\-independent encoding:Each time\-series channel is independently projected to a 256\-dimensional embedding space\. Channels never interact during encoding; only the aggregation step combines them\.
- •Aggregation:Mean pooling across all channel embeddings produces a single fixed\-size representation, which is passed to the classification head\.

##### Missing modality handling\.

CLIMB does not implement explicit modality dropout\. Its channel\-independent encoding provides implicit robustness: absent channels are zeroed at the input level, and mean pooling naturally down\-weights missing inputs without requiring training\-time dropout\.

### B\.2Flex\-MoE

Flex\-MoE\[[49](https://arxiv.org/html/2605.15235#bib.bib20)\]is a sparse Mixture\-of\-Experts framework specifically designed to handle arbitrary combinations of missing modalities at inference time \(NeurIPS 2024 Spotlight\)\. It builds combinatorial robustness directly into the MoE routing mechanism\.

##### Architecture\.

- •Backbone:Standard Transformer encoder layers; the MoE module\[[32](https://arxiv.org/html/2605.15235#bib.bib5)\]replaces the feed\-forward sub\-layer\.
- •Missing Modality Bank:For each observed modality subset, a learned embedding bridges the gap between the observed combination and the full\-modality representation, preventing the model from collapsing to any single dominant modality\.
- •G\-Router \(Generalized Router\):Trained exclusively on complete\-modality samples to inject generalized cross\-modal knowledge into the expert pool, ensuring experts learn coherent cross\-modal representations before specialization\.
- •S\-Router \(Specialized Router\):Uses top\-1 hard routing to assign each incomplete modality combination to the single most appropriate expert\. The expert index is determined by the observed modality combination index, providing a deterministic and interpretable routing scheme\.

##### Missing modality handling\.

The Missing Modality Bank explicitly models every possible subset of observed modalities, making Flex\-MoE inherently robust to any combination of missing inputs without additional training\-time dropout\.

### B\.3MIRA

MIRA\[[23](https://arxiv.org/html/2605.15235#bib.bib22)\]is a large\-scale foundation model for medical time series, pretrained on 454 billion time points from heterogeneous clinical sources \(ICU waveforms, hospital EHR, and epidemiological signals; NeurIPS 2025\)\. Its design priority is zero\-shot generalization across datasets with different sampling rates, signal types, and temporal horizons\.

##### Architecture\.

- •Continuous\-Time Rotary Positional Encoding \(CT\-RoPE\):Extends standard RoPE to encode exact continuous\-valued timestamps rather than integer positions, allowing a single model to natively process signals sampled at 250 Hz \(ECG\), 32 Hz \(wearables\), once per minute \(vitals\), and once per day \(lab values\) within the same attention computation\.
- •Frequency\-Specialized Mixture\-of\-Experts:Each expert specializes in a different temporal frequency regime \(e\.g\., high\-frequency cardiac waveforms vs\. low\-frequency clinical trends\)\. Frequency\-aware routing ensures physiologically distinct signal types are processed by appropriate expert sub\-networks rather than competing for the same representation space\.
- •Neural ODE extrapolation:After encoding the observed time points, a Neural ODE \(viatorchdiffeq\) models latent dynamics continuously, enabling inference at arbitrary timestamps without requiring a fixed\-horizon output head\. For classification fine\-tuning, the continuous forecasting head is replaced with a task\-specific classification head\.

##### Missing modality handling\.

MIRA uses a mask field in its data format to indicate missing time points\. Masked positions are excluded from attention computation via attention masking, so partial\-modality inputs are processed without architectural changes\.

### B\.4ShaSpec

ShaSpec\[[43](https://arxiv.org/html/2605.15235#bib.bib23)\]was originally designed for 3D MRI brain tumor segmentation under missing\-modality conditions \(CVPR 2023\)\. Its core contribution is an explicit decomposition of features into shared and modality\-specific components\. In our benchmark, we adapt this framework to 1D temporal signals\.

##### Architecture\.

- •Shared encoder:A stack of lightweight 1D convolutional blocks \(ConvBlock1d\) learns modality\-agnostic temporal representations usingInstanceNorm1d\. The ResNet\-50 backbone and ASPP module from the original 3D model are not carried over\.
- •Modality\-specific encoders:Each modality has its ownEncoder1Dbranch \(identical structure to the shared encoder\) that captures modality\-exclusive temporal features\.
- •Shared\-specific fusion \(CompositionalLayer1D\):Shared and specific features are fused via a residual connection:fused=shared\+conv​\(cat​\(shared,specific\)\)\\text\{fused\}=\\text\{shared\}\+\\text\{conv\}\(\\text\{cat\}\(\\text\{shared\},\\,\\text\{specific\}\)\), preserving the core ShaSpec design principle\.
- •Classification head:Global average pooling over the fused temporal features followed by a linear layer\.

##### Missing modality handling\.

Missing modalities are handled by zeroing out the corresponding input channels\. The modality\-specific encoder for a missing input still runs on the zero input; the model falls back to the shared encoder features from available modalities when constructing the fused representation\. No explicit modality dropout warmup is implemented\.

### B\.5FuseMoE

FuseMoE is a multimodal time\-series fusion model built on a cross\-modal Transformer with Sparse Mixture\-of\-Experts feed\-forward layers\[[13](https://arxiv.org/html/2605.15235#bib.bib21)\]\. Its design targets irregularly sampled and heterogeneous physiological signals by combining cross\-attention fusion with learned expert specialization\.

##### Architecture\.

- •Per\-modality projection:Each modality is independently projected to a shared embedding space via aConv1dlayer, preserving local temporal structure before cross\-modal interaction\.
- •Token type embeddings:A learned embedding is added to each token to identify its source modality, providing the cross\-encoder with explicit modality identity information\.
- •TransformerCrossEncoder with Sparse MoE:The fusion module uses multi\-head cross\-attention where query tokens attend to key\-value pairs from all modalities\. The feed\-forward sub\-layer is replaced by a Sparse MoE block\[[32](https://arxiv.org/html/2605.15235#bib.bib5)\]with top\-kkgating \(k=1k=1, 4 experts\), allowing different experts to specialize in different inter\-modality interaction patterns\.
- •mTAND \(Multi\-Time Attention\)AmultiTimeAttentionmodule handles irregular sampling by learning continuous\-time attention weights, enabling the model to process asynchronous sensor streams\.

Model configuration: embed\_dim=128, num\_heads=8, num\_layers=2, num\_experts=4, top\_k=1\.

##### Missing modality handling\.

Missing\-modality tokens are replaced with learned zero\-padded embeddings; an attention mask prevents the model from attending to missing\-modality positions in the cross\-encoder\.

### B\.6Maestro

Maestro\[[29](https://arxiv.org/html/2605.15235#bib.bib17)\]is a previously published multi\-dataset medical time\-series classifier included here solely as a representative MoE\-fusion baseline\. It is not a contribution of this paper, and its hyperparameters and training protocol are held to the same standard of transparency as all other baselines\. Its stronger clean\-data performance on several datasets reflects its architectural design choices, not preferential treatment in our evaluation\.

Maestro is built on a custom multi\-stage sparse architecture built on a custom multi\-stage sparse architecture\. It does not rely on any standard backbone \(e\.g\., Transformer, Informer, ResNet\); instead, it combines SAX symbolic tokenization, per\-modality sparse attention encoding, cross\-modal sparse attention, and loss\-free Sparse MoE routing\[[32](https://arxiv.org/html/2605.15235#bib.bib5)\]\. It is designed to scale efficiently to long physiological sequences while maintaining per\-modality specialization\.

##### Architecture\.

- •SAX Symbolic Tokenization:Each modality’s time series is compressed into a symbolic sequence via Symbolic Aggregate approXimation \(SAX\)\. A dedicated reserved symbol represents missing modalities, enabling the model to natively distinguish absent from present inputs at the token level\.
- •ModalityPositionalEncoder:Combines sinusoidal temporal positional encodings with learned modality embeddings \(one per sensor/channel type\), giving each token a two\-part identity:*when*it occurred and*which modality*it belongs to\.
- •Per\-modality Sparse Attention Encoder:Each modality is encoded independently by a sparse self\-attention module whose attention budget is dynamically controlled by a modality\-aware gate, followed by 1D convolution and max\-pooling distillation for sequence compression\. This reduces per\-modality self\-attention complexity fromO​\(L2\)O\(L^\{2\}\)toO​\(L​log⁡L\)O\(L\\log L\)\.
- •Cross\-modal Sparse Attention:Token sequences from all modalities are concatenated and processed by a sparse cross\-modal multi\-head attention layer, enabling inter\-modality information exchange atO​\(L​log⁡L\)O\(L\\log L\)complexity\.
- •Sparse MoE Routing:A top\-kkSparse MoE block \(4 experts\) replaces the standard feed\-forward layer\. Experts automatically specialize by modality combination without auxiliary balancing losses \(loss\-free MoE\)\.
- •CrossAttnTransformerClf:The top\-level classifier uses cross\-attention between a learned class token and the encoded multi\-modality token sequence, followed by a linear classification head\.

Model configuration: d\_model=32, nhead=16, num\_layers=2, num\_experts=4\.

##### Missing modality handling\.

Maestro applies two complementary mechanisms for missing modality handling\. At inference time, missing modality streams are replaced with a learned reserved token produced by SAX symbolic tokenization, allowing the model to natively accept any combination of present and absent modalities without architectural changes\. During training, curriculum modality dropout\[[4](https://arxiv.org/html/2605.15235#bib.bib45)\]linearly increases the per\-modality dropout probability from 0 to a maximum of 40% over a warmup schedule of 10 epochs, after which it remains fixed\. Dropped modality streams are zeroed out and excluded from attention computation via an attention mask\. This structured exposure to progressively harder missing\-data scenarios is the primary source of Maestro’s robustness advantage over other MoE\-fusion models at test time\.

## Appendix CDetailed Dataset Specifications

We provide full specifications for all nine datasets below, organized by modality type\. For each dataset we describe its clinical significance, label semantics, channel configuration, dataset scale, and the preprocessing steps applied before training\. The primary evaluation metric is Macro\-AUROC for all datasets, with Macro\-F1 reported alongside\.

### C\.1Dataset Selection Criteria

We use three selection rules to create a rigorous testbed that measures both spatial and temporal robustness:

- •Domain Diversity: We include multiple fields \(intensive care, heart signals, activity tracking\) to show that a model’s spatial compensation ability holds across different physical signals\.
- •Data Shape: We select datasets with various channel counts and sequence lengths to test the limits of temporal reasoning\. Datasets with many channels evaluate spatial redundancy, while those with long time steps measure how well models interpolate missing blocks\.
- •Task Types: We include binary, multi\-class, and multi\-label tasks to verify that our conclusions hold true regardless of the clinical objective\.

### C\.2Dataset Overall Statistics

The benchmark covers 9 datasets from 7 clinical and physiological domains, totalling over 125,000 samples\. Table[2](https://arxiv.org/html/2605.15235#S4.T2)summarizes their key properties\.

Scale and Task Diversity\.Dataset sizes range from 515 samples \(HAR\-UP\) to 64,700 samples \(PPG\-DaLiA\)\. The benchmark covers three task formats: binary classification \(HAR\-UP, MIMIC\-IV, Challenge\-2012\), multi\-class classification with 3 to 8 categories \(Sleep\-EDF, PPG\-DaLiA, WESAD, CirCor\), and multi\-label prediction \(PTB\-XL, Chapman\-Shaoxing\)\. All datasets are evaluated with Macro\-AUROC as the primary metric, with Macro\-F1 reported alongside\.

Channel and Sequence Diversity\.Channel counts range from 2 \(WESAD\) to 78 \(CirCor\), providing a testbed for both low\-dimensional and high\-dimensional fusion\. Sequence lengths range from 48 time steps \(MIMIC\-IV, Challenge\-2012\) to 3000 time steps \(Sleep\-EDF\)\. Sampling rates span from once\-per\-hour clinical observations to 4,000 Hz physiological waveforms\. This variation deliberately stresses models across different spatial redundancy levels and temporal resolutions, ensuring that benchmark conclusions generalize beyond any single signal regime\.

### C\.3Formal Modality Type Definitions

We categorize the data into three types to isolate how structural alignment affects a model’s ability to handle missing information:

##### Type 1: Homogeneous and Aligned Datasets\.

All modalities in this group share the same physical signal format \(inertial motion or ECG voltage\) and are recorded synchronously along a common time axis\.

- •HAR\-UP \(UP\-Fall Detection Dataset\)\.HAR\-UP\[[27](https://arxiv.org/html/2605.15235#bib.bib24)\]is a multimodal fall\-detection dataset from the Autonomous University of Puebla \(Mexico\), designed for ambient\-assisted living applications\. Falls are one of the leading causes of injury\-related mortality in the elderly, and automated detection from body\-worn sensors can trigger emergency alerts within seconds\. The dataset covers controlled fall events and typical daily living activities across 17 subjects\. The task isbinary classification\. Label 1 \(Fall\) covers forward, backward, and lateral fall events caused by tripping, stumbling, or loss of balance\. Label 0 \(Non\-Fall\) covers activities of daily living including walking, sitting, standing, and postural transitions\. Five inertial measurement unit \(IMU\) sensors are placed at five body positions: ankle, right pocket, belt, necklace, and wrist\. Each sensor records a 3\-axis accelerometer signal and a 3\-axis gyroscope signal, yielding5×6=305\\times 6=30channels in total\. Because all channels share the same physical signal type \(IMU motion\), this is aType 1 \(homogeneous and aligned\)dataset\. After sliding\-window segmentation the dataset contains515 samples\. Each sample has shapeℝ30×140\\mathbb\{R\}^\{30\\times 140\}\(C=30C=30,T=140T=140\), corresponding to approximately 2\.8 seconds at≈50\{\\approx\}50Hz\. We use all 30 sensor channels without resampling\. A fixed sliding window of 140 time steps is applied to the raw sensor streams\. Z\-score normalization is applied to each channel independently\. Data splits are stratified by subject to prevent data leakage\.
- •PTB\-XL\.PTB\-XL\[[42](https://arxiv.org/html/2605.15235#bib.bib25)\]is the largest publicly available annotated clinical 12\-lead ECG dataset, released by the Physikalisch\-Technische Bundesanstalt \(PTB\) in Berlin\. It covers the full diagnostic spectrum encountered in cardiology practice and serves as the de facto standard benchmark for ECG multi\-label classification\[[38](https://arxiv.org/html/2605.15235#bib.bib47)\]\. Automated ECG interpretation has direct clinical value in screening large populations and supporting cardiologists in high\-volume environments\. The task ismulti\-label classificationacross five diagnostic superclasses derived from the 71 SCP\-ECG codes in the original annotation\. A patient may simultaneously carry multiple diagnoses: - –NORM:Normal ECG, no pathological finding identified\. - –MI \(Myocardial Infarction\):ST\-segment elevation, pathological Q\-waves, or T\-wave inversion consistent with current or prior infarction\. - –STTC \(ST/T\-wave Change\):Non\-specific repolarization abnormalities including ST depression or elevation not meeting STEMI criteria, and diffuse T\-wave changes\. - –CD \(Conduction Disturbance\):Bundle branch blocks \(LBBB/RBBB\), fascicular blocks, Wolff\-Parkinson\-White syndrome, and first\-/second\-/third\-degree AV conduction delays\. - –HYP \(Hypertrophy\):Left or right ventricular hypertrophy and left or right atrial enlargement\. The standard clinical 12\-lead ECG configuration is used: limb leads I, II, III, aVR, aVL, aVF, and precordial leads V1–V6 \(C=12C=12\)\. All leads are derived from body\-surface electrodes and are synchronously sampled, making this aType 1 \(homogeneous and aligned\)dataset\. The dataset contains21,837 recordingsfrom 18,885 patients\. After preprocessing each sample has shapeℝ12×250\\mathbb\{R\}^\{12\\times 250\}\(C=12C=12,T=250T=250\)\. Original signals are recorded at 500 Hz \(5000 time steps per 10\-second recording\)\. We resize the sequence toT=250T=250via linear interpolation to align input shapes across all model architectures\. Z\-score normalization is applied per lead\. We use the official 10\-fold stratified cross\-validation split provided with the dataset\.
- •Chapman\-Shaoxing\.The Chapman\-Shaoxing ECG dataset\[[52](https://arxiv.org/html/2605.15235#bib.bib26)\]is a large\-scale 12\-lead ECG corpus jointly collected by Chapman University \(USA\) and Shaoxing People’s Hospital \(China\)\. It is designed specifically for cardiac arrhythmia and conduction\-abnormality classification and provides a realistic distribution of rhythm types from outpatient and inpatient clinical settings\. Automated arrhythmia detection reduces missed diagnoses and enables earlier intervention for life\-threatening conditions such as atrial fibrillation and STEMI\. The task ismulti\-label classificationwith 7 diagnostic label groups derived by mapping SNOMED CT codes viagenerate\_labels\.py; a recording may receive multiple simultaneous labels \(47% of recordings carry≥\\geq2 labels\): - –Normal:Sinus bradycardia, normal sinus rhythm, or sinus tachycardia\. - –CD \(Conduction Disturbance\):First\-degree AV block, complete left/right bundle branch block, Wolff–Parkinson–White, and related conduction delays\. - –MI \(Myocardial Infarction\):ST\-elevation patterns consistent with current or prior myocardial infarction\. - –STTC \(ST/T\-wave Change\):ST\-segment depression or elevation and T\-wave changes not meeting MI criteria\. - –Other:Remaining arrhythmia codes not captured by the above groups \(e\.g\., atrial bigeminy, aberrant ventricular conduction, early repolarisation\)\. - –AFib \(Atrial Fibrillation\):Atrial fibrillation and atrial flutter\. - –HYP \(Hypertrophy\):Left or right ventricular hypertrophy and atrial enlargement\. All 12 ECG leads are used \(C=12C=12\) and recorded synchronously, making this aType 1 \(homogeneous and aligned\)dataset\. The dataset contains10,646 recordings\. After preprocessing each sample has shapeℝ12×1000\\mathbb\{R\}^\{12\\times 1000\}\(C=12C=12,T=1000T=1000\)\. Original 500 Hz signals are downsampled to 100 Hz using anti\-aliasing filtering, yieldingT=1000T=1000time steps per 10\-second recording\. Downsampling suppresses high\-frequency noise and reduces overfitting\. Z\-score normalization is applied per lead\.

##### Type 2: Heterogeneous and Aligned Datasets\.

In this group, distinct physiological signal types are recorded simultaneously at a shared \(or unified\) sampling rate\. After resampling all channels share a common time axis, but they originate from fundamentally different physiological processes\.

- •Sleep\-EDF\.The Sleep\-EDF dataset\[[21](https://arxiv.org/html/2605.15235#bib.bib27)\]is a publicly available polysomnography \(PSG\) corpus from PhysioNet\[[12](https://arxiv.org/html/2605.15235#bib.bib54)\], originating from a prospective study of sleep and ageing in healthy subjects\. Automated sleep\-stage scoring is clinically critical for diagnosing sleep disorders such as insomnia, obstructive sleep apnoea, and narcolepsy\. Manual PSG scoring by trained technicians is time\-consuming, expensive, and subject to inter\-rater variability\. The task isfive\-class sleep\-stage classificationfollowing the AASM scoring rules\. Each 30\-second epoch receives exactly one label: - –W \(Wake\):Subject is awake; EEG shows high\-frequency, low\-amplitude activity and voluntary eye movements\. - –N1 \(NREM Stage 1\):Sleep onset; theta waves \(4–7 Hz\) dominate the EEG, muscle tone decreases, and slow eye movements appear\. - –N2 \(NREM Stage 2\):Established light sleep; defined by K\-complexes and sleep spindles \(12–15 Hz bursts\) in the EEG\. - –N3 \(NREM Stage 3 / Slow\-Wave Sleep\):Deep sleep; high\-amplitude delta waves \(<<2 Hz,\>\>75μ\\muV\) occupy≥\\geq20% of the epoch\. - –R \(REM\):Rapid Eye Movement sleep; EEG resembles wake but EMG shows near\-complete skeletal muscle atonia, and sawtooth waves may appear\. Five channels are used, each from a distinct physiological source: - –EEG Fpz\-Cz:Frontal\-central electroencephalogram; primary signal for staging N1–N3 and REM\. - –EEG Pz\-Oz:Parietal\-occipital EEG; captures occipital alpha rhythm \(8–13 Hz\) during relaxed wakefulness\. - –EOG \(horizontal\):Electrooculogram recording horizontal eye movements; distinguishes REM eye movements from wakefulness saccades\. - –EMG \(submental chin\):Chin electromyogram measuring skeletal muscle tone; key discriminator for REM atonia\. - –Resp \(oro\-nasal\):Oro\-nasal airflow thermistor recording respiratory rate and pattern; aids detection of apnoeic events\. All five channels are recorded at 100 Hz simultaneously but measure different physiological phenomena, making this aType 2 \(heterogeneous and aligned\)dataset\. After removing transition epochs at recording boundaries, the dataset contains10,918 30\-second epochsfrom 78 subjects \(153 overnight recordings\)\. Each sample has shapeℝ5×3000\\mathbb\{R\}^\{5\\times 3000\}\(C=5C=5,T=3,000T=3\{,\}000\)\. All channels are natively sampled at 100 Hz; no resampling is required\. Each 30\-second epoch yieldsT=100×30=3,000T=100\\times 30=3\{,\}000time steps\. Z\-score normalization is applied to each channel independently\. Data splits are stratified by subject\.
- •PPG\-DaLiA\.PPG\-DaLiA \(PPG Dataset for Life Activities\)\[[30](https://arxiv.org/html/2605.15235#bib.bib28)\]is a multimodal wearable dataset designed to advance activity recognition and heart\-rate estimation under real\-world, free\-living conditions\. It provides synchronized data from a chest\-worn device and a wrist\-worn device across naturalistic daily activities\. Reliable activity recognition from low\-cost wrist PPG devices is essential for consumer health wearables and for correcting motion artifacts in continuous heart\-rate monitoring\. The task isnine\-class activity classification\. Each sample window is assigned one activity label via majority vote over the 4 Hz label signal within the window: - –Transient:Activity\-transition segments that do not belong to a stable activity state; the most frequent class \(27\.8% of training windows\)\. - –Sitting:Stationary sedentary posture\. - –Ascending stairs:Upward stair climbing\. - –Descending stairs:Downward stair descent\. - –Table soccer:Upper\-limb dominated gaming activity with minimal lower\-body motion\. - –Cycling:Stationary or outdoor pedalling\. - –Driving:Seated vehicle operation\. - –Lunch break:Low\-activity resting or eating period\. - –Walking:Moderate\-intensity locomotion at a self\-selected pace\. Nine channels are drawn from two wearable devices\. The wrist device \(Empatica E4\) contributes BVP/PPG \(1 channel\), a 3\-axis accelerometer \(3 channels\), electrodermal activity \(EDA, 1 channel\), and skin temperature \(TEMP, 1 channel\), totalling 6 channels\. The chest device \(RespiBAN\) contributes a 3\-axis accelerometer \(3 channels\)\. Native sampling rates differ across sensors \(e\.g\., 64 Hz for wrist BVP vs\. higher rates for chest ACC\), and the two devices measure physiologically distinct signals\. After resampling to a unified rate, this dataset isType 2 \(heterogeneous and aligned\)\. The dataset contains 15 subjects\. After sliding\-window segmentation the dataset yields64,726 samples\. Each sample has shapeℝ9×256\\mathbb\{R\}^\{9\\times 256\}\(C=9C=9,T=256T=256\)\. All nine channels are resampled to a unified 32 Hz using downsampling and linear interpolation\. An 8\-second sliding window with a 2\-second stride producesT=32×8=256T=32\\times 8=256time steps per sample\. Z\-score normalization is applied to each channel independently\. Data splits are stratified by subject\.
- •WESAD\.WESAD \(Wearable Stress and Affect Detection\)\[[31](https://arxiv.org/html/2605.15235#bib.bib29)\]by Schmidtet al\.is a benchmark for automated, real\-time stress and affect recognition from wrist\-worn physiological sensors\. Stress detection is clinically relevant for mental health monitoring, early burnout prevention, and cardiovascular risk management, especially when deployed passively on consumer wearables without user intervention\. The task isthree\-class affective\-state classification: - –Baseline \(Neutral\):Resting state; subjects sit quietly or read neutral materials\. This condition establishes each subject’s individual physiological baseline\. - –Stress:Acute psychological stress induced by the Trier Social Stress Test \(TSST\), which combines public speaking and mental arithmetic performed in front of an evaluator panel\. TSST reliably elevates cortisol and activates the sympathetic nervous system\. - –Amusement:Mild positive affect induced by a curated selection of short comedy video clips, intended as a low\-arousal positive affect condition contrasting with the high\-arousal stress condition\. Two wrist physiological channels are used\. EDA \(Electrodermal Activity\) measures skin conductance and reflects sympathetic nervous system arousal; conductance rises within seconds of an acute stressor, making it the most discriminative channel for stress detection\. BVP \(Blood Volume Pulse / PPG\) captures cardiac activity from the wrist; stress\-induced sympathetic activation increases heart rate and alters heart\-rate variability features embedded in the BVP waveform\. Both channels are recorded simultaneously by the Empatica E4 wrist device but reflect different physiological pathways, making this aType 2 \(heterogeneous and aligned\)dataset\. The dataset contains 15 subjects\. After sliding\-window segmentation the dataset yields4,387 samples\. Each sample has shapeℝ2×480\\mathbb\{R\}^\{2\\times 480\}\(C=2C=2,T=480T=480\)\. We use the wrist EDA and BVP channels only\. All channels are resampled to 32 Hz\. A 15\-second sliding window with a 7\.5\-second stride producesT=32×15=480T=32\\times 15=480time steps\. Z\-score normalization is applied at the sample level to each channel independently\. Data splits are stratified by subject\.

##### Type 3: Heterogeneous and Unaligned Datasets\.

In this group, modalities differ in both physiological origin and temporal resolution and cannot be placed on a common synchronised time axis without fundamental structural transformation, such as mixing regularly\-sampled time series with static embeddings or irregular clinical measurements\.

- •MIMIC\-IV\.MIMIC\-IV \(Medical Information Mart for Intensive Care IV\)\[[19](https://arxiv.org/html/2605.15235#bib.bib30)\]is a comprehensive de\-identified critical\-care database developed by Beth Israel Deaconess Medical Center \(BIDMC\) in partnership with MIT\. It records the full clinical trajectory of more than 40,000 ICU admissions including vital signs, laboratory values, medications, clinical notes, chest radiographs, and ECG waveforms\. In\-hospital mortality prediction is a fundamental decision\-support problem that informs prognosis, triage, and palliative\-care planning\. The task isbinary in\-hospital mortality prediction\. The positive label \(1\) indicates that the patient dies during the current hospital admission; the negative label \(0\) indicates survival to discharge\. The dataset is severely class\-imbalanced with an approximate mortality rate of 13%, reflecting real\-world ICU mortality distributions\. Four fundamentally heterogeneous modalities are present: 1. 1\.Clinical time series \(C=30C=30,T=48T=48\):Vital signs and laboratory values aggregated into 1\-hour bins over the first 48 hours of ICU admission\. The 30 channels include heart rate, systolic/diastolic/mean arterial blood pressure, respiratory rate, body temperature, SpO2, and key biochemical markers such as glucose, creatinine, potassium, sodium, and bicarbonate\. 2. 2\.Chest X\-ray features \(1024\-D static vector\):Visual embeddings pre\-extracted from the most recent chest radiograph sourced from MIMIC\-CXR\-JPG\[[20](https://arxiv.org/html/2605.15235#bib.bib8)\], following the multimodal configuration ofHanet al\.\[[13](https://arxiv.org/html/2605.15235#bib.bib21)\], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations\. 3. 3\.ECG features \(256\-D static vector\):Temporal embeddings pre\-extracted from the 12\-lead ECG recording closest to ICU admission time, encoding arrhythmia and ischaemia patterns in a compact representation\. 4. 4\.Clinical text features \(768\-D static vector\):Semantic embeddings pre\-extracted from clinical notes \(nursing notes, discharge summaries\) using a pretrained BERT\-based clinical language model, encoding free\-text observations not captured by structured variables\. Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time series, placing this dataset inType 3 \(heterogeneous and unaligned\)\. After filtering and preprocessing the dataset contains5,100 ICU admission episodes\. The time\-series component has shapeℝ30×48\\mathbb\{R\}^\{30\\times 48\}; the three embedding modalities are fused separately by each model architecture\. Clinical measurements are aggregated into 1\-hour bins; missing values within a stay are forward\-filled, and any remaining gaps are set to zero\. Z\-score normalization is applied to the time\-series channels using training\-split statistics\. The embedding vectors are used directly without further normalisation\.
- •CinC Challenge 2012 \(Challenge\-2012\)\.The PhysioNet\[[12](https://arxiv.org/html/2605.15235#bib.bib54)\]/Computing in Cardiology Challenge 2012\[[35](https://arxiv.org/html/2605.15235#bib.bib31)\]is a canonical benchmark for ICU mortality prediction, providing irregularly sampled multivariate clinical time series from general medical, cardiac, and surgical ICUs\. Its extreme feature sparsity makes it directly relevant to the missingness study in this paper\. The task isbinary ICU mortality prediction\. The positive label \(1\) indicates in\-hospital death; the negative label \(0\) indicates survival to discharge\. The positive \(mortality\) rate is approximately 13\.9%\. Forty\-two clinical variables are recorded per patient \(C=42C=42\), grouped into three categories\. General descriptors \(quasi\-static\) include age, gender, height, weight, ICU type \(cardiac surgery, medical\-surgical, or other\), and SAPS\-I score\. Time\-varying vital signs include heart rate, systolic/diastolic/mean arterial blood pressure, respiratory rate, body temperature, SpO2/SaO2, Glasgow Coma Scale \(GCS\), mechanical ventilation flag, and fraction of inspired oxygen \(FiO2\)\. Time\-varying laboratory values include blood urea nitrogen \(BUN\), creatinine, glucose, potassium \(K\), sodium \(Na\), pH, PaO2, PaCO2, bicarbonate \(HCO3\), lactate, total bilirubin, hematocrit \(HCT\), white blood cell count \(WBC\), and magnesium \(Mg\)\. Clinical measurements are taken at irregular intervals dictated by clinical decisions rather than a fixed schedule\. After binning into hourly slots, the resulting42×4842\\times 48grid is naturally sparse, with most patient\-hours containing no measurements for most variables\. This irregular temporal structure combined with the mixture of static and dynamic variables makes this aType 3 \(heterogeneous and unaligned\)dataset\. The dataset contains approximately4,000 ICU patient stays\(set\-a\)\. After preprocessing each sample has shapeℝ42×48\\mathbb\{R\}^\{42\\times 48\}\(C=42C=42,T=48T=48\)\. Irregular clinical measurements are grouped into 1\-hour bins; if a bin contains multiple observations for the same variable, they are averaged\. Bins with no observation are set to zero without forward\-filling or imputation\. This deliberate choice ensures that zero values represent genuine structural missingness rather than imputed estimates, isolating the effect of missing data in a controlled way\. Z\-score normalization is applied to each of the 42 channels using training\-split statistics\.
- •CirCor DigiScope \(CirCor\)\.CirCor DigiScope\[[28](https://arxiv.org/html/2605.15235#bib.bib32)\]is a cardiac auscultation dataset from the PhysioNet/Computing in Cardiology Challenge 2022, targeting murmur detection from phonocardiogram \(PCG, i\.e\., heart sound\) recordings\. It covers a large pediatric cardiac screening campaign conducted in Brazil\. Early detection of congenital and acquired cardiac abnormalities through heart auscultation is a critical but resource\-limited clinical task in low\- and middle\-income settings, motivating automated screening tools\. The task isthree\-class murmur classification\. Label 0 \(Absent\) indicates that no murmur is detected at any auscultation location and accounts for 75\.6% of recordings\. Label 1 \(Unknown\) indicates insufficient audio quality to determine murmur presence \(5\.0%\)\. Label 2 \(Present\) indicates a clearly audible murmur at one or more locations \(19\.4%\)\. The class distribution is severely imbalanced, requiring class\-balanced loss weighting\. Each recording is a single\-channel PCG waveform captured at one of four standard auscultation sites: Aortic Valve \(AV\), Pulmonary Valve \(PV\), Tricuspid Valve \(TV\), or Mitral Valve \(MV\)\. In the preprocessed format, each recording is transformed into a log\-mel spectrogram with 64 mel\-frequency bins, yielding a time\-frequency feature tensor\. An additional 14 static demographic and metadata features are broadcast as constant channels across the time dimension and appended to the spectrogram, comprising age group one\-hot encoding \(6 channels for Neonate, Infant, Child, Adolescent, Young Adult, Unknown\), sex \(1 channel\), height and weight \(2 channels\), pregnancy status \(1 channel\), and recording location one\-hot encoding \(4 channels for AV, PV, TV, MV\)\. The resulting input hasC=64\+14=78C=64\+14=78channels\. The 14 demographic channels are static scalars with no temporal structure; they possess a fundamentally different format and resolution from the time\-varying mel spectrogram channels, making temporal alignment between the two channel groups meaningless\. This mixture of a dynamic time\-frequency modality and static patient\-level metadata makes this aType 3 \(heterogeneous and unaligned\)dataset\. The original collection contains 942 patients and 3,163 WAV recordings at 4,000 Hz\. After excluding 45 recordings that exceed the 20\-second duration cap,3,118 recordingsare retained\. Each sample has shapeℝ78×625\\mathbb\{R\}^\{78\\times 625\}\(C=78C=78,T=625T=625\)\. Raw PCG waveforms are recorded at 4,000 Hz\. Each recording is converted to a log\-mel spectrogram usingnfft=512n\_\{\\text\{fft\}\}=512andhop\_length=128\\text\{hop\\\_length\}=128, yieldingT=625T=625time frames for a 20\-second clip\. Recordings shorter than 20 seconds are zero\-padded; longer recordings are truncated\. The spectrogram is normalized using global per\-channel mean and standard deviation computed from the training split\. Static demographic features are broadcast across theTTdimension at runtime\. Data splits are patient\-level \(seed=42\) to prevent data leakage across recordings from the same patient\.

## Appendix DProtocol Adaptation and Implementation

### D\.1Compute Resources

All main experiments are conducted on NVIDIA B200 GPUs \(178\.4 GiB VRAM\) on the University of Florida HiPerGator cluster\. Each job occupies one B200 GPU, with approximately three jobs running concurrently\. The full benchmark of 810 experimental runs \(6 models×\\times9 datasets×\\times3 seeds×\\times5 missing\-data conditions\) completes in approximately one month of wall\-clock time\. A small number of early exploratory runs use NVIDIA L4 GPUs \(22 GiB\); these are superseded by the B200 runs and do not contribute to any reported result\.

To ensure a rigorous and fair comparison, we implement a unified evaluation protocol\. The primary goal is to ensure that the model architecture is the only variable across all experiments\.

### D\.2Unified Data Interface

We use a standardized pipeline to ensure all models process the exact same data splits and missingness patterns\. We generate unified index files \(e\.g\.,split\_train\.json\) and ameta\.jsonfor each dataset\. Models load raw data on the fly using these shared indices rather than a single forced intermediate format\. After preprocessing, all datasets are saved as shared cache files on disk\. All six baseline models read from these exact same cached files, guaranteeing absolute data\-level fairness across the entire evaluation\.

### D\.3Framework\-Agnostic Missingness Library

We implement missingness injection as a single Python module that is copied into all six model projects\. This ensures that the mathematical logic for modality dropping and time\-block masking is byte\-for\-byte identical across all frameworks, regardless of each project’s underlying training infrastructure\.

Each sample is represented as a dict with four fields:

- •x∈ℝC×T\\in\\mathbb\{R\}^\{C\\times T\}: raw channel signals\.
- •mask∈\{0,1\}C×T\\in\\\{0,1\\\}^\{C\\times T\}: per\-timestep validity mask \(1 = valid, 0 = padded or missing\)\.
- •mod\_mask∈\{0,1\}C\\in\\\{0,1\\\}^\{C\}: per\-channel presence mask \(1 = present, 0 = dropped\)\.
- •ts\_mod∈\{True,False\}C\\in\\\{\\texttt\{True\},\\texttt\{False\}\\\}^\{C\}: flags marking each channel as a time\-series modality\. Channels withts\_mod = Falseare skipped by the time\-block injector\.

The injectors updatemaskandmod\_maskin place\. Downstream model code treats these fields either by multiplying the input tensor by the mask before projection or by passing it as an attention mask\.

#### D\.3\.1The\_\_getitem\_\_Interface

The unified entry point isapply\_missingness\(sample, mode, \*, p\_mod, block\_n, block\_m, block\_n\_max, seed, sample\_id, min\_kept\), which dispatches onmode∈\\in\{none,modality,block\}\. It is called insideDataset\.\_\_getitem\_\_immediately after the raw sample is loaded from the cache, so missingness is injected on\-the\-fly during data loading rather than written to disk\. This allows any missing rate to be evaluated without storing additional copies of the data\. All six model projects follow the same call pattern, so models with entirely different internal formats \(channel\-independent projections, cross\-modal attention, 1\-D CNN encoders\) receive the identical missingness pattern without any format\-specific modification to the library\.

#### D\.3\.2Deterministic Stochasticity

For a given pair \(seed,sample\_id\), the generated missingness pattern is identical across all models and runs\. Before each injection, a local random generator is constructed\. The multiplicative folding ensures that nearby sample indices produce uncorrelated patterns\. Because the same formula is used in every model project, the mask applied to sampleiiunder a given configuration is identical whether the sample is loaded by CLIMB, MIRA, or any other model, guaranteeing a fair cross\-model comparison\.

### D\.4Implementation of Missingness Modes

#### D\.4\.1Modality Missing

Our implementation ofmodality missingis a channel\-level missing completely at random \(MCAR\) simulation in which each input channel is independently dropped to approximate whole\-sensor failures\. We use this channel\-level approach because it provides a controlled and reproducible proxy for modality\-level loss across all nine datasets, which differ widely in channel count and channel type\. We note that this simulation does not always match the strict semantic meaning of a modality\. In PTB\-XL, for example, each of the 12 ECG leads measures the same cardiac activity from a different electrode position, so dropping one lead is more precisely a channel\-level event than the loss of an independent modality\. In CirCor, removing a mel\-spectrogram frequency bin removes part of a single audio representation rather than an entire sensor source\. A stricter protocol would group channels by their physical source and drop all channels from that source at once, capturing correlated source\-level failures\. We treat this source\-level correlated missingness as a known limitation of the current simulation and leave it as future work\.

For each currently present channel \(mod\_mask​\[c\]=1\\texttt\{mod\\\_mask\}\[c\]=1\), the injector performs an independent Bernoulli trial with probabilitypmodp\_\{\\text\{mod\}\}\. A successful trial setsmod\_mask​\[c\]←0\\texttt\{mod\\\_mask\}\[c\]\\leftarrow 0andmask​\[c,:\]←𝟎\\texttt\{mask\}\[c,\\,:\\,\]\\leftarrow\\mathbf\{0\}, zeroing the entire channel row in the validity mask\. A safeguard enforcesmin\_kept=1\\texttt\{min\\\_kept\}=1: if the Bernoulli draws would drop all present channels, the injector randomly restores one dropped channel so the model always receives at least one valid input\. Under this strategy, the realized number of dropped channels follows approximatelyBinomial​\(npresent,pmod\)\\text\{Binomial\}\(n\_\{\\text\{present\}\},\\,p\_\{\\text\{mod\}\}\)\.

#### D\.4\.2Within\-Modality Missing

This mode carves non\-overlapping contiguous segments independently in each time\-series channel, simulating asynchronous sensor interruptions\. For a channel of lengthTT, the procedure is:

1. 1\.Compute the block length range:ℓmin=⌈block\_n⋅T⌉\\ell\_\{\\text\{min\}\}=\\lceil\\texttt\{block\\\_n\}\\cdot T\\rceil,ℓmax=⌈block\_n\_max⋅T⌉\\ell\_\{\\text\{max\}\}=\\lceil\\texttt\{block\\\_n\\\_max\}\\cdot T\\rceil\. We useblock\_n=0\.05\\texttt\{block\\\_n\}=0\.05andblock\_n\_max=0\.10\\texttt\{block\\\_n\\\_max\}=0\.10, so each block covers 5–10% ofTT\.
2. 2\.Estimate the number of blocks required to cover fractionblock\_mof the sequence: k=⌊block\_m⋅T\(ℓmin\+ℓmax\)/2⌉\.k=\\left\\lfloor\\frac\{\\texttt\{block\\\_m\}\\cdot T\}\{\(\\ell\_\{\\text\{min\}\}\+\\ell\_\{\\text\{max\}\}\)/2\}\\right\\rceil\.
3. 3\.For each block, uniformly sample a start position and check for overlap with already\-placed blocks\. If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel\.
4. 4\.Setmask\[c,start:end\]←0\\texttt\{mask\}\[c,\\,\\text\{start\}:\\text\{end\}\]\\leftarrow 0for each placed block\.

Each channel uses an independent sub\-generator: before iterating over channels, the sharedrngdraws one 64\-bit seed per channel upfront, and each channel’s block placement proceeds from its ownnp\.random\.default\_rng\. This ensures that different channels miss different time windows while the entire per\-sample pattern remains reproducible from \(seed,sample\_id\)\.

### D\.5Standardized Training Protocols

##### Loss functions and class imbalance\.

For multi\-class tasks \(Sleep\-EDF, PPG\-DaLiA, WESAD, CirCor\), we use cross\-entropy with inverse\-frequency class weights\. For binary tasks \(HAR\-UP, MIMIC\-IV, Challenge\-2012\), we use binary cross\-entropy or two\-class cross\-entropy \(depending on the model architecture\) with a scalar positive\-class weight\. For multi\-label tasks \(PTB\-XL, Chapman\-Shaoxing\), we useBCEWithLogitsLosswith a per\-class positive weight:

wpos=nneg/npos,w\_\{\\text\{pos\}\}=n\_\{\\text\{neg\}\}/n\_\{\\text\{pos\}\},computed from the training split\. This is particularly important for clinical datasets with severe class imbalance, such as MIMIC\-IV \(positive rate≈13%\\approx 13\\%,wpos≈6\.6w\_\{\\text\{pos\}\}\\approx 6\.6\) and Challenge\-2012 \(positive rate≈13\.9%\\approx 13\.9\\%,wpos≈6\.2w\_\{\\text\{pos\}\}\\approx 6\.2\)\.

##### Checkpoint selection\.

All six models select the best checkpoint by the highest validation Macro\-AUROC score, regardless of the primary test metric\. This common stopping criterion avoids per\-model tuning of the selection rule and ensures that reported test results reflect each model at its peak validation state\.

##### Maestro training protocol\.

Maestro diverges from the standard protocol only during training\. Rather than receiving externally injected missing masks, Maestro applies its own curriculum modality dropout internally: the per\-channel drop probability starts at 0 and increases linearly to a maximum of 40% over the first 10 warmup epochs, after which it remains fixed\. This schedule encourages the model to gradually adapt to missing inputs without being exposed to severe dropout from the start\. Critically, this internal dropout is applied only during the forward pass of training batches and is disabled at inference time\. For validation and test evaluation, Maestro receives the same externally injected missing masks as all other models, generated from the shared\(seed, sample\_id\)deterministic protocol\. This means that all reported validation and test AUROC numbers for Maestro are directly comparable to those of other models under identical missing conditions\.

## Appendix EDetailed Result

### E\.1Per\-Dataset Per\-Run Results

This section reports full numerical results for all nine datasets across all five missing\-data settings and six models\. Every experiment is repeated under three independent random seeds: r1 \(seed==42\), r2 \(seed==2026\), and r3 \(seed==114514\), and each run is listed separately so that run\-to\-run variance is visible\. For each seed and setting, we report two metrics: AUROC \(macro one\-vs\-rest\) and Macro\-F1\. The five settings are Clean \(no missing data\), Modality 20% and 50% \(channel\-level MCAR at the respective drop probability\), and Within\-modality 20% and 50% \(block\-level masking covering 20% or 50% of each channel’s time steps\)\. Bold entries mark the best AUROC and best Macro\-F1 within each setting row\. The superscript†\\daggermarks ShaSpec runs on Challenge\-2012 where training collapsed to predicting a single class \(F1==0\);‡\\ddaggermarks isolated CLIMB or MIRA runs that similarly collapsed to all\-positive prediction\. Full results for each of the nine datasets are presented in Tables[10](https://arxiv.org/html/2605.15235#A5.T10)–[18](https://arxiv.org/html/2605.15235#A5.T18)\.

Table 10:HAR\-UP: AUC and Macro\-F1 \(AUC: AUROC\) \(binary fall detection\)\.Table 11:PTB\-XL: AUC and Macro\-F1 \(AUC: AUROC\) \(multi\-label, 5 classes\)\.Table 12:Chapman\-Shaoxing: AUC and Macro\-F1 \(AUC: AUROC\) \(multi\-label, 7 classes\)\.Table 13:Sleep\-EDF: AUC and Macro\-F1 \(AUC: AUROC\) \(5\-class, multiclass\)\.Table 14:PPG\-DaLiA: AUC and Macro\-F1 \(AUC: AUROC\) \(9\-class HAR\)\.Table 15:WESAD: AUC and Macro\-F1 \(AUC: AUROC\) \(3\-class affect recognition\)\.Table 16:MIMIC\-IV: AUC and Macro\-F1 \(AUC: AUROC\) \(binary IHM\-48, 4 modalities\)\.Table 17:Challenge\-2012: AUC and Macro\-F1 \(AUC: AUROC\) \(binary ICU mortality\)\.†\\dagger: ShaSpec collapse \(all\-negative / all\-positive prediction; F1≈\\approx0\)\.‡\\ddagger: CLIMB or MIRA collapses to all\-positive in that run\.Table 18:CirCor: AUC and Macro\-F1 \(AUC: AUROC\) \(3\-class murmur detection\)\.
### E\.2Performance Degradation by Model Family

Tables[19](https://arxiv.org/html/2605.15235#A5.T19)and[20](https://arxiv.org/html/2605.15235#A5.T20)provide a quantitative breakdown of performance degradation grouped by architecture family across all nine datasets and both missingness modes\. We assign the six models to three families based on their fusion design: channel\-independent \(Chan\.\-Indep\.: CLIMB and MIRA\), shared\-specific \(Shared\-Spec\.: ShaSpec\), and mixture\-of\-experts fusion \(MoE\-Fusion: Flex\-MoE, FuseMoE, and Maestro\)\. For each modelmm, datasetdd, and missing rater∈\{20%,50%\}r\\in\\\{20\\%,50\\%\\\}, we first compute per\-model degradationδm,d,r=AUROC¯clean−AUROC¯missing\\delta\_\{m,d,r\}=\\overline\{\\text\{AUROC\}\}\_\{\\text\{clean\}\}\-\\overline\{\\text\{AUROC\}\}\_\{\\text\{missing\}\}, where each term is the mean test AUROC over three independent seeds \(s∈\{42,2026,114514\}s\\in\\\{42,2026,114514\\\}\)\. The value reported in each cell is then the mean ofδm,d,r\\delta\_\{m,d,r\}over all models belonging to that family\. A positive entry indicates that the family loses that many AUROC points relative to the clean baseline; a negative entry indicates a marginal improvement under missing conditions, which can occur when the clean baseline is itself unstable due to class imbalance or small dataset size, as seen for Shared\-Spec\. on Challenge\-2012\. Across both tables, MoE\-Fusion models tend to show larger degradation than channel\-independent or shared\-specific models in most settings, especially at the 50% missing rate, which quantitatively supports the claim that architecture family is the strongest predictor of robustness to missing data\.

Table 19:MeanΔ\\DeltaAUROC \(clean−\-missing\) by model family under modality missing\. Chan\.\-Indep\.: CLIMB and MIRA; Shared\-Spec\.: ShaSpec; MoE\-Fusion: Flex\-MoE, FuseMoE, and Maestro\. Larger values indicate greater performance drop\.Table 20:MeanΔ\\DeltaAUROC \(clean−\-missing\) by model family under within\-modality missing\. Family definitions follow Table[19](https://arxiv.org/html/2605.15235#A5.T19)\.

## Appendix FDetailed Diffusion Imputation Results

### F\.1Imputation Model and Pipeline

We train a conditional score\-based diffusion model\[[14](https://arxiv.org/html/2605.15235#bib.bib3),[39](https://arxiv.org/html/2605.15235#bib.bib4)\]on PTB\-XL independently of all benchmark classifiers\. The model operates at native ECG resolution \(T=1000T=1000, 100 Hz×\\times10 s\) across all 12 leads and is conditioned on observed channels to fill missing segments on a per\-channel basis, with no cross\-lead conditioning\.

##### Training\.

Missing regions are simulated with a contiguous block pattern: each affected channel has one or more blocks zeroed out, where each block spans 5–10% of the signal length \(50–100 timesteps\), and the total missing fraction per channel is set to 20% or 50%\. The model is trained for 20 epochs with batch size 16 and 50 diffusion steps \(β\\betalinearly spaced from10−410^\{\-4\}to0\.020\.02\)\. The architecture uses 4 diffusion layers, 64 channels, 8 attention heads, and embedding dimensions of 128 \(time\) and 16 \(feature\)\.

##### Offline imputation pipeline\.

After training, the model imputes the entire PTB\-XL train, validation, and test splits and saves the results as pre\-computed arrays\. Each benchmark classifier loads these pre\-imputed signals directly, with on\-the\-fly corruption disabled\. This two\-stage design keeps the imputation and classification training processes fully decoupled\.

##### Modality\-missing condition\.

For the modality\-missing experiment, the same within\-channel model is applied to leads that are entirely absent\. This is an out\-of\-distribution use: the model is trained to fill short intra\-channel blocks and has no observed signal to condition on when a full lead group is dropped\. The resulting reconstruction is drawn from an unconstrained prior, which explains the performance degradation reported in Table[7](https://arxiv.org/html/2605.15235#S4.T7)\.

### F\.2Per\-Run Results

This section provides per\-run details for the diffusion\-based imputation experiment described in Section[4\.5](https://arxiv.org/html/2605.15235#S4.SS5)\. Three downstream classifiers \(CLIMB, Flex\-MoE, ShaSpec\) are evaluated under both modality missing and within\-modality missing conditions, across three independent seeds\. Table[21](https://arxiv.org/html/2605.15235#A6.T21)shows per\-run results under modality missing; Table[22](https://arxiv.org/html/2605.15235#A6.T22)shows per\-run results under within\-modality missing\. Notation:Raw: model trained on zero\-filled corrupted input\.Diff\.: model trained on diffusion\-reconstructed input\.

Table 21:Per\-run modality missing vs\. diffusion\-based modality imputation on PTB\-XL, three seeds\.Raw: model trained on zero\-filled dropped\-modality input\.Diff\.: same model trained on diffusion\-reconstructed input\.Δ\\Deltadenotes performance change from Raw to Diff\.AUC: AUROC;F1: Macro\-F1\. r1==42, r2==2026, r3==114514\.Table 22:Per\-run within\-modality missing vs\. diffusion\-based imputation on PTB\-XL, three seeds\.Raw: model trained on zero\-filled corrupted input\.Diff\.: same model trained on diffusion\-reconstructed input\.Δ\\Deltadenotes performance change from Raw to Diff\.AUC: AUROC;F1: Macro\-F1\. r1==42, r2==2026, r3==114514\.##### Cross\-seed stability\.

Comparing the three seeds reveals a clear pattern of consistency\. For CLIMB and Flex\-MoE, the imputation improvement over raw missing is present in all nine seed–rate combinations; no seed shows a regression\. The single largest gain for Flex\-MoE occurs at seed=114514, reachingΔ\\DeltaAUROC≈\+0\.047\\approx\+0\.047at both rates \(20%:0\.7694−0\.72250\.7694\-0\.7225; 50%:0\.7478−0\.70060\.7478\-0\.7006\), explaining the relatively high variance in the F1 column for that model at 20% \(±\.026\\pm\.026\)\. For ShaSpec, the improvements are uniformly small \(Δ\\DeltaAUROC<0\.010<0\.010in every individual run\), confirming that shared\-specific decomposition already exploits cross\-lead redundancy effectively, leaving little room for diffusion\-based reconstruction to add value\. Across all 18 within\-modality per\-run comparisons \(3 models×\\times2 rates×\\times3 seeds\), diffusion\-based imputation does not degrade AUROC relative to the matched raw\-missing baseline in any of the evaluated runs\. Note that this holds for within\-modality missing in this PTB\-XL experiment; under modality missing, imputation degrades performance across all evaluated settings \(Table[7](https://arxiv.org/html/2605.15235#S4.T7)\), as discussed in Section[4\.5](https://arxiv.org/html/2605.15235#S4.SS5)\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction precisely state the benchmark scope \(9 datasets, 6 models, 2 missing\-data modes, 810 runs\) and the three key findings \(architecture family as the strongest robustness predictor; dataset structure determining which failure mode dominates; diffusion imputation helping under within\-modality missing but not modality missing\)\. These claims are directly supported by the experimental results in Sections[4\.2](https://arxiv.org/html/2605.15235#S4.SS2)–[4\.5](https://arxiv.org/html/2605.15235#S4.SS5)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Limitations are discussed in Section[5](https://arxiv.org/html/2605.15235#S5): curriculum modality dropout provides bounded protection only up to its training\-time maximum rate; shared\-specific models collapse on short or severely imbalanced sequences; diffusion\-based imputation degrades performance under modality missing\. The benchmark scope is also bounded to six architectures and two missing\-data severities, which is acknowledged in the conclusion\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: This is an empirical benchmark paper; it contains no theorems, lemmas, or formal proofs\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: Full reproduction information is provided: model architectures in Appendix[B](https://arxiv.org/html/2605.15235#A2), dataset preprocessing in Appendix[C](https://arxiv.org/html/2605.15235#A3), missingness injection algorithm in Appendix[D](https://arxiv.org/html/2605.15235#A4), training protocols \(loss functions, class weighting, checkpoint selection\) in Appendix[D](https://arxiv.org/html/2605.15235#A4), and three random seeds \(42, 2026, 114514\) with per\-seed results in Appendix[E](https://arxiv.org/html/2605.15235#A5)\. Code and pretrained weights are publicly available on GitHub; the repository URL is withheld from the paper to preserve double\-blind anonymity and will be included in the camera\-ready version\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: The full codebase \(missingness injection library, dataset loaders, training scripts, and pretrained model checkpoints\) is publicly available on GitHub\. The repository URL is omitted from the paper to preserve double\-blind anonymity and will be added in the camera\-ready version\. All nine datasets are accessible from their original sources under their respective licenses; eight are publicly available without restriction, while MIMIC\-IV requires credentialed access via the PhysioNet Credentialed Health Data License\. Instructions for data preparation and experiment execution are provided in Appendix[D](https://arxiv.org/html/2605.15235#A4)\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Dataset splits, loss functions, class\-weighting strategies, checkpoint selection criterion \(best validation Macro\-AUROC\), and missingness injection parameters are fully specified in Appendices[B](https://arxiv.org/html/2605.15235#A2)–[D](https://arxiv.org/html/2605.15235#A4)\. Per\-dataset preprocessing steps \(resampling, normalization, window size\) are detailed in Appendix[C](https://arxiv.org/html/2605.15235#A3)\. Model\-specific hyperparameters \(embedding dimension, number of layers, expert count\) are listed in Appendix[B](https://arxiv.org/html/2605.15235#A2)\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: All main\-table results \(Tables[4](https://arxiv.org/html/2605.15235#S4.T4)–[6](https://arxiv.org/html/2605.15235#S4.T6)\) are averaged over three independent random seeds \(42, 2026, 114514\) with fixed data splits; the sources of variability are random weight initialization and the per\-sample missingness patterns, which are seeded by the same experiment seed\. Imputation results \(Tables[7](https://arxiv.org/html/2605.15235#S4.T7)–[8](https://arxiv.org/html/2605.15235#S4.T8)\) report mean±\\pmstd across the three seeds\. Full per\-seed breakdowns are provided in Appendix[E](https://arxiv.org/html/2605.15235#A5), confirming that architecture rankings are stable across initializations\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: All main experiments were conducted on NVIDIA B200 GPUs \(178\.4 GiB VRAM\) on the University of Florida HiPerGator cluster, with approximately three jobs running concurrently \(one B200 per job\)\. The full benchmark of 810 experimental runs \(6 models×\\times9 datasets×\\times3 seeds×\\times5 missing\-data conditions\) completed in approximately one month of wall\-clock time\. A small number of early exploratory runs used NVIDIA L4 GPUs \(22 GiB\); these were superseded by the B200 runs and do not contribute to any reported result\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: All datasets used are publicly released, fully de\-identified, and collected under the ethical oversight of their original studies\. No new human\-subject data was collected\. The benchmark promotes safer deployment of clinical AI systems under realistic sensor failure conditions, which is directly aligned with societal benefit\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: A dedicated Broader Impacts section \(the Broader Impacts section\) discusses positive impacts \(enabling selection of robust clinical AI architectures, reducing risk of silent failure in ICU and wearable settings\) and negative impacts \(potential identification of vulnerable operating regimes, mitigated by the defensive framing of the benchmark and the public availability of all underlying assets\)\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The released assets \(missingness library, data loaders, pretrained classifiers on de\-identified clinical datasets\) pose no meaningful risk of misuse; they are diagnostic tools for benchmarking robustness, not generative or dual\-use models\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: All nine datasets and all six baseline model codebases are cited with their original publications\. Dataset licenses: MIMIC\-IV is under the PhysioNet Credentialed Health Data License; PTB\-XL and Chapman\-Shaoxing are under Creative Commons Attribution 4\.0 International \(CC BY 4\.0\); Sleep\-EDF, CirCor, and Challenge\-2012 are under the Open Data Commons Attribution License v1\.0 \(ODC\-BY\); PPG\-DaLiA and WESAD are under CC BY 4\.0\. HAR\-UP is distributed via a public link by the original authors with no explicit license stated\. Baseline model repositories are used under their respective open\-source licenses \(MIT or Apache 2\.0\)\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.15235v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The benchmark releases three new assets: \(1\) a framework\-agnostic missingness injection library, \(2\) unified dataset loaders and index files for all nine datasets, and \(3\) pretrained model checkpoints for all six architectures across all datasets\. Their design and usage are documented in Appendix[D](https://arxiv.org/html/2605.15235#A4)and in the repository README\. Assets are anonymized at submission time\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This paper involves no crowdsourcing and no new human\-subject experiments\. All datasets are pre\-existing, de\-identified clinical and physiological recordings collected under the ethical oversight of their original studies\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: No new human\-subject research was conducted\. All datasets were collected under IRB approval or equivalent ethical review by their original data providers; this paper only reuses fully de\-identified, publicly released data\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: LLMs are not a component of the benchmark methodology\. The models evaluated are clinical time\-series fusion architectures\. Any use of LLMs was limited to writing assistance and does not affect the methodology, experimental results, or scientific conclusions\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Hugging Face Daily Papers

Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

arXiv cs.CL

Introduces SMMBench, a benchmark to evaluate multimodal agents' ability to retrieve, align, and compose evidence scattered across independently originated sources like conversations, tables, and documents. Experiments show current systems struggle with this source-distributed memory composition task.