ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Summary
Proposes ReTAMamba, a method using reliability-aware temporal aggregation with Mamba for irregular clinical time series prediction, achieving significant AUPRC gains on MIMIC-IV, eICU, and PhysioNet 2012.
View Cached Full Text
Cached at: 05/19/26, 06:42 AM
# ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
Source: [https://arxiv.org/html/2605.16380](https://arxiv.org/html/2605.16380)
\(2026\)
###### Abstract\.
Clinical time\-series data are difficult to model with methods designed for regular sequences because they exhibit irregular sampling, frequent missing values, and heterogeneous observation patterns across variables\. Existing approaches commonly use observation masks and time\-gap information, but they do not continuously capture the decaying reliability of past observations or consistently organize multi\-resolution information within a coherent temporal context during aggregation\. To address these limitations, we propose Reliability\-aware Temporal Aggregation with Mamba \(ReTAMamba\), which reconstructs clinical time series as time\-variable token sequences, estimates observation reliability from missingness and elapsed time, and augments interval summaries with statistical descriptors\. Chronological Weaving is used to integrate short\- and long\-term temporal information within a coherent temporal context, and a budgeted token router is applied to constrain sequence length while preserving informative summaries\. Experiments on MIMIC\-IV, eICU, and PhysioNet 2012 show that ReTAMamba consistently improves AUPRC over strong baselines, with average relative gains of 7\.51%, 7\.80%, and 10\.15%, respectively\. Cohort\-level and patient\-level analyses on eICU further showed that the learned mean decay for more dynamic signals, such as heart rate and blood pressure, was 24\.3% larger than that for relatively static signals, such as laboratory test variables\. These findings suggest that effective prediction in irregular clinical time series requires modeling not only what was measured, but also when and how it was observed, including information freshness and observation timeliness\.
Irregular clinical time series, Mortality prediction, Reliability\-aware modeling, Multi\-scale temporal aggregation, Mamba
††copyright:none††journalyear:2026††publicationmonth:11## 1\.Introduction
Clinical time\-series data collected through the widespread adoption of electronic health record \(EHR\) systems have become a key source for characterizing changes in patient physiological status and play an important role in severity assessment, risk prediction, and early clinical decision support\(Pungitore and Subbian,[2023](https://arxiv.org/html/2605.16380#bib.bib1); Maet al\.,[2019](https://arxiv.org/html/2605.16380#bib.bib2)\)\. This is especially critical in intensive care units \(ICUs\), where patient conditions can change rapidly, making accurate prediction of outcomes such as in\-hospital mortality within the first 24–48 hours of admission an important task\(Harutyunyanet al\.,[2019](https://arxiv.org/html/2605.16380#bib.bib3)\)\.
Unlike regular time series, clinical time series are inherently irregular and sparse: measurement frequencies differ across variables, only a subset of variables may be observed at a given time, and intervals between measurements are highly uneven\(Ghassemiet al\.,[2015](https://arxiv.org/html/2605.16380#bib.bib4)\)\. Moreover, missingness is not merely the absence of information, but can reflect the observation process and clinical decision\-making, thereby functioning as informative missingness\(Cheet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib5)\)\. Clinical time\-series prediction therefore requires modeling not only observed values, but also the observation structure, including which variables were measured and when, together with the freshness of information as time elapses after the last observation\.
Accordingly, prior research has moved toward jointly modeling irregular observation structures, the meaning of missingness, and temporal context across multiple resolutions\. Early approaches mainly regularized data through hourly aggregation or forward imputation and then applied general\-purpose models, but such methods failed to preserve the clinical context embedded in the original observation patterns\(Liptonet al\.,[2015](https://arxiv.org/html/2605.16380#bib.bib6)\)\. Later studies attempted to model irregular sampling more directly, either by interpolating irregular observations at reference time points or by unfolding them into event sequences\(Shukla and Marlin,[2019](https://arxiv.org/html/2605.16380#bib.bib7); Zhanget al\.,[2021](https://arxiv.org/html/2605.16380#bib.bib8)\)\. However, these methods struggled to simultaneously handle cross\-variable heterogeneity in observation frequency, the computational burden of long\-sequence processing, and the complexity of clinical observation structures\(Hornet al\.,[2020](https://arxiv.org/html/2605.16380#bib.bib9)\)\. Because missingness in clinical time series often functions as informative missingness, recent approaches have increasingly emphasized both the meaning of missingness itself and the validity of information as time elapses\(Cheet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib5)\)\. In addition, accurate assessment of patient status requires temporal aggregation and multi\-scale modeling that capture both short\-term physiological changes and long\-term disease trajectories\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\. Nevertheless, these studies did not sufficiently resolve how to incorporate the reliability of observational signals consistently throughout representation learning or how to align information generated at different temporal resolutions along the true temporal axis while efficiently controlling the resulting increase in sequence length\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\.
To address these challenges, this paper proposes Reliability\-aware Temporal Aggregation with Mamba \(ReTAMamba\), a unified framework for modeling irregular clinical time series as a reliability\-aware multi\-scale sequence\. ReTAMamba represents irregular multivariate records as time\-variable token sequences, preserving variable\-specific observation intervals and missingness patterns without collapsing them into conventional aggregation\-based inputs\. It then estimates time\-varying observation reliability from missingness and elapsed time, incorporates this reliability into multi\-resolution temporal aggregation, and reorders summary tokens from different temporal resolutions through Chronological Weaving\. Finally, it applies budgeted token routing, using soft routing during training and hard top\-kkselection during inference, before Mamba encoding\. Through this design, ReTAMamba jointly models observation structure, information reliability, recency, multi\-scale temporal context, and budgeted sequence compression within a single predictive framework\.
The main contributions are as follows:
- •Proposing a unified token\-sequence framework for irregular clinical time series that more directly preserves sparse observation structure and variable\-specific missingness than conventional aggregation\-based representations\.
- •Introducing a reliability\-aware temporal aggregation mechanism that continuously estimates observation validity from missingness and elapsed time under variable\-specific decay and incorporates it into multi\-resolution summary construction\.
- •Developing a multi\-scale sequence modeling strategy centered on Chronological Weaving, which reorders interval summaries from different temporal resolutions in a time\-ordered manner and incorporates budgeted token routing, using soft routing during training and hard top\-kkselection during inference, before sequence encoding\.
The remainder of this paper is organized as follows: Section 2 reviews related work, Section 3 describes the proposed ReTAMamba architecture, Section 4 presents the experimental setup and results, and Section 5 concludes the paper\.
## 2\.Related Work
### 2\.1\.Irregular Clinical Time\-Series Modeling
Early studies on irregularly sampled time series mainly regularized data through hourly aggregation or forward imputation, but such approaches were criticized for discarding meaningful clinical context embedded in observation patterns\(Liptonet al\.,[2015](https://arxiv.org/html/2605.16380#bib.bib6)\)\. Later work attempted to learn sampling irregularity directly by modifying RNN state updates or introducing time\-aware attention\(Cheet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib5); Shukla and Marlin,[2021](https://arxiv.org/html/2605.16380#bib.bib11)\)\. However, these methods still struggled to address large cross\-variable differences in observation frequency, known as inter\-series discrepancy\(Hornet al\.,[2020](https://arxiv.org/html/2605.16380#bib.bib9)\)\. More recent approaches have followed two directions: interpolation\-based methods, such as IP\-Nets\(Shukla and Marlin,[2019](https://arxiv.org/html/2605.16380#bib.bib7)\)and mTAN\(Shukla and Marlin,[2021](https://arxiv.org/html/2605.16380#bib.bib11)\), which map irregular observations to predefined reference times, and unfolding\-based methods, such as SeFT\(Hornet al\.,[2020](https://arxiv.org/html/2605.16380#bib.bib9)\), which directly represent data as\(value,variable,time\)\(\\text\{value\},\\text\{variable\},\\text\{time\}\)tuple sequences\. Interpolation\-based methods forcibly merged information at fixed reference points, limiting their ability to jointly capture fine\- and coarse\-grained temporal patterns\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\. Unfolding\-based methods were shown to preserve raw observation patterns more faithfully, but in ICUs, where dozens of physiological variables are monitored over long periods, it was observed that sequence length grows rapidly with the number of observations, causing substantial memory cost and computational bottlenecks in Transformer\-based architectures withO\(L2\)O\(L^\{2\}\)complexity and hindering real\-time clinical decision support\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.16380#bib.bib12)\)\. Recently, linear\-complexity state space models such as Mamba\(Gu and Dao,[2024](https://arxiv.org/html/2605.16380#bib.bib13)\)emerged as promising alternatives for large\-scale sequence modeling, but effectively integrating them with clinical observation structures characterized by extreme imbalance and missingness remains an important challenge\.
### 2\.2\.Reliability\-Aware Modeling
Unlike regular time series, clinical time series exhibit variable\-specific measurement intervals, and missingness itself can function as informative missingness shaped by clinical decision\-making\(Groenwold,[2020](https://arxiv.org/html/2605.16380#bib.bib14); Tanet al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib15)\)\. Early approaches, such as GRU\-D\(Cheet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib5)\), model information decay using observation masks and time gaps, while more recent methods, such as Raindrop\(Zhanget al\.,[2021](https://arxiv.org/html/2605.16380#bib.bib8)\), use graph structures to learn interactions among asynchronous observations\. However, these methods still struggle to address frequency differences across observational signals and their time\-varying value in a structurally consistent manner\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\. In practice, rapidly changing variables such as heart rate or respiratory rate have fundamentally different information half\-lives from blood test variables that evolve on a daily scale\(Shiet al\.,[2025](https://arxiv.org/html/2605.16380#bib.bib16)\)\. Consequently, although effective for short\-term missing\-value correction, these methods may still distort information by failing to preserve observation validity consistently throughout deeper representation learning\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\. This calls for a framework that quantifies the exponentially decaying validity of past observations in terms of a variable\-specific continuous reliability\. Dynamically controlling uncertain missing information and integrating such reliability\-based observation structures throughout large\-scale sequence modeling remains an important challenge\. At the same time, the need to jointly capture rapid, short\-term physiological changes and long\-term disease trajectories has highlighted the importance of multi\-scale analysis in clinical time\-series modeling\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\.
Figure 1\.Detailed architecture of ReTAMamba\.
### 2\.3\.Multi\-scale Modeling
In EHR research, capturing patient status across multiple temporal resolutions, from minute\-level physiological changes to multi\-day disease trajectories, is critical for accurate prognosis prediction\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)\. Recent efforts have incorporated multi\-scale mechanisms into general\-purpose models such as Transformers, but these models are mainly designed for regularly sampled data and remain inefficient in sparse, irregular clinical settings\(Hornet al\.,[2020](https://arxiv.org/html/2605.16380#bib.bib9); Vaswaniet al\.,[2017](https://arxiv.org/html/2605.16380#bib.bib12)\)\. Multi\-scale architectures tailored to clinical irregularity, including Warpformer\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\), have also been proposed, yet they still construct resolution\-specific summaries independently or combine them in parallel\. Moreover, clinical signals inherently exhibit temporal misalignment due to differences in sampling rates and measurement precision\(Zhanget al\.,[2021](https://arxiv.org/html/2605.16380#bib.bib8); Leeet al\.,[2025](https://arxiv.org/html/2605.16380#bib.bib17)\)\. DTW\-based alignment has been studied to address this issue, but most methods are limited to retrospective matching between regular time series and are therefore unsuitable for causal modeling with real\-time inference\(Sakoe and Chiba,[1978](https://arxiv.org/html/2605.16380#bib.bib18); Senin,[2008](https://arxiv.org/html/2605.16380#bib.bib19)\)\. Supporting multi\-resolution learning in irregular clinical time series therefore requires more than parallel multi\-scale summarization\. It requires a framework that can continuously account for time\-varying reliability, organize cross\-scale summaries along the temporal axis, and efficiently control sequence length\. Related studies have also explored complementary directions such as efficient recurrent architectures, self\-supervised representation learning, and alternative sequence representations\(Joshi and Hauskrecht,[2025](https://arxiv.org/html/2605.16380#bib.bib30); Liuet al\.,[2026](https://arxiv.org/html/2605.16380#bib.bib31); Liet al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib32)\), further highlighting the need for scalable and structure\-aware learning under irregular observation patterns\.
## 3\.Proposed Model
Figure[1](https://arxiv.org/html/2605.16380#S2.F1)illustrates the overall architecture of ReTAMamba\. Each input sample consists of valuesX∈ℝL×VX\\in\\mathbb\{R\}^\{L\\times V\}, an observation maskm∈\{0,1\}L×Vm\\in\\\{0,1\\\}^\{L\\times V\}, measurement timest∈ℝLt\\in\\mathbb\{R\}^\{L\}, and time gapsΔ∈ℝ≥0L×V\\Delta\\in\\mathbb\{R\}\_\{\\geq 0\}^\{L\\times V\}, whereΔ\\Deltadenotes the elapsed time since the last observation of each variable\. Here,LLandVVdenote the sequence length and number of variables, respectively\. ReTAMamba transforms this irregular multivariate time series into a reliability\-aware multi\-scale token sequence for prediction\. It first reconstructs the input as a time\-variable token sequence aligned with the observation structure, then aggregates tokenized observations across multiple temporal resolutions under continuously estimated observation reliability, and finally compresses the resulting summaries under a fixed budget before sequence encoding\. Through this process, ReTAMamba jointly models observation structure, information reliability, recency, and cross\-scale temporal context within a single predictive pipeline\.
### 3\.1\.Tokenization
The Tokenization stage converts an irregular multivariate time series into a time\-variable token sequence that preserves variable\-wise observation structure\. Rather than collapsing the input into conventional interval\-level summaries at the tokenization stage, it represents each time\-variable cell on the shared temporal axis as a token unit, allowing sparse observations and variable\-specific missingness patterns to be retained explicitly in the sequence representation\. Tokenization consists of three steps: Event Flattening, Normalization, and Event Token Embedding\.
Event Flattening\.Tokenization does not aggregate variables into a single time\-step representation\. Instead, each time\-variable cell is treated as a separate token unit and rearranged into a one\-dimensional sequence aligned with the time\-variable structure\. The total number of token units generated from one sample is defined as follows:
whereNNdenotes the total number of time\-variable token units generated from one sample\. The observed values, observation mask, and variable\-wise time gaps are flattened into one\-dimensional sequences:
\(2\)xe=reshape\(X\),me=reshape\(m\),Δe=reshape\(Δ\),x^\{e\}=\\mathrm\{reshape\}\(X\),\\quad m^\{e\}=\\mathrm\{reshape\}\(m\),\\quad\\Delta^\{e\}=\\mathrm\{reshape\}\(\\Delta\),wherexex^\{e\},mem^\{e\}, andΔe\\Delta^\{e\}are one\-dimensional rearrangements of the value matrixXX, observation maskmm, and variable\-wise time\-gap matrixΔ\\Deltaaccording to the token sequence order\. Meanwhile, the temporal positionst∈ℝLt\\in\\mathbb\{R\}^\{L\}are defined only along the temporal axis\. Each time value is therefore repeated across all variables and rearranged into a one\-dimensional sequence aligned with the time\-variable cell order to construct the event\-aligned timestamptet^\{e\}:
\(3\)te∈ℝN,te=reshape\(t⊗𝟏V\),t^\{e\}\\in\\mathbb\{R\}^\{N\},\\qquad t^\{e\}=\\mathrm\{reshape\}\(t\\otimes\\mathbf\{1\}\_\{V\}\),where𝟏V\\mathbf\{1\}\_\{V\}is an all\-one vector of lengthVV, and⊗\\otimesdenotes repetition of the temporal axis along the variable axis\. In addition, a variable index sequenceve∈\{0,…,V−1\}Nv^\{e\}\\in\\\{0,\\dots,V\-1\\\}^\{N\}is defined to indicate the variable associated with each token unit\. This preserves variable\-specific identity for each element in the flattened sequence and is later used to model variable\-dependent freshness decay\.
Normalization\.Since the temporal position within the observation window and the elapsed time since the previous observation have different meanings and ranges, time\-related inputs are normalized for stable integration within a single token representation\. LetTmaxT\_\{\\max\}denote the maximum length of the observation window in minutes\. The event\-aligned time is linearly transformed from the absolute range\[0,Tmax\]\[0,T\_\{\\max\}\]to\[−1,1\]\[\-1,1\]:
\(4\)τi=clip\(2tieTmax−1,−1,1\),\\tau\_\{i\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{2t\_\{i\}^\{e\}\}\{T\_\{\\max\}\}\-1,\\,\-1,\\,1\\right\),wheretiet\_\{i\}^\{e\}is the absolute measurement time of theii\-th token unit, andτi\\tau\_\{i\}is the normalized temporal position\. Here,clip\(⋅\)\\mathrm\{clip\}\(\\cdot\)restricts a value to the specified range\. Next, the raw time gapΔie\\Delta\_\{i\}^\{e\}may exhibit a distribution skewed toward large values in clinical data\. Log scaling is therefore applied to mitigate this heavy\-tailed property, after which the result is normalized to the range\[0,1\]\[0,1\]:
\(5\)Δ¯i=clip\(log\(1\+Δie\)log\(1\+Tmax\),0,1\),\\bar\{\\Delta\}\_\{i\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{\\log\(1\+\\Delta\_\{i\}^\{e\}\)\}\{\\log\(1\+T\_\{\\max\}\)\},\\,0,\\,1\\right\),whereΔie\\Delta\_\{i\}^\{e\}is the raw time gap of theii\-th token unit, andΔ¯i\\bar\{\\Delta\}\_\{i\}is the normalized staleness magnitude in\[0,1\]\[0,1\]\. The normalized value is then remapped to\[−1,1\]\[\-1,1\]so that it can be encoded together withτi\\tau\_\{i\}on a consistent scale:
\(6\)δi=2Δ¯i−1,\\delta\_\{i\}=2\\bar\{\\Delta\}\_\{i\}\-1,whereδi\\delta\_\{i\}is the normalized gap representation used for staleness embedding\. By usingτi\\tau\_\{i\}andδi\\delta\_\{i\}separately, the model can distinguish the absolute temporal position of each token unit from the recency of its information\.
For entries withmie=0m\_\{i\}^\{e\}=0, the value input is forward\-filled from the most recent observation when available; otherwise, it is set to 0 after variable\-wise normalization\. These inputs are treated as placeholders rather than observed measurements, and their contribution is later modulated by the observation mask and elapsed\-time signals through the Reliability Gate\.
Event Token Embedding\.Each token unit is embedded by combining the normalized temporal representationτi\\tau\_\{i\}, value inputxiex\_\{i\}^\{e\}, variable indexviev\_\{i\}^\{e\}, and staleness representationδi\\delta\_\{i\}\. The continuous inputsτi\\tau\_\{i\},xiex\_\{i\}^\{e\}, andδi\\delta\_\{i\}are transformed through continuous value embeddings, while the variable identityviev\_\{i\}^\{e\}is represented by a learnable embedding corresponding to each variable index:
\(7\)ei\(t\)\\displaystyle e\_\{i\}^\{\(t\)\}=CVEt\(τi\),ei\(x\)=CVEx\(xie\),\\displaystyle=\\mathrm\{CVE\}\_\{t\}\(\\tau\_\{i\}\),\\qquad e\_\{i\}^\{\(x\)\}=\\mathrm\{CVE\}\_\{x\}\(x\_\{i\}^\{e\}\),ei\(v\)\\displaystyle e\_\{i\}^\{\(v\)\}=Emb\(vie\),ei\(Δ\)=CVEΔ\(δi\),\\displaystyle=\\mathrm\{Emb\}\(v\_\{i\}^\{e\}\),\\qquad e\_\{i\}^\{\(\\Delta\)\}=\\mathrm\{CVE\}\_\{\\Delta\}\(\\delta\_\{i\}\),whereei\(t\)e\_\{i\}^\{\(t\)\},ei\(x\)e\_\{i\}^\{\(x\)\},ei\(v\)e\_\{i\}^\{\(v\)\}, andei\(Δ\)∈ℝDe\_\{i\}^\{\(\\Delta\)\}\\in\\mathbb\{R\}^\{D\}are independentDD\-dimensional embedding vectors corresponding to the temporal position, value input, variable identity, and staleness signal, respectively\. Here,CVEt\(⋅\)\\mathrm\{CVE\}\_\{t\}\(\\cdot\),CVEx\(⋅\)\\mathrm\{CVE\}\_\{x\}\(\\cdot\), andCVEΔ\(⋅\)\\mathrm\{CVE\}\_\{\\Delta\}\(\\cdot\)denote embedding functions that map continuous scalar inputs toDD\-dimensional vectors, andEmb\(⋅\)\\mathrm\{Emb\}\(\\cdot\)denotes a learnable embedding function for variable indices\. The final event embedding is computed as follows:
\(8\)zi=ei\(t\)\+ei\(x\)\+ei\(v\)\+ei\(Δ\),z\_\{i\}=e\_\{i\}^\{\(t\)\}\+e\_\{i\}^\{\(x\)\}\+e\_\{i\}^\{\(v\)\}\+e\_\{i\}^\{\(\\Delta\)\},wherezi∈ℝDz\_\{i\}\\in\\mathbb\{R\}^\{D\}is the final representation of theii\-th token unit\. The full event token embedding sequence is then denoted byZZ:
\(9\)Z=\[z1,…,zN\]⊤∈ℝN×D,Z=\[z\_\{1\},\\dots,z\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times D\},whereNNis the total number of token units within the observation window andDDis the embedding dimension\.ZZserves as the base representation for subsequent stages to preserve time\-variable information, incorporate observation reliability, and construct summary tokens across multiple temporal resolutions\.
### 3\.2\.Reliability Gate
The Reliability Gate converts missingness and elapsed time into a continuous reliability weight used during aggregation\. Its primary role is to distinguish observed entries from imputed placeholders and to downweight uncertain missing information, rather than to encode all recency effects within the gate itself\. To this end, a positive decay rateλc\\lambda\_\{c\}is learned for each variable channelc∈\{0,…,V−1\}c\\in\\\{0,\\dots,V\-1\\\}:
\(10\)λc=softplus\(wc\)\+λmin,\\lambda\_\{c\}=\\mathrm\{softplus\}\(w\_\{c\}\)\+\\lambda\_\{\\min\},wherewcw\_\{c\}is a learnable scalar parameter for channelcc, andλc\>0\\lambda\_\{c\}\>0is the decay rate of that channel\.softplus\(⋅\)\\mathrm\{softplus\}\(\\cdot\)enforces positivity, andλmin\>0\\lambda\_\{\\min\}\>0prevents excessively small decay\. For each token unitii, the reliability weight is computed using the decay rateλvie\\lambda\_\{v\_\{i\}^\{e\}\}corresponding to its variable indexviev\_\{i\}^\{e\}:
\(11\)relw,i=mie\+\(1−mie\)exp\(−λvieΔie\),\\mathrm\{rel\}\_\{w,i\}=m\_\{i\}^\{e\}\+\(1\-m\_\{i\}^\{e\}\)\\exp\\\!\\left\(\-\\lambda\_\{v\_\{i\}^\{e\}\}\\Delta\_\{i\}^\{e\}\\right\),whererelw,i∈ℝ\\mathrm\{rel\}\_\{w,i\}\\in\\mathbb\{R\}is the reliability weight of theii\-th token unit\. Ifmie=1m\_\{i\}^\{e\}=1, the corresponding time\-variable entry is observed, and thusrelw,i=1\\mathrm\{rel\}\_\{w,i\}=1\. Ifmie=0m\_\{i\}^\{e\}=0, the entry is missing, and the reliability decreases exponentially asΔie\\Delta\_\{i\}^\{e\}increases\. Thus,λvie\\lambda\_\{v\_\{i\}^\{e\}\}controls the variable\-specific decay rate of reliability, whileΔie\\Delta\_\{i\}^\{e\}reflects the staleness of missing information\. Collecting all token\-wise weights yields the output of the Reliability Gate:
\(12\)relw=\[relw,1,…,relw,N\]⊤∈ℝN,\\mathrm\{rel\}\_\{w\}=\[\\mathrm\{rel\}\_\{w,1\},\\dots,\\mathrm\{rel\}\_\{w,N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\},whererelw\\mathrm\{rel\}\_\{w\}is the reliability vector over the full flattened sequence\. It is used in the Multi\-scale Aggregation stage to modulate the contribution of each token unit when time\-variable entries are aggregated into interval\-level summaries\. This design places greater emphasis on directly observed data while suppressing uncertain missing information\. For observed entries, recency is modeled separately through theΔt\\Delta tembedding and the bucket\-level mean staleness used in multi\-scale aggregation\. Accordingly, the gate and staleness\-related features play complementary roles rather than encoding the same notion twice\.
### 3\.3\.Multi\-scale Aggregation
Multi\-scale Aggregation constructs compressed summary\-token sequences at multiple temporal resolutions while preserving time\-variable information from the flattened input\. It consists of five submodules: Bucketing, Weighted Pooling, Stats Augment, Concat Scales, and Chronological Weaving\. LetS=\{s1,s2,…,sM\}S=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{M\}\\\}denote the set of temporal resolutions, where eachs∈Ss\\in Sis a temporal scale in minutes\. The specific scale values are given in the experimental setup\.
Bucketing\.Bucketing assigns each token unit to a time interval at a given temporal scale\. To compute the bucket index of theii\-th token unit at scaless, the normalized timeτi\\tau\_\{i\}is first restored to its position in minutes:
\(13\)ui=τi\+12Tmax,u\_\{i\}=\\frac\{\\tau\_\{i\}\+1\}\{2\}T\_\{\\max\},whereuiu\_\{i\}is the restored temporal position in minutes\. The bucket index of token unitiiat scalessis then defined as
\(14\)bi\(s\)=⌊uis⌋,b\_\{i\}^\{\(s\)\}=\\left\\lfloor\\frac\{u\_\{i\}\}\{s\}\\right\\rfloor,wherebi\(s\)b\_\{i\}^\{\(s\)\}is the bucket index and⌊⋅⌋\\lfloor\\cdot\\rfloordenotes the floor operation\. Thus, bucket assignment follows a left\-closed, right\-open convention, so that each token unit is assigned to exactly one interval according to its restored time position\. The set of indices assigned to bucketkkat scalessis
\(15\)Bk\(s\)=\{i∣bi\(s\)=k\},B\_\{k\}^\{\(s\)\}=\\\{\\,i\\mid b\_\{i\}^\{\(s\)\}=k\\,\\\},whereBk\(s\)B\_\{k\}^\{\(s\)\}is the index set of token units in bucketkkat scaless\. In addition, the center time of each bucket is defined as
\(16\)θk\(s\)=\(k\+12\)s,\\theta\_\{k\}^\{\(s\)\}=\\left\(k\+\\frac\{1\}\{2\}\\right\)s,whereθk\(s\)\\theta\_\{k\}^\{\(s\)\}is the center time of bucketkkat scalessand is later used for temporal ordering across scales\.
Weighted Pooling\.Weighted Pooling aggregates token units within each bucket in a reliability\-aware manner to form a single vector representation for the corresponding interval:
\(17\)μk\(s\)=∑i∈Bk\(s\)relw,izi∑i∈Bk\(s\)relw,i\+ε,\\mu\_\{k\}^\{\(s\)\}=\\frac\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}z\_\{i\}\}\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}\+\\varepsilon\},whereμk\(s\)∈ℝD\\mu\_\{k\}^\{\(s\)\}\\in\\mathbb\{R\}^\{D\}is the pooled representation of bucketkkat scaless,ziz\_\{i\}is theii\-th token embedding,relw,i\\mathrm\{rel\}\_\{w,i\}is its reliability weight, andε\>0\\varepsilon\>0is a small constant to prevent division by zero\. IfBk\(s\)B\_\{k\}^\{\(s\)\}is empty, no token is generated for that bucket and it is excluded in later concatenation\.
Stats Augment\.To enrich mean\-pooled bucket representations, this module incorporates a dispersion signal and bucket\-level statistics that summarize observation density and freshness\. The dispersion signalνk\(s\)\\nu\_\{k\}^\{\(s\)\}is computed from the reliability\-weighted second moment:
\(18\)νk\(s\)=\(∑i∈Bk\(s\)relw,i\(zi⊙zi\)∑i∈Bk\(s\)relw,i\+ε−μk\(s\)⊙μk\(s\)\)\+,\\nu\_\{k\}^\{\(s\)\}=\\left\(\\frac\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}\(z\_\{i\}\\odot z\_\{i\}\)\}\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}\+\\varepsilon\}\-\\mu\_\{k\}^\{\(s\)\}\\odot\\mu\_\{k\}^\{\(s\)\}\\right\)\_\{\+\},whereνk\(s\)∈ℝD\\nu\_\{k\}^\{\(s\)\}\\in\\mathbb\{R\}^\{D\}reflects variance\-like feature behavior within the bucket,⊙\\odotdenotes element\-wise product, and\(⋅\)\+\(\\cdot\)\_\{\+\}denotes element\-wise nonnegative clipping\. The dispersion information is then transformed in a scale\-specific manner and added to the bucket representation:
\(19\)μk\(s\)←μk\(s\)\+ϕs\(νk\(s\)\),\\mu\_\{k\}^\{\(s\)\}\\leftarrow\\mu\_\{k\}^\{\(s\)\}\+\\phi\_\{s\}\\\!\\left\(\\nu\_\{k\}^\{\(s\)\}\\right\),whereϕs\(⋅\)\\phi\_\{s\}\(\\cdot\)is a scale\-specific transformation implemented with layer normalization and a learnable projection\.
Bucket\-level measurement patterns are also incorporated using observation count, coverage, effective count, and mean staleness:
\(20\)nk,obs\(s\)=∑i∈Bk\(s\)mie,nk,all\(s\)=∑i∈Bk\(s\)1,n\_\{k,\\mathrm\{obs\}\}^\{\(s\)\}=\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}m\_\{i\}^\{e\},\\qquad n\_\{k,\\mathrm\{all\}\}^\{\(s\)\}=\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}1,wherenk,obs\(s\)n\_\{k,\\mathrm\{obs\}\}^\{\(s\)\}andnk,all\(s\)n\_\{k,\\mathrm\{all\}\}^\{\(s\)\}denote the numbers of observed and total entries, respectively\. Since empty buckets are removed in advance,nk,all\(s\)≥1n\_\{k,\\mathrm\{all\}\}^\{\(s\)\}\\geq 1\. Coverage is defined as
\(21\)ρk\(s\)=nk,obs\(s\)nk,all\(s\),\\rho\_\{k\}^\{\(s\)\}=\\frac\{n\_\{k,\\mathrm\{obs\}\}^\{\(s\)\}\}\{n\_\{k,\\mathrm\{all\}\}^\{\(s\)\}\},whereρk\(s\)∈\[0,1\]\\rho\_\{k\}^\{\(s\)\}\\in\[0,1\]is the proportion of observed entries in the bucket\. The reliability\-weighted effective countwk\(s\)w\_\{k\}^\{\(s\)\}and mean stalenessΔ¯k\(s\)\\bar\{\\Delta\}\_\{k\}^\{\(s\)\}are defined as
\(22\)wk\(s\)=∑i∈Bk\(s\)relw,i,w\_\{k\}^\{\(s\)\}=\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\},\(23\)Δ¯k\(s\)=∑i∈Bk\(s\)relw,iΔ¯i∑i∈Bk\(s\)relw,i\+ε,\\bar\{\\Delta\}\_\{k\}^\{\(s\)\}=\\frac\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}\\bar\{\\Delta\}\_\{i\}\}\{\\sum\_\{i\\in B\_\{k\}^\{\(s\)\}\}\\mathrm\{rel\}\_\{w,i\}\+\\varepsilon\},and these statistics are concatenated as
\(24\)gk\(s\)=\[log\(1\+nk,obs\(s\)\),ρk\(s\),log\(1\+wk\(s\)\),Δ¯k\(s\)\],g\_\{k\}^\{\(s\)\}=\\left\[\\log\\\!\\bigl\(1\+n\_\{k,\\mathrm\{obs\}\}^\{\(s\)\}\\bigr\),\\;\\rho\_\{k\}^\{\(s\)\},\\;\\log\\\!\\bigl\(1\+w\_\{k\}^\{\(s\)\}\\bigr\),\\;\\bar\{\\Delta\}\_\{k\}^\{\(s\)\}\\right\],wheregk\(s\)∈ℝ4g\_\{k\}^\{\(s\)\}\\in\\mathbb\{R\}^\{4\}is the bucket\-level statistics vector\. The statistics embedding produced by an MLP is added to the bucket token\. Each bucket also receives a bucket\-center\-time embedding and a scale embedding to preserve temporal position and resolution information\.
Concat Scales and Chronological Weaving\.This module integrates the summary tokens generated at each scale and reorders them along the temporal axis so that the backbone can process multi\-resolution information in a more coherent temporal context\. LetZ\(s\)Z^\{\(s\)\}denote the valid bucket\-token sequence generated at scales∈S=\{s1,…,sM\}s\\in S=\\\{s\_\{1\},\\dots,s\_\{M\}\\\}\. The full multi\-scale token sequence, together with its corresponding mask and bucket\-center\-time sequence, is defined as
\(25\)Z=\[Z\(s1\);Z\(s2\);…;Z\(sM\)\]∈ℝNtok×D,Z=\[Z^\{\(s\_\{1\}\)\};Z^\{\(s\_\{2\}\)\};\\dots;Z^\{\(s\_\{M\}\)\}\]\\in\\mathbb\{R\}^\{N\_\{\\mathrm\{tok\}\}\\times D\},\(26\)M=\[M\(s1\);M\(s2\);…;M\(sM\)\]∈\{0,1\}Ntok,M=\[M^\{\(s\_\{1\}\)\};M^\{\(s\_\{2\}\)\};\\dots;M^\{\(s\_\{M\}\)\}\]\\in\\\{0,1\\\}^\{N\_\{\\mathrm\{tok\}\}\},\(27\)T=\[T\(s1\);T\(s2\);…;T\(sM\)\]∈ℝNtok,T=\[T^\{\(s\_\{1\}\)\};T^\{\(s\_\{2\}\)\};\\dots;T^\{\(s\_\{M\}\)\}\]\\in\\mathbb\{R\}^\{N\_\{\\mathrm\{tok\}\}\},whereZ\(s\)Z^\{\(s\)\}is the sequence of valid bucket tokens at scaless, andM\(s\)M^\{\(s\)\}andT\(s\)T^\{\(s\)\}are the corresponding valid mask and ordered sequence of bucket center timesθk\(s\)\\theta\_\{k\}^\{\(s\)\}defined in \([16](https://arxiv.org/html/2605.16380#S3.E16)\)\. Although empty buckets are removed in advance, the valid\-token maskMMis retained for notational consistency with later token selection and sequence processing\. Here,DDis the token embedding dimension, andNtokN\_\{\\mathrm\{tok\}\}is the total number of valid tokens across all scales\.
Simple concatenation arranges tokens in scale\-wise blocks, so summaries describing the same or nearby periods at different resolutions may be far apart in the sequence\. To address this, Chronological Weaving reorders all tokens according toTT, so that summaries centered around similar time points become adjacent regardless of resolution\. Because each bucket token is constructed only from entries within its own interval and weaving is applied only after summary construction, this step does not mix raw observations across intervals\. As a result, the backbone can process nearby cross\-scale summaries in a coherent time\-ordered context without additional cross\-interval aggregation\.
### 3\.4\.Sequence Modeling
The Sequence Modeling stage takes as input the multi\-scale token sequenceZZ, the corresponding maskMM, and the temporal position sequenceTT\. Its goal is to compress the multi\-scale token sequence under a fixed budget and encode it with a Mamba backbone for final prediction\.
Budgeted Token Router\.Although multi\-scale aggregation summarizes information across multiple temporal resolutions, the resulting token sequence can still be long depending on the number of scales and buckets\. To address this, a budgeted token router is applied before sequence encoding\. This module reduces computation while preserving predictive information by assigning routing scores to tokens and retaining a compact subset under a fixed token budget\. The importance scoresjs\_\{j\}of each tokenZjZ\_\{j\}is computed as follows:
\(28\)sj=wr⊤Zj\+br,s\_\{j\}=w\_\{r\}^\{\\top\}Z\_\{j\}\+b\_\{r\},wherewr∈ℝDw\_\{r\}\\in\\mathbb\{R\}^\{D\}andbr∈ℝb\_\{r\}\\in\\mathbb\{R\}are learnable router parameters\. During training, the routing scores are converted into differentiable soft routing weights, which are used to modulate token representations so that the router can be optimized jointly with the downstream prediction objective\. During inference, hard top\-kkselection is applied based on the routing scores to construct a compact sequence with a fixed token budget\. Here,kkis a hyperparameter\. In this way, soft routing is used to maintain differentiability during training, whereas hard top\-kkrouting is used to enforce fixed\-budget sequence compression at inference time\.
The selected tokens are then reordered chronologically to form\(Z′,M′,T′\)\(Z^\{\\prime\},M^\{\\prime\},T^\{\\prime\}\)\. Here,Z′∈ℝN′×DZ^\{\\prime\}\\in\\mathbb\{R\}^\{N^\{\\prime\}\\times D\}is the selected token sequence,M′∈\{0,1\}N′M^\{\\prime\}\\in\\\{0,1\\\}^\{N^\{\\prime\}\}is the corresponding mask, andT′∈ℝN′T^\{\\prime\}\\in\\mathbb\{R\}^\{N^\{\\prime\}\}contains the temporal positions of the selected tokens\. In addition,N′≤kN^\{\\prime\}\\leq kdenotes the number of valid tokens after routing\. Since routing is based on token importance, the selected order may not follow time\. Chronological reordering restores a time\-ordered sequence before Mamba encoding\.
Mamba Backbone and Prediction\.The reordered token sequenceZ′Z^\{\\prime\}is fed into the Mamba backbone\. Mamba\(Gu and Dao,[2024](https://arxiv.org/html/2605.16380#bib.bib13)\)is used as the sequence encoder to efficiently capture temporal dependencies among the selected summary tokens\. The backbone output sequence is written as follows:
\(29\)H=Mamba\(Z′\)∈ℝN′×D,H=\\mathrm\{Mamba\}\(Z^\{\\prime\}\)\\in\\mathbb\{R\}^\{N^\{\\prime\}\\times D\},whereHHis the backbone output sequence, andhj∈ℝDh\_\{j\}\\in\\mathbb\{R\}^\{D\}is thejj\-th output token representation\. The final prediction uses the representation of the last valid token:
\(30\)hpool=hN′,h\_\{\\mathrm\{pool\}\}=h\_\{N^\{\\prime\}\},wherehpool∈ℝDh\_\{\\mathrm\{pool\}\}\\in\\mathbb\{R\}^\{D\}is the sample\-level pooled representation\. After chronological reordering, the last token corresponds to the latest retained summary in the reordered sequence\. This design preserves the most recent summarized patient state after temporal reordering while keeping the prediction head simple\. The logit and prediction are then computed through a linear output head:
\(31\)logit=Wohpool\+bo,y^=σ\(logit\),\\mathrm\{logit\}=W\_\{o\}h\_\{\\mathrm\{pool\}\}\+b\_\{o\},\\qquad\\hat\{y\}=\\sigma\(\\mathrm\{logit\}\),whereWoW\_\{o\}andbob\_\{o\}are output\-head parameters,logit∈ℝ\\mathrm\{logit\}\\in\\mathbb\{R\}is the pre\-sigmoid score for binary classification, andy^∈\(0,1\)\\hat\{y\}\\in\(0,1\)is the final predicted probability\.
## 4\.Experiments
This section evaluates ReTAMamba on three clinical time\-series benchmarks\. After describing the experimental setup, it presents overall performance comparisons, component\-wise ablations, and efficiency analysis\. It then examines differences in observation patterns between survivors and non\-survivors to interpret model behavior, and finally quantifies the effect of the multi\-resolution temporal design\.
### 4\.1\.Experimental Setup
Datasets\.Experiments were conducted on three ICU clinical time\-series benchmarks with irregular sampling and missing values: MIMIC\-IV\(Johnsonet al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib20)\), eICU\(Pollardet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib21)\), and PhysioNet 2012\(Silvaet al\.,[2012](https://arxiv.org/html/2605.16380#bib.bib22)\)\. The prediction task was unified as in\-hospital mortality prediction across all datasets\. For each patient, only the first 48 hours after ICU admission were used as input to reflect an early risk assessment setting\. The input consisted of 17 clinical variables commonly used in prior clinical time\-series studies, and only variables available across all three datasets were retained to ensure a consistent evaluation setting\(Harutyunyanet al\.,[2019](https://arxiv.org/html/2605.16380#bib.bib3)\)\. Cases with missing outcome labels, invalid ICU stay information, or insufficient observations within the 48\-hour window were excluded during preprocessing\. These datasets differ in cohort scale, missingness patterns, and observation density, providing complementary testbeds for evaluating robustness under a unified early\-risk prediction setting across heterogeneous ICU environments\.
Baselines\.To evaluate the proposed design from multiple perspectives, the baselines were organized into three groups\. First, we included general\-purpose predictive backbones, namely XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.16380#bib.bib23)\), LSTM\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2605.16380#bib.bib24)\), Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.16380#bib.bib12)\), and Mamba\(Gu and Dao,[2024](https://arxiv.org/html/2605.16380#bib.bib13)\)\. Second, we considered clinical time\-series models designed to handle irregularity and missingness, including GRU\-D\(Cheet al\.,[2018](https://arxiv.org/html/2605.16380#bib.bib5)\), IP\-Nets\(Shukla and Marlin,[2019](https://arxiv.org/html/2605.16380#bib.bib7)\), mTAN\(Shukla and Marlin,[2021](https://arxiv.org/html/2605.16380#bib.bib11)\), SeFT\(Hornet al\.,[2020](https://arxiv.org/html/2605.16380#bib.bib9)\), and Raindrop\(Zhanget al\.,[2021](https://arxiv.org/html/2605.16380#bib.bib8)\)\. Third, we included Warpformer\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16380#bib.bib10)\)as the most relevant comparator because it explicitly combines temporal aggregation with multi\-scale representation learning for irregular clinical time series\.
Evaluation Metrics\.AUROC and AUPRC were used to evaluate binary classification performance\. Because in\-hospital mortality prediction is highly imbalanced, AUPRC was treated as the primary metric\.
Table 1\.Temporal input types used by different models\.ModelMaskTime gapTimestampXGBoost✓––LSTM✓✓–Transformer✓✓–Mamba✓✓–GRU\-D✓✓–IP\-Nets✓✓–mTAN✓✓✓SeFT✓✓✓Raindrop✓✓✓Warpformer✓✓✓ReTAMamba✓✓✓Table 2\.Main results on three clinical time\-series benchmarks\.ModelsMIMIC\-IVeICUPhysioNet 2012AUROCAUPRCAUROCAUPRCAUROCAUPRCXGBoost0\.815±0\.0010\.815\\pm 0\.0010\.481±0\.0020\.481\\pm 0\.0020\.764±0\.0010\.764\\pm 0\.0010\.343±0\.0010\.343\\pm 0\.0010\.804±0\.0010\.804\\pm 0\.0010\.443±0\.0010\.443\\pm 0\.001LSTM0\.850±0\.0020\.850\\pm 0\.0020\.515±0\.0010\.515\\pm 0\.0010\.781±0\.0030\.781\\pm 0\.0030\.350±0\.0040\.350\\pm 0\.0040\.826±0\.0050\.826\\pm 0\.0050\.458±0\.0170\.458\\pm 0\.017Transformer0\.852±0\.0030\.852\\pm 0\.0030\.523±0\.0050\.523\\pm 0\.0050\.784±0\.0020\.784\\pm 0\.0020\.361±0\.0010\.361\\pm 0\.0010\.820±0\.0080\.820\\pm 0\.0080\.468±0\.0120\.468\\pm 0\.012Mamba0\.840±0\.0020\.840\\pm 0\.0020\.490±0\.0080\.490\\pm 0\.0080\.774±0\.0040\.774\\pm 0\.0040\.351±0\.0060\.351\\pm 0\.0060\.814±0\.0070\.814\\pm 0\.0070\.468±0\.0140\.468\\pm 0\.014GRU\-D0\.854±0\.0020\.854\\pm 0\.0020\.517±0\.0050\.517\\pm 0\.0050\.783±0\.0040\.783\\pm 0\.0040\.358±0\.0070\.358\\pm 0\.0070\.831±0\.0030\.831\\pm 0\.0030\.487±0\.0110\.487\\pm 0\.011IP\-Nets0\.856±0\.0020\.856\\pm 0\.0020\.529±0\.0040\.529\\pm 0\.0040\.787±0\.0040\.787\\pm 0\.0040\.362±0\.0030\.362\\pm 0\.0030\.825±0\.0040\.825\\pm 0\.0040\.463±0\.0070\.463\\pm 0\.007mTAN0\.848±0\.0090\.848\\pm 0\.0090\.511±0\.0080\.511\\pm 0\.0080\.778±0\.0050\.778\\pm 0\.0050\.354±0\.0050\.354\\pm 0\.0050\.829±0\.0060\.829\\pm 0\.0060\.472±0\.0100\.472\\pm 0\.010SeFT0\.853±0\.0010\.853\\pm 0\.0010\.514±0\.0050\.514\\pm 0\.0050\.783±0\.0050\.783\\pm 0\.0050\.359±0\.0090\.359\\pm 0\.0090\.833±0\.0050\.833\\pm 0\.0050\.501±0\.0130\.501\\pm 0\.013Raindrop0\.841±0\.0030\.841\\pm 0\.0030\.489±0\.0100\.489\\pm 0\.0100\.775±0\.0050\.775\\pm 0\.0050\.352±0\.0070\.352\\pm 0\.0070\.821±0\.0030\.821\\pm 0\.0030\.476±0\.0090\.476\\pm 0\.009Warpformer0\.855±0\.0040\.855\\pm 0\.0040\.528±0\.0140\.528\\pm 0\.0140\.784±0\.0020\.784\\pm 0\.0020\.372±0\.0040\.372\\pm 0\.0040\.831±0\.0070\.831\\pm 0\.0070\.485±0\.0200\.485\\pm 0\.020ReTAMamba0\.861±0\.001\\mathbf\{0\.861\\pm 0\.001\}0\.548±0\.005\\mathbf\{0\.548\\pm 0\.005\}0\.793±0\.004\\mathbf\{0\.793\\pm 0\.004\}0\.384±0\.003\\mathbf\{0\.384\\pm 0\.003\}0\.847±0\.002\\mathbf\{0\.847\\pm 0\.002\}0\.520±0\.013\\mathbf\{0\.520\\pm 0\.013\}
Note\.Best results are shown in bold\. All results are reported as mean±\\pmstandard deviation over five random seeds\.
Table 3\.Ablation results of key components in ReTAMamba on three clinical time\-series benchmarks\.VariantMIMIC\-IVeICUPhysioNet 2012AUROCAUPRCAUROCAUPRCAUROCAUPRCReTAMamba \(Full\)0\.861±0\.001\\mathbf\{0\.861\\pm 0\.001\}0\.548±0\.005\\mathbf\{0\.548\\pm 0\.005\}0\.793±0\.004\\mathbf\{0\.793\\pm 0\.004\}0\.384±0\.003\\mathbf\{0\.384\\pm 0\.003\}0\.847±0\.002\\mathbf\{0\.847\\pm 0\.002\}0\.520±0\.013\\mathbf\{0\.520\\pm 0\.013\}w/o Reliability Gate0\.850±0\.0030\.850\\pm 0\.0030\.516±0\.0120\.516\\pm 0\.0120\.784±0\.0020\.784\\pm 0\.0020\.367±0\.0040\.367\\pm 0\.0040\.823±0\.0040\.823\\pm 0\.0040\.487±0\.0150\.487\\pm 0\.015w/oΔt\\Delta tEmbedding0\.858±0\.0020\.858\\pm 0\.0020\.528±0\.0070\.528\\pm 0\.0070\.786±0\.0030\.786\\pm 0\.0030\.369±0\.0030\.369\\pm 0\.0030\.842±0\.0050\.842\\pm 0\.0050\.508±0\.0140\.508\\pm 0\.014w/o Stats Augment0\.855±0\.0030\.855\\pm 0\.0030\.535±0\.0060\.535\\pm 0\.0060\.787±0\.0030\.787\\pm 0\.0030\.369±0\.0040\.369\\pm 0\.0040\.841±0\.0040\.841\\pm 0\.0040\.507±0\.0120\.507\\pm 0\.012w/o Token Router0\.858±0\.0020\.858\\pm 0\.0020\.538±0\.0060\.538\\pm 0\.0060\.780±0\.0020\.780\\pm 0\.0020\.361±0\.0040\.361\\pm 0\.0040\.841±0\.0030\.841\\pm 0\.0030\.512±0\.0100\.512\\pm 0\.010w/o Chronological Weaving0\.847±0\.0030\.847\\pm 0\.0030\.520±0\.0050\.520\\pm 0\.0050\.775±0\.0030\.775\\pm 0\.0030\.352±0\.0040\.352\\pm 0\.0040\.826±0\.0070\.826\\pm 0\.0070\.501±0\.0100\.501\\pm 0\.010
Note\.Each variant removes one key component from ReTAMamba\. Best results are shown in bold\. All results are reported as mean±\\pmstandard deviation over five random seeds\.
Implementation Details\.All models were trained and evaluated under the same task definition, patient\-level train/validation/test splits \(0\.70/0\.10/0\.20\), 48\-hour observation window, variable set, and preprocessing pipeline to ensure fair comparison\. The temporal input types used by different models are summarized in Table[1](https://arxiv.org/html/2605.16380#S4.T1)\. The shared preprocessing pipeline was applied to the value inputs of all models, while temporal inputs such as masks, time gaps, and timestamps were additionally provided only when required by each architecture\. Missing values were forward\-filled from the most recent observation when available; unresolved values were then set to 0 after variable\-wise z\-score normalization using training\-set statistics, while missingness and elapsed time were preserved through mask and time\-gap features\. All neural models were optimized with AdamW using a batch size of 64 and trained for up to 50 epochs with early stopping based on validation performance; the best validation checkpoint was used for test evaluation\. Results are reported as averages over five random seeds\. Hyperparameters for all models were tuned on the validation set using Optuna under matched search budgets\. The final hyperparameter settings of ReTAMamba are summarized in Appendix A\.1 \(Table[A1](https://arxiv.org/html/2605.16380#A1.T1)\)\.
### 4\.2\.Main Results
This section quantitatively compares the overall predictive performance of ReTAMamba against existing models\. Table[2](https://arxiv.org/html/2605.16380#S4.T2)summarizes the overall prediction results on three clinical time\-series benchmarks\. ReTAMamba achieves the best AUROC and AUPRC on all three datasets, namely MIMIC\-IV, eICU, and PhysioNet 2012, demonstrating consistently strong discriminative performance across diverse clinical time\-series settings\.
Particular emphasis is placed on AUPRC, which is more informative than AUROC under severe class imbalance\. Under this metric, ReTAMamba improves the average AUPRC of all baseline models by 7\.51%, 7\.80%, and 10\.15% on MIMIC\-IV, eICU, and PhysioNet 2012, respectively\. Compared with the strongest baseline, Warpformer, it also achieves absolute AUPRC gains of 0\.020, 0\.012, and 0\.035\. A paired t\-test against the strongest competing model on each dataset showed that the AUPRC improvements were statistically significant on all three datasets \(p<0\.05p<0\.05\)\. These results support the effectiveness of jointly modeling observation reliability, recency, multi\-scale temporal context, and budgeted sequence compression for irregular clinical time\-series prediction\. A supplementary calibration analysis on the eICU dataset is provided in Table[A4](https://arxiv.org/html/2605.16380#A1.T4)\.
### 4\.3\.Ablation Study
This section analyzes the contribution of each component in ReTAMamba through ablation experiments that remove one key module at a time\. Table[3](https://arxiv.org/html/2605.16380#S4.T3)shows that the full ReTAMamba achieves the best performance on all three datasets\.
Removing any single component leads to performance degradation, indicating that each design element contributes to the overall effectiveness of the model\. In particular, notable drops are observed in w/o Reliability Gate and w/o Chronological Weaving, highlighting the importance of reliability\-aware weighting and chronological ordering of multi\-resolution tokens\. The degradation is especially clear in AUPRC, which is more sensitive to class imbalance in clinical prediction tasks\.
To further validate the role of Chronological Weaving, we additionally compared it with random reordering on eICU\. As shown in Table[4](https://arxiv.org/html/2605.16380#S4.T4), chronological ordering achieved the best performance, outperforming both no weaving and random reordering\. This suggests that the benefit of the module comes not merely from reordering tokens, but from arranging multi\-resolution summaries in a coherent time\-ordered manner\.
Performance also decreases in w/o Stats Augment across all three datasets, suggesting that bucket\-level statistics, such as within\-bucket variance, observation frequency, coverage, and mean staleness, provide useful summary information for irregular clinical time series\. In addition, w/oΔt\\Delta tEmbedding also shows consistent performance reductions, indicating the importance of encoding recency in the event representation\. The performance drop in w/o Token Router further suggests that budgeted token routing helps preserve predictive information while controlling sequence length before Mamba encoding\. Supplementary statistics on pre\-routing token counts and token budget sensitivity are provided in Table[A2](https://arxiv.org/html/2605.16380#A1.T2)and Table[A3](https://arxiv.org/html/2605.16380#A1.T3), respectively\.
Table 4\.Effect of Chronological Weaving on eICU\.Weaving modeAUROCAUPRCChronological0\.793±0\.0040\.793\\pm 0\.0040\.384±0\.0030\.384\\pm 0\.003None \(w/o weaving\)0\.775±0\.0030\.775\\pm 0\.0030\.352±0\.0040\.352\\pm 0\.004Random0\.784±0\.0020\.784\\pm 0\.0020\.364±0\.0030\.364\\pm 0\.003
### 4\.4\.Efficiency Analysis
This section evaluates the computational efficiency of ReTAMamba in terms of parameter count, training peak memory, training step time, and inference latency, in comparison with existing models on eICU\.
Table[5](https://arxiv.org/html/2605.16380#S4.T5)and Figure[2](https://arxiv.org/html/2605.16380#S4.F2)summarize the efficiency results\. Training peak memory was measured as the maximum allocated GPU memory during training\. Training step time included forward pass, backward pass, and optimizer update, and inference latency was evaluated on test samples with batch size 1 under the same implementation and execution setting\. All results were measured on a single NVIDIA RTX PRO 5000 Blackwell 48 GB GPU using CUDA with bf16 automatic mixed precision\. ReTAMamba maintains moderate computational cost with 157K parameters and 163 MB of training peak memory, while Warpformer shows the highest resource consumption with 240K parameters and 844 MB\. Raindrop also requires more memory than ReTAMamba\.
Table 5\.Efficiency comparison of different models\.ModelParams \(K\)Train Peak Mem \(MB\)LSTM6354Transformer7038Mamba3632GRU\-D2031IP\-Nets2349mTAN1855SeFT2099Raindrop113199Warpformer240844ReTAMamba157163
\(a\) Training step time

\(b\) Inference latency
Figure 2\.Efficiency comparison on the eICU dataset\.As shown in Figure[2](https://arxiv.org/html/2605.16380#S4.F2), ReTAMamba maintains lower training step time and inference latency than GRU\-D, mTAN, Raindrop, and Warpformer, although it remains slower than simpler models such as LSTM, Transformer, Mamba, and IP\-Nets\. These results should be interpreted as controlled wall\-clock measurements, since actual efficiency depends not only on parameter count but also on model\-specific computation patterns for irregular\-sequence processing\. Combined with the accuracy results in Table[2](https://arxiv.org/html/2605.16380#S4.T2), these findings suggest that ReTAMamba achieves the best predictive performance while maintaining competitive computational efficiency\.
### 4\.5\.Temporal Patterns and Model Behavior
#### 4\.5\.1\.Cohort Differences
This section examines survivor\-non\-survivor differences in observation patterns during the first 48 hours after ICU admission on eICU\. To highlight characteristics more pronounced in deceased patients, each metric is visualized as non\-survivors minus survivors\.

\(a\)Δ\\DeltaCoverage

\(b\)Δ\\DeltaStaleness
Figure 3\.Temporal cohort differences in the eICU dataset\.Figure[3](https://arxiv.org/html/2605.16380#S4.F3)shows the temporal mean differences in coverage and staleness\. Positive values indicate larger values in non\-survivors, whereas negative values indicate larger values in survivors\. HigherΔ\\Deltacoverage indicates more frequent observation in non\-survivors, whereas lowerΔ\\Deltastaleness indicates more recent observation\.
Non\-survivors generally exhibited higher coverage together with lower staleness for several variables, particularly FiO2, pH, body temperature, and blood pressure\. This suggests that these variables were monitored more frequently and more recently in non\-survivors, consistent with intensified clinical attention to respiratory status, acid\-base balance, and hemodynamic stability in critically ill patients\(Shiet al\.,[2025](https://arxiv.org/html/2605.16380#bib.bib16); Kipniset al\.,[2012](https://arxiv.org/html/2605.16380#bib.bib25)\)\.
Overall, these results suggest that predictive signals reside not only in measured values but also in the observation process itself\. In other words, patient risk may be reflected not only in the blood pressure or pH value itself, but also in how often and how recently clinicians chose to measure it\. The temporal differences in coverage and staleness therefore directly motivate the need to model missingness and information freshness jointly in irregular clinical time series\(Sisket al\.,[2021](https://arxiv.org/html/2605.16380#bib.bib26)\)\.
Table 6\.Group\-wise summary of learned decay patterns on eICU\.GroupMean decayMean coverageMean gap \(h\)Vital0\.20440\.63411\.21Lab0\.16440\.11366\.25Table 7\.Performance comparison across different multi\-scale configurations\.Temporal Scales \(min\)MIMIC\-IVeICUPhysioNet 2012AUROCAUPRCAUROCAUPRCAUROCAUPRCBase: \{60\}0\.854±\\pm0\.0030\.530±\\pm0\.0070\.783±\\pm0\.0030\.366±\\pm0\.0050\.842±\\pm0\.0060\.503±\\pm0\.016Medium: \{60, 120\}0\.856±\\pm0\.0030\.541±\\pm0\.0040\.784±\\pm0\.0040\.371±\\pm0\.0060\.842±\\pm0\.0040\.513±\\pm0\.012Long: \{60, 120, 240\}0\.861±\\pm0\.0010\.548±\\pm0\.0050\.793±\\pm0\.0040\.384±\\pm0\.0030\.847±\\pm0\.0020\.520±\\pm0\.013Extended: \{60, 120, 240, 480\}0\.857±\\pm0\.0040\.540±\\pm0\.0080\.784±\\pm0\.0020\.373±\\pm0\.0020\.839±\\pm0\.0050\.509±\\pm0\.011
#### 4\.5\.2\.Learned Reliability Patterns
To examine how the Reliability Gate models information freshness, Table[6](https://arxiv.org/html/2605.16380#S4.T6)summarizes the learned decay patterns across major feature groups on eICU\. Frequently measured bedside signals in the Vital group, such as heart rate, respiratory rate, blood pressure, and temperature, showed larger decay values, whereas laboratory variables in the Lab group, such as pH and glucose, showed smaller decay values and substantially longer observation gaps\. This suggests that the gate learned to reduce the reliability of rapidly outdated signals more aggressively, while preserving the usefulness of more slowly updated variables for longer periods\.
#### 4\.5\.3\.Case Analysis
To illustrate how these reliability patterns appear at the individual\-patient level, representative survivor and non\-survivor cases were selected from correctly classified test samples whose predicted probabilities were closest to the class\-wise median, in order to avoid cherry\-picking highly atypical examples\.
In the survivor case, token selection was distributed relatively evenly across time, with 60m, 120m, and 240m tokens retained throughout the sequence, as shown in Figure[4](https://arxiv.org/html/2605.16380#S4.F4)\(a\)\. Consistently, the mean gate remained high, whereas the mean staleness stayed low with little fluctuation, as shown in Figure[4](https://arxiv.org/html/2605.16380#S4.F4)\(b\)\. This pattern suggests that, under a stable observation flow, recent information was continuously regarded as reliable, allowing the model to use temporal context broadly rather than concentrating on a small number of intervals\.

\(a\) Survivor Tokens

\(b\) Survivor Gate/Staleness

\(c\) Non\-survivor Tokens

\(d\) Non\-survivor Gate/Staleness
Figure 4\.Case study of model behavior for survivor \(a,b\) and non\-survivor \(c,d\)\.In contrast, in the non\-survivor case, token selection became increasingly concentrated toward the later period, with shorter\-resolution tokens, particularly 60m and 120m, being preferentially retained, as shown in Figure[4](https://arxiv.org/html/2605.16380#S4.F4)\(c\)\. Figure[4](https://arxiv.org/html/2605.16380#S4.F4)\(d\) further shows that mean staleness varied more substantially, whereas mean gate was generally lower than in the survivor case, although it remained relatively sustained when recent observations were available\. This suggests that, under increasing observation irregularity, the model selectively retained information according to both recency and reliability\. In particular, the concentration of short\-resolution tokens in the later period is consistent with clinical decision\-making, where abrupt recent physiological changes often play a central role in assessing mortality risk during rapid deterioration\(Churpeket al\.,[2016](https://arxiv.org/html/2605.16380#bib.bib27); Brekkeet al\.,[2019](https://arxiv.org/html/2605.16380#bib.bib28)\)\.
Overall, these cases show that ReTAMamba reflects cohort\-level differences in observation patterns at the individual\-patient level through gate values, staleness, and multi\-scale token selection\. This supports the ability of the proposed framework to capture short\-term acute signs in irregular clinical time series\.
### 4\.6\.Multi\-scale Effects
This section analyzes the effect of multi\-scale temporal configurations on model performance using different combinations of temporal scales\. The single\-scale setting was fixed at 60 minutes, reflecting that vital signs in the ICU are generally documented on an hourly basis\(Zhanget al\.,[2024](https://arxiv.org/html/2605.16380#bib.bib29)\)\. Multi\-scale settings were then constructed by progressively extending this base configuration to \{60, 120\}, \{60, 120, 240\}, and \{60, 120, 240, 480\}\.
As shown in Table[7](https://arxiv.org/html/2605.16380#S4.T7), multi\-scale configurations generally outperformed the single\-scale setting across all datasets\. In particular, the combination of 60, 120, and 240 achieved the best AUROC and AUPRC on MIMIC\-IV, eICU, and PhysioNet 2012, indicating that jointly modeling short\-term changes and longer\-term trends is effective for clinical time\-series prediction\. In contrast, adding the extended 480\-minute scale did not yield further gains and even degraded some metrics\. This suggests that overly coarse temporal resolution may oversmooth short\-term changes while increasing information redundancy, reducing representational efficiency even with token routing\.
Overall, these results show that multi\-scale aggregation is important for capturing diverse temporal patterns in irregular clinical time series, and that the combination of short\-, medium\-, and long\-range scales provides the best trade\-off between predictive performance and efficiency\.
## 5\.Conclusion
This paper proposes ReTAMamba for effective modeling of clinical time series with irregular sampling and missingness\. ReTAMamba reconstructs clinical time series as time\-variable token sequences, models observation reliability from missingness and elapsed time, and combines multi\-scale aggregation with chronological token organization and budgeted routing\. Through this design, it preserves rich temporal context while efficiently controlling sequence length\.
Experiments on three clinical time\-series benchmarks, MIMIC\-IV, eICU, and PhysioNet 2012, showed that ReTAMamba consistently outperforms existing baselines in both AUROC and AUPRC under a unified in\-hospital mortality prediction setting, with statistically significant AUPRC improvements over the strongest competing models on all three datasets\. The ablation study further confirmed that the Reliability Gate, Stats Augment, Chronological Weaving, and Token Router each contribute to the performance gains\. Efficiency and temporal pattern analyses also showed that the proposed model provides a practical trade\-off among predictive performance, computational cost, and the effective use of observation patterns\.
These findings suggest that effective prediction from irregular clinical time series requires modeling not only measured values themselves but also observation structure, observation reliability, information recency, multi\-scale temporal context, and budgeted sequence compression\. Although this study focused on mortality prediction under a unified early\-risk assessment setting, the same framework could be extended to other clinical tasks such as decompensation, length of stay, and readmission, and could further be examined for irregular time\-series data in other domains\.
## Appendix AAdditional Experimental Details
### A\.1\.Final Hyperparameters
Table[A1](https://arxiv.org/html/2605.16380#A1.T1)summarizes the final hyperparameter settings used for ReTAMamba\.
Table A1\.Final hyperparameter settings for ReTAMamba\.HyperparameterSearch RangeFinal Selectionlr\[3×10−5,3×10−3\]\[3\\times 10^\{\-5\},\\,3\\times 10^\{\-3\}\]\(log scale\)5\.904×10−45\.904\\times 10^\{\-4\}weight decay\[10−6,10−2\]\[10^\{\-6\},\\,10^\{\-2\}\]\(log scale\)1\.709×10−61\.709\\times 10^\{\-6\}dropout\[0\.00,0\.30\]\[0\.00,\\,0\.30\]0\.0320\.032grad clip\{0\.5,1\.0,2\.0,5\.0\}\\\{0\.5,\\,1\.0,\\,2\.0,\\,5\.0\\\}0\.50\.5dmodeld\_\{\\text\{model\}\}\{64,96,128,160,192,256\}\\\{64,\\,96,\\,128,\\,160,\\,192,\\,256\\\}9696nlayersn\_\{\\text\{layers\}\}\[2,8\]\[2,\\,8\]55dstated\_\{\\text\{state\}\}\{16,32,64,96\}\\\{16,\\,32,\\,64,\\,96\\\}9696dconvd\_\{\\text\{conv\}\}\{2,3,4\}\\\{2,\\,3,\\,4\\\}22expand\{1,2,3,4\}\\\{1,\\,2,\\,3,\\,4\\\}33init decay logit\[−6\.0,−1\.0\]\[\-6\.0,\\,\-1\.0\]−2\.717\-2\.717λmin\\lambda\_\{\\min\}\[10−4,10−2\]\[10^\{\-4\},\\,10^\{\-2\}\]\(log scale\)4\.289×10−44\.289\\times 10^\{\-4\}token budgetkk\{8,16,32,48\}\\\{8,\\,16,\\,32,\\,48\\\}3232temporal scales4 candidate sets\{60,120,240\}\\\{60,120,240\\\}pooling\{attn, mean, last\}last
### A\.2\.Token Budget Sensitivity
To examine the effect of routing budget, we conducted a supplementary sensitivity analysis by varyingkkwhile keeping the other model settings fixed\. For reference, Table[A2](https://arxiv.org/html/2605.16380#A1.T2)reports the mean and standard deviation of the number of valid multi\-scale tokens before token routing\.
Table A2\.Pre\-routing token statistics across datasets\.DatasetMean tokensStd tokensMIMIC\-IV46\.7046\.703\.313\.31eICU46\.7246\.725\.095\.09PhysioNet 201244\.6044\.606\.716\.71Table A3\.Token budget sensitivity analysis\.kk\(budget\)MIMIC\-IVeICUPhysioNet 2012AUROCAUPRCAUROCAUPRCAUROCAUPRCrouter off0\.8580\.5380\.7800\.3610\.8410\.51280\.8320\.5180\.7760\.3580\.8310\.496160\.8490\.5310\.7830\.3640\.8380\.507320\.8610\.5480\.7930\.3840\.8470\.520480\.8550\.5360\.7810\.3620\.8420\.514Table[A3](https://arxiv.org/html/2605.16380#A1.T3)summarizes predictive performance under different routing budgets\. Overly small budgets degraded performance across datasets, while performance improved at intermediate budgets\. The best AUROC and AUPRC were consistently achieved atk=32k=32, supporting the use of a moderate token budget as an effective trade\-off between predictive performance and sequence compression\.
### A\.3\.Calibration Results
To complement the discrimination results reported in the main text, we additionally evaluated probability calibration on the eICU dataset\. We chose eICU for this supplementary analysis because it is a multi\-center benchmark with substantial heterogeneity in observation patterns, making calibration assessment particularly informative under diverse clinical settings\. We report the Brier score and expected calibration error \(ECE\), where the Brier score reflects the overall accuracy of predicted probabilities and ECE measures the agreement between predicted confidence and empirical outcomes across probability bins\. Lower values indicate better calibration\.
Table A4\.Calibration results on the eICU dataset\. Lower is better for both metrics\.ModelBrier↓\\downarrowECE↓\\downarrowXGBoost0\.233±0\.0000\.233\\pm 0\.0000\.351±0\.0000\.351\\pm 0\.000LSTM0\.175±0\.0150\.175\\pm 0\.0150\.248±0\.0340\.248\\pm 0\.034Transformer0\.183±0\.0150\.183\\pm 0\.0150\.257±0\.0230\.257\\pm 0\.023Mamba0\.181±0\.0080\.181\\pm 0\.0080\.262±0\.0080\.262\\pm 0\.008GRU\-D0\.182±0\.0110\.182\\pm 0\.0110\.254±0\.0200\.254\\pm 0\.020IP\-Nets0\.187±0\.0180\.187\\pm 0\.0180\.274±0\.0350\.274\\pm 0\.035mTAN0\.183±0\.0480\.183\\pm 0\.0480\.288±0\.0750\.288\\pm 0\.075SeFT0\.202±0\.0120\.202\\pm 0\.0120\.300±0\.0180\.300\\pm 0\.018Raindrop0\.215±0\.0430\.215\\pm 0\.0430\.314±0\.0630\.314\\pm 0\.063Warpformer0\.172±0\.0240\.172\\pm 0\.0240\.238±0\.0450\.238\\pm 0\.045ReTAMamba0\.162±0\.011\\mathbf\{0\.162\\pm 0\.011\}0\.221±0\.018\\mathbf\{0\.221\\pm 0\.018\}Table[A4](https://arxiv.org/html/2605.16380#A1.T4)reports the calibration results on eICU\. ReTAMamba achieved the lowest Brier score and the lowest ECE among all compared models, indicating the best calibration quality in this supplementary evaluation\. Compared with the average of the baseline models, ReTAMamba reduced the Brier score by 15\.3% and the ECE by 20\.7%\. Compared with the strongest baseline, Warpformer, it further reduced the Brier score from0\.1720\.172to0\.1620\.162and the ECE from0\.2380\.238to0\.2210\.221\. These results suggest that the proposed model not only improves discriminative performance but also provides more reliable probability estimates on the eICU benchmark\.
## References
- I\. J\. Brekke, L\. H\. Puntervoll, P\. B\. Pedersen, J\. Kellett, and M\. Brabrand \(2019\)The value of vital sign trends in predicting and monitoring clinical deterioration: a systematic review\.PloS One14\(1\),pp\. e0210875\.Cited by:[§4\.5\.3](https://arxiv.org/html/2605.16380#S4.SS5.SSS3.p3.1)\.
- Z\. Che, S\. Purushotham, K\. Cho, D\. Sontag, and Y\. Liu \(2018\)Recurrent neural networks for multivariate time series with missing values\.Scientific Reports8\(1\),pp\. 6085\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p2.1),[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- M\. M\. Churpek, R\. Adhikari, and D\. P\. Edelson \(2016\)The value of vital sign trends for detecting clinical deterioration on the wards\.Resuscitation102,pp\. 1–5\.Cited by:[§4\.5\.3](https://arxiv.org/html/2605.16380#S4.SS5.SSS3.p3.1)\.
- M\. Ghassemi, M\. Pimentel, T\. Naumann, T\. Brennan, D\. Clifton, P\. Szolovits, and M\. Feng \(2015\)A multivariate timeseries modeling approach to severity of illness assessment and forecasting in icu with sparse, heterogeneous clinical data\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.29\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p2.1)\.
- R\. H\. Groenwold \(2020\)Informative missingness in electronic health record systems: the curse of knowing\.Diagnostic and Prognostic Research4\(1\),pp\. 8\.Cited by:[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InFirst Conference on Language Modeling,Cited by:[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§3\.4](https://arxiv.org/html/2605.16380#S3.SS4.p4.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- H\. Harutyunyan, H\. Khachatrian, D\. C\. Kale, G\. Ver Steeg, and A\. Galstyan \(2019\)Multitask learning and benchmarking with clinical time series data\.Scientific Data6\(1\),pp\. 96\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.Cited by:[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- M\. Horn, M\. Moor, C\. Bock, B\. Rieck, and K\. Borgwardt \(2020\)Set functions for time series\.InInternational Conference on Machine Learning,pp\. 4353–4363\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, and B\. Gow \(2023\)MIMIC\-IV, a freely accessible electronic health record dataset\.Scientific Data10\(1\),pp\. 1\.Cited by:[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p1.1)\.
- A\. Joshi and M\. Hauskrecht \(2025\)Still competitive: revisiting recurrent models for irregular time series prediction\.arXiv preprint arXiv:2510\.16161\.Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- E\. Kipnis, D\. Ramsingh, M\. Bhargava, E\. Dincer, M\. Cannesson, A\. Broccard, B\. Vallet, K\. Bendjelid, and R\. Thibault \(2012\)Monitoring in the intensive care\.Critical Care Research and Practice2012\(1\),pp\. 473507\.Cited by:[§4\.5\.1](https://arxiv.org/html/2605.16380#S4.SS5.SSS1.p3.1)\.
- S\. Lee, K\. Min, Y\. Son, and H\. Do \(2025\)Adaptive time encoding for irregular multivariate time\-series classification\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- Z\. Li, S\. Li, and X\. Yan \(2023\)Time series as images: vision transformer for irregularly sampled time series\.Advances in Neural Information Processing Systems36,pp\. 49187–49204\.Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- Z\. C\. Lipton, D\. C\. Kale, C\. Elkan, and R\. Wetzel \(2015\)Learning to diagnose with LSTM recurrent neural networks\.arXiv preprint arXiv:1511\.03677\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2)\.
- J\. Liu, M\. Cao, and S\. Chen \(2026\)Beyond observations: reconstruction error\-guided irregularly sampled time series representation learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 23712–23720\.Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- J\. Ma, D\. K\. Lee, M\. E\. Perkins, M\. A\. Pisani, and E\. Pinker \(2019\)Using the shapes of clinical data trajectories to predict mortality in icus\.Critical Care Explorations1\(4\),pp\. e0010\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p1.1)\.
- T\. J\. Pollard, A\. E\. Johnson, J\. D\. Raffa, L\. A\. Celi, R\. G\. Mark, and O\. Badawi \(2018\)The eicu collaborative research database, a freely available multi\-center database for critical care research\.Scientific Data5\(1\),pp\. 180178\.Cited by:[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p1.1)\.
- S\. Pungitore and V\. Subbian \(2023\)Assessment of prediction tasks and time window selection in temporal modeling of electronic health record data: a systematic review\.Journal of Healthcare Informatics Research7\(3\),pp\. 313–331\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p1.1)\.
- H\. Sakoe and S\. Chiba \(1978\)Dynamic programming algorithm optimization for spoken word recognition\.IEEE Transactions on Acoustics, Speech, and Signal Processing26\(1\),pp\. 43–49\.Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- P\. Senin \(2008\)Dynamic time warping algorithm review\.Information and Computer Science Department, University of Hawaii at Manoa855\(1\-23\),pp\. 40\.Cited by:[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1)\.
- J\. Shi, A\. E\. Hubbard, N\. Fong, and R\. Pirracchio \(2025\)Implicit bias in icu electronic health record data: measurement frequencies and missing data rates of clinical variables\.BMC Medical Informatics and Decision Making25\(1\),pp\. 241\.Cited by:[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1),[§4\.5\.1](https://arxiv.org/html/2605.16380#S4.SS5.SSS1.p3.1)\.
- S\. N\. Shukla and B\. M\. Marlin \(2019\)Interpolation\-prediction networks for irregularly sampled time series\.arXiv preprint arXiv:1909\.07782\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- S\. N\. Shukla and B\. M\. Marlin \(2021\)Multi\-time attention networks for irregularly sampled time series\.arXiv preprint arXiv:2101\.10318\.Cited by:[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- I\. Silva, G\. Moody, D\. J\. Scott, L\. A\. Celi, and R\. G\. Mark \(2012\)Predicting in\-hospital mortality of icu patients: the physionet/computing in cardiology challenge 2012\.In2012 Computing in Cardiology,pp\. 245–248\.Cited by:[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p1.1)\.
- R\. Sisk, L\. Lin, M\. Sperrin, J\. K\. Barrett, B\. Tom, K\. Diaz\-Ordaz, N\. Peek, and G\. P\. Martin \(2021\)Informative presence and observation in routine health data: a review of methodology for clinical risk prediction\.Journal of the American Medical Informatics Association28\(1\),pp\. 155–166\.Cited by:[§4\.5\.1](https://arxiv.org/html/2605.16380#S4.SS5.SSS1.p4.1)\.
- A\. L\. Tan, E\. J\. Getzen, M\. R\. Hutch, Z\. H\. Strasser, A\. Gutiérrez\-Sacristán, T\. T\. Le, A\. Dagliati, M\. Morris, D\. A\. Hanauer, and B\. Moal \(2023\)Informative missingness: what can we learn from patterns in missing laboratory data in the electronic health record?\.Journal of Biomedical Informatics139,pp\. 104306\.Cited by:[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- J\. Zhang, S\. Zheng, W\. Cao, J\. Bian, and J\. Li \(2023\)Warpformer: a multi\-scale modeling approach for irregular clinical time series\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3273–3285\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16380#S2.SS1.p1.2),[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.
- S\. Zhang, Y\. Y\. Quan, and J\. Chen \(2024\)Construction and application of an icu nursing electronic medical record quality control system in a chinese tertiary hospital: a prospective controlled trial\.BMC Nursing23\(1\),pp\. 493\.Cited by:[§4\.6](https://arxiv.org/html/2605.16380#S4.SS6.p1.1)\.
- X\. Zhang, M\. Zeman, T\. Tsiligkaridis, and M\. Zitnik \(2021\)Graph\-guided network for irregularly sampled multivariate time series\.arXiv preprint arXiv:2110\.05357\.Cited by:[§1](https://arxiv.org/html/2605.16380#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.16380#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.16380#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.16380#S4.SS1.p2.1)\.Similar Articles
PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
This paper proposes Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that aligns model memory with physical timescales to improve robustness under distribution shift in scientific time series, demonstrating improvements on fMRI and weather forecasting tasks.
Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
This paper proposes a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning to forecast 24-month CDR-SB change from ADNI clinical and biomarker histories, achieving reduced MSE and improved correlation over baselines.
RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis
Proposes RAG4Outcome, a retrieval-augmented generation framework integrating multimodal clinical data (PET-CT reports, surgical records, follow-up notes) to improve prognostic prediction in chronic osteomyelitis, enhancing interpretability and clinical reliability.
EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction
EnergyMamba proposes a novel spatiotemporal framework combining a graph-enhanced selective state space model and adaptive conformalized quantile regression for accurate and reliable energy consumption prediction with uncertainty estimates, achieving improvements on real-world datasets from Florida, New York, and California.
Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.