Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

arXiv cs.LG 06/05/26, 04:00 AM Papers
prognostics health-management foundation-models tabular-data data-efficiency time-series industrial-ai
Summary
This paper proposes a framework for applying tabular foundation models to industrial time series for prognostics and health management, demonstrating strong performance and data efficiency across multiple PHM tasks.
arXiv:2606.05481v1 Announce Type: new Abstract: Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:11 AM
# Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
Source: [https://arxiv.org/html/2606.05481](https://arxiv.org/html/2606.05481)
\[type=editor, auid=000,bioid=1, role=Researcher, orcid=0000\-0000\-0000\-0000\]\\cormark\[1\]

1\]organization=IMOS Lab, EPFL, city=Lausanne, country=Switzerland

\\cortext

\[cor1\]Equal contribution and Corresponding author

Lev TelyatnikovLeandro Von KrannichfeldtOlga Fink\[

###### Abstract

Data\-driven Prognostics and Health Management \(PHM\) uses time\-varying condition\-monitoring data to diagnose system states and estimate remaining useful life in engineered assets\. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning\. Foundation models offer a route toward reusable predictive systems, yet most time\-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences\. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in\-context learning, and we evaluate them on a variety of PHM tasks\. By converting raw unit\-level signals into tabular rows, we show that these models perform well across multiple tasks \- including prognostics, and diagnostics — and are highly data efficient\. We compare them directly with sequence models, transformer baselines, and gradient\-boosted trees under a common evaluation protocol\. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks\. Our findings further show that PFN\-based models are competitive in low\-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling\. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems\.

###### keywords:

Tabular Foundation Models\\sepPrognostics\\sepTime Series\\sepTransformer

## 1Introduction

Prognostics and Health Management \(PHM\) has matured into an established framework for supporting maintenance and operational decisions in complex engineering systems\. In industrial practice, PHM solutions are typically based on condition monitoring data collected from heterogeneous sensor networks and are used to support fault detection, diagnostics, and, where feasible, remaining useful life \(RUL\) prediction\[[67](https://arxiv.org/html/2606.05481#bib.bib67),[34](https://arxiv.org/html/2606.05481#bib.bib34)\]\. These outputs are then integrated into maintenance planning and asset management processes to reduce downtime and maintenance costs while maintaining high system availability\.

A cornerstone of PHM is the availability of degradation trajectories and time\-to\-failure information, as these data capture the temporal evolution of system health and form the basis for many prognostic approaches\[[34](https://arxiv.org/html/2606.05481#bib.bib34)\]\. Despite their importance, such trajectories are rarely available in operational environments\. Failures are relatively rare events, assets are often maintained or replaced before reaching end of life, and historical data may be fragmented or incomplete\. As a consequence, many prognostic approaches developed and evaluated in academic settings cannot be readily deployed in practice\[[19](https://arxiv.org/html/2606.05481#bib.bib19)\]\. Instead, most industrial PHM implementations today primarily focus on fault detection, where data from normal operation are abundant, and the objective is to identify deviations from healthy behavior rather than to predict precise failure times\.

Recent advances in machine learning, particularly deep learning, have aimed to reduce reliance on handcrafted features by learning representations directly from data\[[32](https://arxiv.org/html/2606.05481#bib.bib32),[65](https://arxiv.org/html/2606.05481#bib.bib65),[55](https://arxiv.org/html/2606.05481#bib.bib55)\]\. While such approaches have demonstrated strong performance in controlled settings, their success often relies on assumptions that are difficult to satisfy in industrial environments\. In particular, training deep models from scratch typically requires large amounts of labeled data that cover the full range of operating conditions and fault types\. In industrial PHM, however, labeled failure data are scarce, system configurations evolve over time, and degradation trajectories are rarely observed until the end of life\[[35](https://arxiv.org/html/2606.05481#bib.bib35)\]\. As a result, models trained under one set of conditions often require costly retraining, adaptation, and validation before they can be deployed across fleets of assets\.

In current industrial PHM practice, models are often developed and trained for individual components or specific systems, reflecting differences in design, sensing setups, and operating conditions\[[24](https://arxiv.org/html/2606.05481#bib.bib24),[40](https://arxiv.org/html/2606.05481#bib.bib40),[63](https://arxiv.org/html/2606.05481#bib.bib63)\]\. However, in real deployments these models must operate across fleets of assets and multi\-component systems within the same organization\. Assets frequently differ in age, configuration, and operating regimes, and their behavior may evolve over time due to changes in usage patterns, maintenance strategies, or environmental conditions\[[19](https://arxiv.org/html/2606.05481#bib.bib19)\]\. Continuously retraining and revalidating separate models for each component, asset, or configuration is costly and difficult to scale, particularly in safety\-critical applications\. This lack of scalability remains one of the key barriers to broader adoption of advanced PHM methods in industry\. Learning approaches that can generalize across assets and adapt to new conditions without repeated retraining are therefore of high practical relevance\.

Under these constraints, in\-context learning provides an alternative to training models from scratch\[[13](https://arxiv.org/html/2606.05481#bib.bib13),[36](https://arxiv.org/html/2606.05481#bib.bib36)\]\. Rather than updating model parameters, foundation models adapt their behavior at inference time by conditioning on a small set of task\-specific examples\[[66](https://arxiv.org/html/2606.05481#bib.bib66)\]\. This mechanism is particularly well suited to PHM applications, where labeled data are limited, system behavior varies across assets, and rapid adaptation to changing operating conditions is required\. By enabling task adaptation without retraining or parameter updates, in\-context learning reduces deployment and maintenance overhead while remaining compatible with established PHM workflows\.

While time\-series foundation models have recently been proposed and extended to in\-context learning\[[1](https://arxiv.org/html/2606.05481#bib.bib1)\], their applicability in industrial PHM is constrained by both data assumptions and task formulation\[[41](https://arxiv.org/html/2606.05481#bib.bib41),[56](https://arxiv.org/html/2606.05481#bib.bib56)\]\. These models are typically designed for forecasting tasks, where the objective is to predict future signal values over a temporal horizon\[[59](https://arxiv.org/html/2606.05481#bib.bib59),[12](https://arxiv.org/html/2606.05481#bib.bib12),[37](https://arxiv.org/html/2606.05481#bib.bib37)\], and effective adaptation relies on long, coherent temporal sequences with stable dynamics, consistent sensor availability, and well\-aligned sampling across channels\. In contrast, prognostics problems are typically framed as regression tasks, such as estimating health indicators or remaining useful life from the current system state, often without access to complete or continuous degradation trajectories\. Moreover, in industrial PHM settings, condition monitoring data are frequently irregular, affected by missing values, and disrupted by maintenance actions or changes in operating conditions\. These factors limit the suitability of time\-series foundation models, particularly for prognostics applications\.

At the same time, the absence of continuous run\-to\-failure trajectories motivates learning paradigms that operate on snapshots of system state rather than complete time\-to\-failure trajectories\. In practice, PHM models often rely on tabular representations composed of aggregated sensor features, condition indicators, and contextual variables\[[50](https://arxiv.org/html/2606.05481#bib.bib50),[65](https://arxiv.org/html/2606.05481#bib.bib65)\]\. Such representations allow information to be pooled across time and assets, are robust to interruptions caused by maintenance actions, and do not require strict temporal alignment between sensors\. This is particularly important in industrial environments, where sensor signals often exhibit different sampling rates, heterogeneous temporal dynamics, and varying levels of reliability, and where data streams are frequently affected by missing values due to sensor faults, communication issues, or changes in instrumentation\[[24](https://arxiv.org/html/2606.05481#bib.bib24),[40](https://arxiv.org/html/2606.05481#bib.bib40),[63](https://arxiv.org/html/2606.05481#bib.bib63)\]\. Handling these characteristics in sequential models typically requires additional assumptions, resampling, or imputation strategies, whereas tabular representations and models can more naturally accommodate incomplete, irregular, and heterogeneous observations\.

Within this setting, tabular foundation models combined with in\-context learning provide a strong alternative to conventional end\-to\-end training paradigms\[[26](https://arxiv.org/html/2606.05481#bib.bib26),[27](https://arxiv.org/html/2606.05481#bib.bib27),[38](https://arxiv.org/html/2606.05481#bib.bib38)\]\. By leveraging a single pre\-trained model across tasks and assets, these approaches reduce the need for repeated architecture\-specific hyperparameter optimization; the remaining tuning is concentrated mainly on tabular\-shape choices such as lookback length, feature construction, and context size\. Adaptation to evolving operating conditions, changing sensor availability, or new assets can be achieved through contextual examples at inference time, rather than through parameter updates, thereby avoiding costly retraining cycles and simplifying deployment and lifecycle management\.

## 2Related Work

### 2\.1Time\-series Foundation Models

Recently, the broader field of machine learning has witnessed the emergence of*Foundation Models*\(FMs\) as powerful, general\-purpose architectures, leading to a profound impact across multiple disciplines in science\[[66](https://arxiv.org/html/2606.05481#bib.bib66)\]\. FMs are large neural networks pre\-trained on diverse and heterogeneous datasets; the unprecedented scale of this pretraining equips them with emergent general\-purpose capabilities, allowing them to learn general, transferable latent representations and priors that can be reused across tasks and datasets\. Especially for Natural Language Processing, this often allows FMs to outperform domain\-specialized models with minimal task\-specific adaptation\[[43](https://arxiv.org/html/2606.05481#bib.bib43)\]\.

There has been substantial interest in reproducing this success in the field of time\-series analysis, generalist Time\-Series Foundation models\. Inference for these models may be seen as generally analogous to that of the large language models: continuous\-time measurements are sampled, then embedded as discrete high\-dimensional tokens, which are then fed to a deep learning model composed of self/cross\-attention layers and multilayer perceptrons\. Time\-series Foundation Models \(TSFMs\) extend Time\-series Transformers architecture beyond task\-specific design, to enable zero\-shot or few\-shot generalization across datasets and domains\. In general, the research of TSMs can be divided into three categories\[[41](https://arxiv.org/html/2606.05481#bib.bib41)\]\. One category concerns itself with the repurposing of language models for time series tasks, such as Time\-LLM\[[30](https://arxiv.org/html/2606.05481#bib.bib30)\]\. Hereby, time series sequence is converted into a word structure, divided into patches, and used as input tokens for the language foundation model\. The second category leverages a pre\-train on a large variety of data\. The different works either train exclusively on real\-world data\[[59](https://arxiv.org/html/2606.05481#bib.bib59)\], only on synthetic data\[[14](https://arxiv.org/html/2606.05481#bib.bib14)\]or a combination of real measurements and synthetically generated samples\[[12](https://arxiv.org/html/2606.05481#bib.bib12),[2](https://arxiv.org/html/2606.05481#bib.bib2)\]\. The third category creates domain\-specific TSFMs by training directly on domain\-relevant datasets like BioBert for biomedical literature\[[33](https://arxiv.org/html/2606.05481#bib.bib33)\]or PowerPM for Power System tasks\[[54](https://arxiv.org/html/2606.05481#bib.bib54)\]\. Their idea is to improve performance for specific domain tasks, albeit incurring a reduced generalization ability\.

The majority of TSFM works focus on univariate forecasting by treating each time series independently\[[17](https://arxiv.org/html/2606.05481#bib.bib17)\], and consequently do not make effective use of covariance dependencies of a multivariate time series input\. Some publications try to mitigate this effect by constructing a new time series out of the multivariate series through a flattening operation\[[59](https://arxiv.org/html/2606.05481#bib.bib59)\]or subsampling\[[37](https://arxiv.org/html/2606.05481#bib.bib37)\]\. Recent works recognize the potential of in\-context learning for covariate information integration, and are slowly shifting towards in\-context learning for TSFM’s\[[1](https://arxiv.org/html/2606.05481#bib.bib1)\]\.

### 2\.2Tabular Foundation Models

Tabular Foundation Models \(TFMs\) have recently emerged as transformer\-based FMs for two\-dimensional data, including tables and relational databases\. TFMs are trained to predict cells in tables by leveraging contextual information from adjacent rows and columns, allowing them to fill missing values, extrapolate rows and columns, and detect anomalies\. This is achieved by training on vast amounts of tabular data, coupled with a two\-dimensional attention mechanism that captures both row\-level and column\-level dependencies\. These capabilities effectively make TFMs universal pattern recognizers for table\-like data, and have achieved state\-of\-the\-art performance on multiple tabular prediction tasks\.

TFMs differ on how their attention mechanisms are implemented, as well as on the nature of their training data, leading to multiple implementations in the literature\.*Tabular Prior\-Fitted Network*\(TabPFN\)\[[26](https://arxiv.org/html/2606.05481#bib.bib26),[27](https://arxiv.org/html/2606.05481#bib.bib27)\], emulates Bayesian inference on tables by applying attention to raw table data, and is trained on massive causally\-generated synthetic datasets;*TabDPT*combines synthetic and real datasets to improve generalization and one\-shot capacity\[[38](https://arxiv.org/html/2606.05481#bib.bib38)\];*TabICL*uses statistical row embeddings to compress tables, enabling scalable extensions to hundreds of thousands of samples without fine\-tuning\[[45](https://arxiv.org/html/2606.05481#bib.bib45)\];*Carte*embeds tables with graph\-like structure to extract semantic connections between feature columns\[[31](https://arxiv.org/html/2606.05481#bib.bib31)\]\.

The success of TFMs has led to these models being used for a variety of two\-dimensional signals, including multivariate time series\. This typically takes place by representing time series as tables and embedding time markings as additional feature columns\.\[[28](https://arxiv.org/html/2606.05481#bib.bib28)\]demonstrate on small\-scale, scalar time series that a simple tabularization and feature\-based representation of time series allows TabPFN to match or outperform specialized forecasting models, highlighting the strength of large\-scale synthetic pretraining\. Similarly,\[[7](https://arxiv.org/html/2606.05481#bib.bib7)\]further adapt TabPFN for time\-series forecasting by explicitly encoding time\-periodic features as sinusoidal table features, showing consistent improvements over baseline TabPFN and other time series forecasting models for small quasi\-periodic time\-series\.

While the use of tabular foundation models for time\-series forecasting has recently gained attention, it remains unclear how to fully exploit the time dependencies within and across time series beyond simple periodic indexing or feature lagging, especially under the computational constraints of tabular foundation models\.

### 2\.3Foundation Models for PHM

The success of FMs for sequence modeling has led to substantial interest in using them for industrial time series across multiple applications\. Despite numerous FM architectures being proposed to enhance state\-of\-the\-art performance in various time\-series\-related tasks\[[56](https://arxiv.org/html/2606.05481#bib.bib56)\], all\-encompassing FM for PHM have seen limited adoption in industry, where classical models such as Particle Filters, LSTM and CNN are still the most widely adopted\[[47](https://arxiv.org/html/2606.05481#bib.bib47)\]\. Recent studies in the PHM literature have begun to utilize TSFMs\. An example is UniFault\[[16](https://arxiv.org/html/2606.05481#bib.bib16)\], which proposes large\-scale pretraining for rotating machinery fault diagnosis, with the goal of improving cross\-domain transfer and few\-shot adaptation when only limited labeled data are available for a new machine or condition\. Another example isYao and Han \[[62](https://arxiv.org/html/2606.05481#bib.bib62)\], where the streamlining of prognostics and maintenance of wind turbines with generalization capabilities of TSFMs is conceptually outlined\. The strong performance of TFMs has also led to exploratory applications in PHM\.Magadán et al\. \[[39](https://arxiv.org/html/2606.05481#bib.bib39)\]use TabPFN for early fault classification in rotating machinery, showing that it maintains strong diagnostic accuracy even with very limited training samples\.Sun et al\. \[[51](https://arxiv.org/html/2606.05481#bib.bib51)\]propose a hybrid TimeGAN–TabPFN framework for slewing bearing fault diagnosis using audible sound signals, where generative data augmentation combined with TabPFN significantly improves classification performance under small\-sample conditions\.

Although tabular foundation models are slowly gaining traction in the PHM field with domain\-specific pretraining and several diagnostics application studies, a systematic benchmarking study with comprehensive evaluations in both prognostics and diagnostics is still missing\.

### 2\.4Benchmarks in PHM

In recent years, several efforts have been conducted in terms of systematic framework proposals and benchmarking evaluations in PHM\. An early review study was presented byRamasso and Saxena \[[46](https://arxiv.org/html/2606.05481#bib.bib46)\], who analyzes prognostic methods predating modern deep\-learning, developed using the C\-MAPSS turbofan degradation datasets\[[48](https://arxiv.org/html/2606.05481#bib.bib48)\]\. The study examines how the C\-MAPSS dataset has been used across prior work, clarified differences among the dataset variants, and reconstructed benchmark results from the PHM 2008 Data Challenge\. It also highlights inconsistencies in dataset selection, preprocessing, and metric usage, and provides guidelines for more consistent benchmarking\. More recently,Qiao et al\. \[[44](https://arxiv.org/html/2606.05481#bib.bib44)\]carried out a comparative benchmarking evaluation of deep\-learning models for bearing fault diagnosis and fault prognosis using the XJTU\-SYYaguo et al\. \[[61](https://arxiv.org/html/2606.05481#bib.bib61)\]and CWRU\[[8](https://arxiv.org/html/2606.05481#bib.bib8)\]datasets\. The study evaluated CNN, LSTM, ResNet, Transformer, and hybrid architectures, identifying ResCNN\-LSTM as a strong model for both diagnosis and prognosis\. However, its scope is limited to conventional deep\-learning architectures and does not consider TSFMs or TFMs\. PDMBench\[[63](https://arxiv.org/html/2606.05481#bib.bib63)\]proposes a standardized platform for predictive maintenance research that integrates 14 public datasets covering bearings, motors, gearboxes, and multi\-component systems, together with 22 baseline models ranging from traditional machine\-learning methods to modern Transformer\-based architectures\. The framework addresses practical challenges such as irregular sampling rates, imbalanced fault distributions, heterogeneous sensor modalities, and deployment constraints, while focusing on fault classification and remaining useful life prediction\. Its experimental analysis shows that Transformer architectures perform strongly on structured bearing datasets but are less robust when applied to noisy motor signals, whereas lightweight models provide attractive accuracy–efficiency trade\-offs for deployment\. Although PDMBench provides a systematic benchmarking framework, it does not standardize evaluation protocol as well as task semantics and does not evaluate foundation\-model approaches\. Beyond dataset\-level standardization, recent work has also addressed the reproducible implementation of conventional PHM methods\.Telyatnikov et al\. \[[52](https://arxiv.org/html/2606.05481#bib.bib52)\]introduce PICID as a modular PHM evaluation framework for specifying datasets, preprocessing, target construction, model interfaces, and metrics under a shared executable protocol\. This framework has been used to implement and evaluate conventional PHM papers as reproducible benchmark components\[[53](https://arxiv.org/html/2606.05481#bib.bib53)\]\.

Several conceptual as well as experimental benchmarking publications have been carried out with a strong focus on standard deep\-learning models and Transformer architectures\. Therefore, a rigorous and reproducible investigation of tabular foundation models for prognostics and diagnostics has not yet been addressed\.

In this work, we use PICID as the evaluation infrastructure for assessing tabular foundation models under the same task, dataset, preprocessing, and metric interfaces as their conventional PHM competitors\. In particular, PICID’s fit–predict model interface enables tabular foundation models to be benchmarked against conventional baselines through the same executable protocol, making conventional PHM methods available as standardized competitors rather than as isolated, paper\-specific implementations\.

## 3Methodology

In this work, we formalize a unified methodology for transforming raw PHM time\-series into representations compatible with both sequence\-based models and tabular foundation models\. The central objective is to ensure that all models are evaluated on information\-equivalent inputs, thereby isolating the impact of the modeling architecture from differences in preprocessing\.

A key challenge arises from the mismatch between the inherently sequential nature of PHM data and the requirements of tabular foundation models, which operate on collections of independent samples and, in the case of in\-context learning, rely on well\-defined context sets\. To bridge this gap, we define a data processing pipeline that converts continuous time\-series into supervised samples that can be interpreted consistently across model families\. To this end, we introduce a complete data processing pipeline that spans feature extraction, target alignment, sequence construction, and tabularization\. This pipeline allows us to compare fundamentally different model families under controlled and consistent conditions\.

Figure[1](https://arxiv.org/html/2606.05481#S3.F1)provides a high\-level overview of the proposed framework\. This section is structured as a step\-by\-step pipeline\. We first define how raw signals are transformed into aligned feature–target sequences\. We then describe how supervised samples are constructed through sequence slicing, and finally how these samples are converted into tabular representations via the tabularization operator\.

![Refer to caption](https://arxiv.org/html/2606.05481v1/x1.png)Figure 1:Overview of the unified evaluation pipeline\. Unit\-level PHM signals are transformed into aligned feature–target windows and evaluated either by trained sequence models or by in\-context tabular foundation models after tabularization\. Validation data are used for model selection or tabular\-shape selection before final test evaluation\.### 3\.1Data Pre\-Processing Pipeline

We formalize tabularization as a multi\-stage transformation pipeline that maps raw time\-series data into supervised learning representations\. Specifically, the pipeline consists of three stages: \(i\) transforming raw signals into a compact feature representation \(𝒢\\mathcal\{G\}\), \(ii\) slicing the transformed features into temporal sequences \(𝒮\\mathcal\{S\}\), and \(iii\) mapping these sequences into tabular representations via the tabularization operator \(𝒯\\mathcal\{T\}\)\. A key aspect of this formulation is that target alignment is defined explicitly\. This allows us to accommodate feature transformations that modify the temporal grid \(e\.g\., time\-frequency representations\), while remaining agnostic to the specific alignment convention\.

Importantly, this explicit formulation ensures that the resulting samples are:

- •well\-defined as supervised learning instances, and
- •directly usable as context elements for tabular foundation models\.

Table[1](https://arxiv.org/html/2606.05481#S3.T1)summarizes the notation used throughout this pipeline\.

Table 1:Summary of NotationSymbolDescriptionRaw Data SpaceTTTotal duration \(number of timestamps\) of the raw unit lifecyclettDiscrete raw time index,t∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}𝒙\(t\)\{\\bm\{x\}\}\(t\)Raw sensor vector at timett,𝒙\(t\)∈ℝM\{\\bm\{x\}\}\(t\)\\in\\mathbb\{R\}^\{M\}y\(t\)y\(t\)Raw target value at timettData Transformation \(𝒢,ℋ\\mathcal\{G\},\\mathcal\{H\}\)T′T^\{\\prime\}Total number of steps in the transformed time\-serieswwHistory window size \(receptive field\) for feature extractions𝒢s\_\{\\mathcal\{G\}\}Signal processing stride \(step size for sliding the transformation window\)Ψ,Φ\\Psi,\\PhiCollections of fitted parameters for feature and target transformations𝒢\(⋅;Ψ\)\\mathcal\{G\}\(\\cdot;\\Psi\)Feature transformation operator for sensor signals𝒛\(j\)\{\\bm\{z\}\}\(j\)Transformed feature vector at feature\-stepjj,𝒛\(j\)∈ℝF\{\\bm\{z\}\}\(j\)\\in\\mathbb\{R\}^\{F\}zy\(j\)z\_\{y\}\(j\)Transformed target value at feature\-stepjja\(j\)a\(j\)Raw\-time support associated with feature\-stepjj\(timestamp or interval\)𝒜\(⋅\)\\mathcal\{A\}\(\\cdot\)Alignment/aggregation operator used within the target pipelineℋ\(⋅;Φ\)\\mathcal\{H\}\(\\cdot;\\Phi\)Target transformation operator \(e\.g\., RUL normalization or scaling\)ℋ~\(⋅;Φ\)\\widetilde\{\\mathcal\{H\}\}\(\\cdot;\\Phi\)Target pipeline that produces one label per transformed index usinga\(j\)a\(j\)Dataset Slicing \(𝒮\\mathcal\{S\}\)LseqL\_\{\\mathrm\{seq\}\}History length \(number of transformed steps per input window\)Δ\\DeltaStride between consecutive window starts in transformed timeρ\\rhoWarm\-start depth \(left\-padding allowance\),ρ=0\\rho=0strict windowingδ\\deltaSupervision offset \(steps after window end\),δ=0\\delta=0end\-of\-windowLpredL\_\{\\mathrm\{pred\}\}Supervision segment length \(defaultLpred=1L\_\{\\mathrm\{pred\}\}=1for PHM window labels\)𝑾m\{\\bm\{W\}\}\_\{m\}Temporal feature windowNslicesN\_\{\\mathrm\{slices\}\}Total number of windows extracted from the unit𝒦\\mathcal\{K\}Admissible set of window start indices in transformed timeTabularization \(𝒯\\mathcal\{T\}\)𝑿m\{\\bm\{X\}\}\_\{m\}Final tabular input vector for the model𝒟seq\\mathcal\{D\}\_\{\\mathrm\{seq\}\}Dataset in sequence format \(for time\-series models\)𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}Dataset in tabular format \(for tabular models\)DtabD\_\{\\mathrm\{tab\}\}Tabular dimensionality,Dtab=LseqFD\_\{\\mathrm\{tab\}\}=L\_\{\\mathrm\{seq\}\}FModeling & Evaluationfθ\(⋅\)f\_\{\\theta\}\(\\cdot\)Deep Learning Model \(Sequence\-based\)fϕ\(⋅\)f\_\{\\phi\}\(\\cdot\)Foundation Model \(Tabular/In\-Context\)𝒞\\mathcal\{C\}Context set for In\-Context Learningℒ\\mathcal\{L\}Loss function#### 3\.1\.1Problem Formulation: Single Unit Perspective

We begin by formalizing the data representation for a single functional unit monitored continuously over its full operational lifecycle\. This setting serves as the fundamental building block for the subsequent pipeline, which transforms raw time\-series into supervised learning samples\. The raw data consists of a multivariate time\-series of sensor measurements𝒳\{\\mathcal\{X\}\}and a corresponding target signal𝒴\{\\mathcal\{Y\}\}:

𝒳=\{𝒙\(t\)\}t=1T,𝒴=\{y\(t\)\}t=1T\{\\mathcal\{X\}\}=\\\{\{\\bm\{x\}\}\(t\)\\\}\_\{t=1\}^\{T\},\\quad\{\\mathcal\{Y\}\}=\\\{y\(t\)\\\}\_\{t=1\}^\{T\}\(1\)whereTTdenotes the total number of recorded timestamps\. At each time steptt, the system state is represented by a vector of sensor observations𝒙\(t\)∈ℝM\{\\bm\{x\}\}\(t\)\\in\\mathbb\{R\}^\{M\}, defined as :

𝒙\(t\)=\[x1\(t\),x2\(t\),…,xM\(t\)\]⊤\{\\bm\{x\}\}\(t\)=\[x\_\{1\}\(t\),x\_\{2\}\(t\),\\dots,x\_\{M\}\(t\)\]^\{\\top\}\(2\)wherexm\(t\)x\_\{m\}\(t\)represents the scalar measurement of themm\-th sensor at timett\. These channels may correspond to endogenous system variables or exogenous covariates\. The target signaly\(t\)y\(t\)encodes the system health state at timett\. Depending on the PHM task, this may represent a continuous quantity \(e\.g\., remaining useful life for prognostics\) or a discrete label \(e\.g\., fault type for diagnostics\)\. This formulation defines the raw data space on which all subsequent transformations operate\. In particular, it establishes the relationship between sensor observations and the target signal at the original temporal resolution, which will serve as the reference for all downstream processing steps\.

Example\.Consider a lithium\-ion battery cell from the UNIBO21 dataset\[[57](https://arxiv.org/html/2606.05481#bib.bib57)\]monitored over its operational lifetime\. During each discharge cycle, three signals are recorded at regular intervals: voltage, discharge current, and cell temperature, soM=3M=3in this example\. The targety\(t\)y\(t\)is the remaining cumulative discharge throughput, or ah\-RUL, which decreases from the first recorded cycle toward zero at end\-of\-life\. This raw pair\(𝒳,𝒴\)\(\{\\mathcal\{X\}\},\{\\mathcal\{Y\}\}\)serves as input to the pipeline\.

The goal of this stage is to transform the raw time\-series into a feature representation that can be used to construct supervised learning samples while ensuring consistency between inputs and targets\. In PHM, preprocessing often constitutes a substantial part of the methodological pipeline\. It includes operations such as data cleaning, normalization, resampling, windowing, detrending, denoising, and signal\-processing transforms\. Although these stages typically receive less emphasis than the predictive model, they are at least equally important for final performance and reproducibility\. A central challenge is that feature extraction procedures may modify the temporal structure of the data, while the target signal remains defined on the original timeline\. As a result, we must explicitly define how features and targets are brought into correspondence\.

#### 3\.1\.2Feature and Target Transformation Operators

The first stage pre\-processes the raw time\-series into a feature representation that can be used to construct supervised learning samples while ensuring consistency between inputs and targets\. In PHM, preprocessing is a critical component of the pipeline, often encompassing data cleaning, normalization, resampling, windowing, detrending, denoising, and signal\-processing transformations\. Despite receiving less attention than the predictive model, these steps are equally important for performance and reproducibility\. To address feature extraction and target processing, we decompose the transformation into two operators:

- •a feature transformation𝒢\\mathcal\{G\}
- •a target transformation pipelineℋ\\mathcal\{H\}

Target alignment is treated explicitly as part of the pipeline rather than being implicitly induced by the feature representation\. This design decouples feature extraction from supervision design, thereby supporting arbitrary feature transformations, including those that alter temporal resolution, while preserving a consistent definition of the target under different alignment conventions\.

We define the feature transformation𝒢\\mathcal\{G\}as a composition ofPPprocessing functions:

𝒢\(⋅;Ψ\)=gP\(⋅;ψP\)∘⋯∘g1\(⋅;ψ1\)\\mathcal\{G\}\(\\cdot;\\Psi\)=g\_\{P\}\(\\cdot;\\psi\_\{P\}\)\\circ\\dots\\circ g\_\{1\}\(\\cdot;\\psi\_\{1\}\)\(3\)whereΨ\\Psidenotes the set of fitted parameters \(e\.g\., normalization statistics\)\. Applying the feature transformation𝒢\\mathcal\{G\}to the raw sensor series yields a transformed feature series:

𝒵=\{𝒛\(j\)\}j=1T′=𝒢\(𝒳;Ψ\),𝒛\(j\)∈ℝF\(j=1,…,T′\)\.\{\\mathcal\{Z\}\}=\\\{\{\\bm\{z\}\}\(j\)\\\}\_\{j=1\}^\{T^\{\\prime\}\}=\\mathcal\{G\}\(\{\\mathcal\{X\}\};\\Psi\),\\quad\{\\bm\{z\}\}\(j\)\\in\\mathbb\{R\}^\{F\}\\ \(j=1,\\dots,T^\{\\prime\}\)\.\(4\)Similarly, we define the target transformation operatorℋ\\mathcal\{H\}as a composition ofQQdistinct processing functions:

ℋ\(⋅;Φ\)=hQ\(⋅;ϕQ\)∘⋯∘h1\(⋅;ϕ1\)\\mathcal\{H\}\(\\cdot;\\Phi\)=h\_\{Q\}\(\\cdot;\\phi\_\{Q\}\)\\circ\\dots\\circ h\_\{1\}\(\\cdot;\\phi\_\{1\}\)\(5\)whereΦ\\Phidenotes the set of fitted parameters used for target preprocessing, estimated exclusively from the training partition \(e\.g\., selection, scaling or clipping parameters\)\. Importantly,ℋ\\mathcal\{H\}operates on the target signal itself and does not perform temporal alignment\. Its role is to transform the target values into a suitable representation for learning\.

We have now introduced the feature and target transformation operators\. However, many commonly used feature transformations, such as time–frequency representations \(e\.g\., STFT or wavelets\), operate on temporal windows and therefore produce one feature vector per window rather than per original timestamp\. This leads to a mismatch between the feature timeline and the raw target timeline\. Without a precise alignment strategy, the definition of supervision becomes ambiguous\. In addition, it is often necessary to pre\-process the target signal before it is used for supervision\. For example, in prognostics tasks, the raw target \(e\.g\., remaining useful life\) may be clipped, normalized, or rescaled to improve numerical stability and learning behavior\. Together, these considerations necessitate an explicit alignment mechanism\.

##### Temporal alignment

We explicitly define how features and targets are brought into correspondence\. To this end, each transformed indexjjis associated with a raw\-time supporta\(j\)⊆\{1,…,T\}a\(j\)\\subseteq\\\{1,\\dots,T\\\}, which specifies the portion of the original time axis from which the feature𝒛\(j\)\{\\bm\{z\}\}\(j\)is derived\.

The supporta\(j\)a\(j\)may take one of two forms: it can be a single timestampa\(j\)∈\{1,…,T\}a\(j\)\\in\\\{1,\\dots,T\\\}\(e\.g\., the end or center of a window\), or an intervala\(j\)=\[t¯j,t¯j\]a\(j\)=\[\\underline\{t\}\_\{j\},\\overline\{t\}\_\{j\}\]\(e\.g\., the window used to compute the feature\)\.

To map this support to a supervision signal, we introduce an alignment operator𝒜\\mathcal\{A\}that extracts a representative target value from the raw target trajectory𝒴\{\\mathcal\{Y\}\}over the supporta\(j\)a\(j\)\. This yields an aligned target value at indexjj:

zy\(j\)=𝒜\(𝒴,a\(j\)\)\.z\_\{y\}\(j\)=\\mathcal\{A\}\(\{\\mathcal\{Y\}\},a\(j\)\)\.\(6\)
Depending on the form ofa\(j\)a\(j\),𝒜\\mathcal\{A\}may correspond to pointwise sampling \(for timestamp supports\) or aggregation over a window \(e\.g\., last\-value, mean, or max for interval supports\)\. This alignment step is used solely to define supervision and is not provided as input to the model\.

##### Feature\-to\-raw\-time mapping via temporal alignment and temporal indexing\.

When𝒢\\mathcal\{G\}modifies the temporal resolution, each transformed indexjjmust be related back to the original time axis\. Using the temporal alignment defined above, we associate each feature vector𝒛\(j\)\{\\bm\{z\}\}\(j\)with its corresponding raw\-time supporta\(j\)⊆\{1,…,T\}a\(j\)\\subseteq\\\{1,\\dots,T\\\}, which specifies the portion of the original time\-series from which the feature was derived\.

The resulting feature dimensionFF, as seen by the model after applying𝒢\\mathcal\{G\}, is determined by the final stage operatorgPg\_\{P\}\(e\.g\.,F=M×features per channelF=M\\times\\text\{features per channel\}\)\. Each stagegpg\_\{p\}operate either pointwise in time \(e\.g\., scaling\), or over temporal windows \(e\.g\.,time–frequency transformations such as STFT or wavelets\)\. Consequently, the transformation stage may alter both the feature dimension and the temporal resolution\.

In many practical implementations, the index set\{1,…,T′\}\\\{1,\\dots,T^\{\\prime\}\\\}of the transformed sequence𝒵\{\\mathcal\{Z\}\}is induced by one or more windowed stages within the pipeline \(e\.g\., STFT or wavelets\), which operate with a history lengthwwand strides𝒢s\_\{\\mathcal\{G\}\}, thereby producing one feature vector per window, yielding

T′=⌊T−ws𝒢⌋\+1\.T^\{\\prime\}=\\left\\lfloor\\frac\{T\-w\}\{s\_\{\\mathcal\{G\}\}\}\\right\\rfloor\+1\.\(7\)For example, if the windowed stage uses window endpointst¯j=w\+\(j−1\)s𝒢\\overline\{t\}\_\{j\}=w\+\(j\-1\)s\_\{\\mathcal\{G\}\}, then the raw\-time support can be written asa\(j\)=\[t¯j−w\+1,t¯j\]a\(j\)=\[\\overline\{t\}\_\{j\}\-w\+1,\\overline\{t\}\_\{j\}\]\(interval support\), ora\(j\)=t¯ja\(j\)=\\overline\{t\}\_\{j\}under an end\-of\-window convention \(timestamp support\)\. Pointwise stages in𝒢\\mathcal\{G\}\(e\.g\., scaling\) do not affectT′T^\{\\prime\}and are applied on whichever grid is current\.

##### Target alignment via raw\-time mapping

Akin to the feature\-to\-raw\-time mapping defined above, target alignment is treated by associating each transformed feature indexjjwith a raw\-time supporta\(j\)a\(j\)\. The target at indexjjis then constructed from the raw target trajectory𝒴\{\\mathcal\{Y\}\}using this support, ensuring that each transformed feature is paired with a well\-defined supervision signal\. This design decouples feature extraction from supervision design, thereby supporting arbitrary feature transformations, including those that alter temporal resolution, while preserving a consistent definition of the target under different alignment conventions\.

Because𝒢\\mathcal\{G\}may change the number and meaning of time steps, the target pipeline must produce one label per transformed indexjj\. We therefore allow the target pipeline to depend on the raw target series and the associated supporta\(j\)a\(j\)\. This explicit alignment is particularly critical for prognostics tasks, where targets such as Remaining Useful Life depend on precise temporal positioning \(e\.g\., assigning the RUL at the end of a window\) and scaling, whereas in diagnostics tasks labels are often constant or slowly varying over intervals and are therefore less sensitive to the exact alignment choice:

zy\(j\)=ℋ~\(𝒴,a\(j\);Φ\),j=1,…,T′\.z\_\{y\}\(j\)=\\widetilde\{\\mathcal\{H\}\}\\big\(\{\\mathcal\{Y\}\},a\(j\);\\Phi\\big\),\\quad j=1,\\dots,T^\{\\prime\}\.\(8\)
Here,ℋ~\\widetilde\{\\mathcal\{H\}\}is a composition ofQQstages \(e\.g\., clipping, scaling, calibration\) and may include the previously defined alignment/aggregation operator𝒜\\mathcal\{A\}at an arbitrary position in the composition\. A common special case is

zy\(j\)=𝒜\(ℋ\(𝒴;Φ\),a\(j\)\),z\_\{y\}\(j\)=\\mathcal\{A\}\\Big\(\\mathcal\{H\}\\big\(\{\\mathcal\{Y\}\};\\,\\Phi\\big\),a\(j\)\\Big\),\(9\)but other transformation sequence orderings \(e\.g\., transformingy\(t\)y\(t\)pointwise before aggregation overa\(j\)a\(j\)\) are equally admissible and task\-dependent\. Concretely,𝒜\\mathcal\{A\}may implement pointwise sampling𝒜\(𝒴;a\(j\)\)=y\(a\(j\)\)\\mathcal\{A\}\(\{\\mathcal\{Y\}\};a\(j\)\)=y\(a\(j\)\)whena\(j\)a\(j\)is a timestamp, or window aggregation \(e\.g\., mean/last/max/majority\) over\{y\(t\)\}t∈a\(j\)\\\{y\(t\)\\\}\_\{t\\in a\(j\)\}whena\(j\)a\(j\)is an interval\. In our experiments, unless otherwise stated, we use an end\-of\-window convention for window\-derived features \(i\.e\.,a\(j\)=t¯ja\(j\)=\\overline\{t\}\_\{j\}\)\.

Target alignment yields the aligned target series𝒴′=\{zy\(j\)\}j=1T′\{\\mathcal\{Y\}\}^\{\\prime\}=\\\{z\_\{y\}\(j\)\\\}\_\{j=1\}^\{T^\{\\prime\}\}paired with𝒵\{\\mathcal\{Z\}\}\. After this stage, the target alignment remains fixed and is no longer modified\.

Leakage Policy\.All fitted parametersΨ\\PsiandΦ\\Phi\(and any statistics used by𝒜\\mathcal\{A\}, e\.g\., thresholds or calibration\) are estimated exclusively using the training partition\. Once estimated, they are frozen and applied unchanged to validation and test partitions\.

Example\.The feature pipeline𝒢\\mathcal\{G\}is a two\-stage composition\. The first stageg1g\_\{1\}applies min\-max normalization channel\-wise using training\-partition statistics and preserves the current temporal grid\. The second stageg2g\_\{2\}computesKKtime\-domain statistics—mean, maximum, RMS, standard deviation, and others—over a sliding window of widthwwraw samples, expanding theMMraw channels intoF=MKF=MKsummary features at each transformed indexjj\. The support of𝒛\(j\)\{\\bm\{z\}\}\(j\)is therefore the intervala\(j\)=\[t¯j−w\+1,t¯j\]a\(j\)=\[\\overline\{t\}\_\{j\}\-w\+1,\\overline\{t\}\_\{j\}\]\. For target alignment, we adopt an end\-of\-window convention, soa\(j\)=t¯ja\(j\)=\\overline\{t\}\_\{j\}for supervision andzy\(j\)=ℋ~\(𝒴,t¯j;Φ\)z\_\{y\}\(j\)=\\widetilde\{\\mathcal\{H\}\}\(\{\\mathcal\{Y\}\},\\overline\{t\}\_\{j\};\\Phi\)\. In this example, the target transformation insideℋ~\\widetilde\{\\mathcal\{H\}\}scales ah\-RUL by dividing by the maximum value observed in the training partition, with this scaling constant included inΦ\\Phi\. Because ah\-RUL has a natural lower bound of zero at end\-of\-life but no fixed upper bound, a cell near end\-of\-life yieldszy\(j\)≈0z\_\{y\}\(j\)\\approx 0, while a fresh cell yieldszy\(j\)\>0z\_\{y\}\(j\)\>0and may yieldzy\(j\)\>1z\_\{y\}\(j\)\>1for test cells with longer lifetimes than any training cell\.

#### 3\.1\.3Sequence Slicing \(𝒮\\mathcal\{S\}\)

Following the pre\-processing stage, the goal of this stage is to construct supervised learning samples for predictive modeling by forming the final input–target pairs from the transformed feature sequence\. Rather than predicting from a single time step, each input consists of a finite history window of lengthLseqL\_\{\\mathrm\{seq\}\}\(i\.e\., how much past context is provided\), which is paired with a corresponding target value derived from the aligned target sequence\. The feature window is created by the slicing operator𝒮\\mathcal\{S\}and is applied to the transformed feature sequence𝒵=\{𝒛\(j\)\}j=1T′\{\\mathcal\{Z\}\}=\\\{\{\\bm\{z\}\}\(j\)\\\}\_\{j=1\}^\{T^\{\\prime\}\}\. Note that this bound on the window length is imposed on the data\-level by the input representation and may differ from the effective lookback utilized at the model level due to architectural or optimization constraints\. To formalize the construction of𝒮\\mathcal\{S\}, we introduce a set of additional parameters besidesLseqL\_\{\\mathrm\{seq\}\}that control how windows are extracted:

- •Δ∈ℕ\\Delta\\in\\mathbb\{N\}: the stride between consecutive windows \(controls overlap and dataset size\),
- •ρ∈ℤ≥0\\rho\\in\\mathbb\{Z\}\_\{\\geq 0\}\(left\-padding allowance, withρ=0\\rho=0strict windowing\): an optional warm\-start depth that allows windows to start before the first valid index via padding,
- •δ∈ℤ≥0\\delta\\in\\mathbb\{Z\}\_\{\\geq 0\}: a supervision offset that determines how far into the future \(relative to the window end\) the target is read,
- •Lpred∈ℕL\_\{\\mathrm\{pred\}\}\\in\\mathbb\{N\}: the length of the supervision segment, whereLpred=1L\_\{\\mathrm\{pred\}\}=1corresponds to a single window\-level label\.

In the PHM tasks considered here, we use window\-level prediction withLpred=1L\_\{\\mathrm\{pred\}\}=1, i\.e\., a single target value per window and typicallyδ=0\\delta=0, meaning the target corresponds to the end of the window\. However, the more general formulation allows for multi\-step prediction or auxiliary objectives without modifying the preprocessing pipeline\.

##### Valid start indices and dataset size\.

To determine the final dataset length, we now characterize the set of valid start indices for window extraction on the transformed timeline, given\(Lseq,Δ,ρ,δ,Lpred\)\(L\_\{\\mathrm\{seq\}\},\\Delta,\\rho,\\delta,L\_\{\\mathrm\{pred\}\}\)and transformed lengthT′T^\{\\prime\}\.

Lreq≜Lseq\+δ\+Lpred−1\.L\_\{\\mathrm\{req\}\}\\triangleq L\_\{\\mathrm\{seq\}\}\+\\delta\+L\_\{\\mathrm\{pred\}\}\-1\.\(10\)km≜1−ρ\+\(m−1\)Δ,m=1,2,…k\_\{m\}\\triangleq 1\-\\rho\+\(m\-1\)\\Delta,\\qquad m=1,2,\\dots\(11\)km\+Lreq−1≤T′\.k\_\{m\}\+L\_\{\\mathrm\{req\}\}\-1\\leq T^\{\\prime\}\.\(12\)Nslices≜max⁡\{0,⌊T′−Lreq\+ρΔ⌋\+1\},𝒦≜\{km\}m=1Nslices\.N\_\{\\mathrm\{slices\}\}\\triangleq\\max\\\!\\left\\\{\\,0,\\ \\left\\lfloor\\frac\{T^\{\\prime\}\-L\_\{\\mathrm\{req\}\}\+\\rho\}\{\\Delta\}\\right\\rfloor\+1\\right\\\},\\qquad\\mathcal\{K\}\\triangleq\\\{k\_\{m\}\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}\.\(13\)
Here,LreqL\_\{\\mathrm\{req\}\}is the required right\-side coverage in transformed time under the chosen supervision configuration;kmk\_\{m\}enumerates candidate window starts with strideΔ\\Deltaand warm\-start depthρ\\rho\(so the earliest start is1−ρ1\-\\rho\); the inequality enforces that each window has sufficient remaining trajectory length;NslicesN\_\{\\mathrm\{slices\}\}is the resulting number of admissible windows; and𝒦\\mathcal\{K\}is the set of admissible start indices\.

##### Left\-padding for the feature history \(abstract\)\.

One challenge when comparing models with different context window sizes is that, without additional measures, changingLseqL\_\{\\mathrm\{seq\}\}alters the \(testing\) dataset size, leading to unfair comparisons between models\. To address this, we introduce left padding\. Ifρ\>0\\rho\>0, some history indices may satisfykm\+i<1k\_\{m\}\+i<1\. We therefore define a left\-padded feature extension via an abstract padding operator𝒫\\mathcal\{P\}:

𝒛~\(j\)=\{𝒛\(j\),j≥1,𝒫\(𝒵,j\),j≤0\.\\widetilde\{\{\\bm\{z\}\}\}\(j\)=\\begin\{cases\}\{\\bm\{z\}\}\(j\),&j\\geq 1,\\\\ \\mathcal\{P\}\(\{\\mathcal\{Z\}\},j\),&j\\leq 0\.\\end\{cases\}\(14\)
For each admissible startkm∈𝒦k\_\{m\}\\in\\mathcal\{K\}, the feature window is

𝑾m=\[𝒛~\(km\),𝒛~\(km\+1\),…,𝒛~\(km\+Lseq−1\)\]⊤∈ℝLseq×F\.\{\\bm\{W\}\}\_\{m\}=\[\\widetilde\{\{\\bm\{z\}\}\}\(k\_\{m\}\),\\widetilde\{\{\\bm\{z\}\}\}\(k\_\{m\}\+1\),\\dots,\\widetilde\{\{\\bm\{z\}\}\}\(k\_\{m\}\+L\_\{\\mathrm\{seq\}\}\-1\)\]^\{\\top\}\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}\.\(15\)
For window\-level supervision, the supervision index and label are

jsup\(km\)=km\+Lseq−1\+δ,ym=zy\(jsup\(km\)\)\.j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)=k\_\{m\}\+L\_\{\\mathrm\{seq\}\}\-1\+\\delta,\\qquad y\_\{m\}=z\_\{y\}\\\!\\big\(j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)\\big\)\.\(16\)The label is guaranteed to come from a real \(non\-padded\) target index for allmmprovidedρ≤Lseq−1\+δ\\rho\\leq L\_\{\\mathrm\{seq\}\}\-1\+\\delta\.

The resulting windowed dataset for a single unit is

𝒟seq≜\{\(𝑾m,ym\)\}m=1Nslices\.\\mathcal\{D\}\_\{\\mathrm\{seq\}\}\\triangleq\\\{\(\{\\bm\{W\}\}\_\{m\},y\_\{m\}\)\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}\.\(17\)
The output of this stage is the set of temporal windows𝒲=\{𝑾m\}m=1Nslices\\mathcal\{W\}=\\\{\{\\bm\{W\}\}\_\{m\}\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}paired with labels\{ym\}m=1Nslices\\\{y\_\{m\}\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}\.

Example\.Each extracted window𝑾m=\[𝒛\(m\),…,𝒛\(m\+Lseq−1\)\]⊤∈ℝLseq×F\{\\bm\{W\}\}\_\{m\}=\[\{\\bm\{z\}\}\(m\),\\dots,\{\\bm\{z\}\}\(m\+L\_\{\\mathrm\{seq\}\}\-1\)\]^\{\\top\}\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}stacksLseqL\_\{\\mathrm\{seq\}\}consecutive feature vectors and is paired with the ah\-RUL labelym=zy\(m\+Lseq−1\)y\_\{m\}=z\_\{y\}\(m\+L\_\{\\mathrm\{seq\}\}\-1\)at the window end\. With strideΔ=1\\Delta=1, consecutive windows shift by one step, so labels decrease slowly as the battery ages\. A time\-series Transformer \(or LSTM\) consumes𝑾m\{\\bm\{W\}\}\_\{m\}directly as its input sequence, treating each row as one time step\.

#### 3\.1\.4Tabularization Schema \(𝒯\\mathcal\{T\}\)

To support heterogeneous model classes, we define two representations of each sample\. Sequence models \(e\.g\., LSTMs and Transformers\) operate directly on the temporal window𝑾m∈ℝLseq×F\{\\bm\{W\}\}\_\{m\}\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}\. By contrast, tabular foundation models are pre\-trained on row\-wise tabular data, where each example is represented by a fixed\-dimensional feature vector and inference is performed over sets of such rows\. They therefore assume no explicit sequence axis in the input\. For compatibility with this input schema, each window𝑾m\{\\bm\{W\}\}\_\{m\}is mapped to a vector representation that preserves all entries and encodes temporal order through a fixed column ordering\.

We define the tabularization operator𝒯:ℝLseq×F→ℝDtab\\mathcal\{T\}:\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}\\to\\mathbb\{R\}^\{D\_\{\\mathrm\{tab\}\}\}to transform the sequence𝑾m\{\\bm\{W\}\}\_\{m\}into a flat tabular sample𝑿m\{\\bm\{X\}\}\_\{m\}, whereDtab=LseqFD\_\{\\mathrm\{tab\}\}=L\_\{\\mathrm\{seq\}\}F:

𝑿m=𝒯\(𝑾m\)=\[z1\(km\),…,zF\(km\),…,z1\(km\+Lseq−1\),…,zF\(km\+Lseq−1\)\]⊤,\{\\bm\{X\}\}\_\{m\}=\\mathcal\{T\}\(\{\\bm\{W\}\}\_\{m\}\)=\[z\_\{1\}\(k\_\{m\}\),\\dots,z\_\{F\}\(k\_\{m\}\),\\dots,z\_\{1\}\(k\_\{m\}\+L\_\{\\mathrm\{seq\}\}\-1\),\\dots,z\_\{F\}\(k\_\{m\}\+L\_\{\\mathrm\{seq\}\}\-1\)\]^\{\\top\},\(18\)wherezf\(j\)z\_\{f\}\(j\)denotes theff\-th component of𝒛\(j\)\{\\bm\{z\}\}\(j\)and we use a time\-major ordering for reproducibility\. The labels remain paired with the tabularized rows; only the input window is flattened by𝒯\\mathcal\{T\}\.

Example\.The window𝑾m∈ℝLseq×F\{\\bm\{W\}\}\_\{m\}\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}is flattened in time\-major order to yield the tabular vector𝑿m=𝒯\(𝑾m\)∈ℝDtab\{\\bm\{X\}\}\_\{m\}=\\mathcal\{T\}\(\{\\bm\{W\}\}\_\{m\}\)\\in\\mathbb\{R\}^\{D\_\{\\mathrm\{tab\}\}\}withDtab=LseqFD\_\{\\mathrm\{tab\}\}=L\_\{\\mathrm\{seq\}\}F\. The firstFFentries encode the windowed statistics at the earliest step in the window, and the lastFFentries encode those at the most recent step\. The pair\(𝑿m,ym\)\(\{\\bm\{X\}\}\_\{m\},y\_\{m\}\)is one element of𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}\. At inference time, TabPFN or TabDPT conditions on an admissible training\-only context set𝒞⊂𝒟tabtrain\\mathcal\{C\}\\subset\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}to predict the ah\-RUL of a query battery window\.

#### 3\.1\.5Dual Dataset Representation

The application of the transformation pipeline yields two information\-equivalent \(up to reshaping\) sample representations: the temporal window𝑾m\{\\bm\{W\}\}\_\{m\}and its tabularized counterpart𝑿m\{\\bm\{X\}\}\_\{m\}\. At the dataset level, these induce the sequence dataset𝒟seq\\mathcal\{D\}\_\{\\mathrm\{seq\}\}and the tabular dataset𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}\. The final dimension of the tabular vector isDtab=LseqFD\_\{\\mathrm\{tab\}\}=L\_\{\\mathrm\{seq\}\}F\. One may then select the representation that matches the target model class:

- •Sequence Dataset:𝒟seq=\{\(𝑾m,ym\)\}m=1Nslices\\mathcal\{D\}\_\{\\mathrm\{seq\}\}=\\\{\(\{\\bm\{W\}\}\_\{m\},y\_\{m\}\)\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}, which retains the explicit temporal matrix structure produced by applying𝒢\\mathcal\{G\}and slicing𝒮\\mathcal\{S\}to the raw data and using the aligned target series𝒴′\{\\mathcal\{Y\}\}^\{\\prime\}\.
- •Tabular Dataset:𝒟tab=\{\(𝑿m,ym\)\}m=1Nslices\\mathcal\{D\}\_\{\\mathrm\{tab\}\}=\\\{\(\{\\bm\{X\}\}\_\{m\},y\_\{m\}\)\\\}\_\{m=1\}^\{N\_\{\\mathrm\{slices\}\}\}, which extends the pipeline with the flattening operator𝒯\\mathcal\{T\}, making the data structurally compatible with tabular learning algorithms\.

These formats contain identical information content, allowing for flexible model selection without data loss\. Both representations inherit the same admissible start set𝒦\\mathcal\{K\}and the same split membership for each sample\. The total number of samplesNslicesN\_\{\\mathrm\{slices\}\}in both𝒟seq\\mathcal\{D\}\_\{\\mathrm\{seq\}\}and𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}is directly controlled by the slicing strideΔ\\Delta\. By adjustingΔ\\Delta, we can effectively perform subsampling to manage computational constraints or reduce redundancy without altering the fundamental structure of the samples\.

##### Context sets for tabular foundation models \(leakage constraints\)\.

When using tabular foundation models with in\-context learning, predictions may take the formy^m=fϕ\(𝑿m;𝒞\)\\hat\{y\}\_\{m\}=f\_\{\\phi\}\(\{\\bm\{X\}\}\_\{m\};\\mathcal\{C\}\), where𝒞=𝒟tabtrain\\mathcal\{C\}=\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}is the set of in\-context samples\. To prevent leakage and ensure comparability,𝒞\\mathcal\{C\}must be drawn exclusively from the training partition; no validation/test samples may appear in𝒞\\mathcal\{C\}\. Under intra\-unit temporal splitting, context selection must additionally respect time, i\.e\., it must not use samples derived from future timestamps relative to the supervision indexjsup\(km\)j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)of the query sample\. The context size and sampling/retrieval strategy are fixed in advance, pre\-cached for increased performance and applied consistently across methods\.

#### 3\.1\.6Generalization to Multiple Units and Partitions

For a dataset withUUheterogeneous units, we apply the operators𝒢\\mathcal\{G\},𝒮\\mathcal\{S\}, and𝒯\\mathcal\{T\}independently to each unituu\. To learn across multiple units, we define the global dataset𝒟global\\mathcal\{D\}\_\{global\}as the union of these processed unit\-specific subsets\. Depending on the chosen representation \(sequence or tabular\), this is defined as:

𝒟global=⋃u=1U𝒟unit\(u\)\\mathcal\{D\}\_\{global\}=\\bigcup\_\{u=1\}^\{U\}\\mathcal\{D\}\_\{unit\}^\{\(u\)\}\(19\)where𝒟unit\(u\)\\mathcal\{D\}\_\{unit\}^\{\(u\)\}is either𝒟seq\(u\)\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\(u\)\}or𝒟tab\(u\)\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\(u\)\}\.

We formally define the partitioning of the global dataset into disjoint training, validation, and testing subsets:

𝒟global=𝒟globaltrain∪𝒟globalval∪𝒟globaltest\\mathcal\{D\}\_\{global\}=\\mathcal\{D\}^\{\\text\{train\}\}\_\{\\text\{global\}\}\\cup\\mathcal\{D\}^\{\\text\{val\}\}\_\{\\text\{global\}\}\\cup\\mathcal\{D\}^\{\\text\{test\}\}\_\{\\text\{global\}\}\(20\)This partitioning can be performed using two distinct strategies depending on the application constraints:

- •Inter\-Unit Splitting:The set of units\{1,…,U\}\\\{1,\\dots,U\\\}is disjointly divided\. All samples derived from a specific unit belong exclusively to one partition\. This tests generalization to unseen machines\.
- •Intra\-Unit Splitting:The splitting occurs along the temporal dimension within each unit\. This evaluates generalization across time on known machines under a causal protocol \(no future information in training\-time preprocessing\)\.

Prevention of Data Leakage:Crucially, the transformation parametersΨ=\{ψ1,…,ψP\}\\Psi=\\\{\\psi\_\{1\},\\dots,\\psi\_\{P\}\\\}\(introduced in Stage 1\) are estimatedsolelyfrom the training partition𝒟globaltrain\\mathcal\{D\}^\{\\text\{train\}\}\_\{\\text\{global\}\}\. Likewise, any fitted target parametersΦ\\Phiand any choices insideℋ~\\widetilde\{\\mathcal\{H\}\}\(including alignment/aggregation design and any statistics it requires\) are determined using only the training partition\. These fixed parameters are then applied to transform the validation and test partitions, ensuring no information from future or unseen data influences the feature extraction process\. For intra\-unit splitting, “training partition” refers to the training time range within each unit, so that no future timestamps contribute to fitted preprocessing statistics\.

Due to the frameworks abstraction layers, models receive individual windows𝑾m\{\\bm\{W\}\}\_\{m\}identically regardless of which unit they originate from\. However, the framework is unit\-sensitive: unit identities are tracked throughout the pipeline to support per\-unit evaluation \(e\.g\., per\-unit MAE on battery and bearing datasets; Appendix[C\.6\.2](https://arxiv.org/html/2606.05481#A3.SS6.SSS2)\)\. For clarity and without loss of generality, throughout the remainder of this work we refer to the datasets as𝒟seq\\mathcal\{D\}\_\{\\mathrm\{seq\}\}or𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}without explicit mention of the global split superscripts, unless the context requires distinguishing between partitions\.

#### 3\.1\.7Task\-Specific Target Definitions

The goal of the learning task is to map each input window𝑾m\{\\bm\{W\}\}\_\{m\}\(or its tabular representation𝑿m\{\\bm\{X\}\}\_\{m\}\) to an associated target variable\. The precise definition of this target depends on the specific PHM task and how supervision is aligned with the input window\. In particular, targets may represent either future system evolution or the current system state, and can be defined at a specific time index or aggregated over the window\.

- •Prognostics:The target is a scalarym∈ℝ≥0y\_\{m\}\\in\\mathbb\{R\}\_\{\\geq 0\}, representing a monotonically decreasing value such as the Remaining Useful Life \(RUL\) associated with the end of the window𝑾m\{\\bm\{W\}\}\_\{m\}\. Unless otherwise stated, supervision is read at the window\-level supervision indexjsup\(km\)=km\+Lseq−1\+δj\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)=k\_\{m\}\+L\_\{\\mathrm\{seq\}\}\-1\+\\delta, and the label is defined asym=zy\(jsup\(km\)\)y\_\{m\}=z\_\{y\}\\\!\\big\(j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)\\big\)\. The mapping fromjsup\(km\)j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)to raw time \(and any windowing/aggregation\) is governed bya\(jsup\(km\)\)a\\\!\\big\(j\_\{\\mathrm\{sup\}\}\(k\_\{m\}\)\\big\)and the target pipelineℋ~\\widetilde\{\\mathcal\{H\}\}\.
- •Diagnostics:The target is a discrete class labelym∈\{0,1,…,K−1\}y\_\{m\}\\in\\\{0,1,\\dots,K\-1\\\}, indicating the specific fault type present within the window𝑾m\{\\bm\{W\}\}\_\{m\}\. If the diagnostic label is defined over the window \(rather than pointwise\), this is implemented withinℋ~\\widetilde\{\\mathcal\{H\}\}by choosing𝒜\\mathcal\{A\}to aggregate labels over the supports corresponding to the input window \(e\.g\., last\-value, max\-over\-window, or majority vote\), rather than implicitly redefiningℋ\\mathcal\{H\}\.

## 4Training and Implementation

### 4\.1Supervised Learning with Sequential Models

Conventional sequence models, such as 1D\-CNNs, LSTMs, and Time\-Series Transformers, are trained to minimize a task\-specific loss over the entire sequence dataset𝒟seqtrain\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\. These models do not operate on the tabularized data𝑿k\{\\bm\{X\}\}\_\{k\}; instead, they ingest the temporal sequence𝑾k\{\\bm\{W\}\}\_\{k\}directly\.

Letfθf\_\{\\theta\}denote a model parameterized byθ\\theta\. The model produces a predictiony^k=fθ\(𝑾k\)\\hat\{y\}\_\{k\}=f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{k\}\)\. The optimal parametersθ∗\\theta^\{\*\}are obtained by minimizing the empirical risk over the training sequence dataset:

θ∗=argminθ⁡1\|𝒟seqtrain\|∑\(𝑾k,yk\)∈𝒟seqtrainℓ\(fθ\(𝑾k\),yk\)\\theta^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{\\theta\}\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\|\}\\sum\_\{\(\{\\bm\{W\}\}\_\{k\},y\_\{k\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\}\\ell\\left\(f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{k\}\),y\_\{k\}\\right\)\(21\)whereℓ\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)is a task\-dependent loss function \(e\.g\., Mean Squared Error for prognostics or Cross\-Entropy for diagnostics\)\. We typically employ stochastic gradient descent variants \(e\.g\., AdamW\) to solve this optimization problem\.

### 4\.2Tabular Foundation Models: The In\-Context Learning Paradigm

Tabular Foundation Models \(e\.g\., TabPFN\) fundamentally operate via In\-Context Learning \(ICL\)\. In this paradigm, the modelfϕf\_\{\\phi\}, parameterized by weightsϕ\\phi, makes predictions for a query instance𝑿q\{\\bm\{X\}\}\_\{q\}by conditioning on a context set𝒞\\mathcal\{C\}of labeled support examples\. The prediction is given by:

y^q=fϕ\(𝒞,𝑿q\)\\hat\{y\}\_\{q\}=f\_\{\\phi\}\(\\mathcal\{C\},\{\\bm\{X\}\}\_\{q\}\)\(22\)Crucially, to ensure a valid evaluation and prevent data leakage, the context set𝒞\\mathcal\{C\}must always be drawn strictly from the training partition of the tabular dataset:

𝒞=\{\(𝑿j,yj\)\}j=1C⊂𝒟tabtrain\\mathcal\{C\}=\\\{\(\{\\bm\{X\}\}\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{C\}\\subset\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}\(23\)This mechanism allows the model to adapt its posterior distribution to the specific task at inference time\. We utilize this general paradigm in two distinct modes:

#### 4\.2\.1Pre\-trained In\-Context Learning \(Zero\-Shot\)

We use the foundation model with its original, pre\-trained weightsϕ\\phi\(frozen\)\. The model adapts to the specific PHM task solely through the information provided in the context set𝒞\\mathcal\{C\}\(drawn from𝒟tabtrain\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}\) at inference time\.

y^q=fϕ\(𝒞,𝑿q\)\\hat\{y\}\_\{q\}=f\_\{\\phi\}\(\\mathcal\{C\},\{\\bm\{X\}\}\_\{q\}\)\(24\)The context size is typically restricted \(e\.g\.,\|𝒞\|≤4×104\|\\mathcal\{C\}\|\\leq 4\\times 10^\{4\}\)\. If the available training data in𝒟tabtrain\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}exceeds this limit, we construct𝒞\\mathcal\{C\}by subsampling the tabularized training windows\.

For the data\-efficiency experiments, we vary this subsampling explicitly through a subset ratiorsub∈\(0,1\]r\_\{\\mathrm\{sub\}\}\\in\(0,1\], applied after sequence construction and tabularization\. With aggregate subsampling,max⁡\(1,⌊rsubN⌋\)\\max\(1,\\lfloor r\_\{\\mathrm\{sub\}\}N\\rfloor\)windows are sampled uniformly without replacement from theNNeligible training windows, and the selected indices are sorted before constructing the context or training set\. With blockwise subsampling, the same total number of windows is distributed overBBcontiguous index blocks with randomly sampled block lengths and gaps\. The selected windows therefore remain locally contiguous while covering only parts of the training trajectories\. This operation reduces the number of rows supplied to the model, but it does not alter the feature dimension, sequence length, or contents of any selected window\. Test windows are not subsampled in these experiments\.

### 4\.3Evaluation Protocol

To ensure a fair and standardized comparison across disparate modeling architectures, we employ a unified evaluation protocol\. Regardless of whether a model consumes sequential data \(𝑾k∈𝒟seq\{\\bm\{W\}\}\_\{k\}\\in\\mathcal\{D\}\_\{\\mathrm\{seq\}\}\) or tabular data \(𝑿k∈𝒟tab\{\\bm\{X\}\}\_\{k\}\\in\\mathcal\{D\}\_\{\\mathrm\{tab\}\}\), all metrics are calculated on the same set of test instances derived from the held\-out partition𝒟test\\mathcal\{D\}^\{\\text\{test\}\}\.

We define a task\-specific loss functionℓ\(y^,y\)\\ell\(\\hat\{y\},y\)\(e\.g\., squared error for regression or 0\-1 loss for classification\)\. The aggregate test lossℒtest\\mathcal\{L\}\_\{test\}is computed over the test set, ensuring a consistent evaluation logic for both model families:

ℒtest=1\|𝒟test\|∑k∈𝒟testℓ\(y^k,yk\),wherey^k=\{fθ\(𝑾k\)iffis Sequentialfϕ\(𝒞,𝑿k\)iffis Tabular\\mathcal\{L\}\_\{test\}=\\frac\{1\}\{\|\\mathcal\{D\}^\{\\text\{test\}\}\|\}\\sum\_\{k\\in\\mathcal\{D\}^\{\\text\{test\}\}\}\\ell\(\\hat\{y\}\_\{k\},y\_\{k\}\),\\quad\\text\{where \}\\hat\{y\}\_\{k\}=\\begin\{cases\}f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{k\}\)&\\text\{if \}f\\text\{ is Sequential\}\\\\ f\_\{\\phi\}\(\\mathcal\{C\},\{\\bm\{X\}\}\_\{k\}\)&\\text\{if \}f\\text\{ is Tabular\}\\end\{cases\}\(25\)For Prognostics, we employ standard regression metrics such as Root Mean Squared Error \(RMSE\) and Mean Absolute Error \(MAE\)\. For Diagnostics, we utilize classification metrics including Accuracy and Macro\-F1 Score to account for class imbalance\. For cross\-dataset comparisons, we report the average rank obtained by ranking models within each dataset and averaging the resulting ranks across datasets, with rank 1 denoting the best model for a given dataset\.

Sampling \+ AggregationWindowingTabularizationFeaturesLabels\{RUL, HS\}Time⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdotsTime⋮\\vdots⋮\\vdots⋮\\vdots⋮\\vdotsFigure 2:Our tabularization scheme\. Time\-series features are aggregated into their statistics on a per\-window basis, along with the corresponding labels\. These are then composed into a row in our table, providing richer context\.

## 5Practical Implementation Details

### 5\.1Unified Evaluation Pipeline Details

All baselines are trained from scratch and evaluated alongside all foundation models using the evaluation infrastructure ofTelyatnikov et al\. \[[52](https://arxiv.org/html/2606.05481#bib.bib52)\], under identical preprocessing, splitting, windowing, and metric computation \(Section[4\.3](https://arxiv.org/html/2606.05481#S4.SS3)\), ensuring that performance differences reflect modeling choices rather than data preparation\. Model training is performed usingPyTorchwith its implementation of theAdamWoptimizer\. The framework uses an adapted version of theTrainerprovided byLightning111[https://lightning\.ai/docs/pytorch/stable/](https://lightning.ai/docs/pytorch/stable/), extended to accommodate models that use a Scikit\-learn\-likefit\(\)andpredict\(\)API\. Our training regimen for transformer models incorporates a custom learning rate scheduling that includes a warm\-up phase and reduces the learning rate upon reaching a plateau, as this step has improved performance in the original forecasting tasks\. We use min\-max or z\-score normalization, depending on the dataset\. All preprocessing, feature extraction, and sequence slicing, represented by the operators𝒢\\mathcal\{G\}and𝒮\\mathcal\{S\}, are shared among the models\. The tabular foundation models and XGBoost share the same tabularization schema𝒯\\mathcal\{T\}\. Hyper\-parameter optimization was performed using grid search, with initial parameter ranges selected from the original works\. Hyper\-parameter tuning, which includes the number of columns and rows for tabular in\-context learning, was performed on the validation set\. Dataset split and target\-construction metadata are summarized in Section[A](https://arxiv.org/html/2606.05481#A1)\(e\.g\., Tables[3](https://arxiv.org/html/2606.05481#A1.T3),[4](https://arxiv.org/html/2606.05481#A1.T4),[6](https://arxiv.org/html/2606.05481#A1.T6), and[7](https://arxiv.org/html/2606.05481#A1.T7)\)\. We report means and standard deviations computed over five random seeds\.

### 5\.2Benchmark Models

This section introduces the model families considered in our study and formalizes the data\-processing pipeline used to compare them fairly\. Since our objective is to benchmark*tabular foundation models*against established*time\-series models*on PHM tasks, the key challenge is to construct two representations of the same underlying problem: a sequence representation for temporal models and a tabular representation for foundation models\. For completeness, we also include strong non\-transformer baselines commonly used in PHM\.

#### 5\.2\.1Tabular Foundation Models

Recent progress in Prior\-Fitted Networks has led to a growing family of tabular foundation models \(TFMs\), including several alternatives to the original TabPFN framework \(e\.g\.,\[[45](https://arxiv.org/html/2606.05481#bib.bib45),[3](https://arxiv.org/html/2606.05481#bib.bib3)\]\)\. In this work, we evaluate the following two representatives:

- •TabPFN:Tabular Prior\-Fitted Networks\[[26](https://arxiv.org/html/2606.05481#bib.bib26),[27](https://arxiv.org/html/2606.05481#bib.bib27)\], are designed to approximate Bayesian inference on tabular datasets\. Trained on large\-scale causally\-generated synthetic data, TabPFN supports both classification and regression through in\-context learning and often outperforms strong tree\-based methods\. More details are given in Appendix[B\.3](https://arxiv.org/html/2606.05481#A2.SS3)\.
- •TabDPT:A transformer\-based tabular foundation model trained on real\-world datasets, that uses retrieval\-based self\-supervised pre\-training and in\-context learning to generalize to unseen tabular datasets, including both classification and regression, with no task\-specific training or hyper\-parameter tuning\[[38](https://arxiv.org/html/2606.05481#bib.bib38)\]\. More details are given in Appendix[B\.3](https://arxiv.org/html/2606.05481#A2.SS3)\.

#### 5\.2\.2Transformer models

For our experiments, we benchmark tabular foundation models against state\-of\-the\-art \(SOTA\) models in long\-range time\-series forecasting\. These models had to be adapted to perform regression and classification, as described in Section 3\. Within the family of transformer\-based models, these are:

- •PatchTST:A channel\-independent “patching” Transformer that tokenizes univariate time series into subseries\-level patches, enabling longer receptive fields with lower attention cost\[[42](https://arxiv.org/html/2606.05481#bib.bib42)\]\.
- •Crossformer:A Transformer for multivariate time series that tokenizes inputs with cross\-dimensional embeddings and applies a two\-stage attention layer to model both cross\-time and cross\-feature dependencies\[[64](https://arxiv.org/html/2606.05481#bib.bib64)\]\.
- •Spacetimeformer:A long\-range Transformer that jointly learns temporal and spatial interactions by treating spatiotemporal values as tokens, combining sequence and graph\-like reasoning\[[23](https://arxiv.org/html/2606.05481#bib.bib23)\]\.

For completeness, we include non\-transformer\-based models in our comparisons\. These encompass classical ML architectures such as CNNs and LSTMs\[[32](https://arxiv.org/html/2606.05481#bib.bib32)\], which have found wide adoption in the industry\[[34](https://arxiv.org/html/2606.05481#bib.bib34),[55](https://arxiv.org/html/2606.05481#bib.bib55)\], as well as statistical and tabular fit\-predict baselines, and more recent models:

- •1D\-CNN:Convolution\-based models that slide learnable kernels over temporal inputs to extract local trends, scale efficiently over long histories, and capture multi\-resolution features through residual blocks\.
- •LSTM:Recurrent networks with gated memory cells that propagate and update a hidden state step\-by\-step, enabling nonlinear modeling of sequences with adaptive retention of long\-term information\. We implement the bi\-direction variant, which integrates future covariates effectively\[[49](https://arxiv.org/html/2606.05481#bib.bib49)\]
- •Statistical regression baselines:Linear, polynomial, and exponential regression are included as low\-capacity reference models\.
- •MLP:A multilayer perceptron is included as a non\-sequential neural baseline\. It uses fully connected layers to model nonlinear interactions among the flattened temporal features\.
- •XGBoost:A classical tree\-based baseline for tabular tasks\. XGBoost uses gradient boosting with regularized decision trees and is evaluated on the same tabularized rows as the tabular foundation models\[[10](https://arxiv.org/html/2606.05481#bib.bib10)\]\.
- •TiDE, a recently introduced dense residual model, built on MLP\-based encoder–decoders and quasi\-linear networks for long\-term forecasting\[[11](https://arxiv.org/html/2606.05481#bib.bib11)\]\.

### 5\.3Adapting Baseline Models to Prognostics and Diagnostics tasks in PHM

The sequence models considered here—Bi\-LSTM, 1D\-CNN, TiDE, and all Transformer\-based architectures—were originally designed for time\-series forecasting and are applied directly to the sequential input𝑾k\{\\bm\{W\}\}\_\{k\}\. We adapt them to PHM regression and classification tasks by replacing their forecasting heads with task\-specific regression or classification heads\.

In addition, the original encoder decoder forecasting formulation does not directly apply to prognostics and diagnostics tasks in PHM\. The past values of the target variable, namely the remaining useful life, are unavailable at inference time because the degradation and thus the time to failure are unknown prior to asset breakdown\. Moreover, no future information is considered, as the objective is to regress onto the target at the same time indexkk\. We therefore modify encoder\-decoder transformer architectures by removing the encoder and directly feeding the sequential data𝑾k\{\\bm\{W\}\}\_\{k\}to the decoder\. This step is not necessary for PatchTST, since it is a decoder\-only transformer by design\. However, its channel\-independent processing cannot be applied to the regression target for the same reason\. Therefore, we concatenate decoder latents and train a separate regression or classification head\. TiDE introduces an attention\-free multilayer perceptron \(MLP\)\-based encoder\-decoder architecture\. Without lookback data, we disable TiDE’s encoder pathway as well and rely only on the model’s dynamic covariate pathway to process𝑾k\{\\bm\{W\}\}\_\{k\}\. Since all Transformer models produce point predictions for forecasting \(auto\-regressively\), we use the same head for regression by simulating a prediction horizon of one\. For diagnostics tasks, we replace the regression head with a classification head that outputsnnclasses\.

## 6Results and Analysis

This section evaluates whether the proposed tabularization enables tabular foundation models to serve as a unified interface for PHM tasks\. We compare tabular foundation models against multi\-task baselines, including sequence models, transformer models, and a conventional gradient boosting framework, on prognostics and diagnostics benchmarks\. We then analyze whether tabularizing PHM tasks is effective in practical scenarios by examining its behavior under limited training data, the role of temporal context retained in the tabular representation, and the structure of the predictive uncertainty produced by TabPFN\. Full result tables, transformation schemas, and hyperparameter details are provided in Appendix[C](https://arxiv.org/html/2606.05481#A3)\.

#### 6\.0\.1Overall Benchmark Performance

We first compare tabular foundation models across twelve diagnostics and progostics scenarios\. Table[2](https://arxiv.org/html/2606.05481#S6.T2)reports performance for all 13 models across 12 datasets\. Prognostics results are evaluated using normalized MAE \(×100\\times 100; top floor,↓\\downarrow\), while diagnostics results are evaluated using F1 score \(×100\\times 100; bottom floor,↑\\uparrow\), defined according to[Section˜4\.3](https://arxiv.org/html/2606.05481#S4.SS3)\.

The prognostics results show a heterogeneous performance landscape: no single architecture dominates all datasets, and strong dataset\-specific baselines remain competitive\. Nevertheless, the tabular foundation models obtain the best average ranks, with TabDPT ranking first overall \(average rank 2\.67\) and TabPFN second \(average rank 3\.33\)\. A plausible explanation for this aggregate difference is that TabPFN’s synthetic pre\-training priors are not designed specifically for PHM data\. In contrast, TabDPT is pre\-trained on real tabular datasets and constructs row\-level contexts\. This difference in pre\-training and context construction may better match the heterogeneous row distributions induced by PHM tabularization\. In addition to the best overall ranking, the tabular foundation models also perform best on certain datasets\. TabPFN gives the lowest normalized MAE on PHME20 \(1\.95±\\pm0\.03\) and Unibo \(3\.72±\\pm0\.06\), while TabDPT gives the lowest normalized MAE on N\-CMAPSS Prognostics \(NC\-P\) \(6\.85±\\pm0\.02\)\. Conventional sequence and transformer models remain strongest on selected tasks, with STF leading NC\-DS02 \(4\.89±\\pm0\.10\), TiDE leading NB14 \(3\.44±\\pm0\.17\), and LSTM leading XJTU\-SY \(21\.89±\\pm0\.40\)\.

Table 2:Main evaluation results\.Top block:prognostics, measured by MAE in the normalized target space \(×100\\times 100\) \(↓\\downarrow\)\.Bottom block:diagnostics, measured by F1 score \(×100\\times 100\) \(↑\\uparrow\)\. Models are grouped by family: simple baselines, deep sequence models, transformers, tabular models, and tabular foundation models\.Bold/underlinedenote the best/second\-best result\.ModelNC\-DS02↓\\downarrowNC\-P↓\\downarrowNB14↓\\downarrowPHME20↓\\downarrowUnibo↓\\downarrowXJTU\-SY↓\\downarrowAvg rankLinear10\.13 ± 0\.1416\.11 ± 0\.6041\.69 ± 12\.0212\.19 ± 0\.3627\.59 ± 14\.3676\.80 ± 60\.4112\.50Exp5\.35 ± 0\.0610\.96 ± 0\.0930\.47 ± 47\.768\.82 ± 0\.5212\.19 ± 0\.3127\.22 ± 4\.069\.67MLP6\.37 ± 0\.2313\.17 ± 0\.7814\.38 ± 9\.774\.62 ± 1\.1512\.50 ± 0\.7630\.64 ± 2\.6710\.33LSTM4\.93 ± 0\.137\.56 ± 0\.313\.80 ± 0\.223\.73 ± 0\.986\.50 ± 0\.1621\.89 ± 0\.403\.67CNN\-1D5\.33 ± 0\.377\.53 ± 0\.228\.89 ± 1\.705\.35 ± 3\.7112\.41 ± 1\.1531\.02 ± 8\.258\.67TiDE5\.29 ± 0\.227\.62 ± 0\.203\.44 ± 0\.174\.20 ± 0\.666\.46 ± 0\.7825\.11 ± 2\.385\.17TST5\.31 ± 0\.137\.02 ± 0\.176\.28 ± 0\.254\.11 ± 0\.847\.23 ± 0\.3933\.30 ± 7\.727\.00STF4\.89 ± 0\.107\.35 ± 1\.1610\.67 ± 3\.163\.91 ± 1\.008\.89 ± 0\.8128\.49 ± 4\.016\.17CF5\.76 ± 0\.519\.98 ± 0\.573\.57 ± 0\.073\.87 ± 0\.855\.58 ± 1\.0822\.09 ± 1\.065\.00PTST16\.62 ± 0\.0421\.55 ± 0\.035\.22 ± 0\.1015\.09 ± 1\.1311\.18 ± 1\.1125\.42 ± 1\.4810\.33XGBoost8\.52 ± 0\.0015\.24 ± 0\.004\.48 ± 0\.002\.68 ± 0\.004\.06 ± 0\.0024\.59 ± 0\.006\.50TabPFN4\.96 ± 0\.047\.79 ± 0\.043\.91 ± 0\.031\.95 ± 0\.033\.72 ± 0\.0622\.27 ± 0\.353\.33TabDPT5\.07 ± 0\.066\.85 ± 0\.023\.63 ± 0\.042\.19 ± 0\.013\.94 ± 0\.0523\.24 ± 0\.452\.67ModelNC\-D↑\\uparrowHSF15\-A↑\\uparrowHSF15\-C↑\\uparrowHSF15\-P↑\\uparrowHSF15\-V↑\\uparrowMZVAV↑\\uparrowAvg rankLinear72\.34 ± 3\.0458\.75 ± 1\.8198\.35 ± 0\.8254\.40 ± 11\.9732\.55 ± 2\.3639\.89 ± 8\.777\.33MLP79\.49 ± 1\.8091\.02 ± 2\.2599\.91 ± 0\.1497\.32 ± 0\.6080\.99 ± 29\.3860\.10 ± 6\.395\.00LSTM88\.84 ± 0\.7394\.59 ± 0\.97100\.00 ± 0\.0095\.94 ± 2\.8397\.35 ± 3\.3251\.31 ± 6\.013\.83CNN\-1D87\.53 ± 2\.8394\.03 ± 1\.98100\.00 ± 0\.0098\.73 ± 0\.5297\.92 ± 0\.8366\.11 ± 5\.743\.00TiDE32\.57 ± 4\.6542\.90 ± 5\.0761\.57 ± 10\.4159\.37 ± 7\.6842\.11 ± 14\.3625\.19 ± 5\.298\.17TST26\.34 ± 3\.9637\.37 ± 4\.3859\.18 ± 10\.2946\.07 ± 4\.1335\.40 ± 5\.7724\.93 ± 4\.3810\.00STF24\.55 ± 3\.6040\.01 ± 5\.7565\.94 ± 20\.6150\.39 ± 11\.5437\.34 ± 5\.6838\.01 ± 5\.568\.67CF23\.74 ± 0\.9325\.04 ± 5\.2859\.19 ± 10\.0529\.46 ± 2\.6123\.98 ± 2\.9517\.34 ± 4\.2411\.50PTST19\.57 ± 0\.2631\.56 ± 4\.0441\.57 ± 5\.1841\.22 ± 6\.6326\.20 ± 2\.4425\.81 ± 4\.1511\.00XGBoost48\.13 ± 0\.0098\.07 ± 0\.00100\.00 ± 0\.0099\.66 ± 0\.0099\.65 ± 0\.0057\.08 ± 0\.003\.17TabPFN67\.15 ± 1\.4699\.47 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.0058\.32 ± 2\.442\.33TabDPT85\.21 ± 0\.1696\.66 ± 1\.03100\.00 ± 0\.0099\.06 ± 0\.2598\.92 ± 0\.3571\.29 ± 0\.482\.33

The diagnostic results show a similar pattern, with tabular foundation models achieving the best aggregate ranks\. TabPFN and TabDPT obtain the best average rank across the six diagnostic tasks, with an average rank of 2\.33 for both\. TabPFN achieves near\-perfect Macro\-F1 on the HSF15 component\-level tasks, including HSF15\-A, HSF15\-C, HSF15\-P, and HSF15\-V where, interestingly, the performance of other transformer model is significantly lacking\. TabDPT achieves the highest Macro\-F1 on MZVAV\. Deep sequence models remain competitive on NC\-D, where LSTM and CNN\-1D outperform both tabular foundation models\. XGBoost remains a strong conventional fit\-predict baseline on several HSF15 tasks, but is generally outperformed by the tabular foundation models\. Overall,[Table˜2](https://arxiv.org/html/2606.05481#S6.T2)supports the conclusion that tabular foundation models provide the most consistent cross\-task performance, expressed through aggregate robustness rather than uniform dominance on every dataset\.

#### 6\.0\.2Data Efficiency

[Figure˜5](https://arxiv.org/html/2606.05481#S6.F5)evaluates prognostic performance on PHME20 and Unibo as the fraction of available training windows is varied under the protocol described in[Section˜4\.2\.1](https://arxiv.org/html/2606.05481#S4.SS2.SSS1), using both aggregate random subsampling and blockwise subsampling\. The PFN\-based models are already competitive at very small data fractions, and TabPFN is particularly strong in the low\-data regime\. For PHME20, TabPFN and TabDPT achieve competitive performance using only 1% of the training data, with performance saturating once 10% of the data is available\. On the Unibo data set, TabDPT requires 10% of the data to reach competitive performance, whereas TabPFN already performs competitively with 1%\. Both PFN\-based models again show a performance plateau at approximately 10% of the training data\. Under random subsampling, the MAE decreases smoothly as additional windows are included, before reaching a saturation regime\. This suggests that a relatively small but representative support set is sufficient for strong performance\. In contrast, this behavior is not observed under blockwise subsampling, where the sampled data may fail to cover the test distribution\. These results indicate that tabular foundation models are highly sensitive to the in\-context data distribution\. This context\-sensitivity is further reflected in the main benchmark results: TabPFN’s performance on NC\-DS02 is notably weaker than on NC\-P, where the broader multi\-source training distribution provides a richer and more representative context set\. The same subset\-ratio protocol is evaluated on MZVAV using Macro\-F1, as shown in the bottom row of[Fig\.˜5](https://arxiv.org/html/2606.05481#S6.F5)\. Macro\-F1 generally increases with larger data fractions, but the blockwise setting exhibits larger variance compared to the random subsampling for the PFN models\. This indicates that blockwise subsampling changes the class coverage of the context more strongly than aggregate subsampling, which is consistent with the class\-balance analysis in[Fig\.˜6](https://arxiv.org/html/2606.05481#S6.F6)\.

#### 6\.0\.3The Effect of Tabularizing Time Dependencies

Figure[7](https://arxiv.org/html/2606.05481#S6.F7)shows that increasing the sequence length improves performance on PHME20, Unibo, and XJTU\-SY, confirming that the tabularized representation can preserve task\-relevant temporal information\. On PHME20, TabPFN improves rapidly as the sequence length increases and then saturates, whereas TabDPT deteriorates after short histories\. On Unibo, TabPFN remains consistently below TabDPT across the evaluated sequence lengths, with only limited additional gains after the shortest nontrivial histories\. On XJTU\-SY, TabPFN benefits substantially from longer histories, while TabDPT is less monotonic but outperforms TabPFN in absolute MAE\. These patterns indicate that TabPFN is better able to exploit temporal context when history carries degradation information, which is consistent with the model’s cell\-wise embedding of the flattened sensor\-time grid, allowing TabPFN to extract meaningful causal temporal relations\. TabDPT embeds complete rows and is therefore less directly aligned with temporal locality after flattening\. For N\-CMAPSS DS02, neither model benefits from longer histories; TabPFN becomes worse as sequence length increases, and TabDPT shows only small non\-monotonic variation\.

This behavior is consistent with the structure of N\-CMAPSS\. Degradation is modeled primarily at the flight cycle scale, while high frequency variation within a flight is dominated by changing operating conditions, e\.g\. the altitude\. Consequently, extending the history window over consecutive samples from the same flight may add limited degradation information\. Capturing meaningful temporal degradation trends in N\-CMAPSS requires context over many flight cycles rather than only adjacent samples, which is beyond the sequence lengths evaluated here\.

#### 6\.0\.4Data Imputation

We evaluate two strategies for handling missing values\. The first applies last\-observation\-carried\-forward \(LOCF\) imputation to all models uniformly: missing feature entries are filled with the most recent observed value before any model receives the input\. The second passes the raw NaN\-valued features directly to TabPFN v2, which handles missing values internally: NaN entries are replaced by the training\-partition feature mean and a per\-feature binary missingness indicator is appended as extra input features\. The pre\-trained transformer has seen this missingness representation during training and can therefore condition its predictions on the missingness pattern\. All other models receive the LOCF\-imputed inputs, as they do not support NaN\-valued inputs\. We refer to the two TabPFN variants as*TabPFN*\(LOCF\-imputed\) and*TabPFN\+NaN*\(TabPFN\-internal handling\) respectively\.

[Figure˜3](https://arxiv.org/html/2606.05481#S6.F3)reports the effect of missing data on normalized MAE for PHME20\. TabPFN remains the strongest model under this missing\-data setting, followed by TabDPT, indicating that the tabular foundation models retain their advantage when incomplete inputs are present\. Unexpectedly, however, the explicit TabPFN NaN\-token variant underperforms the imputed TabPFN configuration, indicating that TabPFN’s proposed missing\-value handling strategy is not always effective\. For this dataset and preprocessing pipeline, conventional imputation appears to provide a more informative representation than passing missingness directly to the model\.

![Refer to caption](https://arxiv.org/html/2606.05481v1/x2.png)Figure 3:Effect of imputation on normalized MAE for PHME20\. TabPFN\+Nan indicates that we use the Nan token of TabPFN to represent missing values, while TabPFN indicates that we impute missing values\.
#### 6\.0\.5Probabilistic Interpretation of the TabPFN Output

TabPFN represents continuous regression targets through a discretized predictive distribution rather than a single scalar; quantiles and point estimates are derived from this distribution\. In[Fig\.˜4](https://arxiv.org/html/2606.05481#S6.F4)we interpret this predictive distribution qualitatively, examining how probability mass concentrates across RUL values as a function of time\. In both datasets, the probability mass concentrates along repeated descending trajectories, indicating that the model recovers the monotonic run\-to\-failure structure of individual test units\. For PHME20, the 20\-step lookback produces more localized probability bands, whereas the one\-step setting is more diffuse with broader vertical spread, particularly in early\- and mid\-life regions\. For Unibo, both settings show descending trajectories and pronounced probability mass near the low\-RUL endpoint\. Similar to PHME20, the longer lookback localizes the bands more tightly along the descending trajectory\. These qualitative observations indicate that temporal context sharpens the predictive distribution, although they do not constitute a formal calibration analysis\.

![Refer to caption](https://arxiv.org/html/2606.05481v1/figures/quantile_distribution_heatmap_phme20.png)\(a\)PHME20
![Refer to caption](https://arxiv.org/html/2606.05481v1/figures/quantile_distribution_heatmap_unibo.png)\(b\)Unibo

Figure 4:Predictive quantile distribution heatmaps for PHME20 and Unibo\. RUL distribution with and without tabularization of time\.![Refer to caption](https://arxiv.org/html/2606.05481v1/x3.png)\(a\)PHME20
![Refer to caption](https://arxiv.org/html/2606.05481v1/x4.png)\(b\)PHME20, blockwise
![Refer to caption](https://arxiv.org/html/2606.05481v1/x5.png)\(c\)Unibo
![Refer to caption](https://arxiv.org/html/2606.05481v1/x6.png)\(d\)Unibo, blockwise
![Refer to caption](https://arxiv.org/html/2606.05481v1/x7.png)\(e\)MZVAV
![Refer to caption](https://arxiv.org/html/2606.05481v1/x8.png)\(f\)MZVAV, blockwise

Figure 5:Scaling behavior for PHME20, Unibo, and MZVAV\. The left column shows uniformly subsampled scaling laws; the right column shows blockwise subsampled scaling laws\. We show the top 5 models for each dataset, which include TabPFN and TabDPT in all cases\. A complete version with all models is included in the Appendix as[Fig\.˜8](https://arxiv.org/html/2606.05481#A3.F8)\.![Refer to caption](https://arxiv.org/html/2606.05481v1/x9.png)\(a\)Aggregate
![Refer to caption](https://arxiv.org/html/2606.05481v1/x10.png)\(b\)Blockwise

Figure 6:Class balance in the MZVAV subset\-ratio experiment, comparing uniformly subsampled and blockwise subsampling \(Seed: 72\)\.![Refer to caption](https://arxiv.org/html/2606.05481v1/x11.png)\(a\)PHME20
![Refer to caption](https://arxiv.org/html/2606.05481v1/x12.png)\(b\)Unibo
![Refer to caption](https://arxiv.org/html/2606.05481v1/x13.png)\(c\)N\-CMAPSS DS02
![Refer to caption](https://arxiv.org/html/2606.05481v1/x14.png)\(d\)XJTU\-SY

Figure 7:Effect of sequence length on normalized MAE for selected prognostics datasets\.

## 7Conclusion & Discussion

Our experiments demonstrate that tabularization acts as a flexible and effective framework for PHM tasks, allowing prognostics and diagnostics to be performed under a single representation\. Additionally, our experiments indicate that our tabularization scheme enables Tabular Foundation Models \(TFMs\) to achieve strong performance with limited hyperparameter tuning concentrated mainly on the tabular shape\. The results further show that TFMs exhibit the generalization capacity and data efficiency expected of foundation models, achieving the best average ranks without task\-specific training and transferring effectively to new tasks through in\-context learning\.

Across both prognostics and diagnostics benchmarks, TFM remain highly competitive even under severe data scarcity and in the presence of missing values — with or without explicit imputation\. This combination of sample efficiency, cross\-task robustness, and few\-shot generalization makes them attractive for industrial settings where labeled data is scarce and failure modes are heterogeneous\.

This performance, however, depends critically on the distributional richness of the in\-context table\. Because these models learn from the provided context at inference time, the quality and coverage of that context directly determines prediction quality\. Therefore, a context that does not cover the operating regimes, degradation stages, or fault modes present in the test data will result in degraded predictions\. The N\-CMAPSS families provide direct evidence of this: NC\-DS02 is restricted to a single N\-CMAPSS dataset file, with a limited set of units and failure\-mode configurations, whereas NC\-P pools multiple N\-CMAPSS sources and fault configurations, yielding a richer and more varied training distribution\. Notably, TabDPT achieves the best normalized MAE on NC\-P among all evaluated models, underlining that tabular foundation models can excel when context diversity is sufficient\. This distributional perspective is also consistent with the sequence\-length analysis on NC\-DS02, where additional temporal context provides only minimal benefit\. Once the inference\-time distribution is sufficiently covered, adding more samples or variables appears to provide little additional information for the model’s posterior\. Taken together with the data\-efficiency results, this finding is practically encouraging: because tabular foundation models can exploit small but well\-chosen context sets, providing even a modest number of targeted samples that cover rare operating regimes or fault modes may be sufficient to unlock strong generalization in those settings\.

More generally, PHM performance is strongly influenced by the preprocessing pipeline, making the choice of input transformations a central component of the evaluation protocol\. The transformations selected in our experiments span both time and frequency domains to provide a comprehensive representation of the input signal\. Since the primary objective of this study is cross\-task robustness and sample efficiency rather than exhaustive per\-dataset optimization, the shared transform set is intentionally conservative\. This ensures fair comparison across model families rather than maximizing performance on any individual benchmark\. Identifying more general transformation sets that transfer robustly across PHM datasets, architectures, and operating regimes remains an important direction for future work\.

TFM’s computational efficiency is governed by the number of in\-context samples and by the dimensionality of each tabularized row\. In the present experiments, subsampling is applied only at the window level, and selected context rows retain the full flattened sensor\-time representation\. The finite inference\-time budget therefore constrains the number of context samples that can be used, which can limit attainable accuracy\. The current tabularization remains restrictive, as it does not allow individual sensors or time steps to be subsampled within a window\. Extending the approach with per\-variate or per\-timestep subsampling could reduceDtabD\_\{\\mathrm\{tab\}\}, alleviate the computational bottleneck, and preserve more context samples by retaining only the most informative features\. Future work may also develop more advanced context\-selection strategies that preserve operating\-regime, degradation\-stage, and class coverage while remaining within the computational limits of TFM inference\.

A significant advantage of PFN\-based models is the dual utility of the validation set\. In traditional deep learning, the validation set is strictly partitioned for hyperparameter tuning and early stopping to mitigate overfitting, effectively isolating it from the final inference phase\. In contrast, PFNs allow the validation data to be repurposed\. Once the optimal configuration \(e\.g\., tabularization parameters\) is identified, the validation set can be integrated into the model’s context\. This enables the model to leverage a richer set of labeled points during inference on the test set, maximizing the informative value of all available data\. A systematic study of how validation\-set reuse interacts with context size, distributional shift, and tabularization choices remains an open and practically relevant research direction\.

Finally, scalable PHM modeling may not be limited to replacing conventional methods with foundation models\. Recent agentic PHM systems and benchmarks provide a complementary direction by automating parts of model configuration, tool orchestration, method reproduction, and evaluation under executable interfaces\[[9](https://arxiv.org/html/2606.05481#bib.bib9),[18](https://arxiv.org/html/2606.05481#bib.bib18),[53](https://arxiv.org/html/2606.05481#bib.bib53)\]\. Such systems may make conventional methods more viable at scale by reducing the manual effort required to translate paper\-specific assumptions or user\-defined PHM tasks into standardized preprocessing, model, target, and evaluation configurations\.

## Appendix

### Appendix Contents

[A\. Dataset Protocols](https://arxiv.org/html/2606.05481#A1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.05481#A1)

[Battery datasets](https://arxiv.org/html/2606.05481#A1.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.1](https://arxiv.org/html/2606.05481#A1.SS1)

[Bearing datasets](https://arxiv.org/html/2606.05481#A1.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.2](https://arxiv.org/html/2606.05481#A1.SS2)

[N\-CMAPSS DS02 and multi\-source families](https://arxiv.org/html/2606.05481#A1.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.3](https://arxiv.org/html/2606.05481#A1.SS3)

[Hydraulic diagnostics \(HSF15\)](https://arxiv.org/html/2606.05481#A1.SS4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.4](https://arxiv.org/html/2606.05481#A1.SS4)

[PHME20](https://arxiv.org/html/2606.05481#A1.SS5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.5](https://arxiv.org/html/2606.05481#A1.SS5)

[MZVAV](https://arxiv.org/html/2606.05481#A1.SS6)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.6](https://arxiv.org/html/2606.05481#A1.SS6)

[B\. Implementation and Model Details](https://arxiv.org/html/2606.05481#A2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.05481#A2)

[Sequential models and in\-context tabular models](https://arxiv.org/html/2606.05481#A2.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.1](https://arxiv.org/html/2606.05481#A2.SS1)

[Evaluation protocol and baseline adaptation](https://arxiv.org/html/2606.05481#A2.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.2](https://arxiv.org/html/2606.05481#A2.SS2)

[TabPFN and TabDPT details](https://arxiv.org/html/2606.05481#A2.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.3](https://arxiv.org/html/2606.05481#A2.SS3)

[C\. Additional Experimental Results](https://arxiv.org/html/2606.05481#A3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.05481#A3)

[Experimental setup](https://arxiv.org/html/2606.05481#A3.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.1](https://arxiv.org/html/2606.05481#A3.SS1)

[Transformation schemas](https://arxiv.org/html/2606.05481#A3.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2](https://arxiv.org/html/2606.05481#A3.SS2)

[Hyperparameter search](https://arxiv.org/html/2606.05481#A3.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3](https://arxiv.org/html/2606.05481#A3.SS3)

[Reading the result tables](https://arxiv.org/html/2606.05481#A3.SS4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.4](https://arxiv.org/html/2606.05481#A3.SS4)

[Diagnostics results](https://arxiv.org/html/2606.05481#A3.SS5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.5](https://arxiv.org/html/2606.05481#A3.SS5)

[Prognostics results](https://arxiv.org/html/2606.05481#A3.SS6)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.6](https://arxiv.org/html/2606.05481#A3.SS6)

[Data\-efficiency scaling](https://arxiv.org/html/2606.05481#A3.SS7)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.7](https://arxiv.org/html/2606.05481#A3.SS7)

[D\. Reproducibility](https://arxiv.org/html/2606.05481#A4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.05481#A4)

[Framework\-based execution](https://arxiv.org/html/2606.05481#A4.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.1](https://arxiv.org/html/2606.05481#A4.SS1)

[Configuration\-fixed experiments](https://arxiv.org/html/2606.05481#A4.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.2](https://arxiv.org/html/2606.05481#A4.SS2)

[Shared preprocessing and evaluation boundaries](https://arxiv.org/html/2606.05481#A4.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.3](https://arxiv.org/html/2606.05481#A4.SS3)

[E\. Data and Code Access](https://arxiv.org/html/2606.05481#A5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2606.05481#A5)

[Code and protocol availability](https://arxiv.org/html/2606.05481#A5.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.1](https://arxiv.org/html/2606.05481#A5.SS1)

[Third\-party datasets](https://arxiv.org/html/2606.05481#A5.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.2](https://arxiv.org/html/2606.05481#A5.SS2)

## Appendix ADataset Protocols

This section records the dataset contract behind the main benchmark table\. For each evaluated family, it states the source data, split logic, target definition, normalization, and metrics needed to interpret the reported prognostics or diagnostics results\.

### A\.1Battery Datasets

The battery prognostics tasks use NB14 and UNIBO21\. Both datasets are treated as run\-to\-failure battery degradation problems with an operation\-dependent RUL target based on remaining cumulative discharge\.

NB14\.Description\.The NASA Randomized Battery Usage dataset\[[5](https://arxiv.org/html/2606.05481#bib.bib5)\]consists of 28 lithium cobalt oxide 18,650 cells organized into 7 operational groups\. Battery aging is induced through repeated charge and randomized discharge cycles with loading periods of 5 minute\. After 1500 loading periods, reference characterization cycles are used to assess battery health in terms of capacity\. The signals are recorded at 0\.1 Hz and comprise voltage, current, and temperature recorded during the aging tests\.

Split\.FollowingBosello et al\. \[[6](https://arxiv.org/html/2606.05481#bib.bib6)\], Group 3 is excluded, and cells RW3 and RW20 are removed because of corrupted or unusable measurements\. The remaining 22 cells are split into training, validation, and test units; the exact split is shown in[Table˜3](https://arxiv.org/html/2606.05481#A1.T3)\.

- •RW3is excluded because of corrupted temperature measurements\.
- •RW20is excluded because its sensor data report zero for nearly its entire life\.

Table 3:Train/Validation/Test Split of the NASA Randomized Battery Dataset\.GroupOriginal CellsTraining SetValidation SetTest SetGroup 1RW1, RW2, RW7, RW8RW1, RW2RW7RW8Group 2RW3, RW4, RW5, RW6RW4\(RW3 excl\.\)RW5RW6Group 3RW9, RW10, RW11, RW12Excluded \(unrealistic profile\)Group 4RW13, RW14, RW15, RW16RW13, RW14RW15RW16Group 5RW17, RW18, RW19, RW20RW17\(RW20 excl\.\)RW18RW19Group 6RW21, RW22, RW23, RW24RW21, RW22RW23RW24Group 7RW25, RW26, RW27, RW28RW25, RW26RW27RW28Total28 Batteries10 Batteries6 Batteries6 Batteries

Target, normalization, and metrics\.NB14 uses the shared battery ah\-RUL target defined below\. Input features are min\-max normalized, and performance is reported with MAE, MSE, and RMSE onQRULQ\_\{RUL\}\.

UNIBO21\.Description\.The UNIBO Powertools dataset\[[57](https://arxiv.org/html/2606.05481#bib.bib57)\]contains laboratory data from 30 lithium\-ion batteries used to study batteries in cleaning\-equipment conditions\. The cells differ by manufacturer, type, capacity, and test regime and can be categorized in 7 groups\.The experiment protocol comprises a charging with 1\.8 A current to a cutoff\-voltage of 4\.2 V, and discharging with a 5 A current until a defined end\-of\-discharge voltage\. The health indicators capacity and resistance were measured by performing 100 reference cycles\. The collected signals are sampled with 0\.1 Hz and include temperature, voltage, current, and energy measured during charging as well as discharging\.

Split\.FollowingBosello et al\. \[[6](https://arxiv.org/html/2606.05481#bib.bib6)\], cells 019, 047, and 049 are excluded because of data corruption or incomplete end\-of\-life cycling\. The remaining 27 cells are split into training, validation, and test units, with the exact split shown in[Table˜4](https://arxiv.org/html/2606.05481#A1.T4)\.

1. 1\.019: corrupted data\.
2. 2\.047, 049: not cycled to end\-of\-life at the time of dataset construction\.

Table 4:Train/Validation/Test Split of the UNIBO21 Dataset\.GroupOriginal CellsTraining SetValidation SetTest SetDM\-3\.0\-S000, 001, 002, 003000, 001002003DM\-3\.0\-H009, 010, 011009010011DM\-3\.0\-P013\-017, \(047, 049\)014, 015, 017016013EE\-2\.85\-S006, 007, 008, 042007, 008042006EE\-2\.85\-H043, 044043—044DP\-2\.00\-S018, 036\-039, 050, 051, \(019\)018, 036, 037, 038, 050051039DM\-4\.00\-S040, 041040—041Total27 \(\+3 excl\.\)15 Batteries5 Batteries7 Batteries

Target, normalization, and metrics\.UNIBO21 uses the same ah\-RUL formulation as NB14, with dataset\-specific implementation details for end\-of\-life and current integration\. Input features are min\-max normalized, and performance is reported with MAE, MSE, and RMSE onQRULQ\_\{RUL\}\.

Target creation and metrics\.FollowingBosello et al\. \[[6](https://arxiv.org/html/2606.05481#bib.bib6)\], both battery datasets use normalized remaining cumulative discharge, or ah\-RUL, instead of a purely time\- or cycle\-based target\. For any cyclenn, ah\-RUL is computed as:

QRUL\(n\)=Qacc\(nEoL\)−Qacc\(n\)Q\_\{RUL\}\(n\)=Q\_\{acc\}\(n\_\{EoL\}\)\-Q\_\{acc\}\(n\)\(26\)whereQRUL\(n\)=0Q\_\{RUL\}\(n\)=0for cyclesn≥nEoLn\\geq n\_\{EoL\}\. The cumulative discharge throughput is computed by integrating discharge current and normalizing by nominal capacity:

Qacc\(n\)=1Qnom∑i=1n\(∫titi\+1Id\(t\)𝑑t\)Q\_\{acc\}\(n\)=\\frac\{1\}\{Q\_\{nom\}\}\\sum\_\{i=1\}^\{n\}\\left\(\\int\_\{t\_\{i\}\}^\{t\_\{i\+1\}\}I\_\{d\}\(t\)dt\\right\)\(27\)
The NB14 and UNIBO21 implementations differ in the definition ofnEoLn\_\{EoL\}and in how discharge current is selected, so the dataset\-specific target rules are stated together with the split definitions\.

### A\.2Bearing Datasets

The evaluated bearing prognostics task is XJTU\-SY\. It provides high\-frequency vibration measurements from accelerated run\-to\-failure bearing experiments and is evaluated as a health\-indicator regression task\.

XJTU\-SY\.Description\.XJTU\-SY, introduced byYaguo et al\. \[[61](https://arxiv.org/html/2606.05481#bib.bib61)\], provides 15 run\-to\-failure bearings under three operating conditions\. Each bearing is recorded through horizontal and vertical high\-frequency vibration channels until failure with a sampling rate of 25\.6 kHz\. In this benchmark, each acquisition is represented as a two\-channel vibration segment of shape\(32768,2\)\(32768,2\), and the operating conditions and bearing lifetimes used for target construction are summarized below\.

Preprocessing\.The preprocessing pipeline standard\-scales the raw vibration channels, computes a cumulative\-sum feature, extracts time\-domain and spectral descriptors, and min\-max rescales the concatenated descriptor block\. Fit\-predict models reuse the same processed representation through a tabularized variant\.

Split\.The reported XJTU\-SY experiments use the configured PHMD split\. The training bearings are1\_3,1\_4,2\_1,2\_4,2\_5,3\_1,3\_2, and3\_3; the validation bearings are1\_1,1\_2, and3\_5; and the test bearings are1\_5,2\_2,2\_3, and3\_4\.

Table 5:XJTU\-SY Operating Conditions\.Condition IDEngine Speed \(rpm\)Radial Load \(N\)Condition 1210012,000Condition 2225011,000Condition 3240010,000

Table 6:XJTU\-SY bearing lifetimes \(min\), used for regression targets and HI normalization\.ConditionBearing NameBearing Lifetime \(min\)Condition 1Bearing 1\_1123Condition 1Bearing 1\_2161Condition 1Bearing 1\_3158Condition 1Bearing 1\_4122Condition 1Bearing 1\_552Condition 2Bearing 2\_1491Condition 2Bearing 2\_2161Condition 2Bearing 2\_3533Condition 2Bearing 2\_442Condition 2Bearing 2\_5339Condition 3Bearing 3\_12538Condition 3Bearing 3\_22496Condition 3Bearing 3\_3371Condition 3Bearing 3\_41515Condition 3Bearing 3\_5114

Target, normalization, and metrics\.Runtime is converted into the HI target below using the bearing lifetime table\. XJTU\-SY is evaluated with standard regression metrics such as MAE and RMSE between the true HI and the predicted HI for each sample\.

Target creation and metrics\.For XJTU\-SY, the target variable is formulated as a normalized health indicator \(HI\), linearly decreasing from 1 for a healthy bearing to 0 at failure:

HI=1−RuntimeTotal LifetimeHI=1\-\\frac\{\\text\{Runtime\}\}\{\\text\{Total Lifetime\}\}\(28\)For XJTU\-SY,Total Lifetimeis theBearing Lifetime \(min\)in[Table˜6](https://arxiv.org/html/2606.05481#A1.T6)\.

### A\.3N\-CMAPSS DS02 and Multi\-Source Families

N\-CMAPSS\.Description\.N\-CMAPSS\[[4](https://arxiv.org/html/2606.05481#bib.bib4),[20](https://arxiv.org/html/2606.05481#bib.bib20)\]is a NASA benchmark family derived from C\-MAPSS turbofan engine simulations\. It provides multivariate time series with run\-to\-failure trajectories, operating conditions, fault modes, and ground\-truth RUL\. This work uses this source family in two scopes: NC\-DS02 denotes the DS02 prognostics protocol, while NC\-P and NC\-D denote the broader N\-CMAPSS prognostics and diagnostics settings reported in the main results\.

Split\.The explicit split table below documents the DS02 protocol because it is the N\-CMAPSS split currently tabulated in the manuscript\. DS02 is split into five training units, one validation unit, and three test units, with the split shown in[Table˜7](https://arxiv.org/html/2606.05481#A1.T7)\. The broader N\-CMAPSS settings remain part of the same source family and are reported separately in the main results as NC\-P and NC\-D\.

Table 7:N\-CMAPSS\-02 data splits, flight classes, and fault types \(notation aligned with the main text\)\.SplitUnit numberFlight classTrain22Train52Train102Train162Train202Validation182Test112Test142Test152Target, normalization, and metrics\.For NC\-DS02 and NC\-P, the target is RUL, measured as the remaining flights or cycles before end\-of\-life\. Features are min\-max normalized, and prognostics metrics are MAE, MSE, and RMSE of estimated RUL against the true value\. For NC\-D, the target is a diagnostic class label and the main\-results metric is F1 score\.

### A\.4Hydraulic Diagnostics \(HSF15\)

HSF15\.Description\.HSF15\[[25](https://arxiv.org/html/2606.05481#bib.bib25)\]is a hydraulic\-system condition\-monitoring benchmark based on a laboratory test rig with multivariate sensor measurements\. This work evaluates it as four component\-level diagnostics tasks: accumulator \(HSF15\-A\), cooler \(HSF15\-C\), pump \(HSF15\-P\), and valve \(HSF15\-V\)\.

Split\.All four HSF15 tasks use the same datasource family and split protocol, but differ in the target component and number of fault classes\. The tasks are reported separately in the main benchmark table because each component defines a distinct diagnostic classification problem\.

Target, normalization, and metrics\.The target is the component\-specific fault class\. Features are scaled according to the shared preprocessing pipeline, and the tabular fit\-predict pipeline summarizes sensor bursts through time\-domain and spectral descriptors before tabularization\. The main\-results metric is F1 score\.

### A\.5PHME20

PHME20\.Description\.PHME20\[[29](https://arxiv.org/html/2606.05481#bib.bib29)\]is the PHM Society 2020 European Conference Data Challenge dataset\. It records an experimental industrial filtration system in which a particulate\-laden gas stream progressively clogs a filter, increasing differential pressure until an operational threshold is reached\. Each run captures one filter lifetime under varying dust and feed conditions\.

Split\.This work follows the challenge\-provided split, which assigns disjoint filter runs to training, validation, and test partitions\. No additional dataset\-specific filtering is applied in this protocol\.

Target, normalization, and metrics\.PHME20 is used as a prognostics task with a direct per\-timestep RUL target\. The preprocessing pipeline min\-max scales the sensor channels and RUL target\. The task logs MAE, MSE, RMSE, andnasa\_score\.

### A\.6MZVAV

MZVAV\.Description\.MZVAV is an automated fault detection and diagnostics dataset stemming from a small commerical building with multi\-zone variable air volume system generated by Drexel university in the ASHRAE 1312 project\.\[[21](https://arxiv.org/html/2606.05481#bib.bib21)\]\. It is a simulated building\-fault dataset featuring three air\-handling units with 18 sensors collected during 26 days across summer, winter, and transition seasons with a one\-minute resolution\. The operation faults were manually imposed into the control system\. The collected signals include outdoor air temperature, supply air temperature and set\-point, mixed air and return air temperature, supply air fan status and speed control, return air fan status and speed control, exhaust air damper control, outdoor air and return air damper control, cooling and heating coil valve control, supply air duct static pressure and set point, occupancy mode indicator and fault detection ground truth\. On this basis, we formulate a multi\-class classification problem with the fault detection ground truth as label\.

Split\.Individual faults are grouped into the 4 classes "Outdoor Air Damper Stuck", "Heating Coil Valve Leaking", "Cooling Coil Valve Leaking", and "Unfaulted"\. The number of days per group is shown in[Table˜8](https://arxiv.org/html/2606.05481#A1.T8)\. The benchmark uses a stratified train/validation/test split over days with a test size of 20%\.

Fault CategoryFault daysOutdoor Air Damper Stuck5Heating Coil Valve Leaking3Cooling Coil Valve Leaking5Unfaulted13Table 8:Number of daulty days per fault groupTarget, normalization, and metrics\.The target is the fault detection ground truth\. Features are min\-max scaled to\[0,1\]\[0,1\]\. Metrics are F1\-score, precision, recall, and accuracy\.

## Appendix BImplementation and Model Details

### B\.1Sequential Models and In\-Context Tabular Models

This subsection collects the implementation details needed to connect the protocol above to the evaluated model families\. The key distinction is the input representation: sequence models consume𝒟seq\\mathcal\{D\}\_\{\\mathrm\{seq\}\}, while tabular foundation models and tabular baselines consume𝒟tab\\mathcal\{D\}\_\{\\mathrm\{tab\}\}\.

Conventional sequence models, such as 1D\-CNNs, LSTMs, and Time\-Series Transformers, are trained to minimize a task\-specific loss over the training sequence dataset𝒟seqtrain\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\. These models do not operate on tabularized rows𝑿m\{\\bm\{X\}\}\_\{m\}; instead, they ingest the temporal window𝑾m\{\\bm\{W\}\}\_\{m\}directly\.

Letfθf\_\{\\theta\}denote a model parameterized byθ\\theta\. The model produces a predictiony^m=fθ\(𝑾m\)\\hat\{y\}\_\{m\}=f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{m\}\)\. The optimal parametersθ∗\\theta^\{\*\}are obtained by minimizing empirical risk over the training sequence dataset:

θ∗=argminθ⁡1\|𝒟seqtrain\|∑\(𝑾m,ym\)∈𝒟seqtrainℓ\(fθ\(𝑾m\),ym\)\\theta^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{\\theta\}\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\|\}\\sum\_\{\(\{\\bm\{W\}\}\_\{m\},y\_\{m\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{seq\}\}^\{\\text\{train\}\}\}\\ell\\left\(f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{m\}\),y\_\{m\}\\right\)\(29\)
whereℓ\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)is a task\-dependent loss function, such as mean squared error for prognostics or cross\-entropy for diagnostics\. We typically employ stochastic gradient descent variants such as AdamW to solve this optimization problem\.

Tabular foundation models operate through in\-context learning\. In this paradigm, the modelfϕf\_\{\\phi\}, parameterized by weightsϕ\\phi, predicts for a query row𝑿q\{\\bm\{X\}\}\_\{q\}by conditioning on a context set𝒞\\mathcal\{C\}of labeled support examples:

y^q=fϕ\(𝒞,𝑿q\)\\hat\{y\}\_\{q\}=f\_\{\\phi\}\(\\mathcal\{C\},\{\\bm\{X\}\}\_\{q\}\)\(30\)
To ensure a valid evaluation and prevent leakage, the context set must be drawn strictly from the training partition of the tabular dataset:

𝒞=\{\(𝑿j,yj\)\}j=1C⊂𝒟tabtrain\.\\mathcal\{C\}=\\\{\(\{\\bm\{X\}\}\_\{j\},y\_\{j\}\)\\\}\_\{j=1\}^\{C\}\\subset\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}\.\(31\)
XGBoost is included as a classical tree\-based tabular baseline\[[10](https://arxiv.org/html/2606.05481#bib.bib10)\]\. It uses the same tabularized rows𝑿m\{\\bm\{X\}\}\_\{m\}as the tabular foundation models, but learns task\-specific boosted trees from the training partition rather than using in\-context prediction\.

### B\.2Evaluation Protocol and Baseline Adaptation

All models are evaluated on the same held\-out instances derived from𝒟test\\mathcal\{D\}^\{\\text\{test\}\}\. The only difference is the representation supplied to the model: sequence models consume𝑾m\{\\bm\{W\}\}\_\{m\}, while tabular models consume𝑿m\{\\bm\{X\}\}\_\{m\}and, when applicable, a training\-only context set𝒞\\mathcal\{C\}\.

For a task\-specific lossℓ\(y^,y\)\\ell\(\\hat\{y\},y\), the aggregate test loss is computed over the same test index set for both model families:

ℒtest=1\|𝒟test\|∑m∈𝒟testℓ\(y^m,ym\),wherey^m=\{fθ\(𝑾m\)iffis a sequence model,fϕ\(𝒞,𝑿m\)iffis an in\-context tabular model\.\\mathcal\{L\}\_\{\\mathrm\{test\}\}=\\frac\{1\}\{\|\\mathcal\{D\}^\{\\text\{test\}\}\|\}\\sum\_\{m\\in\\mathcal\{D\}^\{\\text\{test\}\}\}\\ell\(\\hat\{y\}\_\{m\},y\_\{m\}\),\\quad\\text\{where \}\\hat\{y\}\_\{m\}=\\begin\{cases\}f\_\{\\theta\}\(\{\\bm\{W\}\}\_\{m\}\)&\\text\{if \}f\\text\{ is a sequence model\},\\\\ f\_\{\\phi\}\(\\mathcal\{C\},\{\\bm\{X\}\}\_\{m\}\)&\\text\{if \}f\\text\{ is an in\-context tabular model\}\.\\end\{cases\}\(32\)
For prognostics, this work reports regression metrics such as RMSE and MAE\. For diagnostics, it reports classification metrics such as accuracy and macro\-F1\. For cross\-dataset comparisons, models are ranked within each dataset and the average rank is reported, with rank 1 denoting the best model for a given dataset\.

Baseline models trained from scratch, namely Bi\-LSTM, 1D\-CNN, TiDE, and Transformer\-based models, are applied directly to the sequence windows𝑾m\{\\bm\{W\}\}\_\{m\}\. Some Transformer models, including the Time\-Series Transformer, Crossformer, and Spacetimeformer, use an encoder–decoder formulation in their original forecasting setting\. That formulation assumes historical target values, historical covariates, and future\-horizon covariates\.

This formulation is not directly applicable to PHM prognostics and diagnostics because the past values of the target variable, such as remaining useful life, are unavailable at inference time, and no future information is used\. We therefore adapt encoder–decoder Transformer architectures by removing the encoder and directly feeding𝑾m\{\\bm\{W\}\}\_\{m\}to the decoder\. This step is not needed for PatchTST, which is decoder\-only by design\. For PatchTST, channel\-independent processing cannot be applied to the regression target for the same reason, so decoder latents are concatenated and a separate regression or classification head is trained\. For TiDE, we disable the encoder pathway and use the dynamic covariate pathway to process𝑾m\{\\bm\{W\}\}\_\{m\}\. For diagnostics, the regression head is replaced with a classification head that outputsnnclasses\.

The transformer training regimen uses a warm\-up phase and reduces the learning rate upon plateau\. Min\-max or z\-score normalization is selected depending on the dataset\. All preprocessing, feature extraction, and sequence slicing, represented by𝒢\\mathcal\{G\}and𝒮\\mathcal\{S\}, are shared among the models; tabular foundation models and XGBoost additionally share the tabularization schema𝒯\\mathcal\{T\}\.

### B\.3TabPFN and TabDPT Details

TabPFN\[[26](https://arxiv.org/html/2606.05481#bib.bib26),[27](https://arxiv.org/html/2606.05481#bib.bib27)\]is a Prior\-Data Fitted Network \(PFN\) trained on a large\-scale synthetic dataset that consists of millions of small tabular instances\. These tabular instances are generated from randomly sampled structural causal models\. TabPFN therefore leverages the structural diversity seen during pre\-training to implicitly understand and generalize to the relational structure of unseen tabular data\. This enables the network to efficiently approximate the Bayesian posterior for the chosen prior:

p\(ytest∣𝒟tabtest,𝒴tabtrain,𝒟tabtrain\)p\(y^\{\\text\{test\}\}\\mid\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{test\}\},\\mathcal\{Y\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\},\\mathcal\{D\}\_\{\\mathrm\{tab\}\}^\{\\text\{train\}\}\)\(33\)
Internally, TabPFN converts tabular cells into a sequence of high\-dimensional feature tokens by mapping every categorical or scalar attribute to a fixed\-dimensional embedding\. The attribute embedding strategy is therefore similar to Spacetimeformer, which flattens spatial\-temporal time sequences and embeds each signal and time coordinate in𝑾k∈ℝLseq×F\{\\bm\{W\}\}\_\{k\}\\in\\mathbb\{R\}^\{L\_\{\\mathrm\{seq\}\}\\times F\}for a total ofLseq×FL\_\{\\mathrm\{seq\}\}\\times Fembeddings\. In TabPFN, the final embedding is obtained by multiplying each raw feature value with an attribute\-specific offset vector that places all attributes in a shared latent space\. TabPFN is a univariate model\. For multivariate outputs in PHM, we therefore parallelize the process by fitting one model per variate\. To further enhance robustness, we use an ensemble of eight TabPFN estimators\.

We also evaluate TabDPT\[[38](https://arxiv.org/html/2606.05481#bib.bib38)\], a tabular foundation model that has been trained purely on real data\. Due to its capability to perform classification and regression, we employ TabDPT as a retrieval\-based discriminatively pretrained transformer that operates on full row representations\. TabDPT supports up toDmax=100D\_\{\\max\}=100features, so whenDtabD\_\{\\mathrm\{tab\}\}exceeds this limit TabDPT applies Principal Component Analysis to obtain a projection with exactlyDmaxD\_\{\\max\}dimensions\. TabDPT separately embeds features and labels linearly into a 768 dimensional space and combines them through element\-wise addition\.

During inference, TabDPT builds a local context by selecting the topKKmost similar training rows using the FAISS library\[[15](https://arxiv.org/html/2606.05481#bib.bib15)\]\. The full transformer sequence is formed by concatenating the embedded training samples and the embedded evaluation sample, where the training embeddings are shifted by their label embeddings and the evaluation sample is kept unlabeled\. As for TabPFN, we use the ensemble of eight TabDPT estimators, each trained on a random subset of columns, and aggregate their outputs through a weighted average\. We ensure that there is no overlap between the datasets evaluated in this work and the pre\-training data of TabPFN\.

## Appendix CAdditional Experimental Results

This appendix presents the complete experimental setup and results for the benchmark evaluation\. It documents the evaluated experiment families, preprocessing schemas, and hyperparameter search spaces\. For the condensed main\-text summary, see Section[6](https://arxiv.org/html/2606.05481#S6); for dataset\-level protocol details, see Appendix[A](https://arxiv.org/html/2606.05481#A1)\.

### C\.1Experimental setup

##### Experiment definitions\.

Each experiment family is defined by a Hydra configuration\[[60](https://arxiv.org/html/2606.05481#bib.bib60)\]that specifies the datasource, transform pipeline, model, evaluator, seed set, and hyperparameter search space\. The benchmark covers gradient\-trained and fit\-predict models for both diagnostics and prognostics tasks\. XGBoost uses a separate fit\-predict configuration that reuses the same experiment definitions and transform families as the other tabular models\.

##### Learning tasks and datasets\.

The evaluation covers two PHM task categories\. Diagnostics comprises multiclass fault classification on MZVAV\[[22](https://arxiv.org/html/2606.05481#bib.bib22)\], four component\-specific hydraulic diagnostics tasks on HSF15\[[25](https://arxiv.org/html/2606.05481#bib.bib25)\]\(accumulator, cooler, pump, and valve\), and concept classification on N\-CMAPSS Multi\[[4](https://arxiv.org/html/2606.05481#bib.bib4),[20](https://arxiv.org/html/2606.05481#bib.bib20)\]\. Prognostics comprises ah\-RUL regression on NB14\[[5](https://arxiv.org/html/2606.05481#bib.bib5)\]and UNIBO21\[[58](https://arxiv.org/html/2606.05481#bib.bib58)\], direct RUL regression on PHME20\[[29](https://arxiv.org/html/2606.05481#bib.bib29)\], RUL prediction on the N\-CMAPSS Multi and DS02 families, and bearing prognostics on XJTU\-SY\[[61](https://arxiv.org/html/2606.05481#bib.bib61)\]\.

Target semantics differ by family\. NB14 and UNIBO21 predict ah\-RUL, i\.e\., remaining cumulative discharge throughput in Ampere\-hours; PHME20 and the N\-CMAPSS families predict remaining useful life; and the bearing family transforms runtime trajectories into a normalized degradation target through the configured health\-index transform, while evaluation is reported with the per\-unit metric suite\. All families use disjoint train/validation/test entities according to their datasource definitions, with MZVAV as the only day\-stratified diagnostics exception\.

##### Models\.

The benchmark compares five model families: \(i\) simple baselines \(Linear, Exp, MLP\), \(ii\) deep sequence models \(LSTM, CNN\-1D, TiDE\), \(iii\) transformers \(TST, STF, CF, PTST\), \(iv\) tabular models \(XGBoost\), and \(v\) tabular foundation models \(TabPFN, TabDPT\)\. Linear is regression\-only and is therefore omitted from diagnostics; for diagnostics, a linear classifier serves the analogous baseline role\.

##### Training and evaluation\.

Every model–dataset configuration is repeated over five random seeds\. Gradient\-trained models search overseq\_len∈\{1,10,50\}\\texttt\{seq\\\_len\}\\in\\\{1,10,50\\\}andlr∈\{0\.001,0\.0005,0\.0001\}\\texttt\{lr\}\\in\\\{0\.001,0\.0005,0\.0001\\\}, with batch size 512, a maximum of 200 epochs, and early stopping\. Fit\-predict models search over context/stride pairs\(1,1\)\(1,1\),\(5,1\)\(5,1\),\(10,5\)\(10,5\),\(20,5\)\(20,5\), and\(50,50\)\(50,50\)\. Diagnostics selects the best configuration onval/f1; prognostics \(generic RUL objective\) selects onval/loss\. Evaluation uses three evaluator families:classification\(diagnostics\),rul\(direct RUL regression\), andper\_unit\(battery and bearing families\)\.

### C\.2Transformation schemas

All experiment families obey the same leakage\-prevention invariant: any fitted normalization or feature\-extraction statistic is estimated on the training partition only and then reused unchanged for validation and test\. The subsections below describe the preprocessing pipeline applied to each dataset group; Across all families, the transformed tensors are subsequently windowed according to the sequence\-length and stride settings described in Section[C\.3](https://arxiv.org/html/2606.05481#A3.SS3)\.

#### C\.2\.1Battery datasets \(NB14 and UNIBO21\)

For both NB14 and UNIBO21, raw sensor channels and the ah\-RUL target are min\-max scaled \(MinMaxScaler\)\. Two descriptor branches are then extracted from the cycle traces: time\-domain statistics—mean, variance, kurtosis, peak factor, and related summaries \(TimeStatsTransform\)—and frequency\-domain statistics via FFT \(SpectralStatsTransform\)\. The resulting descriptor vectors are concatenated \(ConcatenateTransform\) and re\-scaled with a final min\-max transform\. For fit\-predict models, the pipeline additionally tabularizes the processed history into a single feature vector for one\-shot inference \(TimeseriesTabularizer\)\. This pipeline corresponds to thecombined/combined\_fit\_predictconfiguration group\.

#### C\.2\.2Bearing dataset \(XJTU\-SY\)

For XJTU\-SY, raw vibration channels are standardized with training\-partition statistics \(StandardScaler\)\. Runtime targets are converted into a normalized degradation trajectory via HealthIndexTransform\. An additional cumulative\-sum feature is computed and min\-max scaled, and time\-domain plus spectral descriptors \(TimeStatsTransform, SpectralStatsTransform\) are extracted from the standardized vibration signals\. The resulting features are concatenated and min\-max rescaled\. For fit\-predict models, the pipeline tabularizes the combined representation while preserving unit identifiers for per\-unit evaluation\. This pipeline corresponds to thecombined/combined\_fit\_predictconfiguration group\.

#### C\.2\.3N\-CMAPSS families

Sensor features and operating descriptors are standardized with fixed N\-CMAPSS scalers \(N\_CMAPSSDescriptorsScaler, StandardScaler\), and the RUL label is multiplied by the constant factor 0\.01 \(ConstantScaler\)\. Each flight is then temporally aggregated with non\-overlapping windows of 60 timesteps \(WindowedAggregationTransform\), and the aggregated sensor features and descriptors are concatenated into a single input representation\. For the multi\-source diagnostics family, the transform additionally builds unified concept classes from the source\-specific concept annotations\. Fit\-predict models tabularize the aggregated histories after the same scaling and concatenation stages\. This pipeline corresponds to thedepater2023\_default/depater2023\_fit\_predict\_historyconfiguration group\.

#### C\.2\.4Building diagnostics \(MZVAV\)

No additional feature engineering is applied for MZVAV\. The pipeline rescales the sensor channels with a precomputed dataset\-specific min\-max scaler \(MinMaxScalerMZVAV\) and routes the scalar target directly to the fault\-classification key consumed by the diagnostics models\. For fit\-predict models, the pipeline tabularizes the resulting history windows without introducing a separate descriptor stage\. This pipeline corresponds to thedefault/fit\_predict\_historyconfiguration group\.

#### C\.2\.5Hydraulic diagnostics \(HSF15\)

All four HSF15 component tasks share the same preprocessing pipeline\. Features are min\-max scaled \(MinMaxScaler\), the component\-specific target is reduced to the last label in each window and assigned to the fault\-classification key, and the sensor burst is summarized through time\-domain statistics \(TimeStatsTransform\) and spectral statistics \(SpectralStatsTransform\), followed by concatenation and a final min\-max rescaling\. For fit\-predict models, the pipeline tabularizes the statistics\-based representation for XGBoost, TabPFN, and TabDPT\. This pipeline corresponds to thedefault/statistics\_fit\_predictconfiguration group\.

#### C\.2\.6PHM challenge prognostics \(PHME20\)

Both features and the direct RUL target are min\-max scaled \(MinMaxScaler\), with no additional handcrafted feature extraction\. For fit\-predict models, the pipeline adds history tabularization for one\-shot inference\. This pipeline corresponds to thenormalize\_feature\_target/normalize\_feature\_target\_fit\_predictconfiguration group\.

### C\.3Hyperparameter search

The benchmark exposes two hyperparameter\-search families: a gradient\-trained family for sequence models and simple neural baselines, and a fit\-predict family for tabular/foundation models including XGBoost\. Table[9](https://arxiv.org/html/2606.05481#A3.T9)summarizes the search spaces\.

Table 9:Hyperparameter search families used in the benchmark evaluation\.Search familyModelsSearch spaceFixed settingsSelection ruleGradient\-trainedLSTM, 1D\-CNN, STF, Crossformer, Timeseries Transformer, TiDE, PatchTST, MLP, linear classifier, linear regression, exponential regressionseq\_len∈\{1,10,50\}\\in\\\{1,10,50\\\};lr∈\{0\.001,0\.0005,0\.0001\}\\in\\\{0\.001,0\.0005,0\.0001\\\}Batch size 512; max 200 epochs; early stoppingval/f1for diagnostics;val/lossfor prognosticsFit\-predictXGBoost, TabPFN, TabDPTContext/stride pairs\(1,1\)\(1,1\),\(5,1\)\(5,1\),\(10,5\)\(10,5\),\(20,5\)\(20,5\),\(50,50\)\(50,50\)Five seeds; deterministic transform reuse; one\-shot fit/predict evaluationval/f1for diagnostics;val/lossfor prognostics

All successful runs persist the resolved configuration, Hydra override trace, run metadata, and reproduction instructions alongside the predictions and evaluator outputs\. The corresponding artifact protocol is described in Appendix[D](https://arxiv.org/html/2606.05481#A4)\.

### C\.4Reading the result tables

The result tables in Sections[C\.5](https://arxiv.org/html/2606.05481#A3.SS5)–[C\.6](https://arxiv.org/html/2606.05481#A3.SS6)share a common reading convention, summarized here once so individual captions can stay terse\.

##### Dataset abbreviations\.

Column headers use short identifiers consistently across both the main\-text and appendix tables\. Their full meanings are:

AbbreviationTaskSource dataset \(domain\)NC\-DS02PrognosticsN\-CMAPSS DS02 \(turbofan engine RUL\)NC\-PPrognosticsN\-CMAPSS Multi\-source \(turbofan engine RUL\)NB14PrognosticsNASA Randomized Battery Usage \(battery ah\-RUL\)PHME20PrognosticsPHM 2020 Challenge \(industrial filtration RUL\)UniboPrognosticsUNIBO Powertools \(battery ah\-RUL\)XJTU\-SYPrognosticsXJTU\-SY \(bearing degradation\)NC\-DDiagnosticsN\-CMAPSS Multi\-source \(turbofan concept classification\)HSF15\-ADiagnosticsHSF15 \(hydraulic accumulator, 4\-way\)HSF15\-CDiagnosticsHSF15 \(hydraulic cooler, 3\-way\)HSF15\-PDiagnosticsHSF15 \(hydraulic pump, 3\-way\)HSF15\-VDiagnosticsHSF15 \(hydraulic valve, 4\-way\)MZVAVDiagnosticsMZVAV \(multi\-zone HVAC fault, 4\-way\)
##### Cell convention\.

Each cell reports mean±\\pmstd over five independent seeds\. Bold cells are the best in their column; underlined cells are the second best\. Rows are grouped by model family in the order: simple baselines, deep sequence models, transformers, tabular models, tabular foundation models\.

##### Metric scaling and direction\.

Diagnostics metrics \(F1, AUROC, Accuracy\) are reported on a 0–100 scale; prognostics MAE/MSE are reported both in the framework’s normalized target space \(reported×100\\times 100\) and in original engineering units \(denormalized\) for practitioner interpretation\. Arrows in column headers indicate metric direction \(↓\\downarrowlower is better,↑\\uparrowhigher is better\)\. The*Avg rank*column is the mean of per\-task ranks across the columns of that table\.

### C\.5Diagnostics results

Diagnostics is evaluated by F1 \(the headline metric in Section[6](https://arxiv.org/html/2606.05481#S6)\) and complemented here by AUROC and Accuracy\. The three metrics agree on the top group: TabDPT, TabPFN, CNN\-1D, and XGBoost cluster within1\.01\.0average\-rank of each other on F1, with LSTM joining the leading tier on F1 and Accuracy\. MZVAV \(multi\-zone HVAC fault classification under a day\-stratified split\) is the hardest task in every metric and the only family where the gap to chance is small\. The metric\-robustness check therefore supports the F1 choice in the main text\.

Table 10:F1 score on diagnostics \(↑\\uparrow\)\. Same numbers as the diagnostics floor of Table[2](https://arxiv.org/html/2606.05481#S6.T2), reproduced here with full per\-task breakdown\.ModelNC\-DHSF15\-AHSF15\-CHSF15\-PHSF15\-VMZVAVAverage rankLinear72\.34 ± 3\.0458\.75 ± 1\.8198\.35 ± 0\.8254\.40 ± 11\.9732\.55 ± 2\.3639\.89 ± 8\.777\.33MLP79\.49 ± 1\.8091\.02 ± 2\.2599\.91 ± 0\.1497\.32 ± 0\.6080\.99 ± 29\.3860\.10 ± 6\.395\.00LSTM88\.84 ± 0\.7394\.59 ± 0\.97100\.00 ± 0\.0095\.94 ± 2\.8397\.35 ± 3\.3251\.31 ± 6\.013\.83CNN\-1D87\.53 ± 2\.8394\.03 ± 1\.98100\.00 ± 0\.0098\.73 ± 0\.5297\.92 ± 0\.8366\.11 ± 5\.743\.00TiDE32\.57 ± 4\.6542\.90 ± 5\.0761\.57 ± 10\.4159\.37 ± 7\.6842\.11 ± 14\.3625\.19 ± 5\.298\.17TST26\.34 ± 3\.9637\.37 ± 4\.3859\.18 ± 10\.2946\.07 ± 4\.1335\.40 ± 5\.7724\.93 ± 4\.3810\.00STF24\.55 ± 3\.6040\.01 ± 5\.7565\.94 ± 20\.6150\.39 ± 11\.5437\.34 ± 5\.6838\.01 ± 5\.568\.67CF23\.74 ± 0\.9325\.04 ± 5\.2859\.19 ± 10\.0529\.46 ± 2\.6123\.98 ± 2\.9517\.34 ± 4\.2411\.50PTST19\.57 ± 0\.2631\.56 ± 4\.0441\.57 ± 5\.1841\.22 ± 6\.6326\.20 ± 2\.4425\.81 ± 4\.1511\.00XGBoost48\.13 ± 0\.0098\.07 ± 0\.00100\.00 ± 0\.0099\.66 ± 0\.0099\.65 ± 0\.0057\.08 ± 0\.003\.17TabPFN67\.15 ± 1\.4699\.47 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.0058\.32 ± 2\.442\.33TabDPT85\.21 ± 0\.1696\.66 ± 1\.03100\.00 ± 0\.0099\.06 ± 0\.2598\.92 ± 0\.3571\.29 ± 0\.482\.33

Table 11:AUROC on diagnostics \(↑\\uparrow\)\.ModelNC\-DHSF15\-AHSF15\-CHSF15\-PHSF15\-VMZVAVAverage rankLinear93\.76 ± 1\.2985\.10 ± 1\.3599\.43 ± 0\.6377\.41 ± 12\.9964\.74 ± 4\.4971\.47 ± 9\.587\.17MLP96\.31 ± 0\.6798\.95 ± 0\.39100\.00 ± 0\.0099\.84 ± 0\.0790\.02 ± 21\.1185\.24 ± 2\.035\.17LSTM98\.41 ± 0\.1699\.54 ± 0\.11100\.00 ± 0\.0099\.68 ± 0\.2799\.88 ± 0\.1786\.32 ± 4\.063\.33CNN\-1D98\.11 ± 0\.6299\.44 ± 0\.19100\.00 ± 0\.0099\.94 ± 0\.0699\.94 ± 0\.0585\.45 ± 2\.293\.17TiDE62\.71 ± 3\.8469\.00 ± 5\.6681\.46 ± 5\.1478\.78 ± 11\.2464\.91 ± 11\.9860\.77 ± 9\.358\.00TST57\.03 ± 2\.9162\.99 ± 6\.3779\.87 ± 9\.0465\.30 ± 6\.0761\.60 ± 5\.1957\.44 ± 9\.019\.33STF55\.23 ± 3\.0458\.44 ± 5\.1081\.96 ± 14\.1364\.97 ± 9\.0659\.98 ± 15\.8162\.61 ± 10\.359\.50CF52\.00 ± 0\.8753\.48 ± 5\.5473\.59 ± 8\.1737\.35 ± 0\.5351\.76 ± 0\.9447\.64 ± 6\.3311\.67PTST50\.01 ± 0\.2458\.74 ± 1\.6459\.83 ± 4\.3262\.24 ± 5\.0352\.84 ± 2\.8652\.97 ± 5\.0111\.17XGBoost82\.56 ± 0\.0099\.94 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.0084\.51 ± 0\.003\.17TabPFN93\.97 ± 0\.31100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.0083\.67 ± 0\.772\.67TabDPT97\.38 ± 0\.0699\.91 ± 0\.05100\.00 ± 0\.0099\.99 ± 0\.0099\.99 ± 0\.0088\.57 ± 0\.732\.50

Table 12:Accuracy on diagnostics \(↑\\uparrow\)\.ModelNC\-DHSF15\-AHSF15\-CHSF15\-PHSF15\-VMZVAVAverage rankLinear73\.79 ± 2\.3561\.47 ± 1\.9198\.36 ± 0\.8263\.08 ± 14\.5945\.93 ± 2\.0161\.56 ± 4\.947\.33MLP79\.72 ± 1\.8292\.39 ± 2\.0699\.91 ± 0\.1498\.05 ± 0\.4185\.28 ± 22\.1971\.54 ± 4\.905\.00LSTM88\.42 ± 0\.7295\.42 ± 0\.82100\.00 ± 0\.0096\.96 ± 1\.9497\.64 ± 3\.1070\.71 ± 4\.773\.50CNN\-1D86\.91 ± 2\.9195\.00 ± 1\.57100\.00 ± 0\.0099\.09 ± 0\.3698\.38 ± 0\.6777\.81 ± 3\.103\.00TiDE35\.30 ± 3\.7349\.24 ± 10\.7565\.39 ± 8\.9870\.93 ± 3\.7857\.08 ± 10\.3030\.15 ± 7\.478\.50TST26\.80 ± 3\.3743\.36 ± 8\.2362\.53 ± 10\.1252\.24 ± 6\.6245\.19 ± 6\.7445\.19 ± 9\.259\.83STF24\.82 ± 3\.1247\.02 ± 9\.3372\.30 ± 15\.6264\.90 ± 9\.8557\.59 ± 4\.3550\.34 ± 9\.768\.50CF28\.90 ± 3\.1830\.71 ± 5\.8061\.83 ± 10\.4635\.15 ± 4\.4933\.06 ± 3\.2527\.02 ± 10\.4511\.17PTST21\.56 ± 0\.9334\.41 ± 4\.0342\.44 ± 4\.6643\.36 ± 7\.0131\.85 ± 5\.8739\.70 ± 5\.7811\.33XGBoost54\.75 ± 0\.0098\.53 ± 0\.00100\.00 ± 0\.0099\.77 ± 0\.0099\.77 ± 0\.0062\.62 ± 0\.003\.33TabPFN70\.61 ± 1\.0799\.58 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.00100\.00 ± 0\.0065\.55 ± 1\.962\.50TabDPT84\.74 ± 0\.1697\.27 ± 0\.83100\.00 ± 0\.0099\.27 ± 0\.1999\.17 ± 0\.2680\.10 ± 0\.362\.33

##### Cross\-family observations\.

Across the three diagnostics metrics, the model families separate clearly\. Tabular foundation models \(TabDPT, TabPFN\) and XGBoost rank in the top tier even though they consume tabularized windows rather than raw sequences; this is most visible on HSF15 \(where many models reach 100 F1\), but it also holds on harder tasks such as NC\-D and MZVAV\. Deep sequence models split into a strong pair \(LSTM, CNN\-1D\) and a weaker TiDE variant under this training budget\. By contrast, the transformer family \(TST, STF, CF, PTST\) performs near chance on most diagnostics tasks, despite consuming identical inputs under the same protocol\. Simple baselines \(MLP, Linear\) remain competitive on several HSF15 components, reinforcing that hydraulic component diagnostics is comparatively easy relative to MZVAV and NC\-D\. Finally, the robustness check is consistent: F1, AUROC, and Accuracy induce nearly identical rankings, so the headline conclusions do not depend on the specific metric choice\.

### C\.6Prognostics results

The prognostics metric surface is reported in three families\. Aggregate MAE and MSE \(normalized and denormalized\) are computed by pooling all predictions and all units before averaging\. Per\-unit aggregation, defined only on battery and bearing tasks where the framework’sper\_unitevaluator is active, computes one error per monitored unit before averaging across units; this tightens rankings among leading models and reduces the influence of long\-trajectory units that dominate window\-level pools\. Domain\-specific scores \(NASA, PHM\) are reported with the per\-task scoping enforced by the framework’s metric registry\.

#### C\.6\.1Aggregate errors

Tables[13](https://arxiv.org/html/2606.05481#A3.T13)and[14](https://arxiv.org/html/2606.05481#A3.T14)report MAE and MSE in a two\-block format\. The top block reports errors in the normalized target space \(reported×100\\times 100\), which is used to compute cross\-task*Avg rank*; the bottom block reports the same predictions in original engineering units for practitioner interpretation\. MSE emphasizes occasional large errors, but it preserves the headline ordering among the leading models\. Because MSE penalizes large residuals quadratically, it is sensitive to rare catastrophic predictions; we report it alongside MAE to make these failure modes visible\.

Table 13:MAE on prognostics\. Top block: normalized target space \(×100\\times 100\) \(↓\\downarrow\); bottom block: original engineering units \(↓\\downarrow\)\. Same numbers as the prognostics floor of Table[2](https://arxiv.org/html/2606.05481#S6.T2)\(top\), reproduced with full per\-task breakdown\.ModelNC\-DS02NC\-PNB14PHME20UniboXJTU\-SYAverage rankNormalized target space \(↓\\downarrow\)Linear10\.13 ± 0\.1416\.11 ± 0\.6041\.69 ± 12\.0212\.19 ± 0\.3627\.59 ± 14\.3676\.80 ± 60\.4112\.50Exp5\.35 ± 0\.0610\.96 ± 0\.0930\.47 ± 47\.768\.82 ± 0\.5212\.19 ± 0\.3127\.22 ± 4\.069\.67MLP6\.37 ± 0\.2313\.17 ± 0\.7814\.38 ± 9\.774\.62 ± 1\.1512\.50 ± 0\.7630\.64 ± 2\.6710\.33LSTM4\.93 ± 0\.137\.56 ± 0\.313\.80 ± 0\.223\.73 ± 0\.986\.50 ± 0\.1621\.89 ± 0\.403\.67CNN\-1D5\.33 ± 0\.377\.53 ± 0\.228\.89 ± 1\.705\.35 ± 3\.7112\.41 ± 1\.1531\.02 ± 8\.258\.67TiDE5\.29 ± 0\.227\.62 ± 0\.203\.44 ± 0\.174\.20 ± 0\.666\.46 ± 0\.7825\.11 ± 2\.385\.17TST5\.31 ± 0\.137\.02 ± 0\.176\.28 ± 0\.254\.11 ± 0\.847\.23 ± 0\.3933\.30 ± 7\.727\.00STF4\.89 ± 0\.107\.35 ± 1\.1610\.67 ± 3\.163\.91 ± 1\.008\.89 ± 0\.8128\.49 ± 4\.016\.17CF5\.76 ± 0\.519\.98 ± 0\.573\.57 ± 0\.073\.87 ± 0\.855\.58 ± 1\.0822\.09 ± 1\.065\.00PTST16\.62 ± 0\.0421\.55 ± 0\.035\.22 ± 0\.1015\.09 ± 1\.1311\.18 ± 1\.1125\.42 ± 1\.4810\.33XGBoost8\.52 ± 0\.0015\.24 ± 0\.004\.48 ± 0\.002\.68 ± 0\.004\.06 ± 0\.0024\.59 ± 0\.006\.50TabPFN4\.96 ± 0\.047\.79 ± 0\.043\.91 ± 0\.031\.95 ± 0\.033\.72 ± 0\.0622\.27 ± 0\.353\.33TabDPT5\.07 ± 0\.066\.85 ± 0\.023\.63 ± 0\.042\.19 ± 0\.013\.94 ± 0\.0523\.24 ± 0\.452\.67Original engineering units \(↓\\downarrow\)Linear10\.13 ± 0\.1416\.11 ± 0\.60451\.64 ± 130\.1543\.13 ± 1\.27135\.80 ± 70\.68907\.40 ± 774\.3612\.50Exp5\.35 ± 0\.0610\.96 ± 0\.09330\.02 ± 517\.3631\.21 ± 1\.8560\.02 ± 1\.51349\.71 ± 60\.989\.67MLP6\.37 ± 0\.2313\.17 ± 0\.78155\.77 ± 105\.8416\.35 ± 4\.0661\.51 ± 3\.72384\.95 ± 40\.5610\.50LSTM4\.93 ± 0\.137\.56 ± 0\.3141\.13 ± 2\.3413\.19 ± 3\.4732\.00 ± 0\.77271\.33 ± 3\.453\.67CNN\-1D5\.33 ± 0\.377\.53 ± 0\.2296\.25 ± 18\.3818\.95 ± 13\.1261\.10 ± 5\.66380\.05 ± 96\.928\.50TiDE5\.29 ± 0\.227\.62 ± 0\.2037\.22 ± 1\.8014\.86 ± 2\.3531\.78 ± 3\.86320\.41 ± 24\.805\.33TST5\.31 ± 0\.137\.02 ± 0\.1768\.00 ± 2\.7114\.56 ± 2\.9635\.60 ± 1\.94424\.88 ± 100\.257\.00STF4\.89 ± 0\.107\.35 ± 1\.16115\.61 ± 34\.2313\.83 ± 3\.5343\.76 ± 4\.01361\.41 ± 54\.886\.17CF5\.76 ± 0\.519\.98 ± 0\.5738\.70 ± 0\.8013\.71 ± 3\.0227\.48 ± 5\.31282\.20 ± 19\.105\.00PTST16\.62 ± 0\.0421\.55 ± 0\.0356\.53 ± 1\.0953\.40 ± 3\.9955\.02 ± 5\.47312\.68 ± 23\.4610\.17XGBoost8\.52 ± 0\.0015\.24 ± 0\.0048\.54 ± 0\.009\.48 ± 0\.0019\.98 ± 0\.00311\.19 ± 0\.006\.50TabPFN4\.96 ± 0\.047\.79 ± 0\.0442\.32 ± 0\.286\.89 ± 0\.0918\.33 ± 0\.30292\.43 ± 5\.263\.50TabDPT5\.07 ± 0\.066\.85 ± 0\.0239\.35 ± 0\.477\.75 ± 0\.0419\.42 ± 0\.24282\.83 ± 5\.842\.50

Table 14:MSE on prognostics\. Top block: normalized target space \(×100\\times 100\) \(↓\\downarrow\); bottom block: original engineering units \(↓\\downarrow\)\. MSE redistributes weight onto large per\-window errors but preserves the leading\-model ranking from MAE\.ModelNC\-DS02NC\-PNB14PHME20UniboXJTU\-SYAverage rankNormalized target space \(↓\\downarrow\)Linear1\.37 ± 0\.043\.95 ± 0\.3832\.16 ± 16\.142\.13 ± 0\.1017\.83 ± 17\.83107\.37 ± 128\.2912\.50Exp0\.47 ± 0\.011\.92 ± 0\.0328\.60 ± 61\.101\.18 ± 0\.132\.59 ± 0\.0811\.21 ± 3\.679\.50MLP0\.76 ± 0\.052\.77 ± 0\.355\.09 ± 6\.910\.35 ± 0\.152\.91 ± 0\.5114\.80 ± 3\.2610\.50LSTM0\.43 ± 0\.021\.04 ± 0\.090\.27 ± 0\.030\.23 ± 0\.111\.38 ± 0\.076\.80 ± 0\.223\.83CNN\-1D0\.48 ± 0\.051\.01 ± 0\.061\.31 ± 0\.600\.60 ± 0\.752\.54 ± 0\.4814\.34 ± 8\.168\.33TiDE0\.47 ± 0\.041\.09 ± 0\.040\.22 ± 0\.010\.28 ± 0\.101\.34 ± 0\.219\.68 ± 1\.585\.00TST0\.46 ± 0\.020\.91 ± 0\.050\.71 ± 0\.050\.29 ± 0\.111\.47 ± 0\.2016\.52 ± 6\.556\.83STF0\.41 ± 0\.020\.99 ± 0\.291\.92 ± 1\.150\.26 ± 0\.161\.92 ± 0\.3413\.79 ± 4\.666\.17CF0\.57 ± 0\.061\.83 ± 0\.180\.24 ± 0\.020\.26 ± 0\.100\.82 ± 0\.347\.50 ± 1\.025\.00PTST3\.71 ± 0\.026\.34 ± 0\.020\.45 ± 0\.023\.58 ± 0\.462\.06 ± 0\.419\.80 ± 1\.6710\.33XGBoost1\.02 ± 0\.003\.55 ± 0\.000\.33 ± 0\.000\.13 ± 0\.000\.61 ± 0\.009\.04 ± 0\.006\.17TabPFN0\.44 ± 0\.011\.14 ± 0\.010\.27 ± 0\.000\.06 ± 0\.000\.72 ± 0\.037\.55 ± 0\.223\.50TabDPT0\.50 ± 0\.010\.90 ± 0\.010\.25 ± 0\.010\.10 ± 0\.000\.69 ± 0\.018\.63 ± 0\.533\.33Original engineering units \(↓\\downarrow\)Linear136\.91 ± 4\.13394\.76 ± 38\.18377375\.42 ± 189407\.022662\.38 ± 123\.3243203\.12 ± 43216\.431877895\.28 ± 2475205\.9012\.50Exp47\.10 ± 1\.17191\.83 ± 2\.71335612\.34 ± 716896\.981482\.75 ± 160\.376278\.27 ± 188\.56228864\.76 ± 83234\.849\.50MLP75\.94 ± 5\.37276\.90 ± 34\.5659680\.85 ± 81102\.09444\.59 ± 194\.007060\.46 ± 1246\.54287760\.52 ± 74993\.0810\.50LSTM43\.44 ± 2\.21104\.08 ± 8\.973208\.61 ± 369\.69293\.94 ± 136\.273344\.16 ± 173\.37123688\.33 ± 2403\.853\.83CNN\-1D48\.09 ± 4\.82101\.08 ± 5\.8315328\.28 ± 7016\.36749\.70 ± 938\.676162\.49 ± 1160\.94254182\.17 ± 130965\.658\.17TiDE46\.60 ± 3\.56109\.02 ± 4\.132633\.01 ± 107\.72347\.09 ± 128\.233238\.60 ± 518\.75194915\.73 ± 28818\.575\.17TST46\.15 ± 2\.0391\.22 ± 4\.968293\.91 ± 557\.49358\.65 ± 139\.753565\.53 ± 491\.39331603\.13 ± 135837\.296\.83STF41\.31 ± 2\.2599\.50 ± 29\.0622492\.01 ± 13489\.49329\.53 ± 200\.784641\.65 ± 816\.21275295\.87 ± 99592\.976\.33CF56\.62 ± 6\.20182\.98 ± 17\.522873\.02 ± 197\.96320\.71 ± 120\.561986\.17 ± 825\.15149462\.60 ± 25698\.735\.00PTST370\.79 ± 2\.11634\.40 ± 2\.305338\.39 ± 225\.454483\.61 ± 577\.874999\.49 ± 987\.11182128\.52 ± 38677\.9810\.17XGBoost102\.50 ± 0\.00355\.02 ± 0\.003823\.81 ± 0\.00161\.62 ± 0\.001484\.77 ± 0\.00176032\.70 ± 0\.006\.17TabPFN44\.17 ± 0\.59114\.50 ± 1\.473121\.41 ± 41\.3579\.84 ± 1\.561744\.71 ± 73\.12158581\.64 ± 4950\.913\.50TabDPT50\.30 ± 0\.9090\.43 ± 0\.622875\.42 ± 67\.81121\.85 ± 2\.641675\.44 ± 32\.22163786\.30 ± 10020\.043\.33

##### Cross\-family observations\.

Prognostics rankings are more compressed than diagnostics\. TabDPT, TabPFN, and LSTM lead on combined rank, but CF, TiDE, and STF remain within a few rank points and each takes at least one column\-best on normalized MAE\. Two patterns stand out\. First, the transformer family that collapses on diagnostics is competitive on prognostics \(e\.g\., STF is best on NC\-DS02 and TST is near the top on NC\-P\), suggesting a task\-specific failure mode rather than an architecture\-wide limitation\. Second, simple baselines \(Linear, Exp\) degrade more on prognostics than on diagnostics: Exp is consistently far from the leaders, while Linear remains mid\-tier\. Switching from MAE to MSE tightens the leading group and penalizes rare catastrophic errors, but it does not change the top of the leaderboard\.

#### C\.6\.2Per\-unit aggregation \(battery and bearing families\)

Battery \(NB14, UNIBO21\) and bearing \(XJTU\-SY\) prognostics are evaluated trajectory\-level rather than window\-level: the framework’sper\_unitevaluator computes one error per monitored unit and then aggregates\. Table[15](https://arxiv.org/html/2606.05481#A3.T15)reports the per\-unit\-mean MAE on these three families in the same two\-floor form \(normalized top, denormalized bottom\)\. The values are produced from the same predictions as the aggregate tables above; the difference is purely in the aggregation order \(per\-unit\-then\-mean vs\. pooled\)\. The corresponding MSE per\-unit\-mean variants are omitted from the appendix as they preserve the same ranking with no additional insight\.

Table 15:Per\-unit\-mean MAE on battery and bearing prognostics\. Top block: normalized target space \(×100\\times 100\) \(↓\\downarrow\); bottom block: original engineering units \(↓\\downarrow\)\. Computed by theper\_unitevaluator: one error per monitored unit, then averaged across units\.ModelNB14UniboXJTU\-SYAverage rankNormalized target space \(↓\\downarrow\)Linear38\.58 ± 9\.9631\.66 ± 21\.4675\.37 ± 47\.2613\.00Exp30\.74 ± 48\.5510\.93 ± 0\.1821\.38 ± 1\.4410\.00MLP14\.83 ± 9\.6613\.88 ± 1\.7825\.21 ± 2\.7311\.33LSTM4\.49 ± 0\.327\.02 ± 0\.7118\.57 ± 0\.504\.33CNN\-1D9\.12 ± 1\.4311\.74 ± 1\.1526\.13 ± 6\.6310\.67TiDE4\.02 ± 0\.126\.30 ± 0\.6421\.28 ± 2\.554\.33TST7\.20 ± 0\.267\.39 ± 1\.1825\.08 ± 6\.068\.33STF10\.16 ± 3\.038\.10 ± 0\.7221\.44 ± 2\.559\.00CF4\.14 ± 0\.225\.43 ± 1\.0118\.87 ± 0\.433\.67PTST5\.57 ± 0\.1510\.56 ± 0\.9721\.13 ± 1\.167\.33XGBoost4\.82 ± 0\.003\.59 ± 0\.0019\.20 ± 0\.004\.33TabPFN4\.07 ± 0\.023\.25 ± 0\.0717\.83 ± 0\.131\.33TabDPT4\.09 ± 0\.083\.57 ± 0\.1520\.09 ± 0\.303\.33Original engineering units \(↓\\downarrow\)Linear417\.94 ± 107\.84155\.85 ± 105\.63434\.14 ± 341\.4913\.00Exp333\.02 ± 525\.8753\.79 ± 0\.91153\.84 ± 22\.9710\.00MLP160\.68 ± 104\.6968\.31 ± 8\.74173\.19 ± 15\.1111\.00LSTM48\.67 ± 3\.4234\.55 ± 3\.51123\.73 ± 2\.274\.00CNN\-1D98\.76 ± 15\.5257\.78 ± 5\.68175\.36 ± 46\.6210\.33TiDE43\.52 ± 1\.2931\.02 ± 3\.17141\.91 ± 13\.434\.00TST77\.96 ± 2\.7736\.37 ± 5\.79188\.24 ± 43\.659\.00STF110\.11 ± 32\.7939\.87 ± 3\.52161\.01 ± 22\.659\.00CF44\.84 ± 2\.4326\.75 ± 4\.98124\.85 ± 6\.013\.33PTST60\.29 ± 1\.6551\.99 ± 4\.78143\.67 ± 8\.347\.67XGBoost52\.24 ± 0\.0017\.68 ± 0\.00138\.98 ± 0\.004\.67TabPFN44\.10 ± 0\.2516\.00 ± 0\.32125\.89 ± 1\.972\.00TabDPT44\.26 ± 0\.8317\.58 ± 0\.72131\.36 ± 2\.543\.00

Per\-unit aggregation tightens the gap among the top three \(TabDPT, TabPFN, LSTM\) on NB14 and UNIBO21, where pooled\-MAE differences were inflated by long\-trajectory units, and reorders the middle of the table on XJTU\-SY\. The headline that tabular foundation models lead on average rank is unchanged\.

#### C\.6\.3Domain\-specific prognostic scores

Two community\-standard scores are reported on the families they apply to\. The*NASA score*is defined for direct\-RUL targets and is reported on the N\-CMAPSS families \(NC\-DS02, NC\-P\) and PHME20; it asymmetrically penalizes late predictions\. The*PHM score*is the bearing/battery\-prognostics convention and is reported on NB14, UNIBO21, and XJTU\-SY\. Per\-task scoping is enforced by the framework’s metric registry \(Section[4\.3](https://arxiv.org/html/2606.05481#S4.SS3)\); the two scores are presented together in Table[16](https://arxiv.org/html/2606.05481#A3.T16)as two blocks of one table because they apply to disjoint family sets\.

Table 16:Domain\-specific prognostic scores\. Top block: NASA score on direct\-RUL families \(NC\-DS02, NC\-P, PHME20;↓\\downarrow\); bottom block: PHM score \(×100\\times 100\) on battery and bearing prognostics \(NB14, UNIBO21, XJTU\-SY;↑\\uparrow\)\. Per\-task scoping is enforced by the framework’s metric registry\. Note that the two scores apply to disjoint family sets and use opposite directions\.ModelNC\-DS02NC\-PPHME20Average rankNASA score on direct\-RUL families \(↓\\downarrow\)Linear2\.03 ± 0\.077\.90 ± 2\.313044\.09 ± 2526\.3011\.00Exp0\.85 ± 0\.022\.69 ± 0\.04229\.49 ± 39\.406\.67MLP2\.53 ± 1\.624910\.51 ± 10876\.24481046\.65 ± 1027047\.2512\.33LSTM0\.81 ± 0\.041\.46 ± 0\.138\.50 ± 5\.844\.00CNN\-1D0\.87 ± 0\.091\.44 ± 0\.092709\.03 ± 5992\.517\.00TiDE0\.86 ± 0\.061\.69 ± 0\.0611\.82 ± 9\.436\.00TST0\.85 ± 0\.031\.26 ± 0\.0912\.64 ± 7\.244\.00STF0\.78 ± 0\.031\.30 ± 0\.3321\.05 ± 30\.493\.33CF1\.01 ± 0\.112\.99 ± 0\.182215\.75 ± 4653\.649\.00PTST6\.11 ± 0\.1210\.69 ± 0\.0419528004\.00 ± 11187183\.0512\.67XGBoost1\.62 ± 0\.006\.10 ± 0\.003\.95 ± 0\.007\.33TabPFN0\.80 ± 0\.011\.49 ± 0\.031\.21 ± 0\.023\.00TabDPT0\.91 ± 0\.021\.31 ± 0\.015\.94 ± 2\.044\.67ModelNB14UniboXJTU\-SYAverage rankPHM score on battery and bearing prognostics \(↑\\uparrow\)Linear5\.38 ± 1\.515\.85 ± 2\.5511\.40 ± 10\.2413\.00Exp14\.45 ± 8\.139\.91 ± 0\.3019\.15 ± 4\.4410\.67MLP15\.46 ± 6\.3010\.85 ± 1\.1118\.27 ± 2\.2010\.00LSTM31\.28 ± 2\.2114\.74 ± 0\.4624\.83 ± 0\.783\.00CNN\-1D18\.29 ± 2\.0810\.70 ± 1\.3218\.36 ± 4\.7710\.00TiDE36\.28 ± 1\.7214\.63 ± 2\.3922\.46 ± 2\.774\.33TST24\.31 ± 0\.9213\.71 ± 1\.4114\.89 ± 7\.818\.67STF14\.34 ± 6\.8810\.79 ± 1\.2721\.58 ± 3\.299\.67CF33\.45 ± 1\.3117\.21 ± 1\.7824\.72 ± 1\.782\.67PTST24\.17 ± 0\.7112\.57 ± 1\.6623\.41 ± 1\.617\.00XGBoost24\.49 ± 0\.0020\.24 ± 0\.0020\.14 ± 0\.005\.67TabPFN26\.03 ± 0\.2821\.59 ± 0\.4124\.40 ± 0\.633\.33TabDPT29\.16 ± 0\.4020\.35 ± 0\.7324\.42 ± 1\.013\.00

Domain\-specific scores mostly track MAE/MSE, but they surface two important failure modes\. On the NASA score, a small number of catastrophically late predictions on PHME20 causes MLP and PTST to degrade by orders of magnitude under the asymmetric penalty—a pattern that MAE/MSE can hide\. On the PHM score, CF achieves the best score on XJTU\-SY despite being mid\-tier on MAE, reflecting the score’s emphasis on end\-of\-life behavior\. Reporting symmetric errors alongside domain scores therefore highlights behaviors that any single metric would miss\.

### C\.7Data\-Efficiency Scaling

This section supports the main\-paper claim that data efficiency depends not only on the number of available training windows, but also on whether the context covers the relevant operating regimes, degradation stages, and diagnostic classes\. The additional scaling figure compares aggregate and blockwise subsampling for PHME20, Unibo, and MZVAV\. It extends the main\-results discussion by showing when a small context remains representative and when blockwise subsampling removes important trajectory segments or class coverage\.

![Refer to caption](https://arxiv.org/html/2606.05481v1/x15.png)\(a\)PHME20
![Refer to caption](https://arxiv.org/html/2606.05481v1/x16.png)\(b\)PHME20, blockwise
![Refer to caption](https://arxiv.org/html/2606.05481v1/x17.png)\(c\)Unibo
![Refer to caption](https://arxiv.org/html/2606.05481v1/x18.png)\(d\)Unibo, blockwise
![Refer to caption](https://arxiv.org/html/2606.05481v1/x19.png)\(e\)MZVAV
![Refer to caption](https://arxiv.org/html/2606.05481v1/x20.png)\(f\)MZVAV, blockwise

Figure 8:Data\-efficiency scaling under aggregate and blockwise context subsampling for PHME20, Unibo, and MZVAV\. Each row compares aggregate random subsampling with contiguous blockwise subsampling, highlighting when small contexts preserve or lose coverage of trajectories or diagnostic classes\.

## Appendix DReproducibility

### D\.1Framework\-Based Execution

All reported experiments are executed through the PICID evaluation framework\[[52](https://arxiv.org/html/2606.05481#bib.bib52)\]\. In this paper, reproducibility therefore means that the benchmark choices described in the method and dataset appendices are fixed as executable experiment configurations rather than reconstructed manually from prose\.

### D\.2Configuration\-Fixed Experiments

Each experiment is wrapped in a resolved configuration that fixes the datasource, split, transform pipeline, target construction, sequence\-slicing parameters, tabularization rule, model family, hyperparameters, evaluator, seed, and validation\-selection rule\. These configuration choices define the experiment from datasource construction to metric computation, so a reported number is tied to a concrete protocol instance\.

The run record should identify:

- •the resolved configuration, including command\-line overrides;
- •the code revision and dependency environment;
- •the dataset source or local path configuration used for execution;
- •the saved evaluation summaries, predictions, and plots used in the manuscript\.

### D\.3Shared Preprocessing and Evaluation Boundaries

The same configured preprocessing and evaluation boundaries are used across the model families compared in the paper\. Dataset splits are reported in the dataset\-protocol section above, model selection uses validation data, and final metrics are computed on the held\-out test partition\.

The leakage constraints are those defined in the formal protocol\. Parameters fitted by feature transformations𝒢\\mathcal\{G\}, target transformationsℋ\\mathcal\{H\}, target\-alignment choices insideℋ~\\widetilde\{\\mathcal\{H\}\}, normalization statistics, and tabularization choices are determined only from the training partition\. Once fitted, these choices are frozen and applied unchanged to validation and test partitions\.

For tabular in\-context models, labeled context examples are drawn only from the training partition\. For intra\-unit temporal splits, context selection must also respect temporal order so that no future\-derived sample is used to predict an earlier query\. Machine\-specific paths are kept separate from the experimental protocol through path configurations, so local storage layout does not change the scientific comparison\.

## Appendix EData and Code Access

The public repository link is recorded as[github\.com/EPFL\-IMOS/picid](https://github.com/EPFL-IMOS/picid)\. An archival artifact identifier can be added before submission if one is created\.

### E\.1Code and Protocol Availability

The implementation is distributed as a Python project with version\-controlled experiment configurations\. These configurations define the datasource, preprocessing pipeline, task definition, model, and evaluator used for each reported benchmark setting, while machine\-specific dataset and artifact locations are supplied separately through path configurations\.

### E\.2Third\-Party Datasets

The evaluated datasets are third\-party benchmark datasets and should be obtained according to their respective source terms rather than redistributed as part of the paper\. The evaluated datasets are NB14\[[5](https://arxiv.org/html/2606.05481#bib.bib5)\], UNIBO21\[[57](https://arxiv.org/html/2606.05481#bib.bib57)\], XJTU\-SY\[[61](https://arxiv.org/html/2606.05481#bib.bib61)\], N\-CMAPSS\[[4](https://arxiv.org/html/2606.05481#bib.bib4),[20](https://arxiv.org/html/2606.05481#bib.bib20)\], HSF15\[[25](https://arxiv.org/html/2606.05481#bib.bib25)\], PHME20\[[29](https://arxiv.org/html/2606.05481#bib.bib29)\], and MZVAV\[[21](https://arxiv.org/html/2606.05481#bib.bib21)\]\.

The implementation supports local dataset paths through the selected path configuration\. Dataset access therefore remains separate from the benchmark protocol: users obtain the third\-party data from the appropriate source, place or cache them locally, and then run the same configured preprocessing and evaluation pipeline\.

## References

- Ansari et al\. \[2025\]Ansari, A\.F\., Shchur, O\., Küken, J\., Auer, A\., Han, B\., Mercado, P\., Rangapuram, S\.S\., Shen, H\., Stella, L\., Zhang, X\., Goswami, M\., Kapoor, S\., Maddix, D\.C\., Guerron, P\., Hu, T\., Yin, J\., Erickson, N\., Desai, P\.M\., Wang, H\., Rangwala, H\., Karypis, G\., Wang, Y\., Bohlke\-Schneider, M\., 2025\.Chronos\-2: From univariate to universal forecasting\.URL:[https://arxiv\.org/abs/2510\.15821](https://arxiv.org/abs/2510.15821),[arXiv:2510\.15821](http://arxiv.org/abs/2510.15821)\.
- Ansari et al\. \[2024\]Ansari, A\.F\., Stella, L\., Turkmen, A\.C\., Zhang, X\., Mercado, P\., Shen, H\., Shchur, O\., Rangapuram, S\.S\., Arango, S\.P\., Kapoor, S\., Zschiegner, J\., Maddix, D\.C\., Wang, H\., Mahoney, M\.W\., Torkkola, K\., Wilson, A\.G\., Bohlke\-Schneider, M\., Wang, B\., 2024\.Chronos: Learning the language of time series\.URL:[https://openreview\.net/forum?id=gerNCVqqtR](https://openreview.net/forum?id=gerNCVqqtR)\. expert Certification\.
- Arbel et al\. \[2026\]Arbel, M\., Salinas, D\., Hutter, F\., 2026\.Equitabpfn: A target\-permutation equivariant prior fitted network, in: Advances in Neural Information Processing Systems, Curran Associates, Inc\.\. pp\. 62586–62609\.URL:[https://proceedings\.neurips\.cc/paper\_files/paper/2025/file/5a66c7adffdbde9dd5e78820cbf6935c\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2025/file/5a66c7adffdbde9dd5e78820cbf6935c-Paper-Conference.pdf)\.
- Arias Chao et al\. \[2021\]Arias Chao, M\., Kulkarni, C\., Goebel, K\., Fink, O\., 2021\.Aircraft engine run\-to\-failure dataset under real flight conditions for prognostics and diagnostics\.Data 6, 5\.doi:[10\.3390/data6010005](https://arxiv.org/doi.org/10.3390/data6010005)\.
- Bole et al\. \[2014\]Bole, B\., Kulkarni, C\.S\., Daigle, M\., 2014\.Adaptation of an electrochemistry\-based li\-ion battery model to account for deterioration observed under randomized use\.Annual Conference of the PHM Society 6\.doi:[10\.36001/phmconf\.2014\.v6i1\.2490](https://arxiv.org/doi.org/10.36001/phmconf.2014.v6i1.2490)\.
- Bosello et al\. \[2023\]Bosello, M\., Falcomer, C\., Rossi, C\., Pau, G\., 2023\.To charge or to sell? ev pack useful life estimation via lstms, cnns, and autoencoders\.Energies 16, 2837\.doi:[10\.3390/en16062837](https://arxiv.org/doi.org/10.3390/en16062837)\.
- Cai et al\. \[2025\]Cai, S\., Sun, X\., Zhong, H\., 2025\.Explore the time series forecasting potential of tabpfn leveraging the intrinsic periodicity of data, in: ICML 2025 Workshop on Foundation Models for Structured Data \(FMSD\)\.URL:[https://openreview\.net/forum?id=7JGD1kNlzU](https://openreview.net/forum?id=7JGD1kNlzU)\. workshop paper\.
- \[8\]Case Western Reserve University Bearing Data Center, \.Case Western Reserve University Bearing Data Center Website\.[https://engineering\.case\.edu/bearingdatacenter/download\-data\-file](https://engineering.case.edu/bearingdatacenter/download-data-file)\.Accessed: 2026\-05\-29\.
- Cha et al\. \[2025\]Cha, M\., Yoon, S\.i\., Kim, S\., Kang, D\., Nam, K\., Lee, T\., Kim, J\.Y\., 2025\.Large language model\-based autonomous agent for prognostics and health management\.Machines 13, 831\.URL:[https://www\.mdpi\.com/2075\-1702/13/9/831](https://www.mdpi.com/2075-1702/13/9/831), doi:[10\.3390/machines13090831](https://arxiv.org/doi.org/10.3390/machines13090831)\.
- Chen and Guestrin \[2016\]Chen, T\., Guestrin, C\., 2016\.Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp\. 785–794\.doi:[10\.1145/2939672\.2939785](https://arxiv.org/doi.org/10.1145/2939672.2939785)\.
- Das et al\. \[2023\]Das, A\., Kong, W\., Leach, A\., Mathur, S\.K\., Sen, R\., Yu, R\., 2023\.Long\-term forecasting with tide: Time\-series dense encoder\.Transactions on Machine Learning Research \.
- Das et al\. \[2024\]Das, A\., Kong, W\., Sen, R\., Zhou, Y\., 2024\.A decoder\-only foundation model for time\-series forecasting, in: Salakhutdinov, R\., Kolter, Z\., Heller, K\., Weller, A\., Oliver, N\., Scarlett, J\., Berkenkamp, F\. \(Eds\.\), Proceedings of the 41st International Conference on Machine Learning, PMLR\. pp\. 10148–10167\.URL:[https://proceedings\.mlr\.press/v235/das24c\.html](https://proceedings.mlr.press/v235/das24c.html)\.
- Dong et al\. \[2024\]Dong, Q\., Li, L\., Dai, D\., Zheng, C\., Ma, J\., Li, R\., Xia, H\., Xu, J\., Wu, Z\., Chang, B\., et al\., 2024\.A survey on in\-context learning, in: Proceedings of the 2024 conference on empirical methods in natural language processing, pp\. 1107–1128\.
- Dooley et al\. \[2024\]Dooley, S\., Khurana, G\.S\., Mohapatra, C\., Naidu, S\.V\., White, C\., 2024\.Forecastpfn: Synthetically\-trained zero\-shot forecasting\.Advances in Neural Information Processing Systems 36\.
- Douze et al\. \[2024\]Douze, M\., Guzhva, A\., Deng, C\., Johnson, J\., Szilvasy, G\., Mazaré, P\.E\., Lomeli, M\., Hosseini, L\., Jégou, H\., 2024\.The faiss library\.URL:[https://arxiv\.org/abs/2401\.08281](https://arxiv.org/abs/2401.08281), doi:[10\.48550/arXiv\.2401\.08281](https://arxiv.org/doi.org/10.48550/arXiv.2401.08281),[arXiv:2401\.08281](http://arxiv.org/abs/2401.08281)\.
- Eldele et al\. \[2025\]Eldele, E\., Ragab, M\., Qing, X\., Chen, Z\., Wu, M\., Li, X\., Lee, J\., et al\., 2025\.Unifault: A fault diagnosis foundation model from bearing data\.arXiv preprint arXiv:2504\.01373 \.
- \[17\]Feng, C\., Huang, L\., Krompass, D\., \.Only the curve shape matters: Training foundation models for zero\-shot multivariate time series forecasting through next curve shape prediction\.URL:[http://arxiv\.org/abs/2402\.07570](http://arxiv.org/abs/2402.07570),[arXiv:2402\.07570 \[cs\]](http://arxiv.org/abs/2402.07570%20%5Bcs%5D)\.
- Feng et al\. \[2026\]Feng, T\., Chen, Y\., Tsai, C\.Y\., Sun, Y\., Das, A\., El Maghraoui, K\., Lin, S\., Patel, D\., 2026\.PHMForge: Evaluating llm agents on industrial prognostics through mcp\-native, algorithm\-grounded tools\.URL:[https://arxiv\.org/abs/2604\.01532](https://arxiv.org/abs/2604.01532), doi:[10\.48550/arXiv\.2604\.01532](https://arxiv.org/doi.org/10.48550/arXiv.2604.01532),[arXiv:2604\.01532](http://arxiv.org/abs/2604.01532)\.
- Fink et al\. \[2026\]Fink, O\., Nejjar, I\., Sharma, V\., Faghih Niresi, K\., Sun, H\., Dong, H\., Xu, C\., Wei, A\., Bizzi, A\., Theiler, R\., Tian, Y\., Von Krannichfeldt, L\., Ma, Z\., Garmaev, S\., Zhang, Z\., Zhao, M\., Steiner, K\., Kesmen, Y\., 2026\.From physics to machine learning and back: Part II \- Learning and observational bias in prognostics and health management \(PHM\)\.Reliability Engineering & System Safety 274, 112376\.doi:[10\.1016/j\.ress\.2026\.112376](https://arxiv.org/doi.org/10.1016/j.ress.2026.112376)\.
- Frederick et al\. \[2007\]Frederick, D\.K\., DeCastro, J\.A\., Litt, J\.S\., 2007\.User’s guide for the commercial modular aero\-propulsion system simulation \(C\-MAPSS\)\.Technical Report NASA/TM—2007\-215026\. NASA Glenn Research Center\.
- Granderson et al\. \[2020a\]Granderson, J\., Lin, G\., Harding, A\., Im, P\., Chen, Y\., 2020a\.Building fault detection data to aid diagnostic algorithm creation and performance testing\.Scientific Data 7, 65\.URL:[https://www\.nature\.com/articles/s41597\-020\-0398\-6](https://www.nature.com/articles/s41597-020-0398-6), doi:[10\.1038/s41597\-020\-0398\-6](https://arxiv.org/doi.org/10.1038/s41597-020-0398-6)\.
- Granderson et al\. \[2020b\]Granderson, J\., Lin, G\., Harding, A\., Im, P\., Chen, Y\., 2020b\.Dataset for building fault detection and diagnostics algorithm creation and performance testing URL:[https://figshare\.com/articles/dataset/LBNLDataSynthesisInventory\_pdf/11752740](https://figshare.com/articles/dataset/LBNLDataSynthesisInventory_pdf/11752740), doi:[10\.6084/m9\.figshare\.11752740\.v3](https://arxiv.org/doi.org/10.6084/m9.figshare.11752740.v3)\.
- Grigsby et al\. \[2021\]Grigsby, J\., Wang, Z\., Qi, Y\., 2021\.Long\-range transformers for dynamic spatiotemporal forecasting\.[arXiv:2109\.12218](http://arxiv.org/abs/2109.12218)\.
- Hagmeyer et al\. \[2021\]Hagmeyer, S\., Mauthe, F\., Zeiler, P\., 2021\.Creation of publicly available data sets for prognostics and diagnostics addressing data scenarios relevant to industrial applications\.International Journal of Prognostics and Health Management 12\.
- Helwig et al\. \[2015\]Helwig, N\., Pignanelli, E\., Schütze, A\., 2015\.Condition monitoring of a complex hydraulic system using multivariate statistics, in: 2015 IEEE International Instrumentation and Measurement Technology Conference \(I2MTC\) Proceedings, pp\. 210–215\.doi:[10\.1109/I2MTC\.2015\.7151267](https://arxiv.org/doi.org/10.1109/I2MTC.2015.7151267)\.
- Hollmann et al\. \[2023\]Hollmann, N\., Müller, S\., Eggensperger, K\., Hutter, F\., 2023\.Tabpfn: A transformer that solves small tabular classification problems in a second, in: The Eleventh International Conference on Learning Representations\.
- Hollmann et al\. \[2025\]Hollmann, N\., Müller, S\., Purucker, L\., Krishnakumar, A\., Körfer, M\., Hoo, S\.B\., Schirrmeister, R\.T\., Hutter, F\., 2025\.Accurate predictions on small data with a tabular foundation model\.Nature 637, 319–326\.
- Hoo et al\. \[2024\]Hoo, S\.B\., Müller, S\., Salinas, D\., Hutter, F\., 2024\.The tabular foundation model tabpfn outperforms specialized time series forecasting models based on simple features, in: NeurIPS 2024 Workshop on Table Representation Learning \(TRL\)\.URL:[https://neurips\.cc/virtual/2024/103164](https://neurips.cc/virtual/2024/103164)\. neurIPS 2024 workshop paper\.
- İnce et al\. \[2020\]İnce, K\., Sirkeci, E\., Genç, Y\., 2020\.Remaining useful life prediction for experimental filtration system: A data challenge\.PHM Society European Conference 5\.doi:[10\.36001/phme\.2020\.v5i1\.1317](https://arxiv.org/doi.org/10.36001/phme.2020.v5i1.1317)\.
- Jin et al\. \[2024\]Jin, M\., Wang, S\., Ma, L\., Chu, Z\., Zhang, J\., Shi, X\., Chen, P\.Y\., Liang, Y\., Li, Y\.F\., Pan, S\., Wen, Q\., 2024\.Time\-llm: Time series forecasting by reprogramming large language models, in: Kim, B\., Yue, Y\., Chaudhuri, S\., Fragkiadaki, K\., Khan, M\., Sun, Y\. \(Eds\.\), International Conference on Learning Representations, pp\. 23857–23880\.URL:[https://proceedings\.iclr\.cc/paper\_files/paper/2024/file/680b2a8135b9c71278a09cafb605869e\-Paper\-Conference\.pdf](https://proceedings.iclr.cc/paper_files/paper/2024/file/680b2a8135b9c71278a09cafb605869e-Paper-Conference.pdf)\.
- Kim et al\. \[2024\]Kim, M\.J\., Grinsztajn, L\., Varoquaux, G\., 2024\.Carte: Pretraining and transfer for tabular learning, in: Forty\-first International Conference on Machine Learning\.
- LeCun et al\. \[2015\]LeCun, Y\., Bengio, Y\., Hinton, G\., 2015\.Deep learning\.Nature 521, 436–444\.doi:[10\.1038/nature14539](https://arxiv.org/doi.org/10.1038/nature14539)\.
- Lee et al\. \[2019\]Lee, J\., Yoon, W\., Kim, S\., Kim, D\., Kim, S\., So, C\.H\., Kang, J\., 2019\.Biobert: a pre\-trained biomedical language representation model for biomedical text mining\.Bioinformatics 36, 1234–1240\.URL:[http://dx\.doi\.org/10\.1093/bioinformatics/btz682](http://dx.doi.org/10.1093/bioinformatics/btz682), doi:[10\.1093/bioinformatics/btz682](https://arxiv.org/doi.org/10.1093/bioinformatics/btz682)\.
- Lei et al\. \[2018\]Lei, Y\., Li, N\., Guo, L\., Li, N\., Yan, T\., Lin, J\., 2018\.Machinery health prognostics: A systematic review from data acquisition to remaining useful life prediction\.Mechanical Systems and Signal Processing 104, 799–834\.doi:[10\.1016/j\.ymssp\.2017\.11\.016](https://arxiv.org/doi.org/10.1016/j.ymssp.2017.11.016)\.
- Li et al\. \[2024\]Li, C\., Li, S\., Feng, Y\., Gryllias, K\., Gu, F\., Pecht, M\., 2024\.Small data challenges for intelligent prognostics and health management: a review\.Artificial Intelligence Review 57, 214\.
- Liu et al\. \[2025\]Liu, D\., Qu, G\., Xu, Y\., Qiu, T\., Ding, S\., Guo, K\., 2025\.ICL4RUL: In\-context learning\-based aircraft engine remaining useful life prediction\.IEEE Internet of Things Journal 12, 29766–29783\.URL:[https://ieeexplore\.ieee\.org/document/10998956](https://ieeexplore.ieee.org/document/10998956), doi:[10\.1109/JIOT\.2025\.3569131](https://arxiv.org/doi.org/10.1109/JIOT.2025.3569131)\.
- \[37\]Liu, Y\., Zhang, H\., Li, C\., Huang, X\., Wang, J\., Long, M\., \.Timer: Generative pre\-trained transformers are large time series models\.URL:[http://arxiv\.org/abs/2402\.02368](http://arxiv.org/abs/2402.02368),[arXiv:2402\.02368 \[cs, stat\]](http://arxiv.org/abs/2402.02368%20%5Bcs,%20stat%5D)\.
- Ma et al\. \[2025\]Ma, J\., Thomas, V\., Hosseinzadeh, R\., Labach, A\., Kamkari, H\., Cresswell, J\.C\., Golestan, K\., Yu, G\., Caterini, A\.L\., Volkovs, M\., 2025\.TabDPT: Scaling tabular foundation models on real data URL:[https://openreview\.net/forum?id=pIZxEOZCId](https://openreview.net/forum?id=pIZxEOZCId)\.
- Magadán et al\. \[2023\]Magadán, L\., Roldán\-Gómez, J\., Granda, J\., Suárez, F\., 2023\.Early fault classification in rotating machinery with limited data using tabpfn\.IEEE Sensors Journal 23, 30960–30970\.doi:[10\.1109/JSEN\.2023\.3331100](https://arxiv.org/doi.org/10.1109/JSEN.2023.3331100)\.
- Mauthe et al\. \[2025\]Mauthe, F\., Steinmann, L\., Neu, M\., Zeiler, P\., 2025\.Overview and analysis of publicly available degradation data sets for tasks within prognostics and health management, in: 35th european safety and reliability conference\.\(accepted\)\. Research Publishing\.
- \[41\]Miller, J\.A\., Aldosari, M\., Saeed, F\., Barna, N\.H\., Rana, S\., Arpinar, I\.B\., Liu, N\., \.A survey of deep learning and foundation models for time series forecasting\.URL:[http://arxiv\.org/abs/2401\.13912](http://arxiv.org/abs/2401.13912), doi:[10\.48550/arXiv\.2401\.13912](https://arxiv.org/doi.org/10.48550/arXiv.2401.13912),[arXiv:2401\.13912 \[cs\]](http://arxiv.org/abs/2401.13912%20%5Bcs%5D)\.
- Nie et al\. \[2023\]Nie, Y\., H\. Nguyen, N\., Sinthong, P\., Kalagnanam, J\., 2023\.A time series is worth 64 words: Long\-term forecasting with transformers, in: International Conference on Learning Representations\.
- Nori et al\. \[2023\]Nori, H\., Lee, Y\.T\., Zhang, S\., Carignan, D\., Edgar, R\., Fusi, N\., King, N\., Larson, J\., Li, Y\., Liu, W\., Luo, R\., McKinney, S\.M\., Ness, R\.O\., Poon, H\., Qin, T\., Usuyama, N\., White, C\., Horvitz, E\., 2023\.Can generalist foundation models outcompete special\-purpose tuning? case study in medicine\.arXiv preprint arXiv:2311\.16452 URL:[https://arxiv\.org/abs/2311\.16452](https://arxiv.org/abs/2311.16452)\.
- Qiao et al\. \[2025\]Qiao, X\., Liow, H\.Y\., Jauw, V\.L\., Lim, C\.S\., 2025\.A comparative study of deep learning model based equipment fault diagnosis and prognosis\.International Journal of Prognostics and Health Management 16\.doi:[10\.36001/IJPHM\.2025\.v16i1\.4254](https://arxiv.org/doi.org/10.36001/IJPHM.2025.v16i1.4254)\.
- Qu et al\. \[2025\]Qu, J\., Holzmüller, D\., Varoquaux, G\., Morvan, M\.L\., 2025\.Tabicl: A tabular foundation model for in\-context learning on large data, in: Proceedings of the 38th International Conference on Machine Learning \(ICML 2025\)\.URL:[https://arxiv\.org/abs/2502\.05564](https://arxiv.org/abs/2502.05564)\. accepted at ICML 2025\.
- Ramasso and Saxena \[2014\]Ramasso, E\., Saxena, A\., 2014\.Performance Benchmarking and Analysis of Prognostic Methods for CMAPSS Datasets\.International Journal of Prognostics and Health Management 5\.URL:[https://papers\.phmsociety\.org/index\.php/ijphm/article/view/2236](https://papers.phmsociety.org/index.php/ijphm/article/view/2236), doi:[10\.36001/ijphm\.2014\.v5i2\.2236](https://arxiv.org/doi.org/10.36001/ijphm.2014.v5i2.2236)\.
- Salinas\-Camus et al\. \[2025\]Salinas\-Camus, M\., Goebel, K\., Eleftheroglou, N\., 2025\.A comprehensive review and evaluation framework for data\-driven prognostics: Uncertainty, robustness, interpretability, and feasibility\.Mechanical Systems and Signal Processing 237, 113015\.doi:[10\.1016/j\.ymssp\.2025\.113015](https://arxiv.org/doi.org/10.1016/j.ymssp.2025.113015)\.
- Saxena et al\. \[2008\]Saxena, A\., Goebel, K\., Simon, D\., Eklund, N\., 2008\.Damage propagation modeling for aircraft engine run\-to\-failure simulation, in: 2008 International Conference on Prognostics and Health Management, pp\. 1–9\.doi:[10\.1109/PHM\.2008\.4711414](https://arxiv.org/doi.org/10.1109/PHM.2008.4711414)\.
- Siami\-Namini et al\. \[2019\]Siami\-Namini, S\., Tavakoli, N\., Siami Namin, A\., 2019\.The performance of lstm and bilstm in forecasting time series, in: 2019 IEEE International Conference on Big Data \(Big Data\), pp\. 3285–3292\.doi:[10\.1109/BigData47090\.2019\.9005997](https://arxiv.org/doi.org/10.1109/BigData47090.2019.9005997)\.
- Sim et al\. \[2020\]Sim, J\., Kim, S\., Park, H\.J\., Choi, J\.H\., 2020\.A tutorial for feature engineering in the prognostics and health management of gears and bearings\.Applied Sciences 10, 5639\.doi:[10\.3390/app10165639](https://arxiv.org/doi.org/10.3390/app10165639)\.
- Sun et al\. \[2025\]Sun, L\., Wu, J\., Wang, J\., Wen, S\., Li, G\., Liu, Y\., 2025\.Fault diagnosis of slewing bearing using audible sound signal based on time gan–tabpfn method\.Journal of Vibration and Acoustics 147, 041002\.doi:[10\.1115/1\.4068223](https://arxiv.org/doi.org/10.1115/1.4068223)\.
- Telyatnikov et al\. \[2026\]Telyatnikov, L\., Theiler, R\., Von Krannichfeldt, L\., Fink, O\., 2026\.Picid: A modular evaluation infrastructure for reproducible phm across tasks and domains\.URL:[https://arxiv\.org/abs/2605\.28345](https://arxiv.org/abs/2605.28345), doi:[10\.48550/arXiv\.2605\.28345](https://arxiv.org/doi.org/10.48550/arXiv.2605.28345),[arXiv:2605\.28345](http://arxiv.org/abs/2605.28345)\.
- Theiler et al\. \[2026\]Theiler, R\., Comito, L\., Leko, D\., Von Krannichfeldt, L\., Telyatnikov, L\., Fink, O\., 2026\.From paper to benchmark: agentic, framework\-based reproduction of under\-specified methods in machine health intelligence\.URL:[https://arxiv\.org/abs/2605\.28371](https://arxiv.org/abs/2605.28371), doi:[10\.48550/arXiv\.2605\.28371](https://arxiv.org/doi.org/10.48550/arXiv.2605.28371),[arXiv:2605\.28371](http://arxiv.org/abs/2605.28371)\.
- Tu et al\. \[2024\]Tu, S\., Zhang, Y\., Zhang, J\., Fu, Z\., Zhang, Y\., Yang, Y\., 2024\.Powerpm: Foundation model for power systems, in: Globerson, A\., Mackey, L\., Belgrave, D\., Fan, A\., Paquet, U\., Tomczak, J\., Zhang, C\. \(Eds\.\), Advances in Neural Information Processing Systems, Curran Associates, Inc\.\. pp\. 115233–115260\.URL:[https://proceedings\.neurips\.cc/paper\_files/paper/2024/file/d0a2279c9f7ded859bcbf878c3c3d1ed\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2024/file/d0a2279c9f7ded859bcbf878c3c3d1ed-Paper-Conference.pdf), doi:[10\.52202/079017\-3659](https://arxiv.org/doi.org/10.52202/079017-3659)\.
- Wang et al\. \[2018\]Wang, J\., Ma, Y\., Zhang, L\., Gao, R\.X\., Wu, D\., 2018\.Deep learning for smart manufacturing: Methods and applications\.Journal of Manufacturing Systems 48, 144–156\.doi:[10\.1016/j\.jmsy\.2018\.01\.003](https://arxiv.org/doi.org/10.1016/j.jmsy.2018.01.003)\.
- Wang et al\. \[2026\]Wang, Y\., Wu, H\., Dong, J\., Liu, Y\., Wang, C\., Long, M\., Wang, J\., 2026\.Deep time series models: A comprehensive survey and benchmark\.IEEE Transactions on Pattern Analysis and Machine Intelligence URL:[https://doi\.org/10\.1109/TPAMI\.2026\.3690845](https://doi.org/10.1109/TPAMI.2026.3690845), doi:[10\.1109/TPAMI\.2026\.3690845](https://arxiv.org/doi.org/10.1109/TPAMI.2026.3690845)\.
- Wong et al\. \[2021a\]Wong, K\.L\., Bosello, M\., Tse, R\., Falcomer, C\., Rossi, C\., Pau, G\., 2021a\.Li\-ion batteries state\-of\-charge estimation using deep lstm at various battery specifications and discharge cycles, in: Proceedings of the Conference on Information Technology for Social Good, Association for Computing Machinery, New York, NY, USA\. p\. 85–90\.URL:[https://doi\.org/10\.1145/3462203\.3475878](https://doi.org/10.1145/3462203.3475878), doi:[10\.1145/3462203\.3475878](https://arxiv.org/doi.org/10.1145/3462203.3475878)\.
- Wong et al\. \[2021b\]Wong, K\.L\., Bosello, M\., Tse, R\., Falcomer, C\., Rossi, C\., Pau, G\., 2021b\.Li\-ion batteries state\-of\-charge estimation using deep lstm at various battery specifications and discharge cycles, in: Proceedings of the Conference on Information Technology for Social Good, Association for Computing Machinery, New York, NY, USA\. p\. 85–90\.URL:[https://doi\.org/10\.1145/3462203\.3475878](https://doi.org/10.1145/3462203.3475878), doi:[10\.1145/3462203\.3475878](https://arxiv.org/doi.org/10.1145/3462203.3475878)\.
- Woo et al\. \[2024\]Woo, G\., Liu, C\., Kumar, A\., Xiong, C\., Savarese, S\., Sahoo, D\., 2024\.Unified training of universal time series forecasting transformers, in: Salakhutdinov, R\., Kolter, Z\., Heller, K\., Weller, A\., Oliver, N\., Scarlett, J\., Berkenkamp, F\. \(Eds\.\), Proceedings of the 41st International Conference on Machine Learning, PMLR\. pp\. 53140–53164\.URL:[https://proceedings\.mlr\.press/v235/woo24a\.html](https://proceedings.mlr.press/v235/woo24a.html)\.
- Yadan \[2019\]Yadan, O\., 2019\.Hydra \- a framework for elegantly configuring complex applications\.Github\.URL:[https://github\.com/facebookresearch/hydra](https://github.com/facebookresearch/hydra)\.
- Yaguo et al\. \[2019\]Yaguo, L\., Tianyu, H\., Biao, W\., Naipeng, L\., Tao, Y\., Jun, Y\., 2019\.Xjtu\-sy rolling element bearing accelerated life test datasets: A tutorial\.Journal of Mechanical Engineering 55, 1–6\.
- Yao and Han \[2026\]Yao, J\., Han, T\., 2026\.Utilizing large\-scale foundation models for prognostics and health management in wind turbines: Techniques, challenges, and future directions\.Renewable and Sustainable Energy Reviews 227, 116527\.doi:[10\.1016/j\.rser\.2025\.116527](https://arxiv.org/doi.org/10.1016/j.rser.2025.116527)\.
- Zhang et al\. \[2025\]Zhang, S\., Wang, T\., Kulkarni, A\., Adams, S\., Bhattacharya, S\., Tiyyagura, S\.R\., Bowen, E\., Veeramani, B\., Zhou, D\., 2025\.PDMBench: A Standardized Platform for Predictive Maintenance Research\.URL:[https://openreview\.net/forum?id=oJhj8wOCNB](https://openreview.net/forum?id=oJhj8wOCNB)\. openReview preprint\.
- Zhang and Yan \[2023\]Zhang, Y\., Yan, J\., 2023\.Crossformer: Transformer utilizing cross\-dimension dependency for multivariate time series forecasting, in: The eleventh international conference on learning representations\.
- Zhao et al\. \[2019\]Zhao, R\., Yan, R\., Chen, Z\., Mao, K\., Wang, P\., Gao, R\.X\., 2019\.Deep learning and its applications to machine health monitoring\.Mechanical Systems and Signal Processing 115, 213–237\.doi:[10\.1016/j\.ymssp\.2018\.05\.050](https://arxiv.org/doi.org/10.1016/j.ymssp.2018.05.050)\.
- Zhou et al\. \[2025\]Zhou, C\., Li, Q\., Li, C\., Yu, J\., Liu, Y\., Wang, G\., Zhang, K\., Ji, C\., Yan, Q\., He, L\., et al\., 2025\.A comprehensive survey on pretrained foundation models: A history from bert to chatgpt\.International Journal of Machine Learning and Cybernetics 16, 9851–9915\.
- Zonta et al\. \[2020\]Zonta, T\., da Costa, C\.A\., da Rosa Righi, R\., de Lima, M\.J\., da Trindade, E\.S\., Li, G\., 2020\.Predictive maintenance in the industry 4\.0: A systematic literature review\.Computers & Industrial Engineering 150, 106889\.doi:[10\.1016/j\.cie\.2020\.106889](https://arxiv.org/doi.org/10.1016/j.cie.2020.106889)\.
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

Similar Articles

TabPFN-3: Technical Report

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

Submit Feedback

Similar Articles

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making
Assessing the Operational Viability of Foundation Models for Time Series Forecasting