TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

arXiv cs.LG Papers

Summary

TPA-AD is a two-stage pseudo anomaly-guided method for bearing time-series anomaly detection that generates pseudo-anomalous windows near normal boundaries using reconstruction models and contrastive learning, then scores anomalies with KNN—without requiring real anomaly samples during training. It is evaluated on bearing fault and degradation datasets, including high-speed train axle-box bearing data.

arXiv:2606.04073v1 Announce Type: new Abstract: This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series anomaly detection (time series anomaly detection, TSAD) under the setting where only normal samples are available for training. The method first generates pseudo-anomalous windows near the normal boundary using a reconstruction model and per-feature target-error control. It then learns anomaly-sensitive representations through contrastive learning between normal and pseudo-anomalous windows, and finally produces window-level and point-level anomaly scores using k-nearest neighbors (KNN). Compared with existing methods that rely on known fault categories, real anomaly priors, or random anomaly injection, TPA-AD improves the separability of the normal boundary by constructing pseudo-anomalies in boundary neighborhoods and can jointly handle continuous and discrete features in mixed-variable scenarios. The main experiments are conducted on bearing fault detection datasets and degradation-process datasets, with an additional exploratory extension on $13$ public TSAD datasets. The results show that the proposed method yields relatively stable anomaly responses, is sensitive to degradation evolution, and demonstrates a certain degree of broader applicability on public TSAD benchmarks and real high-speed-train-related bearing data.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:20 AM

# TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection
Source: [https://arxiv.org/html/2606.04073](https://arxiv.org/html/2606.04073)
###### Abstract

This paper proposes a two\-stage pseudo anomaly\-guided anomaly detection method \(Two\-stagePseudoAnomaly\-guidedAnomalyDetection,TPA\-AD\) for axle\-box bearing time\-series anomaly detection \(time series anomaly detection, TSAD\) under the setting where only normal samples are available for training\. The method first generates pseudo\-anomalous windows near the normal boundary using a reconstruction model and per\-feature target\-error control\. It then learns anomaly\-sensitive representations through contrastive learning between normal and pseudo\-anomalous windows, and finally produces window\-level and point\-level anomaly scores using k\-nearest neighbors \(KNN\)\. Compared with existing methods that rely on known fault categories, real anomaly priors, or random anomaly injection, TPA\-AD improves the separability of the normal boundary by constructing pseudo\-anomalies in boundary neighborhoods and can jointly handle continuous and discrete features in mixed\-variable scenarios\. The main experiments are conducted on bearing fault detection datasets and degradation\-process datasets, with an additional exploratory extension on1313public TSAD datasets\. The results show that the proposed method yields relatively stable anomaly responses, is sensitive to degradation evolution, and demonstrates a certain degree of broader applicability on public TSAD benchmarks and real high\-speed\-train\-related bearing data\.

###### keywords:

fault detection , bearing fault detection , high\-speed railway fault detection , TSAD , contrastive learning

††journal:Journal of Industrial Information Integration\\affiliation

\[1\]organization=School of Ocean Engineering, Harbin Institute of Technology, addressline=West Wenhua Road, city=Weihai, postcode=264209, state=Shandong, country=China

\\affiliation

\[2\]organization=Technical Center, Bogie Development Department, CRRC Qingdao Sifang Locomotive and Rolling Stock Co\., Ltd, addressline=Jinhong East Road, city=Qingdao, postcode=266111, state=Shandong, country=China

\\affiliation

\[3\]organization=Qingdao University, addressline=No\. 308 Ningxia Road, city=Qingdao, postcode=266071, state=Shandong, country=China

\{highlights\}

We propose a two\-stage pseudo anomaly\-guided framework for scenarios with normal\-only training, combining reconstruction\-driven pseudo\-anomaly generation with contrastive representation learning to improve the separability of axle\-box bearing anomalies without relying on real anomaly samples\.

We design a per\-feature target\-error control and pseudo\-anomaly filtering mechanism to generate boundary samples with controllable deviation magnitudes and a more continuous distribution in the continuous\-feature subspace, thereby alleviating the problems of scattered injected anomalies and unstable boundaries in traditional anomaly injection\.

We construct an experimental system covering fault detection, degradation detection, and an extension on public TSAD datasets, systematically evaluating the proposed method on multi\-condition bearing data and long\-horizon degradation sequences, while also analyzing its broader applicability to generic time\-series anomaly detection tasks\.

## 1Introduction

Axle\-box bearings of high\-speed trains are critical rotating components in the bogie system, and their service condition directly affects operational safety, ride comfort, and maintenance cost\. During long\-term operation, axle\-box bearings are influenced by high\-speed rotation, wheel–rail excitation, load variation, track irregularity, and environmental noise\. Under these conditions, abnormal states such as local damage, wear, and lubrication degradation may gradually accumulate and eventually evolve into severe faults\. Vibration signals can sensitively reflect impact, modulation, and non\-stationary dynamic characteristics inside the bearing\. Therefore, state monitoring and anomaly detection based on vibration signals collected by axle\-box accelerometers are of great practical importance\. In recent years, rolling\-bearing fault diagnosis has gradually shifted from handcrafted feature extraction to deep representation learning\. Related surveys indicate that convolutional neural networks, attention mechanisms, graph neural networks, and Transformer\-based models have become important tools for intelligent bearing diagnosisZhao et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib1)\)\. Under complex operating conditions, previous studies have improved model robustness from perspectives such as variable speed and sample imbalanceDong et al\. \([2024a](https://arxiv.org/html/2606.04073#bib.bib2)\), pseudo\-label\-based time–frequency supervised contrastive learning and unsupervised domain adaptationPang et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib3)\), and dynamic\-model\-assisted disentanglementXu et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib4)\)\. In terms of model design, methods such as convolution with cross\-fusion TransformerLin et al\. \([2023](https://arxiv.org/html/2606.04073#bib.bib5)\), GAF combined with CNN\-ViTZhou et al\. \([2024a](https://arxiv.org/html/2606.04073#bib.bib6)\), and few\-shot learning frameworksLi et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib7)\)have been used to enhance vibration\-signal feature representation\. In multi\-sensor and industrial scenarios, approaches such as multi\-sensor frequency\-domain fusionDai et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib8)\), VMD combined with lightweight networksWang and Feng \([2024](https://arxiv.org/html/2606.04073#bib.bib9)\), lightweight contrastive TransformerDong et al\. \([2024b](https://arxiv.org/html/2606.04073#bib.bib10)\), and multi\-source residual convolutional fusion networksYe et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib11)\)further address noise, limited samples, and multi\-source information fusion\. For high\-speed\-train bogie bearings, methods such as AGFCN consider diagnosis under complex conditions with strong noise and varying loadsHe et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib12)\)\. In addition, zero\-shot diagnosisLi et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib13)\), frequency\-pattern graph modelingLiu et al\. \([2025a](https://arxiv.org/html/2606.04073#bib.bib14)\), and complex\-domain completion with unsupervised time–frequency alignment under cross\-domain missing\-data settingsWang et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib15)\)all suggest that real bearing\-monitoring scenarios often involve insufficient labels, cross\-condition shifts, scarce anomaly samples, and multi\-source noise interference simultaneously\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/Introducation.png)Figure 1:Schematic comparison between conventional fault injection and the pseudo\-anomaly generation mechanism proposed in this paper\. The former usually produces scattered negative samples and thus struggles to form a stable boundary outside the normal manifold\. The latter generates pseudo\-anomalous samples with multiple strength levels in the reconstruction\-error space, thereby forming a more continuous anomalous boundary band around the normal region\.However, most existing bearing fault diagnosis studies target the identification of known fault categories and usually rely on fault samples, class labels, or source\-domain fault knowledge\. In real high\-speed\-train maintenance scenarios, severe fault samples are scarce, anomaly labels are difficult to obtain, and many anomalies do not belong to predefined fault categories\. Therefore, modeling axle\-box vibration monitoring as a time\-series anomaly detection \(TSAD\) problem is more consistent with practical application needs\. In recent years, deep TSAD methods have developed into a relatively complete research landscape, covering prediction\-based, reconstruction\-based, generative, and representation\-learning paradigmsZamanzadeh Darban et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib16)\)\. At the same time, the reliability of anomaly\-detection benchmarks, evaluation metrics, and experimental protocols has also drawn increasing attentionLiu and Paparrizos \([2024](https://arxiv.org/html/2606.04073#bib.bib17)\); Qiu et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib18)\)\. In terms of specific methods, multi\-pattern normality learning in the frequency domainChen et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib19)\), joint self\-supervised modeling in the time, frequency, and residual domainsSun et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib20)\), and time–frequency masked autoencodersFang et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib21)\)all highlight the importance of frequency structure and multi\-domain representation for anomaly detection\. ImDiffusion combines imputation mechanisms with diffusion models for multivariate TSADChen et al\. \([2023](https://arxiv.org/html/2606.04073#bib.bib22)\), while METER addresses concept drift in online anomaly detection through dynamic concept adaptationZhu et al\. \([2023](https://arxiv.org/html/2606.04073#bib.bib23)\)\. For multivariate time series, graph\-alignment methods are used to characterize changes in channel dependenciesWang et al\. \([2024](https://arxiv.org/html/2606.04073#bib.bib24)\), and CATCH captures fine\-grained frequency features and channel correlations through frequency\-domain patching and channel\-fusion modulesWu et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib25)\)\. In addition, recent methods such as DADAShentu et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib26)\), hybrid prototype learningShen \([2025](https://arxiv.org/html/2606.04073#bib.bib27)\), CARLA self\-supervised contrastive representation learningDarban et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib28)\), KAN\-ADZhou et al\. \([2024b](https://arxiv.org/html/2606.04073#bib.bib29)\), and RTDetector based on reconstruction trendsLiu et al\. \([2025b](https://arxiv.org/html/2606.04073#bib.bib30)\)have further advanced TSAD from the perspectives of generic anomaly detection, prototype constraints, contrastive learning, and suppression of reconstruction overfitting\.

Although the above methods have achieved promising results in generic anomaly detection and mechanical fault diagnosis, two key limitations remain for axle\-box bearing monitoring scenarios in which only normal training samples are available\. First, many methods implicitly rely on real anomalies, anomaly\-injection priors, or relatively stable fault templates\. Second, even when artificial negative samples are introduced, they are often scattered and thus fail to form a continuous and stable discriminative boundary outside the normal manifold\. Therefore, a key issue in improving the discriminative capability of TSAD for axle\-box bearings is how to construct “near\-anomalous” reference samples with genuine boundary significance without relying on real anomaly samples\.

As illustrated in Fig\.[1](https://arxiv.org/html/2606.04073#S1.F1), conventional fault injection is usually highly task\-specific, and the generated samples tend to resemble several known fault patterns\. When unseen anomalies or weak anomalies at the early stage of degradation appear during testing, such models may fail to maintain stable discrimination\. In contrast, this paper does not directly fabricate specific fault morphologies\. Instead, it generates pseudo\-anomalous windows with controllable deviation magnitudes based on the reconstruction behavior of normal samples within multiple feature\-wise error intervals\. The resulting pseudo\-anomalous samples are more concentrated in the neighborhood just outside the normal manifold, thereby providing continuous and stable negative references for subsequent contrastive representation learning\.

To more clearly position this work with respect to existing research on bearing fault diagnosis and time\-series anomaly detection, Tables[1](https://arxiv.org/html/2606.04073#S1.T1)and[2](https://arxiv.org/html/2606.04073#S1.T2)summarize two groups of representative studies that are most relevant to this paper\.

Table 1:Key references in bearing fault diagnosisReferenceScenario/ProblemMethodRelevance to this workDong et al\.Dong et al\.\([2024a](https://arxiv.org/html/2606.04073#bib.bib2)\)Variable speed and sample imbalanceMulti\-scale dynamic supervised contrastive learning for enhancing discrimination among different state representations\.Supports the effectiveness of contrastive learning in bearing diagnosis under complex operating conditions\.Pang et al\.Pang et al\.\([2024](https://arxiv.org/html/2606.04073#bib.bib3)\)Unsupervised domain adaptation under variable speedCombines pseudo labels with time–frequency supervised contrastive learning to alleviate cross\-condition distribution shifts\.Related to our use of pseudo\-anomaly/pseudo\-label information for representation learning\.He et al\.He et al\.\([2025](https://arxiv.org/html/2606.04073#bib.bib12)\)Complex operating conditions of high\-speed\-train bogie bearingsFault diagnosis for high\-speed\-train bogie bearings under strong noise and varying loads\.Closest to the application scenario of this paper and supports the background of axle\-box bearing vibration monitoring\.Li et al\.Li et al\.\([2025](https://arxiv.org/html/2606.04073#bib.bib13)\)Zero\-shot bearing diagnosisUses fault\-spectrum knowledge and self\-driven contrastive learning under the absence of fault samples\.Supports the need to exploit weak priors or pseudo\-information when real fault samples are scarce\.Wang et al\.Wang et al\.\([2025](https://arxiv.org/html/2606.04073#bib.bib15)\)Cross\-domain diagnosis with missing dataAddresses cross\-domain and missing\-data issues through complex\-domain completion and unsupervised time–frequency alignment\.Supports the need to handle distribution shifts, missingness, and unsupervised alignment in real industrial data\.Table 2:Key references in time\-series anomaly detectionReferenceScenario/ProblemMethodRelevance to this workZamanzadeh Darban et al\.Zamanzadeh Darban et al\.\([2024](https://arxiv.org/html/2606.04073#bib.bib16)\)Survey of deep TSADSystematically summarizes prediction\-based, reconstruction\-based, generative, and representation\-learning methods\.Provides the overall TSAD research context and problem definition\.Liu and PaparrizosLiu and Paparrizos\([2024](https://arxiv.org/html/2606.04073#bib.bib17)\)TSAD benchmarks and evaluationAnalyzes how datasets, metrics, and evaluation protocols affect result reliability\.Supports the need for point\-level scoring and careful metric design in this work\.Chen et al\.Chen et al\.\([2024](https://arxiv.org/html/2606.04073#bib.bib19)\)Multi\-pattern normality learningLearns multiple normal patterns in the frequency domain to improve anomaly detection efficiency in complex systems\.Relevant to multiple normal operating modes and frequency structure in axle\-box vibration signals\.Wu et al\.Wu et al\.\([2025](https://arxiv.org/html/2606.04073#bib.bib25)\)Multivariate frequency\-domain and channel relationsUses frequency\-domain patching and channel\-fusion modules to capture frequency features and channel correlations\.Supports the importance of frequency\-domain and channel information in multi\-channel vibration signals\.Darban et al\.Darban et al\.\([2025](https://arxiv.org/html/2606.04073#bib.bib28)\)Self\-supervised contrastive TSADObtains discriminative time\-series representations through anomaly injection and contrastive learning\.Related to our idea of constructing boundaries in the embedding space using pseudo\-anomalous negative samples\.In summary, this paper establishes a TSAD framework for high\-speed\-train axle\-box bearing monitoring centered on “pseudo\-anomaly construction–contrastive representation learning–point\-level anomaly scoring\.” The main contributions of this paper are as follows:

1. 1\.We propose a two\-stage pseudo anomaly\-guided framework for scenarios with normal\-only training, combining reconstruction\-driven pseudo\-anomaly generation with contrastive representation learning to improve the separability of axle\-box bearing anomalies without relying on real anomaly samples\.
2. 2\.We design a per\-feature target\-error control and pseudo\-anomaly filtering mechanism to generate boundary samples with controllable deviation magnitudes and a more continuous distribution in the continuous\-feature subspace, thereby alleviating the problems of scattered injected anomalies and unstable boundaries in traditional anomaly injection\.
3. 3\.We construct an experimental system covering fault detection, degradation detection, and an extension on public TSAD datasets, systematically evaluating the proposed method on multi\-condition bearing data and long\-horizon degradation sequences, while also analyzing its broader applicability to generic time\-series anomaly detection tasks\.

## 2Method

This paper studies time\-series anomaly detection under the setting where only normal samples are used for training\. Given normal training sequences, the goal is to assign higher scores to anomalous time points in the test sequence without using test labels\. To this end, we propose a two\-stage pseudo anomaly\-guided representation learning framework, as shown in Fig\.[2](https://arxiv.org/html/2606.04073#S2.F2)\. In the first stage, a reconstruction model is learned in the continuous subspace, and pseudo\-anomalous windows near the normal boundary are generated through per\-feature error control\. In the second stage, contrastive representation learning is performed using normal windows and pseudo\-anomalous windows, and anomaly scoring is carried out in the embedding space based on KNN distances\. By constructing “near\-anomalous” reference samples, the framework enhances the separability of the normal boundary without relying on any real anomaly samples\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/overall_framework.png)Figure 2:The proposed two\-stage framework\. The upper part shows reconstruction\-driven pseudo\-anomalous window generation, and the lower part shows normal–pseudo\-anomalous contrastive representation learning and KNN scoring\. Stage 1 generates controlled pseudo\-anomalous samples from normal windows, while Stage 2 pulls normal neighborhoods closer, pushes pseudo\-anomalous samples farther away in the embedding space, and outputs both window\-level and point\-level anomaly scores\.### 2\.1Problem formulation

Let the training sequence and test sequence be denoted by𝐗tr=\{𝐱ttr\}t=1Ttr\\mathbf\{X\}^\{\\mathrm\{tr\}\}=\\\{\\mathbf\{x\}^\{\\mathrm\{tr\}\}\_\{t\}\\\}\_\{t=1\}^\{T\_\{\\mathrm\{tr\}\}\}and𝐗te=\{𝐱tte\}t=1Tte\\mathbf\{X\}^\{\\mathrm\{te\}\}=\\\{\\mathbf\{x\}^\{\\mathrm\{te\}\}\_\{t\}\\\}\_\{t=1\}^\{T\_\{\\mathrm\{te\}\}\}, respectively, where𝐱t∈ℝD\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{D\}\. The training set contains only normal operating states, while the test set may contain anomalous segments\. The test labels𝐲te∈\{0,1\}Tte\\mathbf\{y\}^\{\\mathrm\{te\}\}\\in\\\{0,1\\\}^\{T\_\{\\mathrm\{te\}\}\}are used only for final evaluation and do not participate in model training or pseudo\-anomaly generation\.

Given a window lengthWWand a sliding stridess, the sequence is segmented into a set of windows

𝐗i=\[𝐱ti,𝐱ti\+1,…,𝐱ti\+W−1\]∈ℝW×D,ti=1\+\(i−1\)​s\.\\mathbf\{X\}\_\{i\}=\[\\mathbf\{x\}\_\{t\_\{i\}\},\\mathbf\{x\}\_\{t\_\{i\}\+1\},\\ldots,\\mathbf\{x\}\_\{t\_\{i\}\+W\-1\}\]\\in\\mathbb\{R\}^\{W\\times D\},\\quad t\_\{i\}=1\+\(i\-1\)s\.\(1\)During evaluation, the window label is determined by the point labels within its covered interval, namelyyi=maxt∈\[ti,ti\+W−1\]⁡yty\_\{i\}=\\max\_\{t\\in\[t\_\{i\},t\_\{i\}\+W\-1\]\}y\_\{t\}\. The goal of this paper is to learn a point\-level anomaly scoring functionS​\(t\)S\(t\)such that anomalous segments receive higher scores\.

### 2\.2Preprocessing for mixed features

Industrial sensing data often contain both continuous features and low\-cardinality discrete features\. We therefore first identify the feature type according to the cardinality observed in the training set: if the number of unique values in thejj\-th dimension of the training set does not exceed a thresholdKmaxK\_\{\\max\}, that dimension is treated as discrete; otherwise, it is treated as continuous\. Let the index sets of continuous and discrete features be𝒞\\mathcal\{C\}and𝒟\\mathcal\{D\}, respectively, satisfying𝒞∪𝒟=\{1,…,D\}\\mathcal\{C\}\\cup\\mathcal\{D\}=\\\{1,\\ldots,D\\\}and𝒞∩𝒟=∅\\mathcal\{C\}\\cap\\mathcal\{D\}=\\emptyset\.

Stage 1 performs reconstruction and pseudo\-anomaly generation only in the continuous subspace\. Forj∈𝒞j\\in\\mathcal\{C\}, feature\-wise scaling is applied using training\-set statistics, i\.e\.,x~t,j=xt,j−ajbj\\tilde\{x\}\_\{t,j\}=\\frac\{x\_\{t,j\}\-a\_\{j\}\}\{b\_\{j\}\}, whereaja\_\{j\}andbjb\_\{j\}are estimated from the training set\. The framework supports three scaling schemes: z\-score, min–max, and robust min–max\. Discrete features are not involved in the continuous editing procedure of Stage 1; instead, they are modeled and scored later through an independent discrete KNN branch\.

### 2\.3Stage 1: Reconstruction\-driven pseudo\-anomalous window generation

As shown in Fig\.[3](https://arxiv.org/html/2606.04073#S2.F3), Stage 1 consists of four steps: continuous\-feature normalization, reconstruction\-model training, per\-feature error control, and pseudo\-anomalous window collection\. The goal of this stage is not to directly perform anomaly discrimination, but rather to generate a set of boundary samples with controllable deviation magnitudes around the normal manifold, thereby providing effective negative references for the representation learning in Stage 2\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/stage1.png)Figure 3:Stage 1 pipeline\. The left part shows data normalization in the continuous subspace and the reconstruction model; the middle part shows a shared controller based on per\-feature states; and the right part shows the pseudo\-anomalous window collection process based on target\-error constraints and error\-bin balancing\.#### 2\.3\.1Continuous\-subspace reconstruction model

Let the representation of window𝐗i\\mathbf\{X\}\_\{i\}in the continuous subspace be denoted by𝐗~i𝒞∈ℝW×\|𝒞\|\\tilde\{\\mathbf\{X\}\}^\{\\mathcal\{C\}\}\_\{i\}\\in\\mathbb\{R\}^\{W\\times\|\\mathcal\{C\}\|\}\. For each input window, an initial hidden representation is first obtained through a linear layer, i\.e\.,𝐇\(0\)=Linear​\(𝐗~i𝒞\)\\mathbf\{H\}^\{\(0\)\}=\\mathrm\{Linear\}\\big\(\\tilde\{\\mathbf\{X\}\}^\{\\mathcal\{C\}\}\_\{i\}\\big\)\. Then, an ExtremeKAN module is introduced to enhance amplitude\-related nonlinear responses, and its output is injected into the Transformer encoder as a residual term:

𝐇\(1\)=TransformerEncoder​\(PE​\(𝐇\(0\)\+EKAN​\(𝐇\(0\)\)\)\)\.\\mathbf\{H\}^\{\(1\)\}=\\mathrm\{TransformerEncoder\}\\left\(\\mathrm\{PE\}\\big\(\\mathbf\{H\}^\{\(0\)\}\+\\mathrm\{EKAN\}\(\\mathbf\{H\}^\{\(0\)\}\)\\big\)\\right\)\.\(2\)Here,PE​\(⋅\)\\mathrm\{PE\}\(\\cdot\)denotes positional encoding andEKAN​\(⋅\)\\mathrm\{EKAN\}\(\\cdot\)denotes the ExtremeKAN nonlinear enhancement module\.

To preserve the independent reconstruction capability of each sensor dimension, a dedicated decoding head is assigned to every continuous feature\. Let𝐡i,t\\mathbf\{h\}\_\{i,t\}denote the hidden state of𝐇\(1\)\\mathbf\{H\}^\{\(1\)\}at time steptt\. The reconstruction result for thejj\-th continuous feature is thenx^i,t,j=Dj​\(𝐡i,t\)\\hat\{x\}\_\{i,t,j\}=D\_\{j\}\(\\mathbf\{h\}\_\{i,t\}\), wherej∈𝒞j\\in\\mathcal\{C\}andDj​\(⋅\)D\_\{j\}\(\\cdot\)is the feature\-specific decoding mapping\. Accordingly, the window\-wise, feature\-wise reconstruction error is defined as

ei,j=1W​∑t=1W\(x~i,t,j−x^i,t,j\)2,j∈𝒞\.e\_\{i,j\}=\\frac\{1\}\{W\}\\sum\_\{t=1\}^\{W\}\\left\(\\tilde\{x\}\_\{i,t,j\}\-\\hat\{x\}\_\{i,t,j\}\\right\)^\{2\},\\qquad j\\in\\mathcal\{C\}\.\(3\)The reconstruction model is trained using the mean squared error over continuous features:

ℒrec=1\|𝒞\|​∑j∈𝒞1N​∑i=1Nei,j\.\\mathcal\{L\}\_\{\\mathrm\{rec\}\}=\\frac\{1\}\{\|\\mathcal\{C\}\|\}\\sum\_\{j\\in\\mathcal\{C\}\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}e\_\{i,j\}\.\(4\)

#### 2\.3\.2Target\-error modeling and per\-feature control

To generate pseudo\-anomalous windows that deviate from the normal manifold without becoming excessively distorted, this paper does not directly impose arbitrary perturbations on the raw signal\. Instead, a target error interval is specified for each continuous feature in the reconstruction\-error space\. Specifically, the reconstruction\-error distribution on the training set is used to construct target intervalsτjlow=Qql​\(\{ei,jtrain\}i\)\\tau\_\{j\}^\{\\mathrm\{low\}\}=Q\_\{q\_\{l\}\}\\left\(\\\{e\_\{i,j\}^\{\\mathrm\{train\}\}\\\}\_\{i\}\\right\)andτjhigh=Qqu​\(\{ei,jtrain\}i\)\\tau\_\{j\}^\{\\mathrm\{high\}\}=Q\_\{q\_\{u\}\}\\left\(\\\{e\_\{i,j\}^\{\\mathrm\{train\}\}\\\}\_\{i\}\\right\), whereQq​\(⋅\)Q\_\{q\}\(\\cdot\)denotes theqq\-th quantile\. A target error is then sampled for each window and each edited feature asτi,j=τjlow\+ui,j​\(τjhigh−τjlow\)\\tau\_\{i,j\}=\\tau\_\{j\}^\{\\mathrm\{low\}\}\+u\_\{i,j\}\\left\(\\tau\_\{j\}^\{\\mathrm\{high\}\}\-\\tau\_\{j\}^\{\\mathrm\{low\}\}\\right\), whereui,j∈\[0,1\]u\_\{i,j\}\\in\[0,1\]can be obtained by uniform sampling, Beta sampling, or grid sampling\. This design is mainly motivated by an empirical phenomenon observed in bearing data: the upper tail of the reconstruction\-error distribution for normal training windows is often closer to the boundary\-strength range required for subsequent pseudo\-anomaly generation\. Therefore, the higher\-quantile interval of reconstruction errors on normal training samples in Stage 1 can serve as an empirical reference for the target error strength of pseudo\-anomalies\. It should be emphasized that this reference is derived solely from the normal training data, does not rely on anomaly labels, and does not constitute the final anomaly decision threshold\.

As shown in the middle part of Fig\.[3](https://arxiv.org/html/2606.04073#S2.F3), we further adopt a shared per\-feature actor–critic controller to adaptively adjust the editing step size for each continuous feature\. Let𝒞e⊆𝒞\\mathcal\{C\}\_\{e\}\\subseteq\\mathcal\{C\}denote the subset of features being edited in the current window\. At therr\-th iteration, the state vector corresponding to thejj\-th feature of windowiiis defined as

𝐬i,j\(r\)=\[ei,j\(r\),τi,j,ei,j\(r\)−τi,j,δi,j\(r\),di,j\(r\),r/R,e¯i\(r\),std​\(𝐞i\(r\)\)\],\\mathbf\{s\}^\{\(r\)\}\_\{i,j\}=\\big\[e^\{\(r\)\}\_\{i,j\},\\tau\_\{i,j\},e^\{\(r\)\}\_\{i,j\}\-\\tau\_\{i,j\},\\delta^\{\(r\)\}\_\{i,j\},d^\{\(r\)\}\_\{i,j\},r/R,\\bar\{e\}^\{\(r\)\}\_\{i\},\\mathrm\{std\}\(\\mathbf\{e\}^\{\(r\)\}\_\{i\}\)\\big\],\(5\)whereδi,j\(r\)\\delta^\{\(r\)\}\_\{i,j\}is the mean absolute reconstruction residual,di,j\(r\)∈\{−1,\+1\}d^\{\(r\)\}\_\{i,j\}\\in\\\{\-1,\+1\\\}is the current editing direction,RRis the maximum number of iterations, ande¯i\(r\)\\bar\{e\}^\{\(r\)\}\_\{i\}andstd​\(𝐞i\(r\)\)\\mathrm\{std\}\(\\mathbf\{e\}^\{\(r\)\}\_\{i\}\)are the mean and standard deviation of the per\-feature errors within the current window, respectively\. All features share the same actor parameters so that the control policy remains consistent and transferable across different sensor dimensions\.

Given the state𝐬i,j\(r\)\\mathbf\{s\}^\{\(r\)\}\_\{i,j\}, the actor outputs a step\-size multipliermi,j\(r\)=mmin\+\(mmax−mmin\)​πϕ​\(𝐬i,j\(r\)\)m^\{\(r\)\}\_\{i,j\}=m\_\{\\min\}\+\\left\(m\_\{\\max\}\-m\_\{\\min\}\\right\)\\pi\_\{\\phi\}\\big\(\\mathbf\{s\}^\{\(r\)\}\_\{i,j\}\\big\), which is then used to obtain the actual update stepηi,j\(r\)=clip​\(η0​mi,j\(r\),ηmin,ηmax\)\\eta^\{\(r\)\}\_\{i,j\}=\\mathrm\{clip\}\\left\(\\eta\_\{0\}m^\{\(r\)\}\_\{i,j\},\\eta\_\{\\min\},\\eta\_\{\\max\}\\right\), whereη0\\eta\_\{0\}is the base step size\. The editing direction is determined by the relative position between the current error and the target error:

di,j\(r\)=\{\+1,ei,j\(r\)≤τi,j,−1,ei,j\(r\)\>τi,j\.d^\{\(r\)\}\_\{i,j\}=\\begin\{cases\}\+1,&e^\{\(r\)\}\_\{i,j\}\\leq\\tau\_\{i,j\},\\\\ \-1,&e^\{\(r\)\}\_\{i,j\}\>\\tau\_\{i,j\}\.\\end\{cases\}\(6\)The update applied to thejj\-th continuous feature in the original window is therefore

x~i,t,j\(r\+1\)=x~i,t,j\(r\)\+ηi,j\(r\)​di,j\(r\)​\(x~i,t,j\(r\)−x^i,t,j\(r\)\)\.\\tilde\{x\}^\{\(r\+1\)\}\_\{i,t,j\}=\\tilde\{x\}^\{\(r\)\}\_\{i,t,j\}\+\\eta^\{\(r\)\}\_\{i,j\}d^\{\(r\)\}\_\{i,j\}\\left\(\\tilde\{x\}^\{\(r\)\}\_\{i,t,j\}\-\\hat\{x\}^\{\(r\)\}\_\{i,t,j\}\\right\)\.\(7\)If the error crosses the target value before and after the update, a crossing indicator is defined as

ci,j\(r\)=𝕀​\[\(ei,j\(r\)−τi,j\)​\(ei,j\(r\+1\)−τi,j\)≤0\]\.c^\{\(r\)\}\_\{i,j\}=\\mathbb\{I\}\\left\[\\left\(e^\{\(r\)\}\_\{i,j\}\-\\tau\_\{i,j\}\\right\)\\left\(e^\{\(r\+1\)\}\_\{i,j\}\-\\tau\_\{i,j\}\\right\)\\leq 0\\right\]\.\(8\)The corresponding per\-feature reward is defined as

Ri,j\(r\)=−\|ei,j\(r\+1\)−τi,j\|\+λold​\|ei,j\(r\)−τi,j\|\+λcross​ci,j\(r\)\.R^\{\(r\)\}\_\{i,j\}=\-\\left\|e^\{\(r\+1\)\}\_\{i,j\}\-\\tau\_\{i,j\}\\right\|\+\\lambda\_\{\\mathrm\{old\}\}\\left\|e^\{\(r\)\}\_\{i,j\}\-\\tau\_\{i,j\}\\right\|\+\\lambda\_\{\\mathrm\{cross\}\}c^\{\(r\)\}\_\{i,j\}\.\(9\)This reward encourages the updated error to approach the target interval while preserving an incentive for effective crossing behavior\. The controller is trained using experience replay and a target network\.

#### 2\.3\.3Pseudo\-anomalous window filtering

After completingRRiterations, the deviation between the final error and the target error is computed for each candidate pseudo\-anomalous window asΔi,j=\|ei,j\(R\)−τi,j\|\\Delta\_\{i,j\}=\\left\|e\_\{i,j\}^\{\(R\)\}\-\\tau\_\{i,j\}\\right\|\. If thejj\-th feature satisfies

Δi,j≤max⁡\(ϵabs,ϵrel​\|τi,j\|\),\\Delta\_\{i,j\}\\leq\\max\\left\(\\epsilon\_\{\\mathrm\{abs\}\},\\epsilon\_\{\\mathrm\{rel\}\}\|\\tau\_\{i,j\}\|\\right\),\(10\)then that feature is regarded as having hit the target\. The hit rate of windowiiis further defined as

hi=1\|𝒞e\|​∑j∈𝒞e𝕀​\[Δi,j≤max⁡\(ϵabs,ϵrel​\|τi,j\|\)\]\.h\_\{i\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{e\}\|\}\\sum\_\{j\\in\\mathcal\{C\}\_\{e\}\}\\mathbb\{I\}\\left\[\\Delta\_\{i,j\}\\leq\\max\\left\(\\epsilon\_\{\\mathrm\{abs\}\},\\epsilon\_\{\\mathrm\{rel\}\}\|\\tau\_\{i,j\}\|\\right\)\\right\]\.\(11\)Only candidate windows whose hit rate exceeds a threshold are retained as valid pseudo\-anomalous samples\. To prevent samples from concentrating in a single error\-strength region, we further adopt an error\-bin balancing strategy to collect pseudo\-anomalous windows evenly across different target\-error intervals, as shown on the right side of Fig\.[3](https://arxiv.org/html/2606.04073#S2.F3)\. The final pseudo\-anomalous window set is denoted by𝒫=\{𝐏k\}k=1Np\\mathcal\{P\}=\\\{\\mathbf\{P\}\_\{k\}\\\}\_\{k=1\}^\{N\_\{p\}\}, where𝐏k∈ℝW×\|𝒞\|\\mathbf\{P\}\_\{k\}\\in\\mathbb\{R\}^\{W\\times\|\\mathcal\{C\}\|\}\.

### 2\.4Stage 2: Normal–pseudo\-anomalous contrastive representation learning

As shown in Fig\.[4](https://arxiv.org/html/2606.04073#S2.F4), Stage 2 uses normal windows and pseudo\-anomalous windows generated in Stage 1 to learn an anomaly\-sensitive representation space\. The left side of the figure corresponds to training\-sample construction and the contrastive objective, while the right side corresponds to KNN\-based anomaly scoring during testing\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/stage2.png)Figure 4:Stage 2 pipeline\. The left side shows normal–pseudo\-anomalous contrastive training, where normal neighbors are pulled closer while pseudo\-anomalous windows and hard normal samples are pushed farther away\. The right side shows KNN\-based anomaly scoring at test time, where each test window obtains an anomaly score according to its distance to the normal sample bank\.#### 2\.4\.1Training\-sample construction and contrastive objective

Let the normal window bank be𝒩\\mathcal\{N\}and the pseudo\-anomalous window bank be𝒫\\mathcal\{P\}\. For a window𝐗\\mathbf\{X\}in the continuous subspace, its embedding is defined as𝐳=gψ​\(fθ​\(𝐗\)\)∈ℝm\\mathbf\{z\}=g\_\{\\psi\}\(f\_\{\\theta\}\(\\mathbf\{X\}\)\)\\in\\mathbb\{R\}^\{m\}, wherefθf\_\{\\theta\}is the window encoder andgψg\_\{\\psi\}is the projection head\. In the experiments, a CNN is used as the default encoder\.

In each training iteration, an anchor window𝐗a\\mathbf\{X\}^\{a\}is sampled from𝒩\\mathcal\{N\}\. Its corresponding positive sample𝐗\+\\mathbf\{X\}^\{\+\}is selected from its normal neighbors, i\.e\., a nearest\-neighbor list is first constructed for normal samples in the raw window space, and then one neighboring window is randomly chosen as the positive sample\. The pseudo\-anomalous negative sample𝐗p−\\mathbf\{X\}^\{p\-\}is selected by hard negative mining from the pseudo\-anomalous candidate pool as the sample closest to the current anchor in the embedding space\. Meanwhile, the farthest normal sample from the current anchor is selected from the normal candidate pool as a normal\-hard negative sample𝐗h−\\mathbf\{X\}^\{h\-\}\.

Given a triplet\(𝐗a,𝐗\+,𝐗−\)\(\\mathbf\{X\}^\{a\},\\mathbf\{X\}^\{\+\},\\mathbf\{X\}^\{\-\}\), the triplet loss is defined as

ℒtri​\(𝐗a,𝐗\+,𝐗−;m\)=max⁡\(0,‖𝐳a−𝐳\+‖2−‖𝐳a−𝐳−‖2\+m\)\.\\mathcal\{L\}\_\{\\mathrm\{tri\}\}\(\\mathbf\{X\}^\{a\},\\mathbf\{X\}^\{\+\},\\mathbf\{X\}^\{\-\};m\)=\\max\\left\(0,\\\|\\mathbf\{z\}^\{a\}\-\\mathbf\{z\}^\{\+\}\\\|\_\{2\}\-\\\|\\mathbf\{z\}^\{a\}\-\\mathbf\{z\}^\{\-\}\\\|\_\{2\}\+m\\right\)\.\(12\)The overall contrastive objective consists of a pseudo\-anomalous negative term and a normal\-hard negative term:

ℒstage2=ℒtri​\(𝐗a,𝐗\+,𝐗p−;mp\)\+λ​ℒtri​\(𝐗a,𝐗\+,𝐗h−;mh\),\\mathcal\{L\}\_\{\\mathrm\{stage2\}\}=\\mathcal\{L\}\_\{\\mathrm\{tri\}\}\(\\mathbf\{X\}^\{a\},\\mathbf\{X\}^\{\+\},\\mathbf\{X\}^\{p\-\};m\_\{p\}\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{tri\}\}\(\\mathbf\{X\}^\{a\},\\mathbf\{X\}^\{\+\},\\mathbf\{X\}^\{h\-\};m\_\{h\}\),\(13\)wherempm\_\{p\}andmhm\_\{h\}denote the pseudo\-anomaly margin and the normal\-hard margin, respectively, andλ\\lambdais a weighting coefficient\. This loss keeps normal neighborhoods compact while pushing pseudo\-anomalous boundary samples and difficult normal samples away from the anchor, thereby producing a clearer distance structure\.

#### 2\.4\.2KNN anomaly scoring and point\-level aggregation

After training, normal windows, pseudo\-anomalous windows, and test windows are encoded to obtain the embedding sets𝒵N\\mathcal\{Z\}\_\{N\},𝒵P\\mathcal\{Z\}\_\{P\}, and𝒵T\\mathcal\{Z\}\_\{T\}, respectively\. For a test\-window embedding𝐳it\\mathbf\{z\}^\{t\}\_\{i\}, its average KNN distance to the normal sample bank is defined as

dN​\(𝐳it\)=1k​∑𝐳∈𝒩k​\(𝐳it;𝒵N\)‖𝐳it−𝐳‖2,d\_\{N\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)=\\frac\{1\}\{k\}\\sum\_\{\\mathbf\{z\}\\in\\mathcal\{N\}\_\{k\}\(\\mathbf\{z\}^\{t\}\_\{i\};\\mathcal\{Z\}\_\{N\}\)\}\\\|\\mathbf\{z\}^\{t\}\_\{i\}\-\\mathbf\{z\}\\\|\_\{2\},\(14\)and its average KNN distance to the pseudo\-anomalous sample bank is defined as

dP​\(𝐳it\)=1k​∑𝐳∈𝒩k​\(𝐳it;𝒵P\)‖𝐳it−𝐳‖2\.d\_\{P\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)=\\frac\{1\}\{k\}\\sum\_\{\\mathbf\{z\}\\in\\mathcal\{N\}\_\{k\}\(\\mathbf\{z\}^\{t\}\_\{i\};\\mathcal\{Z\}\_\{P\}\)\}\\\|\\mathbf\{z\}^\{t\}\_\{i\}\-\\mathbf\{z\}\\\|\_\{2\}\.\(15\)
By default, this paper adopts a window\-level anomaly score based on the distance to the normal sample bank\. LetdN​\(𝐳it\)d\_\{N\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)denote the average KNN distance from the test\-window embedding𝐳it\\mathbf\{z\}^\{t\}\_\{i\}to the normal sample bank\. After robust normalization with quantile clipping, the final window\-level anomaly score is obtained asSi=robust​\_​minmax​\(dN​\(𝐳it\)\)S\_\{i\}=\\mathrm\{robust\\\_minmax\}\\big\(d\_\{N\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)\\big\)\. Hereafter,SiS\_\{i\}is referred to as the*normal score*\. It should be noted thatrobust​\_​minmax​\(⋅\)\\mathrm\{robust\\\_minmax\}\(\\cdot\)here operates on the anomaly\-distance scores output by Stage 2, which is different from the robust scaling used earlier for input\-feature preprocessing\. As a complement, a relative\-distance score can also be defined asSirel=dN​\(𝐳it\)dN​\(𝐳it\)\+dP​\(𝐳it\)\+ϵS^\{\\mathrm\{rel\}\}\_\{i\}=\\frac\{d\_\{N\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)\}\{d\_\{N\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)\+d\_\{P\}\(\\mathbf\{z\}^\{t\}\_\{i\}\)\+\\epsilon\}, which is referred to below as the*relative score*\. In addition, a distance\-weighted KNN classifier can be trained on𝒵N∪𝒵P\\mathcal\{Z\}\_\{N\}\\cup\\mathcal\{Z\}\_\{P\}to obtain an auxiliary*OOC score*\.

Since the model outputs window\-level scores while the evaluation uses point\-level labels, the point\-level anomaly score is obtained by averaging the scores of all windows covering that time point:

S​\(t\)=1\|Ω​\(t\)\|​∑i∈Ω​\(t\)Si,Ω​\(t\)=\{i∣ti≤t≤ti\+W−1\}\.S\(t\)=\\frac\{1\}\{\|\\Omega\(t\)\|\}\\sum\_\{i\\in\\Omega\(t\)\}S\_\{i\},\\qquad\\Omega\(t\)=\\\{i\\mid t\_\{i\}\\leq t\\leq t\_\{i\}\+W\-1\\\}\.\(16\)

### 2\.5Discrete\-feature processing and score fusion

For fully discrete data, the continuous reconstruction and pseudo\-anomaly generation of Stage 1 are no longer applicable\. In this case, each discrete dimension is first one\-hot encoded according to the training\-set categories, and each window is flattened into a vector\. The anomaly score is then defined by comparing the test window against the bank of normal training windows using KNN:

Sdisc​\(𝐗it\)=1k​∑𝐗∈𝒩k​\(𝐗it;𝒩disc\)d​\(𝐗it,𝐗\),S\_\{\\mathrm\{disc\}\}\(\\mathbf\{X\}^\{t\}\_\{i\}\)=\\frac\{1\}\{k\}\\sum\_\{\\mathbf\{X\}\\in\\mathcal\{N\}\_\{k\}\(\\mathbf\{X\}^\{t\}\_\{i\};\\mathcal\{N\}\_\{\\mathrm\{disc\}\}\)\}d\(\\mathbf\{X\}^\{t\}\_\{i\},\\mathbf\{X\}\),\(17\)whered​\(⋅,⋅\)d\(\\cdot,\\cdot\)is Hamming distance by default, but can also be replaced with Euclidean distance\.

For mixed\-feature data, anomaly scores are computed separately by the continuous branch and the discrete branch\. Let the continuous score produced by Stage 2 beScontS\_\{\\mathrm\{cont\}\}, and let the discrete KNN score beSdiscS\_\{\\mathrm\{disc\}\}\. After normalizing the two scores to\[0,1\]\[0,1\], denote them byS¯cont=Norm01​\(Scont\)\\bar\{S\}\_\{\\mathrm\{cont\}\}=\\mathrm\{Norm\}\_\{01\}\(S\_\{\\mathrm\{cont\}\}\)andS¯disc=Norm01​\(Sdisc\)\\bar\{S\}\_\{\\mathrm\{disc\}\}=\\mathrm\{Norm\}\_\{01\}\(S\_\{\\mathrm\{disc\}\}\), respectively\. Score\-level fusion is then performed usingSfuse=max⁡\(S¯cont,S¯disc\)S\_\{\\mathrm\{fuse\}\}=\\max\\left\(\\bar\{S\}\_\{\\mathrm\{cont\}\},\\bar\{S\}\_\{\\mathrm\{disc\}\}\\right\)\. This strategy avoids the combinatorial explosion that would arise from generating pseudo\-anomalous samples in the discrete state space, while preserving the ability to jointly sense continuous anomalies and discrete\-state anomalies\.

### 2\.6Evaluation metrics

The experiments use AUROC, AUPR, and best F1 to evaluate the ranking ability of anomaly scores and the performance of threshold\-based detection, and further use VUS\-ROC and VUS\-PR to evaluate detection robustness at the anomaly\-interval level\. These metrics can be computed on either window\-level or point\-level anomaly scores depending on the experimental setting, as detailed in the experimental section\. Together, they reflect model performance from three aspects: ranking capability, interval tolerance, and threshold\-based detection performance\.

## 3Experiments and Results Analysis

This section evaluates the effectiveness of the proposed method from four aspects: bearing fault detection, degradation\-process detection, ablation analysis, and hyperparameter sensitivity\. The central question is whether the proposed two\-stage pseudo anomaly\-guided representation learning framework can produce stable anomaly scores on different rotating machinery datasets and maintain sensitivity to anomaly evolution in complex degradation scenarios when trained using only normal samples\. Results on public TSAD datasets are presented later in the discussion section as a supplementary analysis of the method’s broader applicability\. Unless otherwise specified, all experiments use normal samples as the training set, while the test set consists of normal segments and fault/anomalous segments\. Labels are used only for metric computation and visualization during testing\.

### 3\.1Experimental datasets and evaluation settings

#### 3\.1\.1Dataset overview

To systematically evaluate the applicability of the proposed method under different equipment conditions, fault patterns, and temporal scales, we construct an experimental suite centered on bearing fault detection datasets and degradation\-process datasets\. The fault detection part includes the CWRU \(Case Western Reserve University\) bearing datasetCase Western Reserve University Bearing Data Center \([2024](https://arxiv.org/html/2606.04073#bib.bib31)\), the HTBF dataset related to high\-speed\-train bogie faults, the PHM2009 gearbox challenge datasetPHM Society \([2009](https://arxiv.org/html/2606.04073#bib.bib32)\), and the REALBOX axle\-box bearing field\-measurement dataset\. The degradation\-detection part includes the XJTU\-SY full\-life bearing degradation datasetWang \([2021](https://arxiv.org/html/2606.04073#bib.bib33)\)and the IMS run\-to\-failure bearing datasetNASA Open Data Portal \([2023](https://arxiv.org/html/2606.04073#bib.bib34)\)\. In addition, the discussion section later provides an extended analysis on simulated anomaly data and1313public time\-series datasets from the TSB\-AD benchmark to examine the broader applicability of the method when transferred from bearing vibration scenarios to more general TSAD tasks\.

The four fault\-detection datasets correspond to different validation goals\. CWRU was collected on a standard test rig and contains clearly defined fault types with a relatively high signal\-to\-noise ratio, making it suitable for validating the model’s basic discriminative ability on typical bearing faults\. In this work, we use the normal baseline and the DE channel from the12​kHz12\\,\\mathrm\{kHz\}drive\-end and fan\-end fault files, covering ball, inner\-race, and outer\-race fault states; the48​kHz48\\,\\mathrm\{kHz\}drive\-end data are not used\. HTBF comes from a mechanical system related to high\-speed\-train bogies and contains gearbox faults, axle\-box faults, and compound fault combinations under multiple operating conditions\. We use five synchronized vibration channels to assess model stability in multi\-channel and multi\-fault\-combination scenarios\. PHM2009 is based on a generic industrial gearbox\. We use the two acceleration channels at the input and output ends and exclude the tachometer channel; spur 1 and helical 1 are treated as normal classes, while spur 2–8 and helical 2–6 are treated as fault classes\. REALBOX consists of measured axial acceleration vibration records from an axle\-box bearing of a high\-speed train, sampled at10​kHz10\\,\\mathrm\{kHz\}, with an inner\-race failure as the fault type\. Because measured fault samples are limited, this dataset is mainly used to verify the applicability of the method in small\-sample real engineering scenarios\.

Unlike fixed fault\-category detection, degradation\-process detection involves anomalous states that usually do not appear abruptly, but instead evolve gradually over a long period\. The XJTU\-SY dataset contains complete run\-to\-failure sequences for1515bearings under33speed/load conditions, with both horizontal and vertical vibration channels recorded at each acquisition\. The IMS dataset contains run\-to\-failure bearing sequences over an even longer time scale, making it suitable for examining the model’s response to weak degradation, long transitional states, and late\-stage failure\. We do not interpret degradation\-detection results as remaining useful life prediction; instead, anomaly scores are viewed as the degree to which test windows deviate from the early\-life normal training distribution\.

#### 3\.1\.2Construction of fault\-detection datasets

To conform to the one\-class anomaly\-detection setting, all fault\-detection experiments train the model using only normal\-state samples, while fault samples are used only for evaluation at test time\. Rather than directly using the complete raw sequences from the public datasets, we extract fixed\-length segments from the raw files according to a unified protocol to construct a normal training segment and a normal/fault test sequence\. Specifically, a normal training segmentXtrNX\_\{\\mathrm\{tr\}\}^\{\\mathrm\{N\}\}is first extracted from the normal\-state data\. This segment is used for Stage 1 reconstruction\-model training, Stage 2 representation\-encoder training, and the construction of the KNN normal reference bank\. The remaining normal test segment is then concatenated directly with fault segments along the time dimension to form a long test sequence\. For multi\-channel data, the channel dimension remains unchanged during concatenation, and only the time dimension is extended\.

Let the retained normal test segment beXteNX\_\{\\mathrm\{te\}\}^\{\\mathrm\{N\}\}, and let themm\-th fault segment beXteF,mX\_\{\\mathrm\{te\}\}^\{\\mathrm\{F\},m\}\. The test sequence for fault detection is then constructed as

Xte=\[XteN;XteF,1;XteF,2;⋯;XteF,M\],X\_\{\\mathrm\{te\}\}=\[X\_\{\\mathrm\{te\}\}^\{\\mathrm\{N\}\};X\_\{\\mathrm\{te\}\}^\{\\mathrm\{F\},1\};X\_\{\\mathrm\{te\}\}^\{\\mathrm\{F\},2\};\\cdots;X\_\{\\mathrm\{te\}\}^\{\\mathrm\{F\},M\}\],\(18\)where\[⋅;⋅\]\[\\cdot;\\cdot\]denotes concatenation along the time dimension\. The corresponding point\-level labels are defined as

yt=\{0,xt∈XteN,1,xt∈XteF,m\.y\_\{t\}=\\begin\{cases\}0,&x\_\{t\}\\in X\_\{\\mathrm\{te\}\}^\{\\mathrm\{N\}\},\\\\ 1,&x\_\{t\}\\in X\_\{\\mathrm\{te\}\}^\{\\mathrm\{F\},m\}\.\\end\{cases\}\(19\)The normal test segment participates in the final computation of AUROC, AUPR, best F1, Precision, and Recall as the negative class, but it is not used for parameter updates in Stage 1 or Stage 2 and is not added to the KNN normal reference bank\. The construction of test sequences for each fault\-detection dataset is summarized in Table[3](https://arxiv.org/html/2606.04073#S3.T3)\.

Table 3:Construction of test sequences in the fault\-detection experimentsDatasetNormal segments\(10410^\{4\}points\)Fault/anomalous segments\(10410^\{4\}points\)Description of test\-sequence constructionCWRU4×5\.04\\times 5\.0105×2\.0105\\times 2\.0The first20\.0×10420\.0\\times 10^\{4\}sampling points form the normal test segment, followed by105105fault segments covering ball, inner\-race, and outer\-race faults\.HTBF3×15\.03\\times 15\.084×1\.084\\times 1\.0The front part consists of normal test segments under three operating conditions, followed by8484fault combinations; the five vibration channels remain time\-aligned\.PHM20092×10\.02\\times 10\.040\.040\.0helical 1 and spur 1 form the normal test segment; the anomalous part is obtained by approximately balanced concatenation of1212fault classes from helical 2–6 and spur 2–8\.REALBOX7\.0\+12\.0\+1\.07\.0\+12\.0\+1\.040\.040\.0The first20\.0×10420\.0\\times 10^\{4\}sampling points are formed by three normal fragments, followed by an inner\-race fault fragment; only samples in the speed interval295295–305305are retained during construction\.
- 1\.Note: The values for normal and fault/anomalous segments in the table are measured in units of10410^\{4\}sampling points\. For example,4×5\.04\\times 5\.0indicates four fragments, each with length5\.0×1045\.0\\times 10^\{4\}\. The test fragments are concatenated sequentially along the time dimension only for presentation and for plotting continuous anomaly\-score curves\. During window construction, however, each original fragment is segmented separately, and no windows are created across concatenation boundaries between different fragments\. Window labels are determined by the point\-level labels of the underlying fragments\. CWRU uses the DE single channel, while the other multi\-channel datasets keep channel synchronization during concatenation\.

The above test sequences are concatenated sequentially along the time dimension only for description and result visualization\. No additional padding, smoothing segments, or transition segments are inserted\. This design facilitates unified presentation of the temporal relationship between the normal segment and the subsequent fault segments, as well as continuous plotting of anomaly\-score curves\.

#### 3\.1\.3Construction of degradation\-detection datasets

For the XJTU\-SY and IMS degradation datasets, the test sequences preserve the original temporal continuity of the full\-life process and do not artificially concatenate normal files with fault files\. The model uses only early\-life snapshots as the normal training segment, and the remaining snapshots are treated as the test segment\. The degradation onset is automatically determined using snapshot\-level vibration\-amplitude statistics\. Specifically, statistics such as RMS, p95, and maximum amplitude are computed for each snapshot, compared by ratio against statistics from the early normal segment, and an abnormal increase is required to persist for at least33snapshots \(sustain=3=3\) to reduce the influence of isolated shocks on onset detection\. Sampling points after the degradation onset are labeled as anomalous, while test samples before the onset that do not belong to the training segment are labeled as normal\.

The XJTU\-SY samples used in this paper include Bearing1\-1, Bearing1\-3, Bearing2\-2, Bearing2\-5, Bearing3\-3, and Bearing3\-5, covering three operating conditions:3535Hz/1212kN,37\.537\.5Hz/1111kN, and4040Hz/1010kN\. Except for Bearing3\-5, which uses the first2323snapshots as the training segment due to an earlier detected degradation onset, all other XJTU\-SY samples use the first3030snapshots for training\. For IMS, Bearing4 from the 1st\-test is used, with the first200200snapshots for training and the remaining snapshots for testing\. The training lengths and degradation\-onset settings are listed in Table[4](https://arxiv.org/html/2606.04073#S3.T4)\.

Table 4:Training\-segment and degradation\-onset settings for degradation detectionDatasetSampleNumber of training snapshotsTotal snapshotsDegradation onsetXJTU\-SYBearing1\-13012377Bearing1\-33015889Bearing2\-23016149Bearing2\-530339182Bearing3\-330371343Bearing3\-52311424IMSBearing420021561436
- 1\.Note: Sampling points after the degradation onset are labeled as anomalous, and the window label is determined by whether the window contains anomalous points\. For Bearing3\-5, the training segment is restricted to the period before the degradation onset because the onset is detected early\.

#### 3\.1\.4Window segmentation, label generation, and normalization

All sequences are segmented using sliding windows with lengthL=512L=512and strideH=256H=256\. Thejj\-th window is defined as

Wj=Xsj:sj\+L−1,:,sj=j​H\.W\_\{j\}=X\_\{s\_\{j\}:s\_\{j\}\+L\-1,:\},\\quad s\_\{j\}=jH\.\(20\)Sampling points at the tail that do not form a complete window are discarded\. Window\-level labels are generated using a conservative rule: if any time point within the window belongs to an anomalous interval, that window is labeled as anomalous, i\.e\.,

Yj=𝕀​\(∑t=sjsj\+L−1yt\>0\)\.Y\_\{j\}=\\mathbb\{I\}\\left\(\\sum\_\{t=s\_\{j\}\}^\{s\_\{j\}\+L\-1\}y\_\{t\}\>0\\right\)\.\(21\)Therefore, a window that crosses a normal/fault concatenation boundary or a degradation\-onset boundary is labeled anomalous as long as it contains any anomalous point\. All main experimental metrics are computed on window\-level anomaly scores and window\-level labels\.

To avoid leakage of test information, normalization statistics for continuous features are estimated only from the normal training segment and then fixed for the test sequence and for baseline methods\. Multi\-channel data are normalized channel\-wise\. Stage 2 applies the same input transformation using the training\-normal statistics saved from Stage 1\. Baseline methods likewise estimate normalization parameters from the normal training windows and compute anomaly scores on the same test windows\. For the window\-energy heatmaps shown in the figures, we compute the RMS energy for each window and feature channel as

ri,c=1L​∑t=1Lxi\+t,c2,r\_\{i,c\}=\\sqrt\{\\frac\{1\}\{L\}\\sum\_\{t=1\}^\{L\}x\_\{i\+t,c\}^\{2\}\},\(22\)and then use the55th and9999th percentiles of the RMS energy of the same channel in the training windows as lower and upper bounds to robustly scale the test\-window valuesri,cr\_\{i,c\}and clip them to\[0,1\]\[0,1\]\. This quantity is denoted as scaled RMS energy in the figures and is used only to visualize input\-energy changes across windows and channels; it does not participate in Stage 2 KNN anomaly scoring and is not included in the final evaluation metrics\.

#### 3\.1\.5Training settings and pseudo\-anomalous samples

The proposed method consists of two training stages\. Stage 1 trains a reconstruction model on normal training windows to generate pseudo\-anomalous windows with controllable deviation strength\. Stage 2 uses the normal training windows together with the pseudo\-anomalous windows generated in Stage 1 to train the representation encoder and compute anomaly scores in the embedding space using a normal reference bank\. Unless otherwise specified, the Stage 1 reconstruction model is trained for1212epochs, including44epochs of pretraining for the continuous branch, and the Stage 2 representation encoder is trained for1212epochs\. In the main experiments, the*normal score*withk=5k=5\(i\.e\., the anomaly score based on the KNN distance to the normal sample bank\) is used as the window\-level anomaly score\. The KNN normal reference bank consists only of embeddings of normal training windows and does not include normal test windows or fault windows\.

Pseudo\-anomalous windows are generated only from normal training windows; no fault test samples or test labels are used\. The number of pseudo\-anomalous windows for each dataset is jointly determined by Stage 1 generation and error\-interval filtering\. For CWRU, HTBF, and REALBOX, pseudo\-anomalous samples are balanced across five error bins, with24002400windows retained per bin, yielding a total of1200012000pseudo\-anomalous windows\. For PHM2009, the acceptable sample numbers after error\-bin filtering are866866,890890,970970,992992, and10061006in the five bins, respectively, so the final number of retained pseudo\-anomalous windows is47244724\. For the XJTU\-SY and IMS degradation experiments,65006500pseudo\-anomalous windows are retained in each case\. Table[5](https://arxiv.org/html/2606.04073#S3.T5)summarizes the main data scales and training configurations for all datasets\.

Table 5:Experimental scale and main configurations of each datasetDatasetSample/subsetDimension indexTrainingsequence lengthTestsequence lengthNumber oftraining windowsNumber oftest windowsNumber ofpseudo\-anomalous windowsCWRU–\[1\]4000004000002300000230000015611561898389831200012000HTBF–\[1,2,3,4,5\]9000009000001290000129000035143514503850381200012000PHM2009–\[1,2\]400000400000600000600000156115612342234247244724REALBOX–\[1\]40000040000060000060000015611561234223421200012000XJTU\-SYBearing1\-1\[1,2\]9830409830403047424304742438393839119031190365006500IMSBearing4\[1,2\]409600040960004005888040058880159991599915647915647965006500

- 1\.Note: The dimension index indicates the channel numbers used for modeling; for example, \[1,2\] means that Channels 1 and 2 are used\. Unless otherwise specified, all experiments use sliding windows with lengthL=512L=512and strideH=256H=256\.

#### 3\.1\.6Evaluation metrics and visualization

The quantitative evaluation metrics include AUROC, AUPR, best F1, Precision, and Recall\. Among them, AUROC and AUPR measure the ranking ability of anomaly scores on normal and anomalous windows and do not depend on a fixed threshold\. Best F1 is obtained by sweeping thresholds over the anomaly scores of the test windows, and Precision and Recall are reported at the threshold that maximizes F1\. Therefore, best F1 and its corresponding Precision/Recall reflect the upper\-bound thresholding performance under the given test labels, rather than the performance of a fixed threshold determined in advance for unsupervised deployment\. For the TSAD extension analysis in the later discussion, VUS\-ROC and VUS\-PRBoniol et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib35)\)are further reported to characterize detection quality in the neighborhood of anomalous intervals\.

In addition to quantitative metrics, we conduct qualitative analysis using anomaly\-score curves and representation\-space visualization\. Anomaly\-score curves are used to observe the temporal response of the model to fault segments, degradation stages, or anomalous intervals\. t\-SNE or PCA visualization is used to analyze the relative distributions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows in the representation space\. The fault\-detection result figures follow a unified layout: Part A presents the scaled RMS energy heatmap together with the anomaly\-score curves of the proposed method and baseline methods, where light background shading indicates anomalous intervals and dashed lines denote the posterior best\-F1 threshold obtained by scanning on the test\-window labels, used only to facilitate interpretation of the anomaly\-score curves; Part B presents radar charts for AUROC, AUPR, Precision, Recall, and best F1; and Part C shows representation\-space visualization\. It should be emphasized again that normal test windows participate in final metric evaluation but are not used for model training and are not added to the KNN normal reference bank\.

To ensure consistency across methods, the bearing fault\-detection and degradation\-detection experiments use fixed data splits, fixed window settings, and a fixed random seed of4242\. Deterministic baseline methods are reported directly under this fixed protocol\. For deep models involving random initialization or random sampling, the random seed is fixed to ensure reproducibility\. Repeated runs are used only to check implementation stability and are not treated as independent repeated trials\. All baseline methods use the same training windows, test windows, and window\-level labels as the proposed method and are evaluated under the same protocol using AUROC, AUPR, best F1, Precision, and Recall\.

#### 3\.1\.7Computational cost

To complement the implementation analysis, we further report the inference cost of each method for completing one full scoring pass over all test windows on CWRU, as shown in Table[6](https://arxiv.org/html/2606.04073#S3.T6)\. This experiment scores all89838983test windows once, and training/fitting time as well as reference\-bank construction cost are excluded for all methods\. For the proposed method, only the online inference of Stage 2 is counted, i\.e\., test\-window encoding and KNN distance computation against the normal sample bank; Stage 1 pseudo\-anomaly generation is excluded\. On CWRU, the proposed method requires only0\.113​s0\.113\\,\\mathrm\{s\}on average to complete one full scoring pass over all test windows, corresponding to0\.0126​ms/window0\.0126\\,\\mathrm\{ms/window\}, with peak GPU memory of about169\.3​MB169\.3\\,\\mathrm\{MB\}\. Compared with other methods, our method is clearly faster than heavier deep sequence baselines such as Adjacent Transformer, Transformer AE, and TranAD, but slower than KNN Distance, LOF Novelty, and Deep SVDD, which do not require reconstruction decoding or use lighter scoring schemes\. This indicates that even after introducing encoder forward passes and KNN retrieval against the normal bank, the online inference cost of the proposed method remains within an acceptable range\.

Table 6:Comparison of inference cost for one complete test\-window scoring pass on CWRU\. The statistics are based on scoring all89838983test windows once; for the proposed method, only Stage 2 inference is counted and Stage 1 pseudo\-anomaly generation is excluded; training/fitting and reference\-bank construction costs are not included for any method\. \(3080ti\)MethodModel paramsReference bank sizeTest windowsInference time \(s\)ms/windowPeak GPU mem \(MB\)Deep SVDD213,56808,9830\.0440\.004913\.5KNN Distance01,5618,9830\.0620\.00690\.0LOF Novelty01,5618,9830\.0660\.00730\.0TPA\-AD \(Stage 2 only\)123,8401,5618,9830\.1130\.0126169\.3Isolation Forest01,5618,9830\.2260\.02510\.0One\-class SVM01,5618,9830\.3600\.04010\.0TranAD40108,9830\.4930\.0549302\.3Transformer AE67,26508,9830\.5480\.0610141\.9Adjacent Transformer67,26508,9830\.9610\.10701261\.4

### 3\.2Bearing fault detection experiments

This subsection evaluates the proposed method on four datasets, namely CWRU, HTBF, PHM2009, and REALBOX, to assess its discriminative ability on fault samples\. The analysis focuses on two issues: first, whether anomaly scores can form a stable and clear boundary between normal and fault segments; and second, whether the representation space learned in Stage 2 can effectively push real fault samples away from the normal region\. The data presentation and detection results are discussed below on a dataset\-by\-dataset basis\.

#### 3\.2\.1CWRU dataset

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_cw_data.png)Figure 5:Illustration of the CWRU dataset\. A shows the experimental device, B shows the long sequence obtained by concatenating samples from different classes, and C shows the corresponding FFT spectra\. The figure highlights the time\-domain and frequency\-domain differences between normal samples and different fault types\.The CWRU dataset was released by the Case Western Reserve University Bearing Data CenterCase Western Reserve University Bearing Data Center \([2024](https://arxiv.org/html/2606.04073#bib.bib31)\)\. Its test rig consists of a22hp motor, a torque transducer/encoder, and a dynamometer\. Faults of different sizes were introduced by electro\-discharge machining on the drive\-end and fan\-end bearings, and vibration signals were collected under loads of0–33hp\. The original repository provides both12​kHz12\\,\\mathrm\{kHz\}and48​kHz48\\,\\mathrm\{kHz\}sampled data and covers multiple fault locations such as ball, inner\-race, and outer\-race faults\. In this work, we use the normal baseline and the DE channel from the12​kHz12\\,\\mathrm\{kHz\}drive\-end and fan\-end fault files, covering ball, inner\-race, and outer\-race fault states, while the48​kHz48\\,\\mathrm\{kHz\}drive\-end data are not used\. A training set is constructed from normal segments, and the remaining normal segments are concatenated with fault segments to form the test sequence\. Figure[5](https://arxiv.org/html/2606.04073#S3.F5)shows the experimental setup, the concatenated long\-horizon sequence, and its FFT spectrum\. As can be seen, normal and fault samples differ clearly in both time\-domain amplitude and frequency\-domain structure, making CWRU a benchmark dataset for validating basic fault separability\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_cw_result.png)Figure 6:Fault\-detection results on the CWRU dataset\. A shows the window\-energy heatmap together with the anomaly\-score curves of the proposed method and baseline methods; B compares AUROC, AUPR, Precision, Recall, and best F1 across methods; and C visualizes the low\-dimensional distributions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows\.Figure[6](https://arxiv.org/html/2606.04073#S3.F6)shows that the proposed method maintains anomaly scores close to zero during the normal stage of the CWRU test sequence, rapidly crosses the threshold when the fault segment begins, and then remains in a high\-score region for most of the remaining sequence, with only brief local drops around a few mixed windows\. Part A further shows that the jump in the proposed anomaly score is broadly aligned with the entrance into the high\-energy region in the heatmap, thereby producing a clear stage boundary among the normal segment, transition segment, and fault segment\. By contrast, Deep SVDD, KNN Distance, LOF Novelty, Adjacent Transformer, and Transformer AE all respond strongly around the first fault injection, but their scores later fall back close to zero for many fault windows, behaving more like instantaneous amplification of local transients than stable characterization of persistent abnormal states\. One\-class SVM maintains an elevated background score over a long interval, which weakens threshold interpretability, while TranAD mainly produces isolated spikes and lacks continuity\. Therefore, in this relatively ideal test\-rig scenario, the main advantage of the proposed method is not simply generating an earlier spike at a single point, but rather integrating local anomaly evidence into a sustained high\-score plateau, thereby balancing low false alarms with a clear stage boundary\. The radar chart in Part B is almost saturated along the outer ring, indicating that on strongly separable data such as CWRU, most methods can already achieve very high AUROC, AUPR, and Recall\. Hence, the key issue here is not whether a method can detect anomalies at all, but whether it can maintain more stable Precision and best F1 while preserving high Recall\. Although the outward expansion of the proposed method on the radar chart is only slightly better than that of several baselines, this slight advantage is consistent with the smoother high\-score plateau observed in Part A\. The t\-SNE visualization in Part C provides more interpretable evidence: normal training windows and normal test windows overlap heavily in the center, pseudo\-anomalous windows are mainly distributed along the outer edge of the normal cluster, and real anomalous windows form an independent arc\-shaped region far from the center\. This structure suggests that the pseudo\-anomalous windows generated in Stage 1 do not need to coincide point by point with the real fault cluster; rather, they act more like a learnable expanded boundary around the normal region, thereby providing an effective directional constraint for the discriminative representation learning in Stage 2\. It should also be noted that CWRU is a relatively ideal laboratory dataset, so this result mainly serves to validate the effectiveness of the method on typical bearing\-fault scenarios\.

#### 3\.2\.2HTBF dataset

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_htbf_data.png)Figure 7:Illustration of the HTBF dataset\. A shows the experimental setup, B shows the long sequence obtained by concatenating samples from different classes, and C shows the corresponding FFT spectra\. This dataset contains high\-speed\-train\-bogie\-related vibration signals under multiple operating conditions and is used to evaluate the model’s detection ability in multi\-channel, multi\-fault\-category scenarios\.The HTBF dataset is used to evaluate performance on a mechanical system related to high\-speed\-train bogies\. It is built around the bogie drivetrain and covers key components such as the gearbox, axle box, main shaft, and wheelset\. In this work, five synchronized vibration channels are used to construct long sequences\. Compared with the single test\-rig bearing data in CWRU, HTBF contains both control states and multiple gearbox\-/axle\-box\-related fault categories, with stronger operating\-condition disturbance and more pronounced channel coupling\. Figure[7](https://arxiv.org/html/2606.04073#S3.F7)shows the experimental setup, the long sequence obtained by concatenating multiple categories, and the corresponding FFT spectra\. As the figure indicates, HTBF exhibits not only differences among fault categories but also more significant changes in operating conditions and channels, making it a more challenging detection scenario than single\-rig bearing data\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_htbf_result.png)Figure 8:Fault\-detection results on the HTBF dataset\. A shows the window\-energy heatmap and the anomaly\-score curves of different methods, B shows the multi\-metric radar chart, and C shows the representation\-space distributions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows\.Figure[8](https://arxiv.org/html/2606.04073#S3.F8)shows that the boundary between normal and anomalous samples is weaker on HTBF than on CWRU, yet the proposed method still forms relatively continuous score responses over multiple anomalous intervals\. Part A shows that the proposed method maintains a low background score in most of the early normal interval\. After about18001800windows and during subsequent condition/fault switching intervals, its score rises along with local energy enhancement and remains at a medium\-to\-high level over clearly anomalous segments\. In contrast, Deep SVDD and TranAD exhibit strong fluctuations over long intervals, already producing many isolated high peaks during normal segments\. KNN Distance, LOF Novelty, Isolation Forest, and One\-class SVM can sense overall distribution shifts, but they are also more likely to map operating\-condition switching and channel\-amplitude differences to elevated background scores, thereby limiting Precision\. Adjacent Transformer and Transformer AE respond more sparsely and often peak only at a few sharp impacts, providing insufficient coverage for weak anomalies spread across wider temporal ranges\. In other words, in a multi\-channel and strongly perturbed scenario such as HTBF, the main advantage of the proposed method lies in its balance of “low background \+ continuous response”: it neither prematurely raises the whole sequence to a high\-score level nor relies solely on a few isolated spikes to indicate anomalies\. The radar chart in Part B further shows that this advantage mainly comes from simultaneous improvements in Precision, best F1, AUROC, and AUPR, rather than from simply achieving higher Recall\. In fact, multiple baselines already approach the outer ring on Recall, indicating that they are not short of the ability to “flag anomalies\.” The more important difference is whether they also raise many normal windows to high scores\. The proposed method keeps Recall high while placing Precision and best F1 farther outward, indicating that its gains mainly come from better false\-alarm control\. The visualization in Part C is consistent with this observation: normal training windows and normal test windows form a large left\-side manifold; pseudo\-anomalous windows are distributed more along the right\-side outer edge; and real anomalous windows partly lie adjacent to this outer edge, while another part still mixes into the left\-side normal manifold\. This structure suggests that HTBF indeed contains a subset of pronounced anomalies that can be effectively captured by the pseudo\-anomalous boundary, but it also contains weak anomalies or transitional samples that remain close to the normal state\. In such a case, the proposed method improves the separability of the major anomalous samples, but cannot yet completely separate all anomalous samples\.

#### 3\.2\.3PHM2009 dataset

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_phm2009_data.png)Figure 9:Illustration of the PHM2009 dataset\. A shows the experimental setup, B shows the long sequence obtained by concatenating samples from different classes, and C shows the corresponding FFT spectra\. The figure illustrates the input characteristics of complex rotating\-machinery fault data from a gearbox system\.The PHM2009 dataset comes from the 2009 Gearbox Challenge released by the PHM SocietyPHM Society \([2009](https://arxiv.org/html/2606.04073#bib.bib32)\)\. It is based on a generic industrial gearbox and synchronously records two vibration channels on the retaining plates of the input and output shafts as well as a tachometer pulse, covering shaft speeds of3030–5050Hz and both high\- and low\-load conditions\. The faulty objects are no longer limited to a single bearing, but rather to a more complex rotating transmission system\. In this paper, the two vibration sequences are recast into a one\-class anomaly\-detection task to provide supplementary cross\-equipment validation\. Figure[9](https://arxiv.org/html/2606.04073#S3.F9)shows the experimental setup, the long sequence obtained by concatenating samples from different states, and the corresponding FFT spectra\. Compared with CWRU, some anomalous samples in PHM2009 are more similar to normal samples in both the time and frequency domains, making it useful for testing method robustness under weak\-difference fault scenarios\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_phm2009_result.png)Figure 10:Fault\-detection results on the PHM2009 dataset\. A shows the window\-energy heatmap and the anomaly\-score curves of different methods, B shows the multi\-metric radar chart, and C shows the representation\-space distributions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows\. In this dataset, normal windows and some anomalous windows are highly similar\.Figure[10](https://arxiv.org/html/2606.04073#S3.F10)indicates that the metric differences among methods are relatively small on PHM2009, and the anomaly\-score curves do not exhibit a fully separated pattern\. Part A shows that the proposed method maintains a low background score over most normal windows in the early portion and produces concentrated, relatively strong peak responses mainly in the high\-energy region around17001700–19501950, so its high scores are more focused on locally complex anomalies corresponding to the bright regions of the heatmap\. By contrast, Deep SVDD, KNN Distance, Isolation Forest, and LOF Novelty are more likely to maintain moderately high scores over a wider temporal range, suggesting that they can sense overall distribution perturbations but find it difficult to further separate truly stronger anomalous segments from ordinary fluctuations\. One\-class SVM oscillates strongly across almost the entire test segment, making thresholding less meaningful\. Adjacent Transformer, Transformer AE, and TranAD tend to respond only to a few sharp events with discrete peaks and therefore under\-cover sustained but not strong anomalies\. Hence, the advantage of the proposed method on this dataset is no longer “complete separation,” but rather its ability to prioritize relatively more suspicious compound\-fault windows into a high\-score region without substantially raising the normal background\. The radar chart in Part B confirms this interpretation: the proposed method maintains a slight outward expansion on AUROC and AUPR, and its Recall is close to that of the best baseline, but its gains in Precision and best F1 are not large\. This suggests that the difficulty of this dataset is not complete miss detection, but the fact that many anomalous windows differ from normal windows only weakly, making it hard for any method to produce a particularly sharp threshold boundary\. The visualization in Part C further reveals the source of this difficulty: normal training windows, normal test windows, and real anomalous windows together form several intertwined ring\-like manifolds, while pseudo\-anomalous windows are more concentrated on an outer branch toward the right\. In other words, pseudo\-anomalous windows provide the model with one dominant anomaly direction, but real complex faults do not deviate along only that single direction\. As a result, a considerable portion of real anomalous windows remain close to the normal manifold\. This finding is consistent with the characteristics of PHM2009, where some fault windows in compound transmission systems exhibit only local or relatively weak frequency\-domain differences\. Overall, the proposed method provides a useful boundary reference for PHM2009, but pseudo\-anomalous windows still cannot completely substitute for real complex\-fault windows\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_realbox_data.png)Figure 11:Illustration of the REALBOX field\-measurement data\. A shows the real data\-acquisition scenario, B shows the long sequence formed by concatenating normal and fault samples, and C shows the corresponding FFT spectra\. The data come from axial acceleration vibration records of an axle\-box bearing of a high\-speed train, sampled at10​kHz10\\,\\mathrm\{kHz\}, with an inner\-race failure as the fault type; part of the vertical\-axis information is hidden at the request of the data provider\.
#### 3\.2\.4REALBOX field\-measurement data

The REALBOX dataset consists of axial vibration records from an axle\-box bearing of a high\-speed train, sampled at10​kHz10\\,\\mathrm\{kHz\}by an accelerometer, with an inner\-race failure as the fault type\. Compared with public laboratory datasets, its main characteristic is the extreme scarcity of measured fault samples\. In real high\-speed\-train maintenance, severe fault samples are inherently rare, and related anomalies are often discovered or handled early by other monitoring means, so publicly available vibration fault records for analysis are very limited\. The REALBOX data used in this paper come from one such retained fault\-vibration record under a condition where temperature monitoring failed\. Figure[11](https://arxiv.org/html/2606.04073#S3.F11)shows the data\-collection scenario, the concatenated normal/fault time series, and the corresponding FFT spectra\. Due to confidentiality requirements from the data provider, some vertical\-axis information in the figure is hidden\. Despite its limited sample size, this dataset is closer to real maintenance conditions and is therefore suitable for examining the applicability of the method in scarce real engineering scenarios\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_realbox_result.png)Figure 12:Fault\-detection results on the REALBOX field\-measurement data\. A shows the window\-energy heatmap and the anomaly\-score curves of different methods, B shows the multi\-metric radar chart, and C shows the representation\-space distributions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows\. The dataset comes from scarce vibration records of a high\-speed\-train axle\-box bearing with an inner\-race failure\.Figure[12](https://arxiv.org/html/2606.04073#S3.F12)shows that the proposed method forms a relatively clear normal/anomalous boundary on REALBOX\. In Part A, the proposed method maintains low scores in the early normal interval, then exhibits a clear step\-like rise after about800800windows, and subsequently remains on a relatively stable medium\-to\-high plateau during the fault segment, with even higher peaks only when local impacts intensify\. This “boundary first, plateau later” pattern makes the threshold easy to interpret\. By contrast, although Isolation Forest, KNN Distance, LOF Novelty, and One\-class SVM can also sense the distribution shift in the later segment, they are more prone to issues such as elevated background scores in the normal segment, near saturation in the later segment, or compressed contrast between earlier and later parts\. Deep SVDD fluctuates more strongly inside the fault segment\. Adjacent Transformer and Transformer AE behave more like sparse pulses, while TranAD produces spikes scattered throughout the whole sequence, making it difficult to form a stable anomaly plateau\. Therefore, in this scarce field\-measurement scenario, the main advantage of the proposed method is that it neither excessively raises the normal segment nor represents the fault segment merely as a few isolated peaks; instead, it produces a continuous response that is more favorable for thresholding and engineering interpretation\. The radar chart in Part B shows that, except for TranAD, many methods are already close to the outer ring, so the chart is somewhat saturated here as well\. Nevertheless, the proposed method still maintains the outermost contour on AUROC, AUPR, best F1, and Precision, while Recall does not decrease despite the tighter boundary\. This indicates that its gains come mainly from clearer separation rather than from simply producing more anomaly alarms\. The visualization in Part C provides an even more practically meaningful explanation: normal training windows lie in the middle, normal test windows form a neighboring but still continuous normal manifold on the right, pseudo\-anomalous windows mainly occupy the left side and outer regions, and real anomalous windows are pushed as a whole into an independent cluster below\. In other words, the pseudo\-anomalous windows do not simply duplicate the real fault cluster; rather, together with the real anomalies, they surround the normal manifold from different directions, thereby encouraging a wider safety margin in the representation space\. Since REALBOX contains only one publicly usable vibration fault record, this experiment should be regarded as engineering validation rather than a large\-sample statistical generalization result\.

Taken together, the results on the four fault\-detection datasets show that the proposed method performs most stably on CWRU and REALBOX, while still maintaining favorable metrics on more complex datasets such as HTBF and PHM2009, although some anomalous samples still overlap with normal ones\. If one looks only at the score curves in Part A of the figures, conventional one\-class classification or distance\-based methods are more easily affected by operating\-condition switching and overall amplitude drift and thus tend to raise the background level, while reconstruction\-/prediction\-based temporal models more often appear as isolated spikes or locally delayed responses\. In contrast, the proposed method more stably achieves a response pattern of “low background in normal segments and continuous elevation in anomalous segments\.” If Parts B and C are considered together, radar charts on strongly separable datasets such as CWRU and REALBOX are often close to saturation, and the real distinction among methods lies in whether the representation space forms a clear normal core and anomalous outer edge\. On complex datasets such as HTBF and PHM2009, the outward expansions in Precision/best F1 or AUROC/AUPR on the radar charts correspond more directly to whether anomalous samples truly leave the normal manifold in Part C\. This is consistent with the design goal of our method: pseudo\-anomalous samples are not direct substitutes for real fault samples, but controllable boundary samples for one\-class detection, used to enhance the discriminability of the normal representation space\.

### 3\.3Degradation\-process detection experiments

In addition to fixed fault\-category detection, we further validate the ability of the proposed method to respond to life\-cycle evolution on the XJTU\-SY and IMS degradation datasets\. The key difference between degradation data and fault\-classification data is that anomalies usually do not emerge abruptly, but instead develop gradually over a relatively long period\. Therefore, an ideal anomaly score should not only rise during the late failure stage, but should also exhibit continuous trend changes during at least part of the early degradation stage\.

#### 3\.3\.1XJTU\-SY data presentation and experimental split

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_data.png)Figure 13:Illustration of the XJTU\-SY degradation data\. A shows the experimental setup, B shows the full\-life degradation trends, and C and D show representative raw sequences and their FFT spectra\. The figure is intended to illustrate the temporal evolution characteristics in the degradation experiments\.The XJTU\-SY dataset was jointly released by Xi’an Jiaotong University and Changxing SumyoungWang \([2021](https://arxiv.org/html/2606.04073#bib.bib33)\)\. It contains complete run\-to\-failure sequences for1515bearings under33speed/load conditions\. At each acquisition, both horizontal and vertical acceleration channels are recorded, with a sampling frequency of25\.6​kHz25\.6\\,\\mathrm\{kHz\}and a signal length of3276832768saved every minute\. Part B of Fig\.[13](https://arxiv.org/html/2606.04073#S3.F13)shows that different bearings exhibit markedly different degradation trajectories: some remain stable for a long time and then rise sharply, while others show sustained gradual growth\. Parts C and D further show that the changes in time\-domain impacts and frequency\-domain energy concentration in the late degradation stage are also inconsistent across samples\. Therefore, this dataset is more suitable for validating the adaptability of anomaly scores to different degradation trends rather than evaluating only a single failure pattern\.

We select Bearing1\-1, Bearing1\-3, Bearing2\-2, Bearing2\-5, Bearing3\-3, and Bearing3\-5 for validation, covering different degradation trajectories under the three operating conditions of3535Hz/1212kN,37\.537\.5Hz/1111kN, and4040Hz/1010kN\. For each sample, the model is trained on early\-life normal snapshots, and the remaining sequence is used for testing\. Bearing3\-5 uses the first2323snapshots as training data because an earlier degradation onset is detected, whereas all other samples use the first3030snapshots\. The window length and stride are512512and256256, respectively\. The target error interval for Stage 1 is set to\[0\.0008,0\.006\]\[0\.0008,0\.006\], and pseudo\-anomalous windows are sampled from this interval\. The purpose of this setting is not to directly predict remaining useful life, but to test whether the anomaly score can reflect deviations in health state relative to the early normal state\.

#### 3\.3\.2Multi\-bearing degradation detection results on XJTU\-SY

Figures[14](https://arxiv.org/html/2606.04073#S3.F14)to[19](https://arxiv.org/html/2606.04073#S3.F19)present the detection results for the six XJTU\-SY bearing samples\. Each figure contains anomaly\-score curves, metric comparisons, and representation\-space visualizations\. The anomaly\-score curves are used to observe detection responses as time evolves, while t\-SNE is used to help assess the relative positions of normal training windows, normal test windows, pseudo\-anomalous windows, and real anomalous windows in the representation space\. Part A of these figures also reveals that the baseline methods in degradation tasks roughly fall into three groups: one group, such as One\-class SVM and Isolation Forest, tends to maintain an elevated background or saturate too early over long intervals; a second group, such as Deep SVDD, KNN Distance, and LOF Novelty, can sense late\-stage distribution shifts but often responds with delayed rises or strong fluctuations depending on the sample; and a third group, such as Adjacent Transformer, Transformer AE, and TranAD, tends to produce local spikes\. Parts B and C play a slightly different role here than in fault detection: because many samples contain a high proportion of late\-stage degradation windows, radar charts often become nearly saturated for multiple methods, so Part B is mainly useful for checking whether a method sacrifices Precision/best F1 in order to detect early signs\. The more interpretable evidence lies in Part C, namely whether real anomalous windows leave the normal manifold to form one or more outer branches, and whether pseudo\-anomalous windows push the boundary outward without fragmenting the normal cluster\. The goal of the proposed method is not necessarily to cross the threshold earliest on every sample, but to make the score evolution consistent with the stage\-wise strengthening of the degradation process\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_11_result.png)Figure 14:Degradation\-detection results for XJTU\-SY Bearing1\-1\. The proposed method maintains low scores during the normal stage and produces a steadily increasing anomaly response in the late degradation stage; a slight score rise appears in some pre\-failure windows\.For Bearing1\-1 in Fig\.[14](https://arxiv.org/html/2606.04073#S3.F14), a clear boundary appears after about60006000windows: the window\-energy heatmap in Part A gradually changes from mixed colors into a sustained high\-energy region, and the anomaly score of the proposed method correspondingly jumps from a low value near the threshold to a stable high level\. Compared with the baselines, Isolation Forest also exhibits a rapid transition near the failure point, but its plateau saturates earlier and has a smaller dynamic range in the later segment\. Deep SVDD, KNN Distance, and LOF Novelty increase gradually in the later stage, but do so more slowly and with stronger fluctuations\. One\-class SVM maintains an elevated background far earlier, while Adjacent Transformer and Transformer AE are overly flat for most of the sequence and only respond near the very end\. By contrast, the proposed method preserves a low background in the early stage and forms a smooth, sustained high\-score plateau after failure, which is more consistent with the stage characteristics of samples exhibiting rapid late\-stage failure\. In Part B, the metrics of many methods are already close to the outer ring except for TranAD, indicating that detecting the late fault stage of this sample is not particularly difficult in itself\. Thus, the key here is not the absolute value of any single metric, but whether high scores can be achieved while preserving interpretability across the degradation stages\. Part C shows that normal training windows and normal test windows are mainly concentrated in a left\-side core region, pseudo\-anomalous windows form a wider envelope around them, and real anomalous windows gather into a slender independent cluster on the right\. This structure indicates that the model has learned not a random dispersion, but a clear separation direction from the normal core toward the late failure region\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_13_result.png)Figure 15:Degradation\-detection results for XJTU\-SY Bearing1\-3\. The anomaly\-score curve and representation\-space visualization jointly show the model’s response to the degradation stages\.Unlike the jump\-like transition of Bearing1\-1, Bearing1\-3 in Fig\.[15](https://arxiv.org/html/2606.04073#S3.F15)exhibits a smoother degradation trend\. The proposed method does not rise sharply immediately after the training segment ends; instead, it grows slowly over a long period and approaches the high\-score region only in the later part, which is consistent with the stable degradation trend marked in the figure\. The baseline comparison in Part A shows that Deep SVDD, KNN Distance, LOF Novelty, Adjacent Transformer, and Transformer AE tend to surge only near the end, compressing the long progressive degradation into what appears to be a late abrupt change\. One\-class SVM and Isolation Forest rise earlier, but their background is higher in the early and middle stages, making it difficult to distinguish slight degradation from severe degradation\. In contrast, the proposed method preserves a continuous score transition from weak deviation to strong degradation, making it more suitable for describing this kind of long and smooth degradation evolution\. Part B is again nearly saturated, indicating that many methods can identify the severe late\-stage degradation if only final labels are considered\. However, Part C better reveals the difference: real anomalous samples form a smooth arc\-shaped manifold on the right and continue to separate from the dense normal/pseudo\-anomalous cloud on the left, rather than separating abruptly only at the end as with some baselines\. This continuously unfolding geometry is consistent with the gradual rise observed in Part A\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_22_result.png)Figure 16:Degradation\-detection results for XJTU\-SY Bearing2\-2\. The proposed method shows continuous score variation before and after degradation and can be used to observe early degradation trends\.Bearing2\-2 in Fig\.[16](https://arxiv.org/html/2606.04073#S3.F16)exhibits a more typical early\-warning pattern\. Part A shows that before the arrival of the large\-scale high\-score region, the proposed anomaly score has already risen continuously in the earlier stage and has crossed the posterior best\-F1 threshold shown in the figure; the corresponding heatmap also begins to exhibit local energy enhancement after about20002000windows\. Notably, some baseline methods also respond early, but in different ways\. One\-class SVM quickly approaches saturation at an early stage and loses the ability to distinguish mild degradation from severe late\-stage degradation\. Isolation Forest, KNN Distance, LOF Novelty, and Transformer AE all show sustained growth later on, but more often remain on a long intermediate plateau, and their stage boundaries are less clear than those of the proposed method\. Deep SVDD, Adjacent Transformer, and TranAD respond more strongly only later\. In contrast, the proposed method preserves two distinct rising levels between early warning and later strengthening, making it more suitable for highlighting weak precursor signals in degradation monitoring\. Part B suggests that most methods still achieve relatively high final AUROC/AUPR values, so the real value here lies in whether an interpretable signal can be produced earlier without significantly sacrificing Precision/best F1\. Part C shows that real anomalous windows do not form a single cluster, but instead split into multiple substructures such as an upper\-right arc and a lower branch, echoing the two\-stage rise seen in Part A\. The pseudo\-anomalous windows mainly remain close to the outer boundary of the normal manifold, suggesting that they play a boundary\-expansion role rather than directly substituting for these specific anomalous submodes\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_25_result.png)Figure 17:Degradation\-detection results for XJTU\-SY Bearing2\-5\. The figure shows anomaly scores, metric comparisons, and representation\-space distributions, with the late anomalous stage overall receiving higher scores\.Bearing2\-5 has a longer life sequence, and its late degradation does not occur as a single jump but strengthens in multiple stages\. In Fig\.[17](https://arxiv.org/html/2606.04073#S3.F17), the anomaly score of the proposed method begins to rise continuously after about2×1042\\times 10^\{4\}windows, and several stepwise high\-score intervals appear in the middle and later stages, consistent with the fact that the two channels enter high\-energy regions successively in the heatmap\. Part A also shows that Deep SVDD, KNN Distance, and LOF Novelty can follow the general worsening trend, but with stronger fluctuations and blurrier boundaries among different stages\. Isolation Forest rises clearly in the middle stage but also exhibits more local drops\. One\-class SVM enters the high\-score region too early and approaches saturation quickly, while Adjacent Transformer and Transformer AE strengthen only in a relatively late stage\. Thus, for this kind of multi\-stage degradation sample, the advantage of the proposed method lies in preserving the form of progressive worsening rather than compressing the entire late stage into a single uniformly high\-score state\. Part B again appears nearly saturated near the outer ring, indicating that if only late\-life labels are aggregated, many methods can score highly\. However, Part C reveals finer structure: real anomalous samples do not collapse into a single group but instead split into multiple isolated island\-like branches, indicating that the late\-stage degradation indeed contains multiple states with different severities, while normal and pseudo\-anomalous samples remain mainly in a relatively compact left\-side core region\. This suggests that the proposed method separates normal and anomalous states while also preserving stage hierarchy inside the anomalous region\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_33_result.png)Figure 18:Degradation\-detection results for XJTU\-SY Bearing3\-3\. The model shows progressively stronger anomaly responses before and after failure\.Bearing3\-3 in Fig\.[18](https://arxiv.org/html/2606.04073#S3.F18)remains approximately normal over most of the test interval, exhibits only one local high\-score pulse before the actual failure, and then rapidly enters a sustained high\-score region near the end\. The comparison in Part A especially highlights the difference between the proposed method and the baselines: Deep SVDD, KNN Distance, LOF Novelty, Isolation Forest, Adjacent Transformer, and Transformer AE mostly surge only at the end and show almost no significant response to the precursor; One\-class SVM also rises rapidly at the end, but has a higher background beforehand; TranAD is dominated almost entirely by the final spikes\. This sample suggests that the precursors of some degradation sequences do not manifest as a long gradual rise, but rather as a combination of sparse precursors and a final abrupt change\. In such a case, the value of the proposed method lies not only in identifying the final failure, but also in separating a brief, local, yet meaningful early deviation from a long low\-background stage\. Since the radar chart in Part B is almost unable to differentiate most methods on this sample, it also shows that final summary statistics alone do not fully reflect the value of precursor sensitivity\. Part C is more interpretable: real anomalous samples on the right split into several relatively compact small clusters, while normal and pseudo\-anomalous samples on the left remain as a dominant manifold, indicating that both a small number of early precursors and severe late\-stage failure are mapped outside the normal region without breaking the consistency of the normal samples\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_xj_35_result.png)Figure 19:Degradation\-detection results for XJTU\-SY Bearing3\-5\. The proposed method yields stable high scores in the late degradation stage and score elevations in part of the early stage\.The test segment of Bearing3\-5 exhibits sustained anomaly responses above the threshold shortly after it begins, indicating that distribution shift has already occurred relatively early after the training segment ends\. Although the anomaly score of the proposed method fluctuates in the middle stage, its overall trend continues to move toward higher\-score regions and eventually forms a stable high\-score plateau near the end\. The comparison in Part A shows that One\-class SVM, KNN Distance, and Isolation Forest also enter the high\-score region early, but they are more likely to maintain an elevated background over a long interval, compressing the difference between early deviation and later severe degradation\. Deep SVDD, Adjacent Transformer, and Transformer AE rise later, while TranAD mainly produces a large number of fluctuating spikes\. In contrast, under this challenging setting where the test segment is already degraded for almost its entire duration, the proposed method still preserves a hierarchy from moderately high scores to an even higher plateau, meaning that it can indicate not only that the sequence has already deviated from normality, but also whether the degradation continues to intensify\. In Part B, almost all methods except TranAD are again close to the outer ring, which further shows that overall AUROC/AUPR alone is insufficient for reflecting which method better captures degradation hierarchy in long degradation sequences\. Part C more clearly indicates that real anomalous windows form a continuous arc\-shaped branch on the right, whereas normal training windows, normal test windows, and pseudo\-anomalous windows are distributed among several adjacent subclusters on the left\. This structure of “multiple but connected normal subclusters and an overall outward shift of anomalies” is consistent with the continuous progression from early deviation to severe late\-stage degradation shown in Part A\.

Across the six XJTU\-SY samples, three typical response patterns can be observed\. Bearing1\-1 and Bearing3\-3 are closer to the pattern of “long stable period followed by rapid failure\.” Bearing1\-3 and Bearing2\-5 exhibit more continuous progressive degradation\. Bearing2\-2 and Bearing3\-5 provide more explicit early warning signals before formal failure\. If one summarizes the shapes of the baseline score curves in Part A, One\-class SVM and some tree\-/distance\-based methods are more likely to show elevated backgrounds or overly early saturation, compressing the hierarchy among degradation stages\. Deep SVDD, KNN Distance, and LOF Novelty are sensitive to overall shifts, but their rise timing and fluctuation strength are not stable across samples\. Adjacent Transformer, Transformer AE, and TranAD are more likely to express degradation responses as local spikes or late concentrated surges\. If Parts B and C are considered together, many radar charts for XJTU\-SY are already near the performance ceiling; thus, they more strongly reflect whether a method can detect late\-stage degradation than whether it can capture early warnings\. What more directly reflects method differences is whether real anomalous windows form one or more outer branches detached from the normal manifold in Part C\. Compared with the baselines, the proposed method more consistently preserves the temporal order of degradation stages, enabling anomaly scores not only to detect late\-stage fault states but also to capture weak pre\-failure signs in some samples\. At the same time, we do not interpret these scores as direct estimates of remaining useful life, nor do we claim that they can precisely locate the degradation onset\. A more appropriate interpretation is that the score curve characterizes the degree to which test windows deviate from the early normal training distribution\. Low\-dimensional visualization also shows that, for most samples, real anomalous windows form relatively clear boundaries away from normal windows, while pseudo\-anomalous windows serve as boundary references\.

#### 3\.3\.3IMS degradation data experiment

The IMS bearing degradation dataset was provided by the Center for Intelligent Maintenance Systems \(IMS\) at the University of Cincinnati and later released through the NASA open data portalNASA Open Data Portal \([2023](https://arxiv.org/html/2606.04073#bib.bib34)\)\. We select the Bearing4 sequence as a representative long\-life degradation sample to examine the stability of the proposed method under long\-duration weak degradation and transitional states\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_ims_data.png)Figure 20:Illustration of the IMS bearing degradation data\. The figure shows the experimental setup, long\-sequence degradation trend, time\-domain signal, and frequency\-domain changes in the IMS run\-to\-failure data, and is intended to illustrate the input characteristics of long\-sequence degradation scenarios\.Figure[20](https://arxiv.org/html/2606.04073#S3.F20)highlights three key characteristics of IMS Bearing4\. First, the RMS degradation trajectory in Part C remains relatively stable over most of the early and middle stages, but already exhibits scattered spikes before the official degradation onset, indicating that the dataset does not consist of only two sharply separated stages, i\.e\., “normal” and “failed\.” Second, the comparison of normal and anomalous windows in Part B shows that anomalous samples have enhanced time\-domain impact amplitudes and stronger dominant frequency peaks, although the enhancement is not as drastic as in some XJTU\-SY samples\. Finally, the figure suggests that IMS is better suited as a validation scenario for long\-duration gradual degradation, where the key issue is model tolerance to weak signs and transitional states\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_ims_result.png)Figure 21:Degradation\-detection results on the IMS bearing data\. The proposed method captures part of the anomaly response that strengthens over time, although some anomalous windows still receive scores close to those of normal windows\.The IMS test sequence is much longer and contains more transitional states in which normal and anomalous samples overlap\. Taking Bearing4 as an example, the training\-set size is4096000×24096000\\times 2, the test\-set size is40058880×240058880\\times 2, and the window length and stride remain512512and256256, respectively\. Part A of Fig\.[21](https://arxiv.org/html/2606.04073#S3.F21)shows that the window energy of the two channels as a whole enters a higher region after about10510^\{5\}windows, while the anomaly score of the proposed method already exhibits multiple isolated peaks before that point, echoing the early warning signs before failure marked in Fig\.[20](https://arxiv.org/html/2606.04073#S3.F20)\. After about1\.05×1051\.05\\times 10^\{5\}windows, many windows continuously enter the high\-score region, indicating that the model can accumulate responses to progressively worsening degradation\. Compared with the baselines, the difference is again evident here\. Deep SVDD and TranAD show strong fluctuations or dense peaks over a long early interval, making them prone to mapping normal fluctuations in long sequences to high scores\. One\-class SVM maintains an elevated background over almost the entire test segment, leaving little room for threshold discrimination\. KNN Distance, LOF Novelty, and Isolation Forest also show an overall rise in the later stage, but more in the form of gradual elevation on top of a relatively high background, making them less targeted at early sparse weak signs\. Adjacent Transformer and Transformer AE respond mainly to larger regions in later stages and provide insufficient coverage for smaller early signs\. In contrast, the strength of the proposed method on IMS is not that all anomalous windows are clearly separated, but that it simultaneously preserves two types of information: sparse early warning spikes and a sustained high\-score platform in the late stage\. The radar chart in Part B supports this observation: the proposed method maintains a relatively large outward area on AUROC, AUPR, best F1, and Precision, whereas methods such as Adjacent Transformer are more aggressive on Recall at the expense of Precision, and Deep SVDD lags on multiple metrics\. This suggests that, in long\-sequence degradation scenarios such as IMS, the more valuable property is not to crudely label more windows as anomalous, but to maintain a relatively balanced Precision–Recall trade\-off while preserving strong ranking ability\. The t\-SNE visualization in Part C further shows that a substantial portion of real anomalous windows have already detached from the normal manifold along a right\-side arc\-shaped branch, while another portion remains close to or even partially overlaps with the region of normal test windows\. Meanwhile, pseudo\-anomalous windows mainly lie outside the normal manifold in another direction\. This structure indicates that degradation in IMS does not follow a single, clear anomaly trajectory, but simultaneously includes late\-stage states that have clearly moved away from normality and transitional states that are still close to normal\. Therefore, for long\-life degradation data such as IMS, anomaly\-detection scores are better interpreted as references to degradation trends rather than as the sole basis for pointwise failure decisions\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_overall_metrics.png)Figure 22:Summary of the core metrics across multiple datasets\. A–E compare the results of99methods on AUROC, AUPR, best F1, Precision, and Recall across1111datasets, and F shows the radar chart of cross\-dataset averages\. This figure summarizes the overall performance of the proposed method on bearing fault detection and degradation detection tasks\.
#### 3\.3\.4Experimental summary

Figure[22](https://arxiv.org/html/2606.04073#S3.F22)provides a unified summary of the core metrics from the above1111fault\-detection and degradation\-detection experiments\. Panels A–E correspond to the five evaluation metrics, and Panel F is the averaged radar chart over different metrics\. This figure is not intended to replace the dataset\-by\-dataset anomaly\-score analysis given earlier, but rather to compare the overall trend of different methods under a unified coordinate system\. Overall, the proposed method occupies the largest area in the averaged radar chart and achieves the highest average performance on AUROC, AUPR, best F1, and Precision, with average values of0\.96730\.9673,0\.97780\.9778,0\.95620\.9562, and0\.94850\.9485, respectively\. The average Recall is0\.96950\.9695, which is not the highest among all methods but still remains high\. This indicates that the strength of the proposed method does not lie mainly in more aggressive recall, but rather in a more stable balance of ranking ability, threshold\-based detection performance, and Precision–Recall trade\-off across datasets\.

A closer examination of the single\-dataset results shows that the proposed method achieves the best AUROC and AUPR on55datasets each, and the best best\-F1 and Precision on44datasets each, indicating that its advantage does not come from only one type of metric\. This advantage is more pronounced on datasets that better differentiate methods, such as HTBF, PHM2009, and REALBOX\. On IMS Bearing4 and some XJTU\-SY subsets, however, the gaps between the proposed method and LOF Novelty, KNN Distance, and One\-class SVM are smaller, and some individual metrics are slightly better for those baselines\. In other words, this summary figure is better suited to support the conclusion that the proposed method is superior in overall cross\-dataset stability and comprehensive performance, rather than to suggest that it has absolute superiority in every scenario and on every metric\. In addition, the larger dispersion among methods on IMS, HTBF, and REALBOX also indicates that these datasets are more discriminative in terms of anomaly\-detection capability and threshold\-selection stability\.

### 3\.4Ablation studies

To further explain the effectiveness of the proposed framework, we conduct ablation analyses from three perspectives: how pseudo\-anomalies are generated, whether pseudo\-anomalies form effective boundaries in the representation space, and how such boundaries are translated into final anomaly scores\. Unlike the main experiments, which focus on final detection metrics, the ablation studies focus more on the mechanism of the method itself\. Ideally, pseudo\-anomalous samples should not simply be “easy negatives” that lie far away from normal samples; instead, they should lie outside the normal manifold, exhibit some directional consistency with possible real anomalies, and be transformed into a stable distance structure after Stage 2 training\. Accordingly, the following analysis combines two\-dimensional embedding distributions, distance statistics, PCA/Sankey coverage relations, anomaly\-score curves, and metrics such as AUROC, AUPR, and best F1\.

#### 3\.4\.1Ablation I: pseudo\-anomaly injection strategy

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_ablation_pseudo.png)Figure 23:Ablation on pseudo\-anomaly injection strategies\. The upper row corresponds to HTBF and the lower row to XJTU\-SY Bearing1\-1\. The figure compares six strategies—our method, CARLA, removing error\-bin balancing, removing positive\-pair pulling, random pseudo\-anomalous negatives, and random pseudo\-anomalies—in terms of the distributions of normal training samples, pseudo\-anomalies, validation\-normal samples, and real anomalies in the two\-dimensional embedding space\.Figure[23](https://arxiv.org/html/2606.04073#S3.F23)compares the proposed pseudo\-anomaly generation method, CARLA anomaly injection, and several alternative strategies obtained by removing key design components\. The core question in this group of experiments is whether the negative samples generated by different anomaly\-injection schemes truly act as “boundary samples\.” For bearing anomaly detection under normal\-only training, pseudo\-anomalies do not need to reproduce the real fault morphology point by point during testing\. More importantly, they should form a continuous, learnable reference region outside the normal sample manifold without deviating excessively from the physical data distribution\. If pseudo\-anomalies are too close to normal samples, Stage 2 cannot learn a sufficiently discriminative margin; if they are too random or too far from the normal manifold, the model is more likely to learn artifacts unrelated to real faults, resulting in representations that are easy to separate during training but unstable during testing\.

Table 7:Quantitative results of Ablation I on HTBF and XJTU\-SY Bearing1\-1\.SettingHTBFXJTU\-SY Bearing1\-1AUROCAUPRBest F1AUROCAUPRBest F1Proposed method0\.97110\.98420\.93470\.98890\.99540\.9784CARLA pseudo\-anomalies0\.96940\.96550\.91870\.95950\.96860\.9548Without error\-bin balancing0\.96180\.98150\.92510\.98890\.99540\.9777Without positive pulling0\.86630\.92280\.85430\.97770\.98970\.9742Random pseudo\-anomalous negatives0\.79350\.86670\.84260\.94700\.93270\.9590Random pseudo\-anomalies0\.84160\.90910\.85410\.99140\.99130\.9771Table[7](https://arxiv.org/html/2606.04073#S3.T7)further reports the quantitative results of the above six strategies on HTBF and XJTU\-SY Bearing1\-1\. On the more complex HTBF dataset, the proposed method achieves the highest AUROC \(0\.9711\), while remaining close to CARLA pseudo\-injection on AUPR \(0\.9842\) and best F1 \(0\.9347\)\. After removing error\-bin balancing, all metrics decrease slightly, indicating that simply generating boundary samples is not sufficient for stable detection; balanced coverage of different boundary strengths also affects the final performance\. By contrast, random pseudo\-anomalies, random pseudo\-anomalous negatives, and removing positive pulling all lead to more obvious performance degradation, suggesting that when negative samples are not properly aligned with the normal boundary, or when the normal neighborhood is not sufficiently compacted, the representation learned in Stage 2 becomes more easily influenced by artifacts from random perturbations\. On XJTU\-SY Bearing1\-1, all six strategies perform relatively well overall, indicating that the anomaly direction of this degradation sample is relatively concentrated and that several pseudo\-anomaly construction schemes can provide useful training signals\. Even so, the proposed method still achieves the highest best F1 \(0\.9784\), whereas CARLA is slightly higher on AUROC and AUPR, suggesting that on relatively easy data, the difference among pseudo\-anomaly strategies is reflected more in boundary details and threshold\-based results than in global separability itself\.

The distributions on HTBF and XJTU\-SY Bearing1\-1 in Fig\.[23](https://arxiv.org/html/2606.04073#S3.F23)show that the pseudo\-anomalous samples generated by the proposed method tend to expand along the outside of the normal samples and maintain some directional consistency with the main shift directions of real anomalies\. Taking HTBF as an example, the normal training samples and validation\-normal samples form a relatively compact normal region, the proposed pseudo\-anomalies are mainly distributed outside that region, and the real anomalies lie further along or adjacent to the outer edge\. This indicates that the samples generated in Stage 1 are not simple random perturbations, but are instead extrapolated along vulnerable directions of the normal manifold in the reconstruction\-error space, thereby providing hard negatives that are close to real anomaly directions for Stage 2\. In XJTU\-SY Bearing1\-1, the life\-stage difference between real degradation samples and normal samples is more pronounced, and the proposed method likewise constructs negative references near the normal region, making it easier for real degradation windows to be pushed away from the normal core in the embedding space\.

By contrast, CARLA’s contextual, global, seasonal, shapelet, and trend injection schemes impose stronger template priors and can generate multiple semantically explicit types of time\-series anomalies, but these templates do not necessarily align with the reconstruction\-residual boundary in bearing vibration data\. CARLA may produce effective negative samples in scenarios where trend anomalies or shapelet anomalies are more pronounced; however, in multi\-channel and strongly perturbed data such as HTBF, template\-based injection can easily create several discrete anomaly directions, and the overlap between pseudo\-anomalies and real anomalies may occur only in local regions\. In other words, CARLA has the advantage of anomaly\-pattern diversity, whereas its limitation is that anomaly directions are determined by predefined transformations\. The advantage of the proposed method lies in the fact that anomaly strength is adaptively derived from the reconstruction\-error distribution of the normal training samples, which is more suitable for constructing bearing\-specific pseudo\-anomalies close to the normal boundary\.

The results after removing key modules further show the necessity of boundary construction\. Without error\-bin balancing, pseudo\-anomalies are more likely to concentrate in the error intervals that are easiest for the controller to hit, leading to insufficient samples along some boundary directions\. Although such pseudo\-anomalies may still separate from normal samples, their uneven coverage makes Stage 2 biased toward a few anomaly\-strength regions and weakens generalization to complex real anomalies\. Removing positive pulling weakens the compactness constraint among normal neighbors, so the repulsion term from pseudo\-anomalies is more likely to spread the normal samples apart, leading to elevated background scores or reduced threshold stability\. RandomNegative and RandomPseudo illustrate the limitations of random negative samples and random pseudo\-anomalies, respectively: in two\-dimensional plots, they may sometimes appear clearly separated, but that separation often comes from “easy differences” inconsistent with the structure of normal reconstruction residuals, so the model may learn random perturbation artifacts rather than the true boundary between bearing anomalies and the normal manifold\.

Therefore, this ablation study indicates that pseudo\-anomaly quality mainly depends on three factors: first, the pseudo\-anomalies should be located in the outer\-neighbor region relative to the normal manifold rather than arbitrarily far away; second, their strength should cover multiple reconstruction\-error intervals rather than concentrate at a single difficulty level; and third, the normal samples themselves should remain compact, otherwise negative\-sample repulsion is converted into elevated background scores\. The proposed target\-error control, error\-bin balancing, and normal positive\-pulling term jointly satisfy these conditions, which explains their greater stability compared with template\-based injection and random injection\.

#### 3\.4\.2Ablation II: shared encoder design and CARLA coverage relation

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_ablation_encoder.png)Figure 24:Ablation on the shared encoder design and the coverage relation with CARLA\. Panels A and B show the nearest\-training\-sample distance and the intra\-pseudo\-anomaly distance statistics for pseudo\-anomalies generated by different Stage 1 encoders, respectively\. Panel C shows the difference between the proposed method and CARLA pseudo\-anomalies in PCA space and sector\-wise coverage relations\.Figure[24](https://arxiv.org/html/2606.04073#S3.F24)further analyzes the effect of the Stage 1 representation structure from two perspectives\. The Nearest Train\-Data Distance in Panel A measures the distance from a pseudo\-anomaly to its nearest normal training sample and can be understood as whether the pseudo\-anomaly is sufficiently close to the normal boundary\. The Pseudo\-Pseudo Distance in Panel B measures the dispersion among pseudo\-anomalous samples and can be understood as whether the pseudo\-anomalies have sufficient coverage\. An effective pseudo\-anomaly generator should not simply maximize both quantities: if the distance to the nearest normal sample is too small, pseudo\-anomalies become almost indistinguishable from normal samples and can introduce label noise; if it is too large, pseudo\-anomalies turn into easy negatives and fail to provide fine\-grained boundary information\. Likewise, if the intra\-pseudo\-anomaly distance is too small, coverage is insufficient; if it is too large, the sample distribution may become fragmented and lose boundary continuity\.

The boxplots for different encoders show that the shared\-encoder structure adopted in this paper achieves a relatively balanced trade\-off between “staying close to the normal boundary” and “maintaining pseudo\-anomaly diversity\.” Structures such as residual MLP generate pseudo\-anomalies that lie closer to normal training samples, suggesting that their perturbations remain closer to normality, but the insufficient boundary margin may weaken the contrastive training signal\. GRU, CNN, or some more complex encoders can generate more dispersed pseudo\-anomalies, but if they place pseudo\-anomalies too far from the normal samples, they are more likely to produce easy negatives unrelated to real bearing degradation\. In the proposed method, a shared encoder is combined with feature\-specific decoding heads, allowing different sensor dimensions to learn normal patterns under a common temporal context while preserving the independence of per\-feature reconstruction errors\. The resulting pseudo\-anomalies are thus constrained by the same normal manifold while still allowing each channel to deviate in a controllable way along its own residual direction, making them more suitable for subsequent contrastive learning in a unified embedding space\.

The PCA and Sankey analyses in Fig\.[24](https://arxiv.org/html/2606.04073#S3.F24)further indicate that the proposed method and CARLA are not simply “two implementations of the same anomaly injection\.” CARLA’s five types of injected samples occupy several semantically explicit sectors in PCA space, such as trend, shapelet, seasonal, global, and contextual directions\. These directions are helpful for simulating typical time\-series anomalies, but in bearing\-vibration tasks they do not necessarily cover all residual boundaries outside the normal manifold\. The samples generated by the proposed method use reconstruction error as the coordinate and do not assume specific fault templates\. As a result, some regions in PCA space are adjacent to or overlap with CARLA, while others fall into no\-match regions not covered by CARLA\. The varying widths of the flows from different CARLA types into overlap, nearby, uncovered, and no\-match regions in the Sankey diagram suggest that the two methods have both common and complementary coverage\.

This observation has two implications\. First, the proposed method covers some anomaly directions also represented by CARLA, which means that the pseudo\-anomalies obtained by extrapolating reconstruction error are not semantically meaningless random perturbations; they bear some correspondence to anomaly types such as trends, shape changes, or contextual shifts\. Second, the proposed method also covers regions that CARLA cannot easily enumerate explicitly, indicating that the anomaly boundary of bearings cannot be fully represented by a small set of predefined time\-series transformations\. In real bearing monitoring, faults may appear as compound changes in energy, frequency, impact sparsity, channel coupling, and condition disturbance\. Using the upper tail of reconstruction errors on normal training samples as the basis for extrapolation makes it possible to build a boundary closer to the equipment’s own data distribution without relying on fault templates\. Therefore, the contribution of the proposed shared\-encoder and error\-control mechanism is not merely that they generate more pseudo\-anomalies, but that they place these pseudo\-anomalies at more reasonable boundary locations and with more reasonable coverage in the representation space\.

#### 3\.4\.3Ablation III: Stage 2 contrastive loss and anomaly score

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_ablation_loss_score.png)Figure 25:Ablation on the Stage 2 contrastive loss and anomaly score\. Panels A and B show the anomaly\-score curves under different contrastive losses on HTBF and CWRU, respectively\. Panels C, D, and E compare the comprehensive metrics of the*normal score*,*relative score*, and*OOC score*on the two datasets\.Figure[25](https://arxiv.org/html/2606.04073#S3.F25)compares three contrastive losses in Stage 2—Triplet\-Margin, Pairwise\-Margin, and InfoNCE—and further compares three anomaly scores, namely the*normal score*,*relative score*, and*OOC score*\. This experiment reveals an important phenomenon: pseudo\-anomalies can serve as boundary references during training, but may not be suitable as “real anomaly prototypes” that directly participate in scoring at test time\. Therefore, the loss function and the anomaly score must be consistent with the role played by pseudo\-anomalies\.

From the score curves over time, all three losses achieve nearly saturated detection performance on CWRU, with Triplet\-Margin, Pairwise\-Margin, and InfoNCE all reaching AUROC, AUPR, and best F1 of 1\.0000\. This indicates that the difference between normal and fault samples is strong in CWRU, so sufficiently separable representations can be learned even with different contrastive objectives\. Therefore, CWRU is better suited to verifying that the model has basic detection capability than to distinguishing which Stage 2 loss is superior\. The HTBF results are more informative: Triplet\-Margin achieves AUROC=0\.9711, AUPR=0\.9842, and best F1=0\.9346, clearly outperforming Pairwise\-Margin \(AUROC=0\.8057, AUPR=0\.8864, best F1=0\.8342\) and InfoNCE \(AUROC=0\.7590, AUPR=0\.8329, best F1=0\.8358\)\. This difference shows that for multi\-channel data under strong operating\-condition disturbance, it is not enough simply to separate normal and pseudo\-anomalous samples; what matters more is learning local relative distance relations\.

Triplet\-Margin has the advantage that it simultaneously constrains the relative distances among anchor, positive, and negative samples: the normal neighbor is required to be closer to the anchor than the pseudo\-anomalous negative sample, with an explicit margin retained\. This objective is highly consistent with the final*normal score*adopted in this paper: during training, the local normal neighborhood is compacted and the boundary samples are pushed away; during testing, the anomaly degree is measured by the KNN distance from the test window to the normal sample bank\. Pairwise\-Margin mainly emphasizes absolute attraction or repulsion between sample pairs and lacks the ranking constraint among anchor, positive, and negative, which makes it more likely that the internal structure of the normal samples is not sufficiently compacted\. Although pseudo\-anomalies are pushed away, the ranking of real anomalies can remain unstable\. InfoNCE relies on in\-batch negatives and a temperature parameter; when the number of pseudo\-anomalies is large and their difficulty is uneven, the loss may be dominated by many easy negatives\. As a result, the model may learn a global separation trend, but the background scores of normal segments are more easily elevated, leading to decreases in Precision/best F1 and ranking metrics on HTBF\.

The comparison among different anomaly scores supports the same explanation\. The*normal score*uses only the distance from a test window to the normal training sample bank and therefore directly answers the one\-class anomaly\-detection question “Has this window deviated from the normal manifold?” The*relative score*uses distances to both the normal bank and the pseudo\-anomalous bank, and the*OOC score*further treats normal and pseudo\-anomalous samples as two reference categories\. These two scores work well on strongly separable data such as CWRU, but are less stable than the*normal score*on complex data such as HTBF\. The reason is that the main role of the Stage 1 pseudo\-anomalies is to shape the boundary, not to exhaust all possible real anomaly categories\. Real fault windows may deviate along some pseudo\-anomaly directions, but may also deviate along directions not fully covered by CARLA or by the pseudo\-anomalies generated in this paper\. In that case, if the pseudo\-anomaly bank is used as an anomaly prototype in scoring, the inconsistency between real anomalies and pseudo\-anomalies can partially offset the deviation from the normal bank\.

Therefore, the final choice in this paper is the combination of Triplet\-Margin and the*normal score*\. This combination provides stronger consistency between the training objective and the test\-time scoring: Triplet\-Margin compacts the normal neighborhood and establishes a margin outside it, while the*normal score*measures distance with respect to this compacted normal neighborhood\. In this framework, pseudo\-anomalies play the role of “boundary\-shaping samples” rather than “anomaly\-category templates\.” This conclusion also explains why, in our framework, the value of generating high\-quality pseudo\-anomalies is mainly manifested in the representation\-learning stage rather than in simply adding pseudo\-anomalies to the test\-time KNN classifier\.

#### 3\.4\.4Ablation IV: RL step\-size controller

To further analyze the specific role of the reinforcement\-learning step\-size controller in pseudo\-anomaly generation, we construct control experiments on CWRU and HTBF in which RL\-based step control is removed\. The two settings are: 1\)TPA\-AD w/ RL step controller, where the actor in Stage 1 adaptively controls the perturbation step size for continuous features according to the current reconstruction error, target error, and perturbation progress; and 2\)TPA\-AD w/o RL step controller, where the actor controller is removed and the perturbation magnitude is computed directly using an analytic target\-ratio step\-size strategy\. To eliminate the effect of differences in the number of pseudo\-anomalous samples, both settings collect a fixed number of30003000pseudo\-anomalous windows with target\-error bin filtering and pseudo\-bin balancing turned off\. Thus, the focus of this experiment is not “which setting generates more pseudo\-anomalies,” but rather “given the same number of pseudo\-anomalous samples, does RL\-based step control improve their effective diversity, source coverage, and boundary\-crossing ability?”

![Refer to caption](https://arxiv.org/html/2606.04073v1/rl_step_controller_ablation_main.png)Figure 26:Ablation results for the RL step\-size controller\. The left column reports the relative gains of RL over no\-RL on five Stage 1 diversity indicators, including source coverage, average distance from pseudo\-anomalies to the nearest normal training samples, intra\-pseudo\-anomaly nearest\-neighbor distance, PCA 95% coverage area, and target\-error boundary\-crossing hit rate\. The middle and right columns show the PCA distributions with and without the RL step\-size controller, respectively\. The figure shows that, under a fixed number of pseudo\-anomalies, the main benefit of RL step control lies in improving the dispersion of boundary samples, their separation from the normal manifold, and their ability to cross the target\-error boundary\. It should be noted that the PCA plots are used only for qualitative visualization and not as standalone quantitative evidence\.Table 8:Compact metrics for the RL step\-size\-controller ablation\. Each cell reports the raw values as “RL / No\-RL,” and the better RL value is highlighted in bold\.DatasetSourcecoveragePseudo\-anomaly–normal NN distanceIntra\-pseudo\-anomalyNN distancePCA 95%areaBoundary\-crossinghit rateCWRU0\.735/ 0\.38617\.296/ 6\.52818\.033/ 3\.183955\.4/ 864\.30\.760/ 0\.265HTBF0\.775/ 0\.76213\.938/ 9\.54755\.339/ 53\.4013157\.0/ 3036\.00\.347/ 0\.214

The left column of Fig\.[26](https://arxiv.org/html/2606.04073#S3.F26)reports indicators with different units in terms of relative gains of RL over no\-RL, making it easier to compare the magnitude of improvement across indicators\. Table[8](https://arxiv.org/html/2606.04073#S3.T8)gives the corresponding raw values\. Overall, the RL step\-size controller improves not the number of pseudo\-anomalous samples, but their effective diversity and boundary expression ability\.

Taking CWRU as an example, the RL step\-size controller brings clear improvements on all five indicators\. Specifically, source coverage increases from 0\.386 to 0\.735, indicating that pseudo\-anomalous samples come from a broader variety of normal training windows instead of repeatedly perturbing only a few samples\. The average distance from pseudo\-anomalies to their nearest normal training samples increases from 6\.528 to 17\.296, showing that the RL\-controlled pseudo\-anomalies deviate more fully from the normal manifold\. The intra\-pseudo\-anomaly nearest\-neighbor distance increases from 3\.183 to 18\.033, indicating greater dispersion and lower redundancy among samples\. The boundary\-crossing hit rate increases from 0\.265 to 0\.760, showing that RL step control makes it easier for samples to cross the target\-error boundary\. In relative terms, the intra\-pseudo\-anomaly nearest\-neighbor distance on CWRU improves by about 5\.67 times, the boundary\-crossing hit rate by about 2\.86 times, the pseudo\-anomaly–normal nearest\-neighbor distance by about 2\.65 times, and the source coverage by about 90\.2%\. These results jointly show that on more standard bearing\-fault data, the RL step\-size controller can substantially enhance the coverage, dispersion, and boundary character of pseudo\-anomalous samples\.

The conclusion on HTBF is consistent with that on CWRU, although the gains are distributed more unevenly across indicators\. Table[8](https://arxiv.org/html/2606.04073#S3.T8)shows that source coverage on HTBF increases only slightly from 0\.762 to 0\.775, and the improvements in intra\-pseudo\-anomaly nearest\-neighbor distance and PCA 95% area are also relatively moderate\. By contrast, the boundary\-crossing hit rate increases from 0\.214 to 0\.347, and the pseudo\-anomaly–normal nearest\-neighbor distance increases from 9\.547 to 13\.938, corresponding to relative gains of about 62\.6% and 46\.0%, respectively\. This suggests that in more complex multi\-channel scenarios, the main benefit of the RL step\-size controller is not necessarily to cover more training\-source windows, but rather to push pseudo\-anomalies away from the normal manifold more effectively and improve their ability to cross the target\-error boundary\. In other words, on HTBF the role of the RL step\-size controller is more about improving the “effective strength” of boundary samples than substantially expanding source coverage\.

The PCA results in Fig\.[26](https://arxiv.org/html/2606.04073#S3.F26)provide intuitive support for these findings\. On CWRU, after introducing the RL step\-size controller, pseudo\-anomalous samples expand over a wider region around the normal samples and become more dispersed\. Without RL control, pseudo\-anomalies still deviate from the normal region, but their overall coverage is narrower and local clustering is more obvious\. On HTBF, both methods can form a relatively clear ring\-like or outward\-expanding distribution, but the RL\-controlled pseudo\-anomalies cover the outer\-edge regions and cross\-boundary directions more fully, whereas the no\-RL version tends to concentrate in a few relatively fixed regions\. It should again be emphasized that PCA visualizations are used only to qualitatively illustrate pseudo\-anomaly distribution trends; the conclusions are still primarily supported by the quantitative indicators in Table[8](https://arxiv.org/html/2606.04073#S3.T8)\.

Overall, under a fixed number of pseudo\-anomalous samples, the RL step\-size controller mainly improves the separation of pseudo\-anomalies from the normal manifold, their internal dispersion, and their ability to cross the target\-error boundary, while on some datasets it further increases source coverage\. This indicates that its main contribution lies in improving the effective diversity of boundary samples rather than simply increasing the number of pseudo\-anomalous samples\.

#### 3\.4\.5Ablation V: empirical basis of the Stage 1 target reconstruction\-error interval

Stage 1 requires a pre\-specified target reconstruction\-error interval to control the deviation strength of pseudo\-anomalous windows relative to normal windows\. If this interval is too low, the generated pseudo\-anomalies may still remain close to the normal manifold and fail to form an effective boundary\. If it is too high, the generated pseudo\-anomalies are more likely to lie too far away from normal samples, weakening the subsequent representation\-learning constraint toward real degradation directions\. To examine whether the interval setting has an empirical basis, we further analyze the distributions of window\-level reconstruction errors for normal training windows \(Train\-N\), validation\-normal windows \(Val\-N\), and validation fault/degradation windows \(Val\-F\)\. The purpose here is not to prove that reconstruction error itself is the final anomaly detector, but rather to verify whether high\-quantile statistics of reconstruction errors on normal training windows can serve as an empirical reference for the target error strength used in Stage 1 pseudo\-anomaly generation\.

For each dataset, we reuse the trained Stage 1 reconstruction model and compute window\-level reconstruction errors under the same window settings \(L=512L=512,H=256H=256\)\. For multi\-channel data, the maximum channel\-wise error is used as the window error to avoid masking local fault responses through channel averaging\. In Fig\.[27](https://arxiv.org/html/2606.04073#S3.F27), the blue, green, and pink histograms denote the reconstruction\-error distributions of Train\-N, Val\-N, and Val\-F, respectively\. The black dashed and dash\-dotted lines denote theQ​95Q95andQ​99Q99of Train\-N, respectively, and the purple shaded region denotes the target\-error interval currently used for pseudo\-anomaly generation in Stage 1\. The horizontal axis islog10⁡\(window reconstruction error\)\\log\_\{10\}\(\\text\{window reconstruction error\}\), so a rightward shift indicates a global increase in window reconstruction error\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/bearing_recon_error_threshold_core.png)Figure 27:Analysis of the empirical basis of the Stage 1 target reconstruction\-error interval\. The figure compares the window\-level reconstruction\-error distributions of Train\-N, Val\-N, and Val\-F on different datasets\. The black dashed/dash\-dotted lines denote the Train\-NQ​95Q95/Q​99Q99, and the purple shaded region denotes the current Stage 1 target\-error interval\. The figure illustrates that the high\-quantile region of normal\-training reconstruction error can serve as an empirical reference for the pseudo\-anomaly target error on some bearing datasets, but this reference is neither a unified cross\-dataset threshold nor the final anomaly decision threshold\.Table 9:Compact results for the reconstruction\-error target\-interval ablation\. Train\-NQ​99Q99denotes the high\-quantile reference of reconstruction error on normal training windows; Val\-N FPR@Train\-NQ​99Q99denotes the proportion of validation\-normal windows lying to the right of the Train\-NQ​99Q99reference line; Val\-FQ​10Q10/Train\-NQ​99Q99measures the rightward shift of the lower quantile of fault/degradation windows relative to the high quantile of normal training windows; and Val\-F below Train\-NQ​99Q99denotes the proportion of fault/degradation windows below Train\-NQ​99Q99\. This table is used only to analyze the empirical basis of the Stage 1 target\-error interval and does not correspond to a final deployment threshold\.DatasetTrain\-NQ​99Q99Val\-N FPR@Train\-NQ​99Q99\(%\)Val\-FQ​10Q10/Train\-NQ​99Q99Val\-F below Train\-NQ​99Q99\(%\)CWRU0\.00490\.383\.462\.66HTBF0\.00661\.880\.1072\.00PHM20090\.07351\.150\.0185\.79REALBOX0\.01140\.000\.0392\.83XJTU\-SY Bearing1\-10\.01454\.7442\.111\.14XJTU\-SY Bearing1\-30\.002041\.4436\.030\.00XJTU\-SY Bearing2\-20\.002012\.18166\.410\.22XJTU\-SY Bearing2\-50\.003714\.56108\.050\.00XJTU\-SY Bearing3\-30\.00220\.824475\.290\.00XJTU\-SY Bearing3\-50\.004274\.8011\.980\.16IMS Bearing40\.00114\.380\.2527\.60

Table[9](https://arxiv.org/html/2606.04073#S3.T9)and Fig\.[27](https://arxiv.org/html/2606.04073#S3.F27)jointly indicate that on CWRU and on most XJTU\-SY degradation samples, the reconstruction errors of fault/degradation windows shift clearly to the right relative to the high quantiles of the normal training windows\. Taking CWRU as an example, the1010th\-percentile error of Val\-F is about3\.463\.46times the9999th\-percentile error of Train\-N, and only2\.66%2\.66\\%of fault windows lie below Train\-NQ​99Q99; meanwhile, Val\-N FPR@Train\-NQ​99Q99is only0\.38%0\.38\\%, suggesting that the high quantiles of normal training errors also provide a relatively stable conservative reference for normal windows on this dataset\. The phenomenon is even more pronounced for most XJTU\-SY samples\. For example, Val\-FQ​10Q10/Train\-NQ​99Q99reaches42\.1142\.11,166\.41166\.41, and4475\.294475\.29for Bearing1\-1, Bearing2\-2, and Bearing3\-3, respectively, while the corresponding Val\-F below Train\-NQ​99Q99values are only1\.14%1\.14\\%,0\.22%0\.22\\%, and0\.00%0\.00\\%\. This suggests that the upper tail of normal\-training reconstruction errors can naturally serve as a reference for pseudo\-anomaly strength\.

On the other hand, HTBF, PHM2009, and REALBOX provide clear counterexamples\. The Val\-FQ​10Q10/Train\-NQ​99Q99value of HTBF is only0\.100\.10, and72\.00%72\.00\\%of fault windows lie below Train\-NQ​99Q99; the corresponding proportions for PHM2009 and REALBOX are even higher, reaching85\.79%85\.79\\%and92\.83%92\.83\\%, respectively\. This indicates that in these datasets there are many real fault windows with low reconstruction errors, so the high quantiles of normal training errors cannot be interpreted as a unified final anomaly decision threshold\. At the same time, Val\-N FPR@Train\-NQ​99Q99varies substantially across datasets\. For example, it reaches41\.44%41\.44\\%and74\.80%74\.80\\%on XJTU\-SY Bearing1\-3 and Bearing3\-5, respectively, further showing that the role of Train\-NQ​99Q99is better understood as an empirical reference line for Stage 1 rather than a stable deployment threshold across datasets\.

This is consistent with the two\-stage design of the proposed method\. The target\-error interval in Stage 1 is not directly used for the final decision; instead, it is used to generate pseudo\-anomalous windows with controlled deviation relative to normal samples, thereby establishing a boundary band outside the normal manifold for representation learning\. The final anomaly score is still determined by the KNN distance to the normal sample bank in the Stage 2 representation space\. Therefore, this experiment supports a more restrained conclusion: in bearing data, high\-quantile statistics of reconstruction errors on normal training windows can provide an empirical reference for the Stage 1 pseudo\-anomaly target\-error interval, thereby reducing the need to completely handcraft the pseudo\-anomaly strength\. However, because fault patterns, operating\-condition disturbances, and degradation processes vary substantially across datasets, this reference is neither a unified theoretical threshold across datasets nor the final anomaly decision threshold\. The IMS Bearing4 result in Fig\.[27](https://arxiv.org/html/2606.04073#S3.F27)can be regarded as a supplementary observation, but should not be used as the main argument to further strengthen the conclusion above\.

#### 3\.4\.6Ablation VI: CARLA\-style pseudo\-anomalies versus the original Stage 2 logic

To further examine whether the performance of the proposed method comes merely from external pseudo\-anomalous samples and whether CARLA’s original Stage 2 logic is suitable for bearing scenarios, we add two groups of CARLA\-related comparison experiments\. In the first group, the pseudo\-anomaly generation in Stage 1 is replaced with CARLA\-style pseudo\-anomalous windows, while the representation learning in Stage 2 and the anomaly score based on the KNN distance to normal samples are kept unchanged; this setting is denoted as*CARLA pseudo \+ ours Stage 2*\. In the second group, an adapted version of CARLA’s original classification/clustering\-based Stage 2 is further adopted under the same training\-normal\-window and test\-window protocol; this setting is denoted as*CARLA original Stage 2 adapter*\. These two control settings are used, respectively, to test whether external pseudo\-anomalous samples can directly replace the target\-error\-controlled generation strategy of this paper, and whether CARLA’s own Stage 2 discrimination logic can stably replace the normal\-sample KNN representation score used in this paper\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/carla_baseline_contrast.png)Figure 28:Comparison of score curves for CARLA\-style pseudo\-anomaly injection and the original CARLA Stage 2 logic on different bearing datasets\. The left column shows CARLA anomaly injection→\\rightarrowOurs Stage 2, i\.e\., only the pseudo\-anomaly generation is replaced while keeping our Stage 2 representation learning and the*normal score*\(the anomaly score based on KNN distance to normal samples\)\. The right column shows the CARLA original Stage 2 adapter, i\.e\., the adapted version using CARLA’s original classification/clustering\-based Stage 2 logic\. Background shading indicates anomalous or degradation stages, and the curves qualitatively illustrate the differences in score stability between the two CARLA\-related schemes on different datasets\.Table 10:Comparison of the proposed method and two CARLA\-related controls on1111bearing datasets\. Best F1 is a posterior separability metric obtained by scanning thresholds on the test scores and should not be interpreted as deployment\-threshold performance\. Bold indicates the best result on the same dataset and metric based on the original unrounded values\.DatasetTPA\-ADCARLA pseudo \+ ours Stage 2CARLA original Stage 2 adapterAUROCAUPRBest F1AUROCAUPRBest F1AUROCAUPRBest F1CWRU1\.00001\.00001\.00001\.00001\.00001\.00000\.99981\.00000\.9986HTBF0\.97110\.98420\.93470\.96650\.98410\.93440\.49050\.64740\.8149PHM20090\.75220\.87510\.80280\.72230\.82450\.76140\.64860\.77580\.8115REALBOX0\.99960\.99980\.99810\.99990\.99990\.99780\.09100\.47110\.8048XJTU\-SY Bearing1\-10\.99800\.99840\.99020\.99820\.99850\.99040\.99770\.99820\.9895XJTU\-SY Bearing1\-30\.99770\.99810\.97760\.99730\.99780\.97560\.99780\.99810\.9784XJTU\-SY Bearing2\-20\.99790\.99960\.98740\.99810\.99970\.98860\.99860\.99980\.9909XJTU\-SY Bearing2\-50\.99990\.99990\.99700\.99990\.99990\.99570\.99900\.99900\.9849XJTU\-SY Bearing3\-31\.00000\.99990\.99721\.00000\.99980\.99690\.00000\.08140\.1518XJTU\-SY Bearing3\-50\.99410\.99990\.99590\.99560\.99990\.99690\.10710\.96970\.9945IMS Bearing40\.92970\.90110\.83760\.88280\.78450\.76990\.88450\.87410\.7723

Table[10](https://arxiv.org/html/2606.04073#S3.T10)and Fig\.[28](https://arxiv.org/html/2606.04073#S3.F28)together show that*CARLA pseudo \+ ours Stage 2*clearly outperforms*CARLA original Stage 2 adapter*overall, indicating that simply replacing our Stage 2 with CARLA’s original classification/clustering\-style logic does not work stably under the bearing\-window protocol used in this paper\. In terms of grouped average results,*CARLA pseudo \+ ours Stage 2*achieves average AUROC/AUPR/best F1 of0\.9222/0\.9521/0\.92340\.9222/0\.9521/0\.9234on fault detection and0\.9817/0\.9686/0\.95910\.9817/0\.9686/0\.9591on degradation detection, whereas*CARLA original Stage 2 adapter*reaches only0\.5575/0\.7236/0\.85750\.5575/0\.7236/0\.8575and0\.7121/0\.8457/0\.83750\.7121/0\.8457/0\.8375, respectively\. This difference is also visually evident in the score curves\. For example, on CWRU and REALBOX, the left column forms relatively clear anomaly boundaries, whereas the right column is more prone to sustained high platforms, compressed scores, or unstable ranking inside anomalous intervals\.

Looking further at the relation between*CARLA pseudo \+ ours Stage 2*and the proposed method, one can see that external pseudo\-anomalous samples can indeed provide useful boundary information on some datasets, but their benefit is clearly dataset\-dependent\. This scheme performs well on CWRU, REALBOX, and most XJTU\-SY degradation samples, indicating that CARLA\-style pseudo\-anomalies are not entirely ineffective\. However, the proposed method remains superior on HTBF, PHM2009, and IMS Bearing4\. For example, on HTBF, TPA\-AD achieves AUROC/AUPR/best F1 of0\.9711/0\.9842/0\.93470\.9711/0\.9842/0\.9347, whereas*CARLA pseudo \+ ours Stage 2*obtains0\.9665/0\.9841/0\.93440\.9665/0\.9841/0\.9344; on PHM2009, TPA\-AD obtains0\.7522/0\.8751/0\.80280\.7522/0\.8751/0\.8028, while*CARLA pseudo \+ ours Stage 2*drops to0\.7223/0\.8245/0\.76140\.7223/0\.8245/0\.7614; and on IMS Bearing4, TPA\-AD achieves0\.9297/0\.9011/0\.83760\.9297/0\.9011/0\.8376, while*CARLA pseudo \+ ours Stage 2*decreases to0\.8828/0\.7845/0\.76990\.8828/0\.7845/0\.7699\. These results indicate that the performance of the proposed method does not simply come from “introducing arbitrary pseudo\-anomalous samples”; rather, the specific construction of pseudo\-anomalies and their deviation strength relative to the normal manifold still significantly affect the subsequent representation learning\.

On the other hand,*CARLA original Stage 2 adapter*can achieve relatively high results on CWRU and some XJTU\-SY samples, but it shows clear instability on multiple datasets\. For example, its AUROC is only0\.09100\.0910on REALBOX, nearly0on XJTU\-SY Bearing3\-3, and only0\.10710\.1071on XJTU\-SY Bearing3\-5\. Correspondingly, the right column of Fig\.[28](https://arxiv.org/html/2606.04073#S3.F28)shows obvious score instability, score flattening, or ranking inversion on these datasets\. This indicates that CARLA’s original classification/clustering\-style Stage 2 logic is more sensitive to dataset structure, anomaly proportion, and degradation process, and cannot be stably transferred to normal\-only bearing scenarios\. Taken together, these two CARLA\-related controls support a more robust attribution of the advantage of the proposed method to the combination of controlled pseudo\-anomaly generation under target\-error constraints and the Stage 2 representation score based on KNN distances to normal samples, rather than to a generic recipe of “arbitrary pseudo\-anomalies \+ arbitrary Stage 2\.”

### 3\.5Hyperparameter sensitivity analysis

The target\-error interval in Stage 1 is one of the more important empirical hyperparameters of the proposed method\. As defined in the method section, we construct the target interval asτjlow=Qql​\(\{ei,jtrain\}i\)\\tau\_\{j\}^\{\\mathrm\{low\}\}=Q\_\{q\_\{l\}\}\(\\\{e\_\{i,j\}^\{\\mathrm\{train\}\}\\\}\_\{i\}\)andτjhigh=Qqu​\(\{ei,jtrain\}i\)\\tau\_\{j\}^\{\\mathrm\{high\}\}=Q\_\{q\_\{u\}\}\(\\\{e\_\{i,j\}^\{\\mathrm\{train\}\}\\\}\_\{i\}\)from the per\-feature reconstruction\-error distribution of the normal training samples, and then sample target errors within this interval to generate pseudo\-anomalies\. It should be emphasized thatqlq\_\{l\}andquq\_\{u\}are not fault\-alarm thresholds used at deployment time; rather, they are unsupervised empirical parameters used in Stage 1 to construct a pseudo\-anomaly boundary band\. They depend only on the reconstruction\-error distribution of the normal training data\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_hyperparameter_sensitivity.png)Figure 29:Hyperparameter sensitivity analysis of the target\-error quantile interval\. Each subplot shows the AUROC, AUPR, and best\-F1 results obtained by different combinations of the lower quantileqlq\_\{l\}and upper quantilequq\_\{u\}on CWRU, XJTU\-SY, PHM2009, HTBF, REALBOX, and the cross\-dataset average\.Figure[29](https://arxiv.org/html/2606.04073#S3.F29)shows the effect of different combinations of lower and upper quantiles on AUROC, AUPR, and best F1\. Overall, the role of the target\-error interval can be understood from two dimensions: boundary strength and boundary coverage\. A relatively low upper quantile makes the pseudo\-anomalies too close to the main body of the normal reconstruction\-error distribution\. Although such samples are difficult, they often lack sufficient anomaly magnitude and can easily treat normal operating fluctuations as negative samples, ultimately resulting in insufficient margin between normal and anomalous states\. By contrast, an excessively high and overly narrow interval concentrates pseudo\-anomalies in the extreme tail of the error distribution\. Although such pseudo\-anomalies can serve as strong negatives, they may sacrifice boundary continuity and make the model overly biased toward strong anomalies while weakening its response to early degradation or complex weak anomalies\. Therefore, a stable target interval usually needs to satisfy both of the following: the upper bound should be sufficiently high to ensure that pseudo\-anomalies leave the normal core, while the lower bound should not be too high so that pseudo\-anomalies can cover multiple boundary levels ranging from weak shifts to strong shifts\.

The sensitivity differences across datasets further indicate that this hyperparameter is related to the intrinsic fault separability and operating\-condition complexity of the data\. CWRU, XJTU\-SY, and REALBOX maintain high metrics under many valid quantile combinations, suggesting that these datasets contain relatively clear amplitude or degradation differences between normal and anomalous states\. As long as Stage 1 can generate pseudo\-anomalies outside the normal region, Stage 2 can produce stable separation\. PHM2009 and HTBF are more sensitive to the target\-error interval\. PHM2009 represents a gearbox compound\-fault scenario in which real anomalies and normal samples are locally entangled, while HTBF contains both multi\-channel coupling and operating\-condition disturbance\. For such data, overly low target\-error intervals easily generate pseudo\-anomalies that still resemble normal data, leading to elevated score backgrounds\. If only extremely high quantiles are used, pseudo\-anomalies may become easy negatives along only a few directions and fail to cover the multiple deviation modes of real anomalies\. Therefore, a moderately wide interval with a mid\-to\-high upper quantile is usually more robust\.

From the cross\-dataset average, the better\-performing region is not an isolated single point but rather a stable area composed of medium lower quantiles and relatively high upper quantiles\. This observation supports the empirical thresholding strategy adopted in this paper: for bearing vibration data, the upper tail of the reconstruction\-error distribution of normal training windows can be treated as a range of “candidate anomaly strengths near the normal boundary,” and bin\-wise sampling can be carried out within this region instead of manually setting perturbations with fixed amplitudes\. The advantage of this strategy is that reconstruction\-error values can vary greatly across datasets, channels, and scales, whereas quantiles are relatively comparable\. In addition, upper\-tail quantiles naturally include the hardest\-to\-reconstruct normal operating samples, which helps generate difficult pseudo\-anomalies closer to early degradation\.

Based on the above results, we provide the following practical interpretation for bearing scenarios\. If a dataset exhibits clear fault differences and relatively stable normal operating conditions, a relatively broad mid\-to\-high quantile interval can yield stable results\. If strong operating\-condition disturbances or multi\-channel coupling are present, overly low upper quantiles should be avoided, and the lower quantile can be moderately increased to reduce overlap with the main normal region\. If the goal is to enhance sensitivity to early degradation, one should not rely exclusively on extreme high quantiles; some lower\-strength boundary samples should also be retained\. In other words, the role of the target\-error interval in Stage 1 is not to locate the final anomaly decision line, but to define the “training boundary band” for pseudo\-anomaly generation\. Figure[29](https://arxiv.org/html/2606.04073#S3.F29)shows that this boundary band has a reasonably wide usable range across multiple bearing datasets, suggesting that the proposed method is not overly dependent on the target\-error quantiles\. However, on more complex datasets such as PHM2009 and HTBF, a proper choice of a mid\-to\-high quantile interval can still significantly improve boundary quality and final detection stability\.

## 4Discussion and extended analysis

As a discussion\-oriented extension, we further examine the broader applicability of the proposed method on public time\-series anomaly detection datasets\. This part does not replace the main conclusions drawn from the bearing fault\-detection and degradation\-detection experiments above; rather, it supplements them by observing how the framework transfers to non\-bearing data\. To this end, we select1313representative public datasets from the TSB\-AD benchmarkLiu and Paparrizos \([2024](https://arxiv.org/html/2606.04073#bib.bib17)\)and adopt the data organization scheme of TSB\-UADPaparrizos et al\. \([2022](https://arxiv.org/html/2606.04073#bib.bib36)\)for the univariate sequences\. The selected data include both univariate sequences converted from classification datasets and multivariate sequences from spacecraft telemetry, wearable sensing, Web\-service KPIs, industrial system monitoring, and cluster runtime logs\. The corresponding anomaly types include point anomalies, interval anomalies, trend drifts, and pattern switches\.

#### 4\.0\.1Experimental setup for the TSAD extension

The TSAD extension analysis uses1313public data sources or subsets from the TSB\-AD benchmark, namely UCR, YAHOO, WSD, CATSv2, Daphnet, Exathlon, NEK, IOPS, LTDB, MSL, SMAP, SMD, and PSM\. Unlike the manually constructed normal/fault concatenated sequences used in the bearing fault\-detection experiments, the TSAD data mainly follow the existing training segments, test segments, and point\-level anomaly labels provided by the public data sources\. If an official training/test split is available, we adopt it directly; otherwise, the training and test segments are constructed according to the fixed protocol implemented in the code\. Only the training segment is used during training, and anomaly\-detection metrics are computed on point\-level labels during testing\. Here, both training size and test size refer to the number of point\-level samples, and the anomaly ratio is calculated from the point\-level labels in the test segment\. For multivariate data, the number of effective features actually involved in the current implementation is reported for both continuous and discrete dimensions\.

In the TSAD experiments, the model likewise outputs a window\-level anomaly score using the robustly normalized*normal score*, which is then mapped to a point\-level anomaly score by averaging over overlapping windows, and the evaluation metrics are computed against point\-level labels\. Window labels are still generated using the conservative rule that a window is marked anomalous if it contains at least one anomalous point\. The robust normalization here still acts on the anomaly\-distance scores output by Stage 2 rather than on input\-feature preprocessing\. This transformation mainly affects the numerical scale of the scores and the threshold location; its influence on ranking metrics such as AUROC and AUPR is relatively small, although quantile clipping may cause ties among extreme scores\.

The TSAD extension reports five metrics: VUS\-PR, VUS\-ROC, AUPR, AUROC, and Point\-F1\. Among them, AUPR and AUROC are point\-level ranking metrics, and Point\-F1 is the point\-level best F1 obtained by sweeping thresholds over the point\-level anomaly scores on the test set\. This threshold depends on the test labels and is used only to measure the upper bound of threshold\-based detection performance; it does not represent a fixed threshold that can be determined without labels at deployment time\. VUS\-ROC and VUS\-PR are computed using a range\-based VUS implementation with point\-level labels and point\-level anomaly scores as input\. In the implementation, RangeAUC\_volume is used to compute the volumetric metrics, and the volume window is set to twice the sliding\-window size\. In most experiments, the VUS window parameter is4040, corresponding to a volume window of8080\. Conventional AUC and Point\-F1 are computed without point\-adjust, event\-adjust, or other post\-processing\.

Unlike the bearing fault\-detection and degradation\-detection experiments, the TSAD extension results are obtained with repeated random runs\. Specifically, under fixed data splits and input\-preprocessing protocols, the Stage 2 representation learning is repeated55times\. The randomness mainly comes from model initialization, mini\-batch construction, or sampling during training\. The TSAD results in the tables are reported as the mean and standard deviation over the55Stage 2 runs\. This setting is used to reflect the random fluctuations introduced by deep representation learning on generic TSAD data, whereas the main bearing experiments are reported under a fixed protocol and repeated runs are used only as stability checks\.

#### 4\.0\.2Processing strategy for continuous and discrete variables

The TSAD experiments adopt a separated continuous/discrete branch strategy\. For continuous features, the model follows the Stage 1 pseudo\-anomaly generation and Stage 2 representation\-learning pipeline, and for multivariate data it uses per\-feature reconstruction and normalization to reduce the influence of different scales on reconstruction errors and KNN distances\. For fully discrete data, continuous pseudo\-anomalies are no longer generated; instead, KNN distances are computed directly on one\-hot encoded discrete windows\. For mixed data containing both continuous and discrete variables, continuous scores and discrete KNN scores are first obtained separately and then normalized and fused\. This strategy preserves the advantage of target\-error control for continuous variables while avoiding semantically inappropriate continuous perturbations on discrete state variables\.

#### 4\.0\.3Simulated anomaly experiment

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_tsad_simulation.png)Figure 30:Anomaly\-detection results on simulated TSAD data\. The top row shows the input time series and predefined anomalous intervals, where red shading indicates anomalous intervals and red circles mark the major anomaly patterns; the middle row shows anomaly\-score curves under different score\-mapping schemes; and the bottom row shows representation\-space visualization, the score distributions of normal and anomalous samples, and integrated detection curves for point anomalies and pattern anomalies\. This experiment covers five typical forms of time\-series anomalies: global point anomalies, contextual point anomalies, shapelet anomalies, periodic anomalies, and trend anomalies\.To further analyze the basic response of the proposed method to different anomaly patterns, we construct a set of controlled simulated time\-series anomaly experiments\. Unlike the real\-data experiments, the anomaly positions, anomaly durations, and anomaly types in the simulated experiments are predefined, making them more suitable for observing whether the anomaly scores align with the anomalous intervals and whether the model can respond to both point\-level anomalies and pattern\-level anomalies\. This experiment is not intended as a substitute for evaluating generalization in real complex scenarios; rather, it complements the analysis by illustrating the detection behavior of the proposed pseudo anomaly\-guided representation\-learning framework under typical TSAD anomaly patterns\.

Figure[30](https://arxiv.org/html/2606.04073#S4.F30)presents the results of the simulated time\-series anomaly experiments\. We divide the simulated anomalies into two categories: point anomalies and pattern anomalies\. Point anomalies further include global point anomalies and contextual point anomalies\. A global point anomaly refers to a single sampling point whose amplitude clearly deviates from the normal range of the entire sequence\. A contextual point anomaly refers to a sampling point whose global amplitude may not be extreme, but which is inconsistent with the normal pattern under its local context or periodic phase\. Pattern anomalies include shapelet anomalies, periodic anomalies, and trend anomalies\. Shapelet anomalies correspond to changes in local short\-subsequence shapes; periodic anomalies correspond to disruptions of local periodic structure or anomalous phase/amplitude patterns; and trend anomalies correspond to deviations in the overall direction or growth rate within a local time interval\.

The anomaly\-score curves in Fig\.[30](https://arxiv.org/html/2606.04073#S4.F30)show that the proposed method produces clear score increases in all five simulated anomalous intervals\. In the global point\-anomaly scenario, the anomaly score forms sharp peaks near the anomalous points, indicating that the model can identify local abrupt changes with significant amplitude deviations\. In the contextual point\-anomaly scenario, the anomaly score not only responds to the local anomalous point itself but also remains at a relatively high level within the contextual interval, showing that the model does not rely solely on global amplitude magnitude but can use the local temporal background to judge anomalousness\. For shapelet anomalies, periodic anomalies, and trend anomalies, the anomaly score rises continuously within the anomalous intervals instead of producing only isolated spikes at individual time points, indicating that the learned representation can capture pattern\-level deviations such as local shapes, periodic structures, and trend changes\.

The representation\-space visualization at the bottom of Fig\.[30](https://arxiv.org/html/2606.04073#S4.F30)further verifies these observations\. Normal training samples and normal test samples are mainly distributed in nearby regions, pseudo\-anomalous samples lie outside the normal region, and real anomalous samples are separated from normal samples to some extent in the representation space\. This suggests that the pseudo\-anomalous samples generated in Stage 1 provide effective boundary constraints for Stage 2 representation learning, enabling the model to form distinguishable embedding distributions when facing different types of real anomalies\. The boxplots for point anomalies and pattern anomalies also show that anomalous windows have higher anomaly scores overall than normal windows, indicating that the proposed method has good response consistency for both point\-level and pattern\-level anomaly detection tasks\.

It should be noted that the anomalous boundaries and anomaly types in the simulated experiments are relatively explicit, and the data complexity is lower than that in real industrial scenarios\. Therefore, the main significance of this experiment is to verify the interpretability of the proposed method’s response to different anomaly forms rather than to prove its ultimate performance on real data\. The actual anomaly\-detection capability in real scenarios still needs to be comprehensively evaluated together with the quantitative results on CWRU, HTBF, PHM2009, REALBOX, XJTU\-SY, IMS, and the public TSAD datasets\.

#### 4\.0\.4Score visualization on public TSAD data

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_tsad_score_01.png)Figure 31:Example anomaly scores on public TSAD data I\. The figure shows the anomaly\-score curves and annotated anomalous intervals on representative time series\.Figure[31](https://arxiv.org/html/2606.04073#S4.F31)summarizes multiple representative sequences from univariate KPIs, multivariate sensor data, and system monitoring data\. For sequences with more explicit interval anomalies, such as Exathlon and PSM, the anomaly scores typically form relatively long high\-score plateaus inside the annotated intervals\. For sequences dominated by sparse spikes, such as IOPS and MSL, the score response is closer to narrow pulses\. The results on LTDB, Daphnet, and SMAP further indicate that when background fluctuations are strong or anomaly density is high, the model can still produce interpretable score increases around major anomalous intervals, although it does not necessarily produce ideal step\-like responses at every anomaly boundary\.

Figure[32](https://arxiv.org/html/2606.04073#S4.F32)further illustrates differences among three types of public univariate sequences\. Isolated anomaly points in UCR correspond to obvious local peaks, suggesting that the model remains highly sensitive to sudden point anomalies\. In WSD sequences, high\-score peaks are sparser, indicating that against a complex periodic background the model tends to respond strongly only to the most salient deviations\. In YAHOO samples, multiple interval anomalies correspond to relatively smooth high\-score plateaus, showing that the method is more stable on sustained anomalies\. Taken together, these two figures mainly serve as qualitative evidence that the model has indeed been tested on public sequences with different anomaly densities and background complexities, and that it can produce response patterns matched to the anomaly forms\. Final cross\-dataset performance should still be judged according to the metrics in Table[11](https://arxiv.org/html/2606.04073#S4.T11)\.

![Refer to caption](https://arxiv.org/html/2606.04073v1/fig_tsad_score_02.png)Figure 32:Example anomaly scores on public TSAD data II\. The figure further shows the model’s detection results on different datasets or sequences\.
#### 4\.0\.5Quantitative metrics on public TSAD data

The quantitative TSAD results are shown in Table[11](https://arxiv.org/html/2606.04073#S4.T11)\. The table summarizes the average dimensionality and the corresponding AUROC, AUPR, Point\-F1, VUS\-ROC, and VUS\-PR for the1313public TSAD data subsets\. AUROC and AUPR measure point\-level ranking ability, Point\-F1 reflects point\-level threshold\-based detection performance, and VUS\-ROC and VUS\-PRBoniol et al\. \([2025](https://arxiv.org/html/2606.04073#bib.bib35)\)further account for the detection quality in the neighborhood of anomalous intervals\. We additionally report VUS metrics because anomalies in TSAD often occur in interval form, and point\-level metrics alone may not fully reflect the model’s overall response around anomalous segments\.

Table 11:Detection results on1313public TSAD datasets\. All results are reported as mean and standard deviation over55runs, and larger values indicate better performance for all metrics\.Data subsetAverage dimensionAUROCAUPRPoint\-F1VUS\-ROCVUS\-PRUCR10\.9048±0\.00170\.9048\{\\scriptstyle\\,\\pm\\,0\.0017\}0\.3386±0\.00980\.3386\{\\scriptstyle\\,\\pm\\,0\.0098\}0\.4125±0\.01230\.4125\{\\scriptstyle\\,\\pm\\,0\.0123\}0\.9125±0\.00170\.9125\{\\scriptstyle\\,\\pm\\,0\.0017\}0\.3423±0\.00860\.3423\{\\scriptstyle\\,\\pm\\,0\.0086\}YAHOO10\.9173±0\.01340\.9173\{\\scriptstyle\\,\\pm\\,0\.0134\}0\.5325±0\.01480\.5325\{\\scriptstyle\\,\\pm\\,0\.0148\}0\.6048±0\.01860\.6048\{\\scriptstyle\\,\\pm\\,0\.0186\}0\.9148±0\.01230\.9148\{\\scriptstyle\\,\\pm\\,0\.0123\}0\.6512±0\.01540\.6512\{\\scriptstyle\\,\\pm\\,0\.0154\}WSD10\.6452±0\.04860\.6452\{\\scriptstyle\\,\\pm\\,0\.0486\}0\.1512±0\.01230\.1512\{\\scriptstyle\\,\\pm\\,0\.0123\}0\.1292±0\.04700\.1292\{\\scriptstyle\\,\\pm\\,0\.0470\}0\.6852±0\.01420\.6852\{\\scriptstyle\\,\\pm\\,0\.0142\}0\.1158±0\.04510\.1158\{\\scriptstyle\\,\\pm\\,0\.0451\}CATSv2170\.7428±0\.00980\.7428\{\\scriptstyle\\,\\pm\\,0\.0098\}0\.5144±0\.01320\.5144\{\\scriptstyle\\,\\pm\\,0\.0132\}0\.5642±0\.03860\.5642\{\\scriptstyle\\,\\pm\\,0\.0386\}0\.7473±0\.00830\.7473\{\\scriptstyle\\,\\pm\\,0\.0083\}0\.4109±0\.02880\.4109\{\\scriptstyle\\,\\pm\\,0\.0288\}Daphnet90\.9226±0\.00950\.9226\{\\scriptstyle\\,\\pm\\,0\.0095\}0\.4460±0\.02170\.4460\{\\scriptstyle\\,\\pm\\,0\.0217\}0\.5123±0\.03480\.5123\{\\scriptstyle\\,\\pm\\,0\.0348\}0\.9294±0\.01220\.9294\{\\scriptstyle\\,\\pm\\,0\.0122\}0\.4643±0\.03950\.4643\{\\scriptstyle\\,\\pm\\,0\.0395\}Exathlon20\.160\.9765±0\.00040\.9765\{\\scriptstyle\\,\\pm\\,0\.0004\}0\.8550±0\.00150\.8550\{\\scriptstyle\\,\\pm\\,0\.0015\}0\.8549±0\.00420\.8549\{\\scriptstyle\\,\\pm\\,0\.0042\}0\.9782±0\.00070\.9782\{\\scriptstyle\\,\\pm\\,0\.0007\}0\.8416±0\.00520\.8416\{\\scriptstyle\\,\\pm\\,0\.0052\}NEK10\.8345±0\.00860\.8345\{\\scriptstyle\\,\\pm\\,0\.0086\}0\.6712±0\.07850\.6712\{\\scriptstyle\\,\\pm\\,0\.0785\}0\.7564±0\.00420\.7564\{\\scriptstyle\\,\\pm\\,0\.0042\}0\.8124±0\.00480\.8124\{\\scriptstyle\\,\\pm\\,0\.0048\}0\.6435±0\.02350\.6435\{\\scriptstyle\\,\\pm\\,0\.0235\}IOPS10\.8595±0\.01050\.8595\{\\scriptstyle\\,\\pm\\,0\.0105\}0\.4596±0\.02300\.4596\{\\scriptstyle\\,\\pm\\,0\.0230\}0\.6153±0\.01530\.6153\{\\scriptstyle\\,\\pm\\,0\.0153\}0\.8196±0\.01540\.8196\{\\scriptstyle\\,\\pm\\,0\.0154\}0\.1484±0\.02120\.1484\{\\scriptstyle\\,\\pm\\,0\.0212\}LTDB2\.250\.8844±0\.00340\.8844\{\\scriptstyle\\,\\pm\\,0\.0034\}0\.7608±0\.01580\.7608\{\\scriptstyle\\,\\pm\\,0\.0158\}0\.7332±0\.00670\.7332\{\\scriptstyle\\,\\pm\\,0\.0067\}0\.8635±0\.00390\.8635\{\\scriptstyle\\,\\pm\\,0\.0039\}0\.7301±0\.00420\.7301\{\\scriptstyle\\,\\pm\\,0\.0042\}MSL550\.5337±0\.00470\.5337\{\\scriptstyle\\,\\pm\\,0\.0047\}0\.2543±0\.02300\.2543\{\\scriptstyle\\,\\pm\\,0\.0230\}0\.4891±0\.01910\.4891\{\\scriptstyle\\,\\pm\\,0\.0191\}0\.4299±0\.00390\.4299\{\\scriptstyle\\,\\pm\\,0\.0039\}0\.2643±0\.01150\.2643\{\\scriptstyle\\,\\pm\\,0\.0115\}SMAP250\.9264±0\.01050\.9264\{\\scriptstyle\\,\\pm\\,0\.0105\}0\.2397±0\.01170\.2397\{\\scriptstyle\\,\\pm\\,0\.0117\}0\.3529±0\.01450\.3529\{\\scriptstyle\\,\\pm\\,0\.0145\}0\.9253±0\.00880\.9253\{\\scriptstyle\\,\\pm\\,0\.0088\}0\.2208±0\.00720\.2208\{\\scriptstyle\\,\\pm\\,0\.0072\}SMD380\.7621±0\.01500\.7621\{\\scriptstyle\\,\\pm\\,0\.0150\}0\.4598±0\.01230\.4598\{\\scriptstyle\\,\\pm\\,0\.0123\}0\.5761±0\.01430\.5761\{\\scriptstyle\\,\\pm\\,0\.0143\}0\.5914±0\.00460\.5914\{\\scriptstyle\\,\\pm\\,0\.0046\}0\.2191±0\.02750\.2191\{\\scriptstyle\\,\\pm\\,0\.0275\}PSM250\.6451±0\.02210\.6451\{\\scriptstyle\\,\\pm\\,0\.0221\}0\.1966±0\.02540\.1966\{\\scriptstyle\\,\\pm\\,0\.0254\}0\.4273±0\.01840\.4273\{\\scriptstyle\\,\\pm\\,0\.0184\}0\.5832±0\.02690\.5832\{\\scriptstyle\\,\\pm\\,0\.0269\}0\.2238±0\.01140\.2238\{\\scriptstyle\\,\\pm\\,0\.0114\}Average–0\.8042±0\.01220\.8042\{\\scriptstyle\\,\\pm\\,0\.0122\}0\.4618±0\.02020\.4618\{\\scriptstyle\\,\\pm\\,0\.0202\}0\.5513±0\.01910\.5513\{\\scriptstyle\\,\\pm\\,0\.0191\}0\.7733±0\.00910\.7733\{\\scriptstyle\\,\\pm\\,0\.0091\}0\.4112±0\.01920\.4112\{\\scriptstyle\\,\\pm\\,0\.0192\}Overall, this TSAD extension analysis indicates that the proposed method has a certain degree of broader applicability to general time\-series anomaly detection tasks, although its performance is still clearly affected by data type, anomaly density, and feature scale\. Compared with bearing data, anomaly patterns in public TSAD datasets are more diverse, and some datasets also contain sparse anomalies, long\-period drift, or discrete state switching\. Therefore, the results in this part are more appropriate as supplementary evidence of transfer performance rather than as a replacement for the main conclusions drawn in bearing scenarios\.

### 4\.1Limitations and discussion

Although the proposed method achieves promising results on bearing time\-series anomaly detection under normal\-only training, several aspects still merit further discussion and improvement\.

##### 1\) Reconstruction and pseudo\-anomaly generation for discrete variables remain difficult\.

We attempted to introduce a Gumbel\-Softmax mechanism into the decoder to reconstruct discrete samples and generate corresponding pseudo\-anomalous samples\. However, preliminary experiments show that stable reconstruction on discrete variables usually requires substantially more samples than continuous variables\. As the sample size further increases, training cost, search space, and generation\-quality control also become more difficult\. Therefore, reconstruction modeling and pseudo\-anomaly generation for discrete variables still require further study\.

##### 2\) The method is relatively sensitive to shifts in the normal\-sample distribution\.

On some TSAD datasets, one important reason for weaker performance is that when the normal samples at test time shift substantially relative to the normal training distribution, some samples that should still be considered normal may also receive high anomaly scores\. Such shifts weaken the reference role of the Stage 1 target\-error interval and further affect the stability of Stage 2 representation learning\. In the future, techniques such as transfer calibration or test\-time adaptation may be introduced to mitigate the influence of shifts in the normal distribution, although whether these directions can stably improve performance still requires further verification\.

##### 3\) The fusion mechanism for continuous and discrete scores is still incomplete\.

For data containing both continuous and discrete variables, the anomaly scores output by the continuous and discrete branches are not fully consistent in either statistical scale or semantic meaning\. Direct fusion can therefore introduce substantial bias\. Building a more reasonable cross\-branch calibration and fusion mechanism remains an important issue for future work\. This issue is also closely related to the ability to reconstruct discrete variables effectively and to generate pseudo\-anomalies for them\.

##### 4\) Validation on real engineering data is still limited\.

This paper validates the method on only one real high\-speed\-train bearing case, and the scale of real engineering samples remains limited\. This limitation also indirectly reflects the fact that anomalous samples are often extremely scarce in real scenarios, which is precisely an important motivation for adopting the normal\-only training setting and for designing TPA\-AD\. Nevertheless, from the perspective of engineering deployment and generalization, further evaluation on larger\-scale real data under more operating conditions is still necessary\.

##### 5\) The method shows some promise for engineering deployment\.

From a deployment perspective, the proposed method does not rely on large quantities of fault samples\. In a high\-speed\-train scenario, it would be sufficient to collect a portion of normal bearing data from the early operation stage of the train to complete Stage 1 and Stage 2 training\. In terms of training\-sample demand, data\-storage cost, and per\-inference computational cost, the method places relatively modest demands on deployment resources and therefore has potential for further evaluation in online or edge\-deployment settings\. Even so, its practical engineering applicability still needs to be further verified on larger\-scale field data under more diverse operating conditions\.

## 5Conclusion

This paper proposes TPA\-AD, a two\-stage pseudo anomaly\-guided anomaly detection method for bearing time\-series anomaly detection under the normal\-only training setting\. In Stage 1, pseudo\-anomalous windows near the normal boundary are generated through reconstruction\-based target\-error control, and in Stage 2, normal and pseudo\-anomalous windows are jointly used for contrastive representation learning, with anomaly detection performed by measuring the distance to the normal sample bank\. Experimental results on fault\-detection and degradation\-detection tasks show that the proposed method achieves competitive and stable performance across multiple bearing datasets, while the ablation studies confirm the importance of target\-error\-controlled pseudo\-anomaly generation, representation learning, and normal\-sample distance scoring\. The discussion\-oriented extension further suggests that the method has certain external applicability to general time\-series anomaly detection tasks\. Future work will focus on robustness to distribution shift, mixed\-variable modeling, and validation on larger\-scale real engineering data\.

## Acknowledgments

This work was supported by the National Key Research and Development Program of China \(2023YFB3308100\), the China State Railway Group Co\., Ltd\. Science and Technology Research and Development Program Project \(K2024J011\),National Key R&D Program of China \(2023YFB4302400\) and the Natural Science Foundation of Shandong Province \(ZR2023ME124\)\.

## CRediT authorship contribution statement

Xiancheng Wang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Visualization, Writing \- original draft\. Zhibo Zhang: Data providing\. Other authors: Supervision, Validation, Writing \- review & editing\.

## Data availability

The data that support the findings of this study are available from the authors upon reasonable request\. Due to data access restrictions, the REALBOX dataset cannot be publicly shared or redistributed by the authors\.

## Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\.

## References

- Zhao et al\. \(2025\)Jiangdong Zhao, Wenming Wang, Ji Huang, and Xiaolu Ma\.A comprehensive review of deep learning\-based fault diagnosis approaches for rolling bearings: Advancements and challenges\.*AIP Advances*, 15\(2\), 2025\.
- Dong et al\. \(2024a\)Yutong Dong, Hongkai Jiang, Renhe Yao, Mingzhe Mu, and Qiao Yang\.Rolling bearing intelligent fault diagnosis towards variable speed and imbalanced samples using multiscale dynamic supervised contrast learning\.*Reliability Engineering & System Safety*, 243:109805, 2024a\.
- Pang et al\. \(2024\)Bin Pang, Qiuhai Liu, Zhenduo Sun, Zhenli Xu, and Ziyang Hao\.Time\-frequency supervised contrastive learning via pseudo\-labeling: An unsupervised domain adaptation network for rolling bearing fault diagnosis under time\-varying speeds\.*Advanced Engineering Informatics*, 59:102304, 2024\.
- Xu et al\. \(2025\)Yuhui Xu, Yimin Jiang, Tangbin Xia, Dong Wang, Zhen Chen, Ershun Pan, and Lifeng Xi\.Dynamic model\-assisted disentanglement framework for rolling bearing fault diagnosis under time\-varying speed conditions\.*Mechanical Systems and Signal Processing*, 230:112588, 2025\.
- Lin et al\. \(2023\)Tantao Lin, Yongsheng Zhu, Zhijun Ren, Kai Huang, and Dawei Gao\.Ccft: The convolution and cross\-fusion transformer for fault diagnosis of bearings\.*IEEE/ASME Transactions on Mechatronics*, 29\(3\):2161–2172, 2023\.
- Zhou et al\. \(2024a\)Zijun Zhou, Qingsong Ai, Ping Lou, Jianmin Hu, and Junwei Yan\.A novel method for rolling bearing fault diagnosis based on gramian angular field and cnn\-vit\.*Sensors*, 24\(12\):3967, 2024a\.
- Li et al\. \(2024\)Yang Li, Xiaojiao Gu, and Yonghe Wei\.A deep learning\-based method for bearing fault diagnosis with few\-shot learning\.*Sensors*, 24\(23\):7516, 2024\.
- Dai et al\. \(2025\)Miao Dai, Hangyeol Jo, Moonsuk Kim, and Sang\-Woo Ban\.Msff\-net: Multi\-sensor frequency\-domain feature fusion network with lightweight 1d cnn for bearing fault diagnosis\.*Sensors*, 25\(14\):4348, 2025\.
- Wang and Feng \(2024\)Shouqi Wang and Zhigang Feng\.Multi\-sensor fusion rolling bearing intelligent fault diagnosis based on vmd and ultra\-lightweight googlenet in industrial environments\.*Digital Signal Processing*, 145:104306, 2024\.
- Dong et al\. \(2024b\)Yutong Dong, Hongkai Jiang, Mingzhe Mu, and Xin Wang\.Multi\-sensor data fusion\-enabled lightweight convolutional double regularization contrast transformer for aerospace bearing small samples fault diagnosis\.*Advanced Engineering Informatics*, 62:102573, 2024b\.
- Ye et al\. \(2025\)Maoyou Ye, Xiaoan Yan, Xing Hua, Dong Jiang, Ling Xiang, and Ning Chen\.Mrcfn: A multi\-sensor residual convolutional fusion network for intelligent fault diagnosis of bearings in noisy and small sample scenarios\.*Expert Systems with Applications*, 259:125214, 2025\.
- He et al\. \(2025\)Deqiang He, Jinxin Wu, Zhenzhen Jin, ChengGeng Huang, Zexian Wei, and Cai Yi\.Agfcn: A bearing fault diagnosis method for high\-speed train bogie under complex working conditions\.*Reliability Engineering & System Safety*, 258:110907, 2025\.
- Li et al\. \(2025\)Guoqiang Li, Meirong Wei, Defeng Wu, Yiwei Cheng, Jun Wu, and Jin Yan\.Zero\-sample fault diagnosis of rolling bearings via fault spectrum knowledge and autonomous contrastive learning\.*Expert Systems with Applications*, 275:127080, 2025\.
- Liu et al\. \(2025a\)Yanlei Liu, Yonggang Xu, Miaorui Yang, Hong Jiang, and Kun Zhang\.Frequency pattern graph spectrum model and its applications in rolling bearing fault diagnosis\.*Mechanical Systems and Signal Processing*, 240:113426, 2025a\.
- Wang et al\. \(2025\)Xuan Wang, Zhanqiang Hou, Gao Liu, Junwei Xue, Qingsong Li, Xuezhong Wu, and Dingbang Xiao\.Intelligent diagnosis of rolling bearings under cross\-domain missing data: A lightweight complex domain imputation and unsupervised time–frequency alignment approach\.*Mechanical Systems and Signal Processing*, 241:113504, 2025\.
- Zamanzadeh Darban et al\. \(2024\)Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi\.Deep learning for time series anomaly detection: A survey\.*ACM Computing Surveys*, 57\(1\):1–42, 2024\.
- Liu and Paparrizos \(2024\)Qinghua Liu and John Paparrizos\.The elephant in the room: Towards a reliable time\-series anomaly detection benchmark\.*Advances in Neural Information Processing Systems*, 37:108231–108261, 2024\.
- Qiu et al\. \(2025\)Xiangfei Qiu, Zhe Li, Wanghui Qiu, Shiyan Hu, Lekui Zhou, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Aoying Zhou, Zhenli Sheng, et al\.Tab: Unified benchmarking of time series anomaly detection methods\.*arXiv preprint arXiv:2506\.18046*, 2025\.
- Chen et al\. \(2024\)Feiyi Chen, Yingying Zhang, Zhen Qin, Lunting Fan, Renhe Jiang, Yuxuan Liang, Qingsong Wen, and Shuiguang Deng\.Learning multi\-pattern normalities in the frequency domain for efficient time series anomaly detection\.In*2024 IEEE 40th International Conference on Data Engineering \(ICDE\)*, pages 747–760\. IEEE, 2024\.
- Sun et al\. \(2024\)Yuting Sun, Guansong Pang, Guanhua Ye, Tong Chen, Xia Hu, and Hongzhi Yin\.Unraveling the ‘anomaly’in time series anomaly detection: a self\-supervised tri\-domain solution\.In*2024 IEEE 40th International Conference on Data Engineering \(ICDE\)*, pages 981–994\. IEEE, 2024\.
- Fang et al\. \(2024\)Yuchen Fang, Jiandong Xie, Yan Zhao, Lu Chen, Yunjun Gao, and Kai Zheng\.Temporal\-frequency masked autoencoders for time series anomaly detection\.In*2024 IEEE 40th international conference on data engineering \(ICDE\)*, pages 1228–1241\. IEEE, 2024\.
- Chen et al\. \(2023\)Yuhang Chen, Chaoyun Zhang, Minghua Ma, Yudong Liu, Ruomeng Ding, Bowen Li, Shilin He, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang\.Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection\.*arXiv preprint arXiv:2307\.00754*, 2023\.
- Zhu et al\. \(2023\)Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, and Wenqiao Zhang\.Meter: A dynamic concept adaptation framework for online anomaly detection\.*arXiv preprint arXiv:2312\.16831*, 2023\.
- Wang et al\. \(2024\)Yuanyi Wang, Haifeng Sun, Chengsen Wang, Mengde Zhu, Jingyu Wang, Wei Tang, Qi Qi, Zirui Zhuang, and Jianxin Liao\.Interdependency matters: graph alignment for multivariate time series anomaly detection\.In*2024 IEEE International Conference on Data Mining \(ICDM\)*, pages 869–874\. IEEE, 2024\.
- Wu et al\. \(2025\)Xingjian Wu, Xiangfei Qiu, Zhengyu Li, Yihang Wang, Jilin Hu, Chenjuan Guo, Hui Xiong, and Bin Yang\.Catch: Channel\-aware multivariate time series anomaly detection via frequency patching\.In*International conference on learning representations*, volume 2025, pages 17017–17045, 2025\.
- Shentu et al\. \(2025\)Qichao Shentu, Beibu Li, Kai Zhao, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, and Chenjuan Guo\.Towards a general time series anomaly detector with adaptive bottlenecks and dual adversarial decoders\.In*International Conference on Learning Representations*, volume 2025, pages 81358–81381, 2025\.
- Shen \(2025\)Ke\-Yuan Shen\.Learn hybrid prototypes for multivariate time series anomaly detection\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Darban et al\. \(2025\)Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu C Aggarwal, and Mahsa Salehi\.Carla: Self\-supervised contrastive representation learning for time series anomaly detection\.*Pattern Recognition*, 157:110874, 2025\.
- Zhou et al\. \(2024b\)Quan Zhou, Changhua Pei, Fei Sun, Jing Han, Zhengwei Gao, Dan Pei, Haiming Zhang, Gaogang Xie, and Jianhui Li\.Kan\-ad: Time series anomaly detection with kolmogorov\-arnold networks\.*arXiv preprint arXiv:2411\.00278*, 2024b\.
- Liu et al\. \(2025b\)Xinhong Liu, Xiaoliang Li, Yangfan Li, Fengxiao Tang, and Ming Zhao\.Rtdetector: Deep transformer networks for time series anomaly detection based on reconstruction trend\.In*Proceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI\-25, J\. Kwok, Ed\. International Joint Conferences on Artificial Intelligence Organization*, volume 8, pages 5788–5796, 2025b\.
- Case Western Reserve University Bearing Data Center \(2024\)Case Western Reserve University Bearing Data Center\.Bearing data center: Apparatus & procedures\.[https://engineering\.case\.edu/bearingdatacenter/apparatus\-and\-procedures](https://engineering.case.edu/bearingdatacenter/apparatus-and-procedures), 2024\.Accessed: 2026\-05\-30\.
- PHM Society \(2009\)PHM Society\.2009 phm challenge competition data set\.[https://phmsociety\.org/public\-data\-sets/](https://phmsociety.org/public-data-sets/), 2009\.Accessed: 2026\-05\-30\.
- Wang \(2021\)Biao Wang\.Xjtu\-sy bearing datasets\.[https://biaowang\.tech/xjtu\-sy\-bearing\-datasets/](https://biaowang.tech/xjtu-sy-bearing-datasets/), 2021\.Accessed: 2026\-05\-30\.
- NASA Open Data Portal \(2023\)NASA Open Data Portal\.Ims bearings\.[https://data\.nasa\.gov/dataset/ims\-bearings](https://data.nasa.gov/dataset/ims-bearings), 2023\.Experiments on bearings provided by the Center for Intelligent Maintenance Systems \(IMS\), University of Cincinnati\. Accessed: 2026\-05\-30\.
- Boniol et al\. \(2025\)Paul Boniol, Ashwin K\. Krishna, Marine Bruel, Qinghua Liu, Mingyi Huang, Themis Palpanas, Ruey S\. Tsay, Aaron Elmore, Michael J\. Franklin, and John Paparrizos\.Vus: effective and efficient accuracy measures for time\-series anomaly detection\.*The VLDB Journal*, 34:32, 2025\.doi:10\.1007/s00778\-025\-00907\-x\.
- Paparrizos et al\. \(2022\)John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S\. Tsay, Themis Palpanas, and Michael J\. Franklin\.Tsb\-uad: An end\-to\-end benchmark suite for univariate time\-series anomaly detection\.*Proceedings of the VLDB Endowment*, 15\(8\):1697–1711, 2022\.doi:10\.14778/3529337\.3529354\.

Similar Articles

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

arXiv cs.LG

This paper proposes CoAD, a novel framework that unifies Outlier Exposure (classification) and Masked Autoencoder (reconstruction) paradigms for time series anomaly detection, addressing their respective limitations. Extensive experiments show that CoAD significantly outperforms state-of-the-art methods while being lightweight and fast.

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

arXiv cs.AI

This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.