Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
Summary
This paper proposes CoAD, a novel framework that unifies Outlier Exposure (classification) and Masked Autoencoder (reconstruction) paradigms for time series anomaly detection, addressing their respective limitations. Extensive experiments show that CoAD significantly outperforms state-of-the-art methods while being lightweight and fast.
View Cached Full Text
Cached at: 05/27/26, 09:05 AM
# Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection Source: [https://arxiv.org/html/2605.26193](https://arxiv.org/html/2605.26193) \\setcctype by\\setcctypeby\-nc\-nd Qideng Tang[tqd18907@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi UniversityHangzhouChinaChaofan Dai[cfdai@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])National Key Laboratory of Information Systems Engineering, National University of Defense TechnologyChangshaHunanChina,Wubin Ma[wb˙ma@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:wb%CB%[email protected]),Yahui Wu[wuyahui@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])National Key Laboratory of Information Systems Engineering, National University of Defense TechnologyChangshaHunanChina,Haohao Zhou[haohaozhou@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])National Key Laboratory of Information Systems Engineering, National University of Defense TechnologyChangshaHunanChina,Tao Zhang[zhangtao@nudt\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])College of Systems Engineering, National University of Defense TechnologyChangshaHunanChina,Huan Li[lihuan\.cs@zju\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])Zhejiang UniversityHangzhouChinaandDalin Zhang[zhangdalin@hdu\.edu\.cn](https://arxiv.org/html/2605.26193v1/mailto:[email protected])Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi UniversityHangzhouChina \(5 June 2009\) ###### Abstract\. Time series anomaly detection \(TSAD\) has long been a hot research topic in data mining due to its various applications\. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies\. Outlier Exposure \(OE\) and Masked Autoencoder \(MAE\) emerge as two promising paradigms \(classification and reconstruction\) for solving the above problems\. However, OE\-based methods are constrained by poor generalization, while MAE\-based methods are limited by masking misalignment issues\. To address these limitations, this paper proposes a novel framework,CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses\. In this framework, the classification module generates probability\-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module\. This cooperative design enablesCoADto effectively detect subtle and complex anomalies that are often overlooked by existing methods\. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information\. Extensive experiments on high\-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate thatCoADsignificantly outperforms both state\-of\-the\-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD\. Moreover,CoADis lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large\-scale, real\-time applications\. ††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818108††isbn:979\-8\-4007\-2259\-2/2026/08††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 9–13, 2026; Jeju Island, Republic of Korea\.††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD 2026\), August 9–13, 2026, Jeju Island, Republic of Korea††isbn:979\-8\-4007\-2259\-2/2026/08††doi:10\.1145/3770855\.3818108††ccs:Computing methodologies Anomaly detection††ccs:Mathematics of computing Time series analysis††ccs:Information systems Data stream mining## 1\.Introduction Illustration comparing masking strategies: random masking, grating masking, and the proposed probabilistic soft masking in CoAD, along with the corresponding reconstruction errors\. Figure 1\.Comparison between our masking strategy \(CoAD\) and existing masking strategies\. The upper parts visualize the masking\. Random and grating masking are binarized strategies with shaded areas serving as masks, whereasCoADis a probabilistic strategy with darker colors indicating stronger masking\.Time series anomaly detection aims to identify patterns that deviate from expected behaviors within temporally sequential data and is crucial across numerous applications\(Blázquez\-Garcíaet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib227); Kimet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib506); Guptaet al\.,[2014](https://arxiv.org/html/2605.26193#bib.bib426); Schmidlet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib758)\)\. In recent years, deep learning has enabled numerous methods for time series anomaly detection \(TSAD\)\(Li and Jung,[2023](https://arxiv.org/html/2605.26193#bib.bib554); Zamanzadeh Darbanet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib970); Choiet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib293)\)\. However, latest studies\(Schmidlet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib758); Laiet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib524); Rewickiet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib741); Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Sarfrazet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1079); Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655); Garget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib399)\)indicate that deep learning\-based methods may underperform classical data mining\-based methods, especially in detecting subtle and prolonged anomalies\(Leeet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib529); Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812)\)\. In response, Outlier Exposure \(OE\)\(Hendryckset al\.,[2019](https://arxiv.org/html/2605.26193#bib.bib1064)\)and Masked Autoencoders \(MAE\)\(Heet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib444)\)have emerged as prominent paradigms to solve the above problems\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479); Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849); Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913); Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052); Goswamiet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1053); Fanget al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1057)\)\. Nevertheless, both paradigms have inherent limitations that can hinder their effectiveness in complex, real\-world time series\. Limitations of OE\-based \(classification\) methods:L1\. Heavy reliance on priori knowledge:OE\-based approaches assume the existence of common anomalous patterns and use prior abnormal knowledge to generate pseudo\-anomalous samples for classifier training\. While effective when real anomalies match the predefined types, this strategy fails to generalize to unseen or unexpected anomalies\.L2\. Improper classification granularity:Current OE\-based methods operate at either the “step level” or the “window level”, each with drawbacks\. Step\-level classification\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479)\)assigns anomaly scores to individual time steps by embedding the entire input window in a single forward pass; however, it struggles with longer windows needed for sufficient context\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051)\)\. In contrast, window\-level classification\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849); Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913)\)predicts anomalies for the whole window and slides it to generate stepwise scores\. Although this captures longer contexts, subtle anomalies can be obscured by predominantly normal patterns, leading to missed detections\.L3\. Neglect of frequency\-domain information:Most OE\-based methods operate solely in the time domain, ignoring the frequency domain where some anomalies may be more pronounced\. Consequently, frequency\-sensitive anomalies that are subtle or ambiguous in the time domain may go undetected\. Limitations of MAE\-based \(reconstruction\) methods:L4\. Masking misalignment with anomaly locations:MAE\-based methods model normal patterns by reconstructing masked patches from unmasked ones and assign anomaly scores based on reconstruction errors\. Ideally, the masking should target potentially anomalous regions while leaving normal patches unmasked, allowing the model to rely on surrounding normal patterns for reconstruction\. However, as shown in Figure[1](https://arxiv.org/html/2605.26193#S1.F1), existing methods typically use random masking\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052)\)or grating masking\(Chenet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib277)\), without considering patch semantics, and indiscriminately mask normal and anomalous regions\. Consequently, the model may reproduce anomalous patterns, leading to false alarms in normal regions\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051)\)\(large reconstruction errors in Figure[1](https://arxiv.org/html/2605.26193#S1.F1)\(a\)\) or miss anomalies by accurately reconstructing anomalous regions\(Yaoet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib941)\)\(Figure[1](https://arxiv.org/html/2605.26193#S1.F1)\(b\)\)\. To overcome these limitations, we proposeCoAD, a cooperative anomaly detection framework that unifies the strengths of classification\- and reconstruction\-based methods\. At the core ofCoADis a guided soft masking mechanism, which leverages OE\-based classification to inform the masking for MAE\-based reconstruction\. Unlike conventional random or uniform masking,CoADapplies probability\-informed soft masking, where all patches are masked, but those more likely to be anomalous are masked more heavily \(see Figure[1](https://arxiv.org/html/2605.26193#S1.F1)\)\. This suppresses anomaly\-related cues during reconstruction, yielding more accurate anomaly scores and effectivelyaddressing L4\. Since MAE\-based reconstruction can restore normal patterns within anomalous regions, it in turn provides a “quasi\-normal reference” for OE\-based classification\. This insight motivates the design of a residual\-based classification module that leverages the discrepancy between original patches and their reconstructed counterparts in the feature space as generalizable representations for anomaly discrimination, thereby facilitating the detection of novel anomalies\. Furthermore, the MAE\-based reconstruction is constrained by OE to learn only the normal data distribution and therefore fails to reproduce unseen anomalies\. During inference, the classification and reconstruction modules operate collaboratively to detect anomalies\. The overall cooperative mechanism enablesCoADto identify previously unseen anomalies, therebyaddressing L1\. To support this cooperative strategy,CoADintroduces a patch\-level, dual\-branch time\-frequency classification module\. The input sequence is divided into non\-overlapping patches, and each patch is classified based on features extracted from both the time and frequency domains\. This design offers two advantages: it captures both long\- and short\-term contextual information via intra\- and inter\-patch correlations \(addressing L2\), and it incorporates frequency\-domain patterns often overlooked by existing methods \(addressing L3\)\. In light of recent concerns\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891); Keogh,[2021](https://arxiv.org/html/2605.26193#bib.bib1073); Wu and Keogh,[2024b](https://arxiv.org/html/2605.26193#bib.bib1072)\)regarding the reliability of experiments in many existing studies, we conduct a rigorous evaluation using the highest\-quality datasets and the most robust metrics available\. Experimental results show thatCoADconsistently outperforms24 SOTA methodsacross314 datasets, achieving significantly superior performance\. Moreover, qualitative analyses confirm thatCoADcan effectively detect subtle and challenging anomalies that are often overlooked by existing methods, including cases that are difficult even for human experts\. Furthermore,CoADruns orders of magnitude faster than existing methods, highlighting its remarkable efficiency and scalability\. ## 2\.Related Work ### 2\.1\.Data Mining vs Deep Learning Recent studies\(Rewickiet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib741); Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655)\)challenge the effectiveness of popular deep learning\-based methods for TSAD, suggesting that classical data mining methods such as discord\-based methods \(e\.g\., Matrix Profile\(Nakamuraet al\.,[2020](https://arxiv.org/html/2605.26193#bib.bib672)\)and DAMP\(Luet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib631)\)\) and clustering\-based methods \(e\.g\., SAND\(Boniolet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib234)\)and KShapeAD\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)\), outperform deep learning methods\. These studies\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812); Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051)\)particularly highlight that deep learning methods fail to detect subtle and prolonged anomalies due to their inability to learn a discerning boundary between normal and subtle abnormal samples\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849); Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913)\)\. ### 2\.2\.OE\-based Methods OE\-based methods assume that common anomalous patterns exist in time series and aim to train a binary classifier based on generated pseudo\-anomalies\. By explicitly injecting prior abnormal knowledge into the model, these methods enable the learning of a clearer decision boundary between normal and abnormal samples\. AnomalyBERT\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479)\)is a pioneering work that assumes four representative types of anomalies in time series and employs a BERT\-like structure to extract features and perform anomaly classification\. Although it achieves promising performance, it suffers from heavy reliance on pre\-assumed anomalous patterns and often generalizes poorly to unseen anomaly types\. Subsequent research attempts to address this issue by either replacing subsequences with random segments clipped from other positions as pre\-assumed anomalies\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849)\), or by employing a one\-class classifier to additionally measure the distance to normal patterns\(Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913)\)\. Although these techniques appear promising, their advances do not lead to substantial improvements \(see Table[4\.3](https://arxiv.org/html/2605.26193#S4.SS3)\)\. Generalization still remains the primary challenge for the OE\-based methods\. ### 2\.3\.MAE\-based Methods MAE\-based methods capture the normal temporal dependencies in time series by learning to reconstruct masked parts from unmasked parts\. These methods\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052); Goswamiet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1053); Fanget al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1057)\)demonstrate strong capability in modeling long sequences and achieve significant performance improvements compared to traditional reconstruction\-based methods\. The masking strategy is the core technical component of MAE\-based methods\. The most intuitive and widely adopted approach is random masking\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052); Chenet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib277)\)\. However, purely random masking can excessively obscure local normal information, making it difficult for the model to reconstruct normal regions accurately\. An alternative is grating masking\(Chenet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib277)\), which imposes a regular masking pattern\. Nevertheless, this approach may result in substantial leakage of anomalous information, making the model to generate good reconstruction for abnormal patches, causing failure of anomaly detection\. In the ideal case, we expect the suspicious anomalies are masked as much as possible\. TFMAE\(Fanget al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1057)\)takes a preliminary step in this direction by masking subsequences with high variance within the input window; however, anomalies do not always correspond to high\-variance areas\. ## 3\.Methodology ### 3\.1\.Task Description and Notations Let𝐒=\{x1,x2,…,x𝙻\}\\mathbf\{S\}=\\\{x\_\{1\},x\_\{2\},\\ldots,x\_\{\\mathtt\{L\}\}\\\}denote a time series of length𝙻\\mathtt\{L\}, wherexix\_\{i\}represents the observation at time stepii\. The goal of time series anomaly detection is to assign an anomaly scoreA\(xi\)∈ℝA\(x\_\{i\}\)\\in\\mathbb\{R\}to each observationxi∈𝐒x\_\{i\}\\in\\mathbf\{S\}, where a higher value ofA\(xi\)A\(x\_\{i\}\)indicates a greater likelihood thatxix\_\{i\}is anomalous\. Following conventions\(Li and Jung,[2023](https://arxiv.org/html/2605.26193#bib.bib554); Zamanzadeh Darbanet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib970); Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655)\), we apply a sliding window to segment the raw time series𝐒\\mathbf\{S\}into a collection of windows,\{𝐗1,𝐗2,…,𝐗𝙺\}\\\{\\mathbf\{X\}\_\{1\},\\mathbf\{X\}\_\{2\},\\ldots,\\mathbf\{X\}\_\{\\mathtt\{K\}\}\\\}, where each window𝐗n=\{xn,1,xn,2,…,xn,𝚃\}\\mathbf\{X\}\_\{n\}=\\\{x\_\{n,1\},x\_\{n,2\},\\ldots,x\_\{n,\\mathtt\{T\}\}\\\}consists of𝚃\\mathtt\{T\}consecutive time steps and serves as a model input\. For simplicity, we omit the window indexnnand denote a generic input window as𝐗=\{x1,x2,…,x𝚃\}\\mathbf\{X\}=\\\{x\_\{1\},x\_\{2\},\\ldots,x\_\{\\mathtt\{T\}\}\\\}\. Figure 2\.TheCoADframework\. \(a\) Overall framework\. \(b\) Time\-Frequency ensemble classification\. \(c\) Residual classification\. ### 3\.2\.Cooperative Anomaly Detection Framework We proposeCoAD, a cooperative framework that seamlessly integrates classification and reconstruction to leverage their complementary strengths and overcome their individual limitations\. As illustrated in Figure[2](https://arxiv.org/html/2605.26193#S3.F2)\(a\),CoADis built on the insight that the classification module guides the reconstruction process to enhance detection accuracy, while the reconstruction module provides normal references that facilitate the identification of unseen anomalies\. Specifically, the framework employs an anomaly classifier trained on predefined anomaly patterns to produce patch\-wise anomaly probabilities\. These probabilities are then used to guide the masking process of the MAE module, enabling the maximal suppression of anomalous information \(Section[3\.4](https://arxiv.org/html/2605.26193#S3.SS4)\)\. The reconstructed patches provided by the MAE module serve as references to extract generalizable residual features for a second stage of anomaly classification \(Section[3\.5](https://arxiv.org/html/2605.26193#S3.SS5)\)\. All components are jointly trained with a weighted loss combining Binary Cross Entropy \(for classification\) and Mean Squared Error \(for reconstruction\), ensuring mutual optimization and synergy \(Section[3\.6](https://arxiv.org/html/2605.26193#S3.SS6)\)\. To further enhance performance, the classification module incorporates two key components\. One is patch\-level classification, which is made feasible by the cooperative framework\. Since the reconstruction module generates fine\-grained anomaly scores at the timestamp level, the classification module can operate on non\-overlapping patches\. This substantially reduces the input sequence length, reducing computational burden and enabling the model to more effectively capture intra\- and inter\-patch contextual dependencies to find anomalies \(see Section[3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1)\)\. The other is a time\-frequency dual\-branch ensemble that aggregates complementary temporal and frequency features, boosting the model’s capacity to detect complex and subtle anomalies \(see Section[3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2)\)\. ### 3\.3\.Mask Generation via Patch\-level Time\-frequency Classification #### 3\.3\.1\.Distortion and Patching The classification module is trained on synthetic anomalies to incorporate generalized prior knowledge regarding anomalous patterns\. Specifically, the input window is stochastically distorted to generate simulated anomalies and is then partitioned into non\-overlapping patches\. Distortion:Following established methodologies\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479); Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052)\), we simulate four types of time series anomalies: Uniform Replacement, Mirror Flip, Length Scale, and Jittering\. These distortion methods are inspired by the most prevalent types of anomalies\(Blázquez\-Garcíaet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib227); Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655); Zamanzadeh Darbanet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib970)\)\. Formally, we randomly select a segment of the input window𝐗\\mathbf\{X\}, with a length ranging from 0 to one dominant period, and inject one of the four types of anomalies, yielding a distorted series denoted as𝐗~\\mathbf\{\\tilde\{X\}\}\.Details of these distortion strategies are available in Appendix[A](https://arxiv.org/html/2605.26193#A1)\. Figure 3\.Different levels of classification granularity\.Patching:As illustrated in Figure[3](https://arxiv.org/html/2605.26193#S3.F3)\(a,b\), existing classification\-based methods typically adopt either “step\-level” or “window\-level” classification granularity, both of which have inherent limitations\. Step\-level methods encode the entire input window into a single latent representation, then use a decoder to produce anomaly scores for each time step\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479)\)\. However, producing accurate fine\-grained scores from a single embedding is especially difficult for long sequences, which are necessary to capture sufficient contextual information\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Nieet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib1054)\)\. Window\-level methods also use a single embedding for the entire input window but output only a binary label indicating whether any anomaly exists in the window\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849); Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913)\)\. This coarse prediction can easily overlook short\-duration or subtle anomalies that are masked by dominant normal patterns\. To address the above issues, we adopt a “patch\-level” classification strategy\. Specifically, an input window is segmented into non\-overlapping patches𝐗~p=\[x~1p,x~2p,…,x~𝙽p\]∈ℝ𝙿×𝙽\\mathbf\{\\tilde\{X\}\}^\{p\}=\[\\tilde\{x\}^\{p\}\_\{1\},\\tilde\{x\}^\{p\}\_\{2\},\\ldots,\\tilde\{x\}^\{p\}\_\{\\mathtt\{N\}\}\]\\in\\mathbb\{R\}^\{\\mathtt\{P\}\\times\\mathtt\{N\}\}, where𝙽\\mathtt\{N\}represents the number of patches andx~ip\\tilde\{x\}^\{p\}\_\{i\}denotes a patch containing𝙿\\mathtt\{P\}time points\. As shown in Figure[3](https://arxiv.org/html/2605.26193#S3.F3)\(c\), each patch is first embedded via a linear layer to model short\-term dependencies, and then all embeddings are interacted using a Bi\-GRU module to extract long\-term correlations, followed by a shared linear layer that classifies whether a patch contains anomalies \(detailed in Section[3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2)\)\. This patch\-level design strikes a balance between the granularity of step\-level and window\-level methods\. Crucially, it is made feasible by our cooperative framework: since the reconstruction module provides fine\-grained anomaly scores at the step level, the classification module does not need to produce scores for each individual time step\. This decoupling enables non\-overlapping patches, which substantially reduces the sequence length and enables the model to more effectively learn both intra\- and inter\-patch context with reduced computational overhead\. Figure 4\.Comparison of frequency domain features between normal and anomalous regions\. The upper panel displays the STFT spectrogram, with darker colors indicating higher amplitudes\. Anomalies that are difficult to detect in the time domain exhibit more distinctive and discriminative patterns in the frequency domain\. #### 3\.3\.2\.Time\-frequency Ensemble Classification Time\-frequency analysis is widely used in time series research\. As illustrated in Figure[4](https://arxiv.org/html/2605.26193#S3.F4), certain anomalies that are difficult to distinguish in the time domain exhibit more salient patterns in the frequency domain\. While existing classification\-based anomaly detection methods generally overlook frequency features, some reconstruction\-based approaches have incorporated time\-frequency representations to improve detection performance\(Fanget al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1057); Wuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1058); Wanget al\.,[2024b](https://arxiv.org/html/2605.26193#bib.bib870); Zhanget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib1004)\)\. However, these reconstruction\-based approaches face two limitations\.i\)As observed in Figure[4](https://arxiv.org/html/2605.26193#S3.F4)and corroborated by prior studies\(Wuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1058); Zhanget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib1004)\), frequency amplitudes vary considerably across different bands, typically higher in low\-frequency and lower in high\-frequency regions\. This uneven distribution complicates accurate reconstruction across all frequency bands, often resulting in large relative errors in high\-frequency components\. Consequently, normal high\-frequency patterns may be misidentified as anomalies due to large reconstruction errors\. To mitigate this issue, we propose using frequency\-domain features for classification rather than as reconstruction targets\.ii\)Most existing works\(Wanget al\.,[2024b](https://arxiv.org/html/2605.26193#bib.bib870); Zhanget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib1004); Wuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1058)\)extract frequency features via the Fast Fourier Transform \(FFT\), which provides fine\-grained frequency resolution but only coarse \(window\-level\) resolution in the time domain\. To address this, we adopt the Short\-Time Fourier Transform \(STFT\), which enables finer temporal localization of frequency patterns\. By integrating both time\- and frequency\-domain representations, our dual\-branch classification module leverages complementary information to enhance anomaly detection capability\. The structure of the time\-frequency ensemble classification is demonstrated in Figure[2](https://arxiv.org/html/2605.26193#S3.F2)\(b\), and the details are described as follows: Frequency branch:In the frequency classification branch, we apply the STFT to the entire input window𝐗~\\mathbf\{\\tilde\{X\}\}to obtain the frequency\-domain representation𝐗~f∈ℝ2𝙺×𝚃\\mathbf\{\\tilde\{X\}\}\_\{f\}\\in\\mathbb\{R\}^\{2\\mathtt\{K\}\\times\\mathtt\{T\}\}, where𝙺\\mathtt\{K\}denotes the number of frequency bins\. The real and imaginary parts of the complex\-valued STFT output are concatenated directly\. We then segment𝐗~f\\mathbf\{\\tilde\{X\}\}\_\{f\}into non\-overlapping patches𝐗~fp=\[x~f,1p,x~f,2p,…,x~f,𝙽p\]∈ℝ2𝙺𝙿×𝙽\\mathbf\{\\tilde\{X\}\}\_\{f\}^\{p\}=\[\\tilde\{x\}^\{p\}\_\{f,1\},\\tilde\{x\}^\{p\}\_\{f,2\},\\ldots,\\tilde\{x\}^\{p\}\_\{f,\\mathtt\{N\}\}\]\\in\\mathbb\{R\}^\{2\\mathtt\{K\}\\mathtt\{P\}\\times\\mathtt\{N\}\}, where𝙿\\mathtt\{P\}is the patch length and𝙽\\mathtt\{N\}is the number of patches\. To model inter\-patch dependencies, each frequency patch is first linearly projected into a hidden space and then processed by a GRU encoder: \(1\)𝐇~fp=GRU\(𝐖fp𝐗~fp\),\\mathbf\{\\tilde\{H\}\}\_\{f\}^\{p\}=\\operatorname\{GRU\}\\left\(\\mathbf\{W\}\_\{f\}^\{p\}\\mathbf\{\\tilde\{X\}\}\_\{f\}^\{p\}\\right\),where𝐖fp∈ℝ𝙷×2𝙺𝙿\\mathbf\{W\}\_\{f\}^\{p\}\\in\\mathbb\{R\}^\{\\mathtt\{H\}\\times 2\\mathtt\{K\}\\mathtt\{P\}\}is a learnable projection matrix and𝙷\\mathtt\{H\}is the hidden dimension\. The output𝐇~fp\\mathbf\{\\tilde\{H\}\}\_\{f\}^\{p\}is then passed through a linear classifier layer followed by a sigmoid activation to generate the anomaly probability for each patch: \(2\)𝐀fp=σ\(𝐖𝐟𝐇~fp\),\\mathbf\{A\}\_\{f\}^\{p\}=\\sigma\\left\(\\mathbf\{W\}\_\{\\mathbf\{f\}\}\\mathbf\{\\tilde\{H\}\}\_\{f\}^\{p\}\\right\),where𝐖𝐟∈ℝ1×𝙷\\mathbf\{W\}\_\{\\mathbf\{f\}\}\\in\\mathbb\{R\}^\{1\\times\\mathtt\{H\}\}is a learnable weight vector,σ\(⋅\)\\sigma\(\\cdot\)represents the sigmoid activation function, and𝐀fp=\[af,1p,af,2p,…,af,𝙽p\]\\mathbf\{A\}\_\{f\}^\{p\}=\[a^\{p\}\_\{f,1\},a^\{p\}\_\{f,2\},\\ldots,a^\{p\}\_\{f,\\mathtt\{N\}\}\]denotes the predicted anomaly probabilities for the frequency domain patches𝐗~fp\\mathbf\{\\tilde\{X\}\}\_\{f\}^\{p\}\. Time branch:The patch set𝐗~p\\tilde\{\\mathbf\{X\}\}^\{p\}is directly projected via a linear layer and then fed into a GRU encoder\. The obtained features𝐇~tp\\mathbf\{\\tilde\{H\}\}\_\{t\}^\{p\}are subsequently processed through a linear layer followed by a sigmoid activation to produce patch\-wise anomaly probabilities: \(3\)𝐇~tp=GRU\(𝐖tp𝐗~p\),𝐀tp=σ\(𝐖𝐭𝐇~tp\),\\mathbf\{\\tilde\{H\}\}\_\{t\}^\{p\}=\\operatorname\{GRU\}\\left\(\\mathbf\{W\}\_\{t\}^\{p\}\\mathbf\{\\tilde\{X\}\}^\{p\}\\right\),\\,\\,\\mathbf\{A\}\_\{t\}^\{p\}=\\sigma\\left\(\\mathbf\{W\}\_\{\\mathbf\{t\}\}\\mathbf\{\\tilde\{H\}\}\_\{t\}^\{p\}\\right\),where𝐀tp=\[at,1p,at,2p,…,at,𝙽p\]\\mathbf\{A\}\_\{t\}^\{p\}=\[a\_\{t,1\}^\{p\},a\_\{t,2\}^\{p\},\\ldots,a\_\{t,\\mathtt\{N\}\}^\{p\}\]denotes the predicted anomaly probabilities for the time domain patches𝐗~p\\tilde\{\\mathbf\{X\}\}^\{p\}, and𝐖tp∈ℝ𝙷×𝙿\\mathbf\{W\}\_\{t\}^\{p\}\\in\\mathbb\{R\}^\{\\mathtt\{H\}\\times\\mathtt\{P\}\}and𝐖𝐭∈ℝ1×𝙷\\mathbf\{W\}\_\{\\mathbf\{t\}\}\\in\\mathbb\{R\}^\{1\\times\\mathtt\{H\}\}represent the weight matrices of the learnable linear layers\. Ensemble strategy:We adopt a maximum fusion strategy to combine anomaly probabilities from the two branches: \(4\)𝐀p=max\(𝐀fp,𝐀tp\)=\[max\(af,1p,at,1p\),…,max\(af,𝙽p,at,𝙽p\)\],\\mathbf\{A\}^\{p\}=\\max\(\\mathbf\{A\}\_\{f\}^\{p\},\\mathbf\{A\}\_\{t\}^\{p\}\)=\[\\max\(a\_\{f,1\}^\{p\},a\_\{t,1\}^\{p\}\),\\ldots,\\max\(a\_\{f,\\mathtt\{N\}\}^\{p\},a\_\{t,\\mathtt\{N\}\}^\{p\}\)\],where𝐀p∈ℝ𝙽\\mathbf\{A\}^\{p\}\\in\\mathbb\{R\}^\{\\mathtt\{N\}\}represents the ensemble anomaly probabilities for the patches𝐗~p\\mathbf\{\\tilde\{X\}\}^\{p\}\. This strategy ensures that as long as either branch detects an anomaly, it will be retained\. It helps reduce false alarms, since both branches must agree on normality\. Moreover, it also avoids forcing either branch to fit on detecting anomalies it is less sensitive to, thus improving training stability\. We also explore different ensemble strategies in Section[4\.4](https://arxiv.org/html/2605.26193#S4.SS4)\. ### 3\.4\.Probability\-informed Soft Masking MAE The core idea ofCoADis to guide MAE reconstruction using prior classification probabilities\. We propose a soft masking strategy, where each patch embedding is blended with a learnable mask embedding, weighted by the anomaly probability generated from the classification module\. This enables a more nuanced suppression of potentially anomalous information, especially in borderline cases where a hard mask based on binary classification may be too rigid or error\-prone\. The resulting soft\-masked embeddings are: \(5\)𝐄𝐦=𝐀p⋅𝐄mask\+\(J1,𝙽−𝐀p\)⋅𝐖𝐦𝐗~p,\\mathbf\{E\}\_\{\\mathbf\{m\}\}=\\mathbf\{A\}^\{p\}\\cdot\\mathbf\{E\}\_\{\\texttt\{mask\}\}\+\\left\(J\_\{1,\\mathtt\{N\}\}\-\\mathbf\{A\}^\{p\}\\right\)\\cdot\\mathbf\{W\}\_\{\\mathbf\{m\}\}\\tilde\{\\mathbf\{X\}\}^\{p\},where𝐄mask∈ℝ𝙷×𝙽\\mathbf\{E\}\_\{\\texttt\{mask\}\}\\in\\mathbb\{R\}^\{\\mathtt\{H\}\\times\\mathtt\{N\}\}represents the learnable mask embedding,J1,𝙽J\_\{1,\\mathtt\{N\}\}denotes an all\-ones vector of length𝙽\\mathtt\{N\}, and𝐖𝐦∈ℝ𝙷×𝙿\\mathbf\{W\}\_\{\\mathbf\{m\}\}\\in\\mathbb\{R\}^\{\\mathtt\{H\}\\times\\mathtt\{P\}\}is a learnable projection matrix that maps raw patches into patch embeddings\.𝐄𝐦\\mathbf\{E\}\_\{\\mathbf\{m\}\}is subsequently input into a GRU encoder followed by a linear layer to reconstruct the original time series: \(6\)𝐗𝐫=Flat\(𝐖𝐨GRU\(𝐄𝐦\)\),where𝐖𝐨∈ℝ𝙿×𝙷\.\\mathbf\{X\}\_\{\\mathbf\{r\}\}=\\texttt\{Flat\}\(\\mathbf\{W\}\_\{\\mathbf\{o\}\}\\operatorname\{GRU\}\\left\(\\mathbf\{E\}\_\{\\mathbf\{m\}\}\\right\)\),\\,\\text\{where\}\\;\\mathbf\{W\}\_\{\\mathbf\{o\}\}\\in\\mathbb\{R\}^\{\\mathtt\{P\}\\times\\mathtt\{H\}\}\.The MAE is trained to minimize the Mean Squared Error between the reconstructed sequence𝐗𝐫\\mathbf\{X\_\{r\}\}and the original normal time series𝐗\\mathbf\{X\}, thereby learning to reconstruct normal patterns within masked anomalous regions and to capture only the normal data distribution\. The probability\-informed soft masking strategy offers fine\-grained control over the masking strength, enabling adaptive suppression of anomalous information based on classification confidence\. Unlike rigid hard masking, it handles uncertainty more gracefully, enhancing robustness in ambiguous cases\. Moreover, soft masking ensures smoother gradient flow during training and fosters better synergy between the classification and reconstruction branches by aligning their objectives\. A comparative analysis of soft\- and hard\-masking strategies is presented in Section[4\.4](https://arxiv.org/html/2605.26193#S4.SS4)\. ### 3\.5\.Reconstruction\-informed Residual Classification Recent advances\(Yaoet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1090),[2025](https://arxiv.org/html/2605.26193#bib.bib1091)\)in image anomaly detection indicate that the discrepancy between the features of anomalous regions and their nearest normal references, referred to asresidual features, serves as an intrinsic indicator for anomaly discrimination, exhibiting strong generalization capability in cross\-domain settings\. The underlying rationale is that, irrespective of the anomaly type, the features of anomalous samples differ substantially from those of the corresponding normal references\. Consequently, residual features can be regarded as class\-invariant and generalizable representations for anomaly detection\. However, extracting residual features necessitates an extensive reference pool that stores all possible normal features for comparison\. Unlike static images, the inherent temporal dynamics of time series data make maintaining a reference pool with sufficient coverage memory\-intensive\. Furthermore, performing nearest\-neighbor retrieval for each input imposes substantial computational overhead in large\-scale time series settings\. To mitigate these constraints, we propose a simple yet effective approach that directly leverages the MAE reconstruction output as the reference for classification\. Since the MAE is optimized to reconstruct normal patterns even within anomalous regions, it provides a “quasi\-normal reference” for the corresponding input \(as visualized in Figure[1](https://arxiv.org/html/2605.26193#S1.F1)\)\. The structure of this reconstruction\-informed residual classification module is illustrated in Figure[2](https://arxiv.org/html/2605.26193#S3.F2)\(c\), with technical details presented below: The reconstructed sequence𝐗𝐫\\mathbf\{X\_\{r\}\}is first partitioned into patches𝐗𝐫p\\mathbf\{X\}\_\{\\mathbf\{r\}\}^\{p\}, which are subsequently processed by frequency\-branch and time\-branch encoders to extract features𝐇fp\\mathbf\{H\}\_\{f\}^\{p\}and𝐇tp\\mathbf\{H\}\_\{t\}^\{p\}, respectively\. Residual features𝐇fp,r\\mathbf\{H\}\_\{f\}^\{p,r\}and𝐇tp,r\\mathbf\{H\}\_\{t\}^\{p,r\}are then computed as the differences between the reconstructed patch features and the original input patch features: \(7\)𝐇fp,r=𝐇~fp−𝐇fp,𝐇tp,r=𝐇~tp−𝐇tp\.\\mathbf\{H\}\_\{f\}^\{p,r\}=\\mathbf\{\\tilde\{H\}\}\_\{f\}^\{p\}\-\\mathbf\{H\}\_\{f\}^\{p\},\\,\\,\\mathbf\{H\}\_\{t\}^\{p,r\}=\\mathbf\{\\tilde\{H\}\}\_\{t\}^\{p\}\-\\mathbf\{H\}\_\{t\}^\{p\}\.The obtained dual\-domain residual features are projected through a linear layer followed by a sigmoid activation function to calculate patch\-wise anomaly probabilities\. These domain\-specific probabilities are subsequently aggregated using an element\-wise maximum fusion strategy: \(8\)𝐀fp,r=σ\(𝐖𝐫𝐇fp,r\),𝐀tp,r=σ\(𝐖𝐫𝐇tp,r\),𝐀p,r=max\(𝐀fp,r,𝐀tp,r\),\\begin\{split\}\\mathbf\{A\}\_\{f\}^\{p,r\}&=\\sigma\\left\(\\mathbf\{W\}\_\{\\mathbf\{r\}\}\\mathbf\{H\}\_\{f\}^\{p,r\}\\right\),\\,\\,\\mathbf\{A\}\_\{t\}^\{p,r\}=\\sigma\\left\(\\mathbf\{W\}\_\{\\mathbf\{r\}\}\\mathbf\{H\}\_\{t\}^\{p,r\}\\right\),\\\\ \\mathbf\{A\}^\{p,r\}&=\\max\(\\mathbf\{A\}\_\{f\}^\{p,r\},\\mathbf\{A\}\_\{t\}^\{p,r\}\),\\end\{split\}where𝐖𝐫∈ℝ1×𝙷\\mathbf\{W\}\_\{\\mathbf\{r\}\}\\in\\mathbb\{R\}^\{1\\times\\mathtt\{H\}\}denotes a learnable weight vector\. The final patch\-level anomaly probabilities𝐀cp\\mathbf\{A\}\_\{c\}^\{p\}are obtained by averaging the predictions of the time–frequency ensemble classification and residual classification modules: \(9\)𝐀cp=𝐀p⊕𝐀p,r2,where𝐀cp=\[ac,1p,ac,2p,…,ac,𝙽p\]\.\\mathbf\{A\}\_\{c\}^\{p\}=\\frac\{\\mathbf\{A\}^\{p\}\\oplus\\mathbf\{A\}^\{p,r\}\}\{2\},\\,\\text\{where\}\\;\\mathbf\{A\}\_\{c\}^\{p\}=\[a\_\{c,1\}^\{p\},a\_\{c,2\}^\{p\},\\ldots,a\_\{c,\\mathtt\{N\}\}^\{p\}\]\.In Eq\. \([9](https://arxiv.org/html/2605.26193#S3.E9)\),𝐀p\\mathbf\{A\}^\{p\}confirms known anomaly patterns, while𝐀p,r\\mathbf\{A\}^\{p,r\}enhances generalization to unseen anomalies\.The efficacy of the residual classification module in detecting novel anomalies is experimentally validated in Appendix[I](https://arxiv.org/html/2605.26193#A9)\. ### 3\.6\.Cooperative Training and Joint Anomaly Inference #### 3\.6\.1\.Cooperative Training The classification and reconstruction modules are jointly trained in an end\-to\-end manner\. The overall training objective integrates the binary cross\-entropy \(BCE\) loss for classification and the mean squared error \(MSE\) loss for reconstruction: \(10\)ℒBCE=1N∑i=1N\(yi⋅log\(ac,ip\)\+\(1−yi\)⋅log\(1−ac,ip\)\),ℒMSE=1N∑i=1N‖xr,ip−xip‖22,ℒFinal=ℒBCE\+λ⋅ℒMSE,\\begin\{split\}\\mathcal\{L\}\_\{BCE\}&=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(y\_\{i\}\\cdot\\log\(a\_\{c,i\}^\{p\}\)\+\(1\-y\_\{i\}\)\\cdot\\log\(1\-a\_\{c,i\}^\{p\}\)\\right\),\\\\ \\mathcal\{L\}\_\{MSE\}&=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\\\|x\_\{r,i\}^\{p\}\-x\_\{i\}^\{p\}\\right\\\|\_\{2\}^\{2\},\\,\\,\\mathcal\{L\}\_\{Final\}=\\mathcal\{L\}\_\{BCE\}\+\\lambda\\cdot\\mathcal\{L\}\_\{MSE\},\\end\{split\}whereac,ipa\_\{c,i\}^\{p\}denotes the anomaly probability of the patchx~ip\\tilde\{x\}^\{p\}\_\{i\}, andyiy\_\{i\}is its corresponding label, whereyi=1y\_\{i\}=1if the patch contains anomalies andyi=0y\_\{i\}=0otherwise\.xr,ipx\_\{r,i\}^\{p\}is the reconstructed patch andxipx\_\{i\}^\{p\}is the initial normal time series patch\.λ\\lambdais the weight to balance the two losses\. #### 3\.6\.2\.Joint Anomaly Inference During inference, the classification and reconstruction modules operate collaboratively to enable comprehensive anomaly detection\. The final anomaly score for each patch is computed as the sum of the anomaly probability and the reconstruction error: \(11\)a\(xip\)=ac,ip⋅J1,𝙿\+\|x~ip−xr,ip\|,a\(x\_\{i\}^\{p\}\)=a\_\{c,i\}^\{p\}\\cdot J\_\{1,\\mathtt\{P\}\}\+\\left\|\\tilde\{x\}^\{p\}\_\{i\}\-x\_\{r,i\}^\{p\}\\right\|,whereJ1,𝙿J\_\{1,\\mathtt\{P\}\}denotes an all\-ones vector of length𝙿\\mathtt\{P\}, ensuring proper dimensional alignment for element\-wise addition\. Since the input time series is normalized using the training set, both the classification probabilities and reconstruction errors are on comparable scales, allowing them to be directly added into a unified anomaly score\. Following prior studies\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Qinet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib726); Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812)\), a moving average is applied to smooth the unified anomaly scores\. ## 4\.Experiments ### 4\.1\.Datasets and Evaluation Metrics #### 4\.1\.1\.Current Issues Unreliable datasets and biased evaluation metrics have long plagued the field of time series anomaly detection\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891)\)\. Many studies reported impressive performance based on flawed datasets and metrics, thereby contributing to the problem of*“Creating the Illusion of Progress”*\(Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891); Keogh,[2021](https://arxiv.org/html/2605.26193#bib.bib1073)\)\. Commonly used datasets such as SMD\(Suet al\.,[2019](https://arxiv.org/html/2605.26193#bib.bib813)\), PSM\(Abdulaalet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib184)\), SWaT\(Gohet al\.,[2017](https://arxiv.org/html/2605.26193#bib.bib1060)\), SMAP\(Hundmanet al\.,[2018](https://arxiv.org/html/2605.26193#bib.bib464)\), MSL\(Hundmanet al\.,[2018](https://arxiv.org/html/2605.26193#bib.bib464)\), and NAB\(Ahmadet al\.,[2017](https://arxiv.org/html/2605.26193#bib.bib1061)\), suffer from several critical issues including mislabeled ground truth, trivial anomalies, unrealistic anomaly densities, and the run\-to\-failure bias\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891); Keogh,[2021](https://arxiv.org/html/2605.26193#bib.bib1073); Wu and Keogh,[2024b](https://arxiv.org/html/2605.26193#bib.bib1072)\)\. Moreover, some evaluation metrics, such as point\-adjusted F1 \(F1\-PA\)\(Suet al\.,[2019](https://arxiv.org/html/2605.26193#bib.bib813)\)and F1\-Affiliation\(Huetet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib462)\), tend to overestimate model performance, often awarding the highest scores to random predictions\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Kimet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib506)\)\. #### 4\.1\.2\.Our Settings To ensure the reliability of our evaluation, we adopt recently proposed, largest\-scale and highest\-quality datasets\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891)\), along with the most rigorous evaluation metrics\(Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702)\)\. For datasets, we use the KDD21\(Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891)\)and TSB\-AD\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)benchmarks\.i\)KDD21 is widely acknowledged for its high data quality and domain diversity, comprising 250 datasets across various fields, including healthcare, sports, industry, and robotics\.ii\)TSB\-AD also covers various domains and partially overlaps with KDD21\. However, certain datasets within TSB\-AD, such as NAB, WSD, YAHOO, and Stock, still suffer from the aforementioned data quality issues\(Wu and Keogh,[2024b](https://arxiv.org/html/2605.26193#bib.bib1072)\)\. Therefore, following established dataset quality criteria\(Schmidlet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib758); Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891); Luet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib631)\), we select only the high\-quality, non\-overlapping datasets from TSB\-AD for our experiments\. In total, our evaluation encompasses314 datasetsfrom diverse domains drawn from these two benchmarks, ensuring a comprehensive assessment\.Detailed dataset descriptions are provided in Appendix[B\.1](https://arxiv.org/html/2605.26193#A2.SS1)\. For metrics, we adopt the evaluation metrics recommended by recent benchmarking studies\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702); Boniolet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1089)\), including Standard\-F1, AUC\-PR, Range\-AUC\-PR, and VUS\-PR, which are recognized as the most reliable and precise measures in current research\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)\.Details of these metrics are available in Appendix[B\.2](https://arxiv.org/html/2605.26193#A2.SS2)\. ### 4\.2\.Baselines and Implementation Details #### 4\.2\.1\.Baselines To demonstrate the superiority of our method, we compared it against24 SOTAbaseline methods, including 17 deep learning\-based and 7 data mining\-based methods\. As shown in Table[4\.3](https://arxiv.org/html/2605.26193#S4.SS3), the deep learning\-based methods can be categorized into four groups:i\) Pure MAE\-based methodsandii\) Pure OE\-based methodsare standalone MAE and OE approaches;iii\) Time\-Frequency Reconstruction methodsperform reconstruction in both time and frequency domains; andiv\) Other Deep Learning\-based methodsrely on non\-masking reconstruction \(TranAD, MAUT and FITS\), forecasting \(M2N2\), or feature discrepancy \(DCdetector and AnomalyTransformer\)\. Notably, DADA and MOMENT are built on large foundation models\. #### 4\.2\.2\.Implementation Details Both our model and all baselines follow identical data preprocessing procedures within an integrated, unified pipeline\.More implementation details ofCoADand baselines are available in Appendix[C](https://arxiv.org/html/2605.26193#A3)\. ### 4\.3\.Comparison results Table 1\.Average results \(%\) on KDD21 and TSB\-AD\. The best results are inbold, and the second\-best results are withunderline\.Model / DatasetKDD21TSB\-ADModel ClassModel \(Venue\)F1AUC\-PRR\-AUC\-PRVUS\-PRF1AUC\-PRR\-AUC\-PRVUS\-PRDADA \(ICLR\-25\(Shentuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1052)\)\)3\.491\.392\.042\.0517\.3612\.3014\.9214\.53MOMENT \(ICML\-24\(Goswamiet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1053)\)\)11\.067\.719\.209\.1325\.2918\.8025\.9024\.67MMA \(VLDB\-25\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051)\)\)44\.4739\.2437\.9737\.3843\.2038\.6537\.3336\.94Pure MAETFMAE \(ICDE\-24\(Fanget al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1057)\)\)2\.530\.991\.751\.757\.293\.455\.835\.77AnomalyBERT \(ICLR\-23\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479)\)\)23\.4215\.9214\.1613\.7919\.149\.8513\.2113\.85CutAddPaste \(KDD\-24\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849)\)\)21\.7515\.5119\.0018\.2026\.2221\.3425\.4525\.08TriAD \(ICDE\-24\(Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812)\)\)16\.7822\.3728\.0927\.56N/AN/AN/AN/APure OECOUTA \(TKDE\-24\(Xuet al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib913)\)\)6\.583\.653\.793\.8218\.7813\.1211\.0711\.01FCVAE \(WWW\-24\(Wanget al\.,[2024b](https://arxiv.org/html/2605.26193#bib.bib870)\)\)11\.237\.186\.256\.3828\.4722\.1220\.6120\.58TFAD \(CIKM\-22\(Zhanget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib1004)\)\)1\.850\.861\.591\.586\.173\.346\.215\.99Time\-FrequencyReconstructionCATCH \(ICLR\-25\(Wuet al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1058)\)\)13\.299\.089\.269\.1924\.4620\.0222\.3221\.67TranAD \(VLDB\-22\(Tuliet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib839)\)\)11\.237\.787\.947\.8918\.0713\.0612\.2312\.05MAUT \(ICASSP\-23\(Qinet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib726)\)\)30\.2023\.8423\.9423\.6519\.8214\.6714\.9114\.75M2N2 \(AAAI\-24\(Kimet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib510)\)\)5\.572\.703\.153\.1816\.9810\.389\.419\.25FITS \(ICLR\-24\(Xuet al\.,[2024b](https://arxiv.org/html/2605.26193#bib.bib1080)\)\)5\.342\.633\.463\.4513\.237\.1410\.6310\.37DCdetector \(KDD\-23\(Yanget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib1081)\)\)2\.661\.091\.881\.856\.473\.125\.795\.69Other Deep LearningAnomalyTrans \(ICLR\-22\(Xuet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib1082)\)\)2\.370\.971\.871\.807\.013\.366\.276\.13KShapeAD \(NeurIPS\-24\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)\)43\.2739\.3739\.4839\.0235\.5132\.6731\.8731\.27SAND \(VLDB\-21\(Boniolet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib234)\)\)39\.4334\.7234\.2333\.7834\.7431\.5231\.8531\.05Sub\-PCA \(NeurIPS\-24\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)\)15\.4511\.3214\.0613\.1732\.4927\.5923\.8823\.85Series2Graph \(VLDB\-20\(Boniol and Palpanas,[2020](https://arxiv.org/html/2605.26193#bib.bib235)\)\)28\.1122\.6125\.6324\.8133\.4830\.1130\.1329\.57KMeansAD \(VLDB\-22\(Yairiet al\.,[2001](https://arxiv.org/html/2605.26193#bib.bib1063)\)\)37\.9734\.3333\.4933\.2041\.4237\.2637\.6436\.81Matrix Profile \(CIKM\-16\(Nakamuraet al\.,[2020](https://arxiv.org/html/2605.26193#bib.bib672)\)\)28\.0018\.5025\.3724\.1335\.1027\.9329\.8828\.94Data MiningDAMP \(KDD\-22\(Luet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib631)\)\)29\.3118\.9925\.0924\.1817\.4411\.5912\.9812\.60\\rowcolor\[HTML\]EFEFEF\\cellcolor\[HTML\]EFEFEFCooperative\\cellcolor\[HTML\]EFEFEFCoAD\(ours\)52\.8248\.1046\.35\\cellcolor\[HTML\]EFEFEF45\.6749\.1343\.6640\.5039\.83 #### 4\.3\.1\.Effectiveness The comparison results are summarized in Table[4\.3](https://arxiv.org/html/2605.26193#S4.SS3)\(results with KDD21 cup metrics are in Appendex[D](https://arxiv.org/html/2605.26193#A4)\)\. The following key observations can be made: 1\)CoADconsistently surpasses all baseline models on both the KDD21 and TSB\-AD benchmarks across all evaluation metrics\. 2\) Compared with methods that rely exclusively on either MAE or OE,CoADachieves substantial performance improvements, demonstrating the effectiveness of the proposed cooperative framework in leveraging the complementary strengths of classification and reconstruction\. 3\)CoADalso outperforms other OE\-based methods that incorporate a broader range of simulated anomaly types during training\. This highlights the superior generalization capabilities ofCoAD, delivering stronger detection performance while requiring far less prior knowledge\. 4\) Notably, data mining\-based methods outperform many deep learning counterparts, consistent with prior findings\(Schmidlet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib758); Laiet al\.,[2021](https://arxiv.org/html/2605.26193#bib.bib524); Rewickiet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib741); Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Sarfrazet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1079); Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655); Garget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib399)\), highlighting the ongoing challenges of deep learning in time series anomaly detection\. Nevertheless,CoADsurpasses even the strongest data mining baselines, illustrating that with principled architectural design and rigorous evaluation, deep learning can achieve SOTA performance in TSAD\.Appendix[J](https://arxiv.org/html/2605.26193#A10)further investigates the influence of various hyperparameters, including the window size𝚃\\mathtt\{T\}, patch size𝙿\\mathtt\{P\}, loss weightλ\\lambda, and different encoder backbone choices \(GRU or Transformer\) on model effectiveness\. Figure 5\.Model efficiency comparison\.#### 4\.3\.2\.Efficiency Figure[5](https://arxiv.org/html/2605.26193#S4.F5)presents a comparison of model efficiency in terms of overall inference time and parameter count on the KDD21 benchmark \(the settings and results on the TSB\-AD benchmark are available in Appendix[E](https://arxiv.org/html/2605.26193#A5)\)\. The results highlight thatCoADnot only achieves superior detection performance but also delivers remarkable computational efficiency\. Specifically,CoADcompletes inference over the entire KDD21 benchmark \(with dataset lengths ranging from6Kto650Kdata points\), comprising more than6 milliondata points in total, in only6\.89 seconds, outperforming most baselines by orders of magnitude in speed\. This underscores the strong potential ofCoADfor real\-time anomaly detection in high\-throughput data streams\. We also evaluate training efficiency on entity 241, the largest dataset in the KDD21 benchmark, which contains250Ktraining samples\.CoADachieves a training speed of0\.4066 seconds per epoch, confirming its high training efficiency\. Furthermore, the model maintains a compact size of only2\.04Mparameters, highlighting its practicality for deployment in resource\-constrained environments\.The theoretical time complexity analysis is provided in Appendix[F](https://arxiv.org/html/2605.26193#A6)\. ### 4\.4\.Ablation and Design Choice Study Table 2\.Quantitative ablation results\. Best results are inbold\.OEMAEKDD21TSB\-ADVariantsTimeFrequencyRandomGratingF1AUC\-PRR\-AUC\-PRVUS\-PRF1AUC\-PRR\-AUC\-PRVUS\-PR✓–––––27\.8424\.1224\.8124\.3829\.7922\.3627\.1426\.28OE alone✓✓––––47\.9542\.7541\.0640\.4137\.6831\.2130\.4829\.90––✓–––35\.6530\.2330\.9630\.2432\.3025\.7423\.5523\.35MAE alone–––✓––38\.9932\.3132\.7631\.9935\.0129\.0828\.5128\.03✓✓✓–––44\.8841\.2639\.5539\.0545\.0139\.3135\.7935\.20✓✓–✓––46\.1641\.9540\.8640\.1947\.0741\.4837\.3937\.08✓✓––✓–49\.0345\.2743\.8743\.4944\.0638\.5735\.3134\.76Cooperative✓––––✓48\.4543\.2142\.8742\.3543\.1036\.8737\.9136\.92\\rowcolor\[HTML\]EFEFEF\\cellcolor\[HTML\]EFEFEFCoAD✓\\cellcolor\[HTML\]EFEFEF✓–––\\cellcolor\[HTML\]EFEFEF✓52\.8248\.1046\.35\\cellcolor\[HTML\]EFEFEF45\.6749\.1343\.6640\.5039\.83Design ChoiceVariantsClassification GranularityFusion StrategyMasking MethodKDD21TSB\-ADStep LevelWindow LevelFeature\_AddFeature\_GateDecision\_MeanF1AUC\-PRR\-AUC\-PRVUS\-PRF1AUC\-PRR\-AUC\-PRVUS\-PR✓–––––43\.0936\.6235\.4335\.0042\.5236\.0634\.0233\.37–✓––––15\.2610\.2111\.2911\.0222\.4317\.0815\.7115\.53––✓–––50\.7645\.9344\.7644\.2540\.1135\.1533\.4032\.94–––✓––50\.5245\.5845\.5344\.7142\.2337\.2034\.8734\.40––––✓–51\.5045\.9845\.5344\.8142\.2436\.2333\.7133\.18Cooperative–––––✓48\.1342\.9242\.9342\.0937\.0730\.5428\.0427\.73\\rowcolor\[HTML\]EFEFEF\\cellcolor\[HTML\]EFEFEFCoAD\\cellcolor\[HTML\]EFEFEFPatch Level\\cellcolor\[HTML\]EFEFEFDecision\_Max\\cellcolor\[HTML\]EFEFEFGuide w/ Score52\.8248\.1046\.35\\cellcolor\[HTML\]EFEFEF45\.67\\cellcolor\[HTML\]EFEFEF49\.13\\cellcolor\[HTML\]EFEFEF43\.66\\cellcolor\[HTML\]EFEFEF40\.50\\cellcolor\[HTML\]EFEFEF39\.83 Figure 6\.Visualization of detection results ofCoADon challenging cases\. To evaluate the effectiveness of each component and design choice inCoAD, we conduct a comprehensive study with the model variants listed in Table[4\.4](https://arxiv.org/html/2605.26193#S4.SS4)\. Specifically, we examine standalone models \(OE or MAE alone\), different cooperative strategies, classification granularities, time\-frequency ensemble strategies, and anomaly scoring methods\. For clarification:OE\+MAE \(Random/Grating\)directly combines detection results from OE and random/grating masking MAE without guidance;OE\+MAE \(Guide w/ Hard Mask\)uses OE guidance with discrete hard masks;OE\+MAE \(Guide w/o Score\)guides MAE using OE soft masking, but only uses MAE’s reconstruction error for final anomaly scoring;Feature\_Gatefuses time and frequency features using a gated mechanism\(Arevaloet al\.,[2017](https://arxiv.org/html/2605.26193#bib.bib1074)\); andDecision\_Meanaverages the classification scores from both branches\.Implementation details and descriptions of all variants are available in Appendix[G](https://arxiv.org/html/2605.26193#A7)\. Based on the results, we draw the following key conclusions: i\)CoADachieves the best overall performance, validating the effectiveness of our design\. ii\) Cooperative models generally outperform standalone ones, with our guided soft masking framework outperforming hard\-masked or naive cooperative baselines\. iii\) Patch\-level classification granularity yields markedly better results than step\- or window\-level approaches\. iv\) Frequency\-domain information significantly enhances anomaly detection performance\. vi\) Decision\-level fusion via maximum operation is more effective than feature\-level fusion or averaging strategies\.Qualitative ablation study results are available in Appendix[H](https://arxiv.org/html/2605.26193#A8)\. ### 4\.5\.Visualization on Challenging Anomalies Figure[6](https://arxiv.org/html/2605.26193#S4.F6)visualizes the detection results ofCoADon several challenging cases:1\) Subtle anomalies:KDD21 003, KDD21 024, and KDD21 031 contain subtle anomalies whose amplitudes are similar to those of normal values, making them easily overlooked\(Leeet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib529); Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812)\)\. In contrast, these anomalies can be effectively detected byCoAD\.2\) Prolonged anomalies:KDD21 044 and TSB\-AD 238 include anomalies that persist over multiple periods, posing difficulties for many existing methods\(Mejriet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib655)\)\. Nevertheless,CoADsuccessfully captures and localizes these long\-duration anomalies\.3\) Anomalies challenging for human experts:As emphasized by\(Wu and Keogh,[2024b](https://arxiv.org/html/2605.26193#bib.bib1072)\), a desirable anomaly detection model should be able to identify anomalies that are difficult even for human experts\. KDD21 209, TSB\-AD 230, and TSB\-AD 234 exemplify such cases\.CoADconsistently assigns high anomaly scores to these regions, highlighting its robustness in handling complex and ambiguous patterns\.4\) Anomalies in non\-stationary time series:As highlighted in\(Wu and Keogh,[2024a](https://arxiv.org/html/2605.26193#bib.bib1078)\), existing deep learning\-based methods generally fail to detect anomalies in non\-stationary series\. In contrast,CoADeffectively handles such dynamic cases\. ## 5\.Conclusion This paper proposesCoAD, a cooperative framework that seamlessly integrates classification and reconstruction to leverage their complementary strengths and overcome their individual limitations\. Extensive experiments on reliable datasets using rigorous evaluation metrics validate that our proposed framework significantly outperforms baselines in both detection performance and computational efficiency\. ###### Acknowledgements\. This work was supported by the National Natural Science Foundation of China \(72571279\) and the science and technology innovation Program of Hunan Province \(2023RC1002\)\. ## References - A\. Abdulaal, Z\. Liu, and T\. Lancewicki \(2021\)Practical approach to asynchronous multivariate time series anomaly detection and localization\.InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining,pp\. 2485–2494\.External Links:[Document](https://dx.doi.org/10.1145/3447548.3467174),ISBN 978\-1\-4503\-8332\-5Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - S\. Ahmad, A\. Lavin, S\. Purdy, and Z\. Agha \(2017\)Unsupervised real\-time anomaly detection for streaming data\.Neurocomputing262,pp\. 134–147\.External Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2017.04.070)Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - J\. Arevalo, T\. Solorio, M\. Montes\-y\-Gómez, and F\. A\. González \(2017\)Gated multimodal units for information fusion\.arXiv preprint arXiv:1702\.01992\.Cited by:[4th item](https://arxiv.org/html/2605.26193#A7.I1.i3.I1.i4.p1.1),[§4\.4](https://arxiv.org/html/2605.26193#S4.SS4.tab1.8)\. - A\. Blázquez\-García, A\. Conde, U\. Mori, and J\. A\. Lozano \(2022\)A review on outlier/anomaly detection in time series data\.ACM Computing Surveys54\(3\),pp\. 1–33\.External Links:ISSN 0360\-0300,[Document](https://dx.doi.org/10.1145/3444690),LCCN 1Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p2.2)\. - P\. Boniol, A\. K\. Krishna, M\. Bruel, Q\. Liu, M\. Huang, T\. Palpanas, R\. S\. Tsay, A\. Elmore, M\. J\. Franklin, and J\. Paparrizos \(2025\)VUS: effective and efficient accuracy measures for time\-series anomaly detection\.The VLDB Journal34\(3\)\.External Links:ISSN 1066\-8888,[Link](https://doi.org/10.1007/s00778-025-00907-x),[Document](https://dx.doi.org/10.1007/s00778-025-00907-x)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p3.1)\. - P\. Boniol and T\. Palpanas \(2020\)Series2Graph: : graph\-based subsequence anomaly detection for time series\.Proceedings of the VLDB Endowment13\(12\),pp\. 1821–1834\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3407790.3407792),LCCN 3Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.23.23.2)\. - P\. Boniol, J\. Paparrizos, T\. Palpanas, and M\. J\. Franklin \(2021\)SAND: streaming subsequence anomaly detection\.Proceedings of the VLDB Endowment14\(10\),pp\. 1717–1729\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3467861.3467863),LCCN 3Cited by:[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.21.21.2)\. - Y\. Chen, C\. Zhang, M\. Ma, Y\. Liu, R\. Ding, B\. Li, S\. He, S\. Rajmohan, Q\. Lin, and D\. Zhang \(2023\)ImDiffusion: imputed diffusion models for multivariate time series anomaly detection\.Proceedings of the VLDB Endowment17\(3\),pp\. 359–372\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3632093.3632101),LCCN 3Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.26193#S2.SS3.p1.1)\. - K\. Choi, J\. Yi, C\. Park, and S\. Yoon \(2021\)Deep learning for anomaly detection in time\-series data: review, analysis, and guidelines\.IEEE Access9,pp\. 120043–120065\.External Links:ISSN 2169\-3536,[Document](https://dx.doi.org/10.1109/ACCESS.2021.3107975),LCCN 3Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1)\. - J\. Chung, C\. Gulcehre, K\. H\. Cho, and Y\. Bengio \(2014\)Empirical evaluation of gated recurrent neural networks on sequence modeling\.arXiv preprint arXiv:1412\.3555\.Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1),[§C\.1](https://arxiv.org/html/2605.26193#A3.SS1.p1.4)\. - J\. Davis and M\. Goadrich \(2006\)The relationship between precision\-recall and roc curves\.InProceedings of the 23rd International Conference on Machine Learning,pp\. 233–240\.External Links:[Document](https://dx.doi.org/10.1145/1143844.1143874),ISBN 978\-1\-59593\-383\-6Cited by:[2nd item](https://arxiv.org/html/2605.26193#A2.I1.i2.p1.1),[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1)\. - Y\. Fang, J\. Xie, Y\. Zhao, L\. Chen, Y\. Gao, and K\. Zheng \(2024\)Temporal\-frequency masked autoencoders for time series anomaly detection\.InIEEE 40th International Conference on Data Engineering,pp\. 1228–1241\.External Links:[Document](https://dx.doi.org/10.1109/ICDE60146.2024.00099),ISBN 9798350317152Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.26193#S2.SS3.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p1.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.6.6.2)\. - A\. Garg, W\. Zhang, J\. Samaran, R\. Savitha, and C\. Foo \(2022\)An evaluation of anomaly detection and diagnosis in multivariate time series\.IEEE Transactions on Neural Networks and Learning Systems33\(6\),pp\. 2508–2517\.External Links:ISSN 2162\-237X,[Document](https://dx.doi.org/10.1109/TNNLS.2021.3105827),LCCN 1Cited by:[1st item](https://arxiv.org/html/2605.26193#A2.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3)\. - J\. Goh, S\. Adepu, K\. N\. Junejo, and A\. Mathur \(2017\)A dataset to support research in the design of secure water treatment systems\.InCritical Information Infrastructures Security,pp\. 88–99\.Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - M\. Goswami, K\. Szafer, A\. Choudhry, Y\. Cai, S\. Li, and A\. Dubrawski \(2024\)MOMENT: a family of open time\-series foundation models\.In41st International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.26193#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.4.4.2)\. - M\. Gupta, J\. Gao, C\. C\. Aggarwal, and J\. Han \(2014\)Outlier detection for temporal data: a survey\.IEEE Transactions on Knowledge and Data Engineering26\(9\),pp\. 2250–2267\.External Links:ISSN 1041\-4347,[Document](https://dx.doi.org/10.1109/TKDE.2013.184),LCCN 2Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1)\. - K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollar, and R\. Girshick \(2022\)Masked autoencoders are scalable vision learners\.In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15979–15988\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01553),ISBN 978\-1\-66546\-946\-3Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1)\. - D\. Hendrycks, M\. Mazeika, and T\. Dietterich \(2019\)Deep anomaly detection with outlier exposure\.InThe Seventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1)\. - A\. Huet, J\. M\. Navarro, and D\. Rossi \(2022\)Local evaluation of time series anomaly detection algorithms\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 635–645\.External Links:ISBN 978\-1\-4503\-9385\-0Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - K\. Hundman, V\. Constantinou, C\. Laporte, I\. Colwell, and T\. Soderstrom \(2018\)Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 387–395\.External Links:[Document](https://dx.doi.org/10.1145/3219819.3219845),ISBN 978\-1\-4503\-5552\-0Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - Y\. Jeong, E\. Yang, J\. H\. Ryu, I\. Park, and M\. Kang \(2023\)AnomalyBERT: self\-supervised transformer for time series anomaly detection using data degradation scheme\.InThe 11th International Conference on Learning Representations,Cited by:[1st item](https://arxiv.org/html/2605.26193#A7.I1.i3.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.26193#S2.SS2.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p2.2),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p3.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.7.7.2)\. - E\. Keogh \(2021\)Irrational exuberance: why we should not believe 95% of papers on time series anomaly detection\.Note:[https://kdd\-milets\.github\.io/milets2021/slides/Irrational%20Exuberance\_Eammon\_Keogh\.pdf](https://kdd-milets.github.io/milets2021/slides/Irrational%20Exuberance_Eammon_Keogh.pdf)Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p5.1),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - D\. Kim, S\. Park, and J\. Choo \(2024\)When model meets new normals: test\-time adaptation for unsupervised time\-series anomaly detection\.Proceedings of the AAAI Conference on Artificial Intelligence38\(12\),pp\. 13113–13121\.External Links:ISSN 2374\-3468,[Document](https://dx.doi.org/10.1609/aaai.v38i12.29210)Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.16.16.2)\. - S\. Kim, K\. Choi, H\. Choi, B\. Lee, and S\. Yoon \(2022\)Towards a rigorous evaluation of time\-series anomaly detection\.Proceedings of the AAAI Conference on Artificial Intelligence36\(7\),pp\. 7194–7201\.External Links:ISSN 2374\-3468,[Document](https://dx.doi.org/10.1609/aaai.v36i7.20680)Cited by:[1st item](https://arxiv.org/html/2605.26193#A2.I1.i1.p1.1),[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - Y\. Kong, Z\. Wang, Y\. Nie, T\. Zhou, S\. Zohren, Y\. Liang, P\. Sun, and Q\. Wen \(2025\)Unlocking the power of lstm for long term time series forecasting\.Proceedings of the AAAI Conference on Artificial Intelligence39\(11\),pp\. 11968–11976\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/33303),[Document](https://dx.doi.org/10.1609/aaai.v39i11.33303)Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1)\. - K\. Lai, D\. Zha, Y\. Zhao, G\. Wang, J\. Xu, and X\. Hu \(2021\)Revisiting time series outlier detection: definitions and benchmarks\.InThe 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3)\. - D\. Lee, S\. Malacarne, and E\. Aune \(2024\)Explainable time series anomaly detection using masked latent generative modeling\.Pattern Recognition156,pp\. 110826\.External Links:ISSN 0031\-3203,[Document](https://dx.doi.org/10.1016/j.patcog.2024.110826),LCCN 1Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.5](https://arxiv.org/html/2605.26193#S4.SS5.p1.1)\. - D\. Li, S\. Zhang, Y\. Sun, Y\. Guo, Z\. Che, S\. Chen, Z\. Zhong, M\. Liang, M\. Shao, M\. Li, S\. Liu, Y\. Zhang, and D\. Pei \(2023\)An empirical analysis of anomaly detection methods for multivariate time series\.In2023 IEEE 34th International Symposium on Software Reliability Engineering,pp\. 57–68\.External Links:[Document](https://dx.doi.org/10.1109/ISSRE59848.2023.00014),ISBN 9798350315943Cited by:[1st item](https://arxiv.org/html/2605.26193#A2.I1.i1.p1.1)\. - G\. Li and J\. J\. Jung \(2023\)Deep learning for anomaly detection in multivariate time series: approaches, applications, and challenges\.Information Fusion91,pp\. 93–102\.External Links:ISSN 1566\-2535,[Document](https://dx.doi.org/10.1016/j.i%20nffus.2022.10.008),LCCN 1Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.26193#S3.SS1.p1.14)\. - S\. Lin, W\. Lin, W\. Wu, F\. Zhao, R\. Mo, and H\. Zhang \(2023\)SegRNN: segment recurrent neural network for long\-term time series forecasting\.External Links:2308\.11200,[Link](https://arxiv.org/abs/2308.11200)Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1)\. - Q\. Liu and J\. Paparrizos \(2024\)The elephant in the room: towards a reliable time\-series anomaly detection benchmark\.InThe 38th Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[1st item](https://arxiv.org/html/2605.26193#A2.I1.i1.p1.1),[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p2.1),[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1),[§C\.2](https://arxiv.org/html/2605.26193#A3.SS2.p2.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p2.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p3.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.20.20.2),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.22.22.2)\. - J\. M\. Lobo \(2010\)AUC: a misleading measure of the performance of predictive distribution models\.Global Ecology & Biogeography17\(2\),pp\. 145–151\.Cited by:[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1)\. - Y\. Lu, R\. Wu, A\. Mueen, M\. A\. Zuluaga, and E\. Keogh \(2022\)Matrix profile xxiv: scaling time series anomaly detection to trillions of datapoints and ultra\-fast arriving data streams\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1173–1182\.External Links:[Document](https://dx.doi.org/10.1145/3534678.3539271),ISBN 978\-1\-4503\-9385\-0Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p1.1),[§E\.1](https://arxiv.org/html/2605.26193#A5.SS1.p1.1)\. - Y\. Lu, R\. Wu, A\. Mueen, M\. A\. Zuluaga, and E\. Keogh \(2023\)DAMP: accurate time series anomaly detection on trillions of datapoints and ultra\-fast arriving data streams\.Data Mining and Knowledge Discovery37\(2\),pp\. 627–669\.External Links:ISSN 1384\-5810,[Document](https://dx.doi.org/10.1007/s10618-022-00911-7)Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p2.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.26.26.2)\. - N\. Mejri, L\. Lopez\-Fuentes, K\. Roy, P\. Chernakov, E\. Ghorbel, and D\. Aouada \(2024\)Unsupervised anomaly detection in time\-series: an extensive evaluation and analysis of state\-of\-the\-art methods\.Expert Systems with Applications256,pp\. 124922\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/10.1016/j.eswa.2024.124922),LCCN 1Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.26193#S3.SS1.p1.14),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p2.2),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3),[§4\.5](https://arxiv.org/html/2605.26193#S4.SS5.p1.1)\. - T\. Nakamura, M\. Imamura, R\. Mercer, and E\. Keogh \(2020\)MERLIN: parameter\-free discovery of arbitrary length anomalies in massive time series archives\.In2020 IEEE International Conference on Data Mining,pp\. 1190–1195\.External Links:ISBN 978\-1\-72818\-316\-9Cited by:[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.25.25.2)\. - Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InThe 11th International Conference on Learning Representations,Cited by:[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p3.1)\. - J\. Paparrizos, P\. Boniol, T\. Palpanas, R\. S\. Tsay, A\. Elmore, and M\. J\. Franklin \(2022a\)Volume under the surface: a new accuracy evaluation measure for time\-series anomaly detection\.Proceedings of the VLDB Endowment15\(11\),pp\. 2774–2787\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3551793.3551830),LCCN 3Cited by:[3rd item](https://arxiv.org/html/2605.26193#A2.I1.i3.p1.1),[4th item](https://arxiv.org/html/2605.26193#A2.I1.i4.p1.1),[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p3.1)\. - J\. Paparrizos, Y\. Kang, P\. Boniol, R\. S\. Tsay, T\. Palpanas, and M\. J\. Franklin \(2022b\)TSB\-uad: an end\-to\-end benchmark suite for univariate time\-series anomaly detectiontsb\-uad\.Proceedings of the VLDB Endowment15\(8\),pp\. 1697–1711\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3529337.3529354),LCCN 3Cited by:[§C\.2](https://arxiv.org/html/2605.26193#A3.SS2.p2.1)\. - S\. Qin, Y\. Luo, and G\. Tao \(2023\)Memory\-augmented u\-transformer for multivariate time series anomaly detection\.InICASSP 2023 \- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP49357.2023.10096179),ISBN 978\-1\-72816\-327\-7Cited by:[§3\.6\.2](https://arxiv.org/html/2605.26193#S3.SS6.SSS2.p1.2),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.15.15.2)\. - F\. Rewicki, J\. Denzler, and J\. Niebling \(2023\)Is it worth it? comparing six deep and classical methods for unsupervised anomaly detection in time series\.Applied Sciences13\(3\),pp\. 1778\.External Links:ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app13031778),LCCN 4Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3)\. - M\. S\. Sarfraz, M\. Chen, L\. Layer, K\. Peng, and M\. Koulakis \(2024\)Position: quo vadis, unsupervised time series anomaly detection?\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3)\. - S\. Schmidl, P\. Wenig, and T\. Papenbrock \(2022\)Anomaly detection in time series: a comprehensive evaluation\.Proceedings of the VLDB Endowment15\(9\),pp\. 1779–1797\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3538598.3538602),LCCN 3Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p2.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p2.1),[§4\.3\.1](https://arxiv.org/html/2605.26193#S4.SS3.SSS1.p1.3)\. - Q\. Shentu, B\. Li, K\. Zhao, Y\. Shu, Z\. Rao, L\. Pan, B\. Yang, and C\. Guo \(2025\)Towards a general time series anomaly detector with adaptive bottlenecks and dual adversarial decoders\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.26193#S2.SS3.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p2.2),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.3.3.2)\. - H\. Si, J\. Li, C\. Pei, H\. Cui, J\. Yang, Y\. Sun, S\. Zhang, J\. Li, H\. Zhang, J\. Han, D\. Pei, and G\. Xie \(2024\)TimeSeriesBench: an industrial\-grade benchmark for time series anomaly detection models\.InIEEE 35th International Symposium on Software Reliability Engineering,Vol\.,pp\. 61–72\.External Links:[Document](https://dx.doi.org/10.1109/ISSRE62328.2024.00017)Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p1.1)\. - Y\. Su, Y\. Zhao, C\. Niu, R\. Liu, W\. Sun, and D\. Pei \(2019\)Robust anomaly detection for multivariate time series through stochastic recurrent neural network\.Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining\.External Links:[Document](https://dx.doi.org/10.1145/3292500.3330672)Cited by:[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1)\. - Y\. Sun, G\. Pang, G\. Ye, T\. Chen, X\. Hu, and H\. Yin \(2024\)Unraveling the ‘anomaly’in time series anomaly detection: a self\-supervised tri\-domain solution\.Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§3\.6\.2](https://arxiv.org/html/2605.26193#S3.SS6.SSS2.p1.2),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.9.9.2),[§4\.5](https://arxiv.org/html/2605.26193#S4.SS5.p1.1)\. - P\. Tang and W\. Zhang \(2025a\)Unlocking the power of patch: patch\-based mlp for long\-term time series forecasting\.Proceedings of the AAAI Conference on Artificial Intelligence39\(12\),pp\. 12640–12648\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/33378),[Document](https://dx.doi.org/10.1609/aaai.v39i12.33378)Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1)\. - P\. Tang and W\. Zhang \(2025b\)Unlocking the power of patch: patch\-based mlp for long\-term time series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 12640–12648\.External Links:ISSN 2374\-3468,[Document](https://dx.doi.org/10.1609/aaai.v39i12.33378)Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p1.5)\. - Q\. Tang, C\. Dai, Y\. Wu, and H\. Zhou \(2025\)MLP\-mixer based masked autoencoders are effective, explainable and robust for time series anomaly detection\.InProceedings of the VLDB Endowment,Vol\.18,pp\. 798–811\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3712221.3712243)Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p1.1),[§B\.2](https://arxiv.org/html/2605.26193#A2.SS2.p1.1),[§E\.1](https://arxiv.org/html/2605.26193#A5.SS1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p2.1),[§1](https://arxiv.org/html/2605.26193#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.26193#S2.SS3.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p3.1),[§3\.6\.2](https://arxiv.org/html/2605.26193#S3.SS6.SSS2.p1.2),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.5.5.2)\. - S\. Tuli, G\. Casale, and N\. R\. Jennings \(2022\)TranAD: deep transformer networks for anomaly detection in multivariate time series data\.Proceedings of the VLDB Endowment15\(6\),pp\. 1201–1214\.External Links:ISSN 2150\-8097,[Document](https://dx.doi.org/10.14778/3514061.3514067),LCCN 3Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.14.14.2)\. - A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1)\. - R\. Wang, X\. Mou, R\. Yang, K\. Gao, P\. Liu, C\. Liu, T\. Wo, and X\. Liu \(2024a\)CutAddPaste: time series anomaly detection by exploiting abnormal knowledge\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3176–3187\.External Links:[Document](https://dx.doi.org/10.1145/3637528.3671739),ISBN 9798400704901Cited by:[2nd item](https://arxiv.org/html/2605.26193#A7.I1.i3.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.26193#S2.SS2.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p3.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.8.8.2)\. - Z\. Wang, C\. Pei, M\. Ma, X\. Wang, Z\. Li, D\. Pei, S\. Rajmohan, D\. Zhang, Q\. Lin, H\. Zhang, J\. Li, and G\. Xie \(2024b\)Revisiting vae for unsupervised time series anomaly detection : a frequency perspective\.InProceedings of the ACM on Web Conference 2024,pp\. 3096–3105\.External Links:[Document](https://dx.doi.org/10.1145/3589334.3645710),ISBN 9798400701719Cited by:[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p2.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.11.11.2)\. - R\. Wu and E\. Keogh \(2021a\)Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress\.IEEE Transactions on Knowledge and Data Engineering,pp\. 1–1\.External Links:ISSN 1041\-4347,[Document](https://dx.doi.org/10.1109/TKDE.2021.3112126),LCCN 2Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p2.1),[§1](https://arxiv.org/html/2605.26193#S1.p5.1),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p2.1)\. - R\. Wu and E\. Keogh \(2021b\)Multi\-dataset time series anomaly detection competition\.Note:[https://compete\.hexagon\-ml\.com/practice/competition/39/\#evaluation](https://compete.hexagon-ml.com/practice/competition/39/#evaluation)Cited by:[Appendix D](https://arxiv.org/html/2605.26193#A4.p1.5)\. - R\. Wu and E\. Keogh \(2024a\)Deep learning time series anomaly detection algorithms are brutally sensitive to concept drift\.Note:[https://lnkd\.in/g7qWVTpS](https://lnkd.in/g7qWVTpS)Cited by:[§4\.5](https://arxiv.org/html/2605.26193#S4.SS5.p1.1)\. - R\. Wu and E\. Keogh \(2024b\)The fundamental problem in tsad research\.Note:[https://lnkd\.in/gP\-H8w4i](https://lnkd.in/gP-H8w4i)Cited by:[§B\.1](https://arxiv.org/html/2605.26193#A2.SS1.p2.1),[§1](https://arxiv.org/html/2605.26193#S1.p5.1),[§4\.1\.1](https://arxiv.org/html/2605.26193#S4.SS1.SSS1.p1.1),[§4\.1\.2](https://arxiv.org/html/2605.26193#S4.SS1.SSS2.p2.1),[§4\.5](https://arxiv.org/html/2605.26193#S4.SS5.p1.1)\. - X\. Wu, X\. Qiu, Z\. Li, Y\. Wang, J\. Hu, C\. Guo, H\. Xiong, and B\. Yang \(2025\)CATCH: channel\-aware multivariate time series anomaly detection via frequency patching\.InThe 13th International Conference on Learning Representations,Cited by:[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p2.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.13.13.2)\. - H\. Xu, Y\. Wang, S\. Jian, Q\. Liao, Y\. Wang, and G\. Pang \(2024a\)Calibrated one\-class classification for unsupervised time series anomaly detection\.IEEE Transactions on Knowledge and Data Engineering,pp\. 1–14\.External Links:ISSN 1041\-4347,[Document](https://dx.doi.org/10.1109/TKDE.2024.3393996),LCCN 2Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§1](https://arxiv.org/html/2605.26193#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26193#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.26193#S2.SS2.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p3.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.10.10.2)\. - J\. Xu, H\. Wu, J\. Wang, and M\. Long \(2022\)Anomaly transformer: time series anomaly detection with association discrepancy\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=LzQQ89U1qm_)Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.19.19.2)\. - Z\. Xu, A\. Zeng, and Q\. Xu \(2024b\)FITS: modeling time series with $10k$ parameters\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=bWcnvZ3qMb)Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.17.17.2)\. - T\. Yairi, Y\. Kato, and K\. Hori \(2001\)Fault detection by mining association rules from house\-keeping data\.Inproceedings of the 6th International Symposium on Artificial Intelligence, Robotics and Automation in Space,Vol\.18,pp\. 21\.Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.24.24.2)\. - Y\. Yang, C\. Zhang, T\. Zhou, Q\. Wen, and L\. Sun \(2023\)DCdetector: dual attention contrastive representation learning for time series anomaly detection\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,KDD ’23,New York, NY, USA,pp\. 3033–3045\.External Links:ISBN 9798400701030,[Link](https://doi.org/10.1145/3580305.3599295),[Document](https://dx.doi.org/10.1145/3580305.3599295)Cited by:[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.18.18.2)\. - X\. Yao, Z\. Chen, C\. Gao, G\. Zhai, and C\. Zhang \(2024\)ResAD: a simple framework for class generalizable anomaly detection\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385Cited by:[§3\.5](https://arxiv.org/html/2605.26193#S3.SS5.p1.1)\. - X\. Yao, Y\. Luo, Z\. Qian, and C\. Zhang \(2025\)ADPretrain: advancing industrial anomaly detection via anomaly representation pretraining\.InProceedings of the 39th International Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=mHfpziOtTW)Cited by:[§3\.5](https://arxiv.org/html/2605.26193#S3.SS5.p1.1)\. - X\. Yao, C\. Zhang, R\. Li, J\. Sun, and Z\. Liu \(2023\)One\-for\-all: proposal masked cross\-class anomaly detection\.Proceedings of the AAAI Conference on Artificial Intelligence37\(4\),pp\. 4792–4800\.External Links:ISSN 2374\-3468,[Document](https://dx.doi.org/10.1609/aaai.v37i4.25604)Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p3.1)\. - Z\. Zamanzadeh Darban, G\. I\. Webb, S\. Pan, C\. Aggarwal, and M\. Salehi \(2025\)Deep learning for time series anomaly detection: a survey\.ACM Computing Surveys57\(1\),pp\. 1–42\.External Links:ISSN 0360\-0300,[Document](https://dx.doi.org/10.1145/3691338),LCCN 1Cited by:[§1](https://arxiv.org/html/2605.26193#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.26193#S3.SS1.p1.14),[§3\.3\.1](https://arxiv.org/html/2605.26193#S3.SS3.SSS1.p2.2)\. - A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.Proceedings of the AAAI Conference on Artificial Intelligence37\(9\),pp\. 11121–11128\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/26317),[Document](https://dx.doi.org/10.1609/aaai.v37i9.26317)Cited by:[Appendix J](https://arxiv.org/html/2605.26193#A10.p2.1)\. - C\. Zhang, T\. Zhou, Q\. Wen, and L\. Sun \(2022\)TFAD: a decomposition time series anomaly detection architecture with time\-frequency analysis\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,pp\. 2497 –2507\.External Links:[Document](https://dx.doi.org/10.1145/3511808.3557470),ISBN 978\-1\-4503\-9236\-5Cited by:[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.26193#S3.SS3.SSS2.p2.1),[§4\.3](https://arxiv.org/html/2605.26193#S4.SS3.tab1.10.1.12.12.2)\. Figure 7\.Illustration of the four types of distortions\. The red segments represent the distorted parts\.## Appendix ASimulated Anomalies We simulated anomalies by distorting a portion of the input time series window𝐗\\mathbf\{X\}\. In detail, we randomly select an interval\[t1′,t2′\]⊂\[t1,t2\]\[t^\{\\prime\}\_\{1\},t^\{\\prime\}\_\{2\}\]\\subset\[t\_\{1\},t\_\{2\}\]in the input window𝐗=\{xt1…xt2\}\\mathbf\{X\}=\\\{x\_\{t\_\{1\}\}\\ldots x\_\{t\_\{2\}\}\\\}, and replace the values𝐗′\[t1′,t2′\]=\{xt1′…xt2′\}\\mathbf\{X^\{\\prime\}\}\_\{\[t^\{\\prime\}\_\{1\},t^\{\\prime\}\_\{2\}\]\}=\\\{x\_\{t^\{\\prime\}\_\{1\}\}\\ldots x\_\{t^\{\\prime\}\_\{2\}\}\\\}with one of the following anomalies \(see Figure[7](https://arxiv.org/html/2605.26193#A0.F7)\): - •Uniform Replacement:The original values are replaced with a constant sequence with values in the range\{min\(𝐗′\),max\(𝐗′\)\}\\\{min\(\\mathbf\{X^\{\\prime\}\}\),max\(\\mathbf\{X^\{\\prime\}\}\)\\\}\. - •Mirror Flip:The original values are flipped across the x\-axis or the y\-axis\. - •Length Scale:The original sequences are substituted with lengthened or shortened versions of themselves\. - •Jittering:The original values are added with random noise sampled from a normal distribution𝒩\(0,0\.1𝑰\)\\mathcal\{N\}\(0,0\.1\\boldsymbol\{I\}\)\. ## Appendix BDatasets and Metrics ### B\.1\.Datasets The KDD21 benchmark\(Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891)\), also known as the UCR Anomaly Archive, is widely recognized as the highest\-quality benchmark for time series anomaly detection\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Sunet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib812); Luet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib632); Siet al\.,[2024](https://arxiv.org/html/2605.26193#bib.bib1067)\)\. It comprises 250 datasets drawn from diverse domains such as healthcare, sports, industry, and robotics\. Notably, the anomalies in the datasets are challenging to detect and could not be easily addressed by the “one\-liner” approach\(Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891)\)\. In addition, the KDD21 benchmark provides a document explaining why certain regions are labeled as anomalies\. This further enhances the credibility and transparency of the dataset\. The TSB\-AD benchmark\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)represents the largest currently available collection of time\-series anomaly detection datasets\. It curates and manually cleanses datasets from various sources\. However, it has a significant overlap with the KDD21 dataset, and several subsets, such as NAB, WSD, YAHOO, and Stock, still suffer from issues including mislabeled ground truth, trivial anomalies, and unrealistic anomaly densities\(Wu and Keogh,[2024b](https://arxiv.org/html/2605.26193#bib.bib1072)\)\. Therefore, we exclude the overlapping parts and adhere to the criteria outlined in prior research on dataset quality\(Wu and Keogh,[2021a](https://arxiv.org/html/2605.26193#bib.bib891); Luet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib631); Schmidlet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib758)\)to select several high\-quality subsets from the TSB\-AD dataset\. The selected subsets include MGAB, SED, SVDB, IOPS, and TODS\. The statistics of the datasets are summarized in Table[3](https://arxiv.org/html/2605.26193#A2.T3)\. The anomaly ratio is calculated from the ratio between the sum of all anomaly points and the sum of all test points\. Table 3\.Statistics of benchmarks\. ### B\.2\.Metrics Recent studies\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Kimet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib506)\)indicate that widely used evaluation metrics, such as point\-adjusted F1 \(F1\-PA\) and F1\-Affiliation, tend to significantly overestimate model performance, even assigning high evaluation scores to randomly generated predictions\. In addition, anomaly detection datasets are highly class\-imbalanced, with normal points far outnumbering anomalous points\. The AUC\-ROC \(Area Under the Receiver Operating Characteristic Curve\) metric is biased towards the majority class, leading to inflated evaluation scores\(Davis and Goadrich,[2006](https://arxiv.org/html/2605.26193#bib.bib312)\)\. AUC\-PR \(Area Under the Precision\-Recall Curve\) has been advocated as a more informative alternative for imbalanced datasets\(Lobo,[2010](https://arxiv.org/html/2605.26193#bib.bib1069)\)\. Therefore, a recent benchmark evaluation paper has demonstrated that Standard\-F1, AUC\-PR\(Davis and Goadrich,[2006](https://arxiv.org/html/2605.26193#bib.bib312)\), Range\-AUC\-PR\(Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702)\)and VUS\-PR\(Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702)\)are the most reliable and accurate metrics for assessing model performance\. The details of the evaluation metrics are described as follows: - •Standard\-F1is computed directly from the obtained anomaly scores without any post\-processing\. To reduce the influence of threshold selection on evaluation results and maintain consistency with prior works\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050); Garget al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib399); Kimet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib506); Liet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib558)\), we report the maximum F1\-score across all possible thresholds\. - •AUC\-PR\(Davis and Goadrich,[2006](https://arxiv.org/html/2605.26193#bib.bib312)\)measures the area under the precision\-recall curve, providing a threshold\-independent evaluation of model performance\. - •Range\-AUC\-PR\(Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702)\)addresses labeling uncertainty and anomaly scoring delays by introducing a buffer zone around the boundaries of labeled anomalies when calculating AUC\-PR, thereby giving some credit to high anomaly scores in the vicinity of the anomaly boundaries\. Following the original study, we set the buffer length to the average anomaly length for each dataset\. - •VUS\-PR\(Paparrizoset al\.,[2022a](https://arxiv.org/html/2605.26193#bib.bib702)\)further solves the buffer length selection issue in Range\-AUC\-PR by calculating the volume under the surface formed by varying buffer lengths and thresholds, thus providing a more robust evaluation metric\. ## Appendix CImplementation Details Both our model and all baselines follow identical data preprocessing procedures within an integrated, unified pipeline\. All experiments are conducted on an Intel i9\-12900K CPU and a single NVIDIA RTX 3090 GPU\. ### C\.1\.Implementation Details ofCoAD CoADtakes the Gate Recurrent Unit \(GRU\)\(Chunget al\.,[2014](https://arxiv.org/html/2605.26193#bib.bib1071)\)as the encoder, with a hidden size of 24 and 3 recurrent layers\. The input window size𝚃\\mathtt\{T\}is set to 4 times the dominant period of the time series, while the patch size𝙿\\mathtt\{P\}is fixed to 8\. The dominant period is automatically determined using the autocorrelation function \(ACF\)\. The loss weightλ\\lambdais set to 10, and the number of frequency bands𝙺\\mathtt\{K\}for STFT is set to 4\. We use the Adam optimizer with a learning rate of 0\.002\. We train our model for 300 epochs for all datasets\.A detailed analysis of the hyperparameters is presented in Section[J](https://arxiv.org/html/2605.26193#A10)\. ### C\.2\.Implementation Details of Baselines For deep learning\-based methods,we reproduce all models using their official open\-source repositories\. We strictly follow the original training and test splits provided by the KDD21 and TSB\-AD benchmarks and employ early stopping for model selection\. Each deep learning\-based method is trained 5 times with different random seeds, and the averaged performance is reported\. For data mining\-based methods,we include the 7 best\-performing approaches identified in recent evaluation studies\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)\. All implementations are based on the TSB\-UAD\(Paparrizoset al\.,[2022b](https://arxiv.org/html/2605.26193#bib.bib701)\)and TSB\-AD\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)libraries\. The hyperparameters for these methods are set according to the TSB\-AD\(Liu and Paparrizos,[2024](https://arxiv.org/html/2605.26193#bib.bib1050)\)paper, where they are tuned on a large\-scale validation set\. Table 4\.The evaluation results \(%\) on the KDD21 dataset using Top\-k\\mathit\{k\}Accuracies \(Acc\.@k\\mathit\{k\}\)\. The best results are inbold, and the second\-best results are withunderline\.## Appendix DEvaluation Results on the KDD21 Benchmark Using Top\-k\\mathit\{k\}Accuracy Scores Settings\.The KDD21 benchmark originates from the SIGKDD Cup 2021 competition, and we also adopt the official evaluation metric provided by the competition organizer\(Wu and Keogh,[2021b](https://arxiv.org/html/2605.26193#bib.bib1077)\)\. Each of the 250 datasets in the KDD21 benchmark contains only a single anomaly\. For each algorithm, the point with the highest anomaly score \(Top\-1\\mathit\{1\}\) is selected as the predicted anomaly\. A prediction is considered correct if it lies within±100\\pm 100data points of the true anomaly range; otherwise, it is marked incorrect\. The overall accuracy is computed as the mean accuracy across all datasets\. In addition to Top\-1\\mathit\{1\}accuracy, we also report Top\-3\\mathit\{3\}and Top\-5\\mathit\{5\}accuracies to account for scenarios where multiple plausible anomaly locations may exist in real\-world time series\. Results\.Table[C\.2](https://arxiv.org/html/2605.26193#A3.SS2)summarizes the evaluation results on the KDD21 benchmark in terms of Top\-k\\mathit\{k\}accuracy \(Acc\.@k\\mathit\{k\}\)\. The empirical data consistently demonstrate the superior performance of the proposedCoADframework\. ## Appendix EEfficiency Experiments on the TSB\-AD Benchmark ### E\.1\.Experiment Settings We comprehensively compare the detection performance, inference speed, and model params ofCoADagainst baseline methods\. Following existing works\(Tanget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1051); Luet al\.,[2022](https://arxiv.org/html/2605.26193#bib.bib632)\), we report the total inference time across all subsets as the overall inference time\. To ensure a fair comparison, all deep learning\-based models are evaluated with the same batch size of 128, the window size that yields the highest VUS\-PR score, and are executed on the same NVIDIA RTX 3090 GPU\. All Data Mining\-based methods are tested on the same Intel i9\-12900K CPU, as they don’t support GPU parallel processing\. Figure 8\.Model efficiency comparison\.Figure 9\.Qualitative ablation results\. The lower parts of \(c\) and \(d\) represent the probability\-informed soft masks\. ### E\.2\.Efficiency Results The model efficiency comparison on the TSB\-AD dataset is shown in Figure[8](https://arxiv.org/html/2605.26193#A5.F8)\. Our proposed frameworkCoADoutperforms the state\-of\-the\-art methods in terms of both efficiency and performance\. While MMA and KmeansAD demonstrate performance levels comparable toCoAD,CoADachieves inference speeds that are several orders of magnitude faster\. Specifically,CoADcompletes the inference on all subsets of the TSB\-AD benchmark, comprising more than 5\.69 million data points, in just 37\.55 seconds\. ## Appendix FTime Complexity Analysis The computational complexity ofCoADprimarily arises from its classification and reconstruction modules\. In the classification module, the frequency branch involves STFT computation with time complexityO\(𝚃\)O\(\\mathtt\{T\}\)\. The GRU encoders in both the frequency and time branches have a time complexity ofO\(𝚃𝙿⋅𝙷2\)O\(\\frac\{\\mathtt\{T\}\}\{\\mathtt\{P\}\}\\cdot\\mathtt\{H\}^\{2\}\)\. The reconstruction module’s GRU encoder also has a time complexity ofO\(𝚃𝙿⋅𝙷2\)O\(\\frac\{\\mathtt\{T\}\}\{\\mathtt\{P\}\}\\cdot\\mathtt\{H\}^\{2\}\)\. Therefore, the overall time complexity ofCoADisO\(𝚃\+𝚃𝙿⋅𝙷2\)O\\left\(\\mathtt\{T\}\+\\frac\{\\mathtt\{T\}\}\{\\mathtt\{P\}\}\\cdot\\mathtt\{H\}^\{2\}\\right\)\. This indicates that the computational cost scales linearly with the input sequence length𝚃\\mathtt\{T\}, while the patching design effectively reduces𝚃\\mathtt\{T\}by a factor of the patch length𝙿\\mathtt\{P\}, thereby substantially reducing computational overhead\. ## Appendix GDetailed Ablation Settings The implementation details of ablation variants are outlined below: 1. i\)OE and MAE variants works alone: - •OE \(Time\) takes only the time domain features for classification\. - •OE \(Time \+ Frequency\) employs the maximum fusion strategy to combine classification results from both the time and frequency domains\. - •MAE \(Random/Grating\) applies either the random or grating masking strategy\. 2. ii\)Different cooperative strategies: - •OE\+MAE \(Random/Grating\) calculates the anomaly score by summing the anomaly probability produced by OE and the reconstruction error generated by MAE \(Random/Grating\)\. - •OE\+MAE \(Guide w/ Hard Mask\) integrates OE’s guidance using discrete hard masks, where patches with anomaly probabilities exceeding a certain threshold are masked\. The threshold is determined as the mean plus three standard deviations of the anomaly probabilities across all patches in the training set\. - •OE \(Time\) \+ MAE \(Guide w/ Soft Mask\) is softly guided by the OE module; however, the OE module performs classification based solely on time\-domain features\. 3. iii\)Cooperation under different design choices: - •OE \(Step Level\)\+MAE adopts step\-level classification granularity similar to AnomalyBERT\(Jeonget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib479)\)\. The mean anomaly probability across all points within a patch is used as the patch’s anomaly probability for soft masking\. - •OE \(Window Level\)\+MAE utilizes window\-level classification granularity similar to CutAddPaste\(Wanget al\.,[2024a](https://arxiv.org/html/2605.26193#bib.bib849)\)\. A sliding window with a stride of 1 is applied to obtain the anomaly probability for each point\. The soft masking process is then performed as in OE \(Step Level\)\+MAE\. - •OE \(Feature Add\)\+MAE employs a feature\-level fusion strategy, where features from both domains are directly added and then fed into the classifier\. - •OE \(Feature Gated\)\+MAE also uses a feature\-level fusion strategy, combining features from both domains through a learnable gating mechanism\(Arevaloet al\.,[2017](https://arxiv.org/html/2605.26193#bib.bib1074)\)\. - •OE \(Decision Mean\)\+MAE applies a decision\-level fusion strategy, averaging the classification scores from both branches\. - •OE\+MAE \(Guide w/o Score\) takes the same guided soft masking strategy asCoAD, but relies solely on MAE’s reconstruction error for final anomaly scoring\. Figure 10\.Comparison ofCoADand Pure OE in detecting unseen anomalies\. The experiments are conducted under cross\-type settings \(e\.g\., trained without the Uniform Replacement distortion type and tested on Uniform Replacement anomalies\)\.\(a\)Parameter analysis on the KDD21 benchmark with respect to the window size, patch size, loss weightλ\\lambdaand backbone choice\. \(b\)Parameter analysis on the TSB\-AD benchmark with respect to the window size, patch size, loss weightλ\\lambdaand backbone choice\. Figure 11\.Hyperparameter study on \(a\) KDD21 and \(b\) TSB\-AD datasets\.## Appendix HQualitative Ablation Results We further present qualitative ablation results in Figure[9](https://arxiv.org/html/2605.26193#A5.F9), providing an intuitive understanding of the contribution and effectiveness of each component inCoAD\. Figure[9](https://arxiv.org/html/2605.26193#A5.F9)\(a\) clearly shows that incorporating frequency\-branch classification results helps identify anomalies that are difficult to detect in the time domain alone\. Figure[9](https://arxiv.org/html/2605.26193#A5.F9)\(b\) confirms that joint anomaly inference from both the classification and reconstruction modules delivers more comprehensive detection results than relying solely on either module individually\. Figure[9](https://arxiv.org/html/2605.26193#A5.F9)\(c\) and \(d\) highlight the superiority of our proposed guided soft masking strategy over existing random and grating masking methods\. The random masking approach generates high reconstruction errors even in normal regions, leading to severe false alarms, while the grating masking method tends to overfit anomalies, resulting in false negatives\. Our proposed probability\-informed soft masking suppresses anomaly\-related cues and retains normal information, enabling accurate reconstruction in normal regions and higher reconstruction errors in anomalous regions, thereby maximizing the reconstruction module’s performance\. ## Appendix IGeneralizability Verification We conduct experiments to evaluate the generalizability ofCoADin comparison with the pure OE\-based method\.CoADincorporates anomaly scores computed from both the Time\-Frequency ensemble classification and the Residual classification modules, whereas the pure OE\-based method computes anomaly scores solely from the Time–Frequency ensemble classification module\. Specifically, one of the four simulated anomaly types in Figure[7](https://arxiv.org/html/2605.26193#A0.F7)is excluded during training and then reintroduced in the testing set to replace the original anomalies, enabling the assessment of each model’s ability to detect entirely novel anomaly types\. We report results using the Standard\-F1 and AUC\-PR metrics, as the remaining metrics exhibit consistent trends\. The evaluation results are presented in Figure[10](https://arxiv.org/html/2605.26193#A7.F10)\. The pure OE\-based method shows unstable performance when encountering unseen anomaly types, with particularly poor results on Mirror Flip and Length Scale in the TSB\-AD benchmark\. In contrast,CoADdemonstrates consistently robust performance across all test scenarios, indicating that its cooperative design effectively enhances generalization and enables reliable detection of previously unseen anomaly types\. ## Appendix JHyperparameter Study We conduct a comprehensive parameter study to investigate the sensitivity ofCoADto key hyperparameters, including the input window size𝚃\\mathtt\{T\}, patch size𝙿\\mathtt\{P\}, and loss weightλ\\lambda\. The results are presented in Figure[11](https://arxiv.org/html/2605.26193#A7.F11)\. The following observations can be made: 1\) Increasing the input window size generally improves model performance\. This is because a larger window enables the model to leverage richer contextual information to facilitate anomaly detection\. However, when the window size exceeds 4 times the dominant period, the performance improvement becomes marginal while introducing additional computational overhead\. 2\) The performance ofCoADinitially improves with increasing patch size, as larger patches provide more comprehensive local context for both classification and reconstruction\. However, excessively large patches may lead to over\-smoothing and the loss of fine\-grained details\(Tang and Zhang,[2025b](https://arxiv.org/html/2605.26193#bib.bib1075)\), leading to a slight decline in performance\. 3\) Since the classification loss and reconstruction loss are on different scales, the loss weightλ\\lambdais critical to balance the learning process\. In practice, settingλ\\lambdabetween 5 and 10 generally yields optimal performance\. We further investigate different backbone choices with results shown on the right side of Figure[11](https://arxiv.org/html/2605.26193#A7.F11)\.CoADwith GRU\(Chunget al\.,[2014](https://arxiv.org/html/2605.26193#bib.bib1071)\)as the backbone achieves significantly better performance than Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.26193#bib.bib1084)\)\. This is consistent with prior findings\(Zenget al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib1085); Tang and Zhang,[2025a](https://arxiv.org/html/2605.26193#bib.bib1086)\)that the self\-attention mechanism inevitably leads to a loss of temporal information in time series data \(i\.e\., permutation\-invariant and anti\-order characteristics of self\-attention\), thereby impairing anomaly detection performance\. Meanwhile, recent studies also demonstrate that thePatch \+ RNNarchitecture is more effective in time series modeling\(Konget al\.,[2025](https://arxiv.org/html/2605.26193#bib.bib1087); Linet al\.,[2023](https://arxiv.org/html/2605.26193#bib.bib1088)\), as the inherent sequential structure of RNN models enables them to effectively capture the crucial temporal dependencies, and the patching design significantly improves the long\-term temporal modeling ability and computational efficiency for RNNs\.
Similar Articles
CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Proposes CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection that uses estimated channel relevance to construct contrastive samples, achieving state-of-the-art performance.
Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection
This paper proposes MODIAD, a framework for multimodal online distributed industrial anomaly detection, addressing resource constraints with a Multi-class Intelligent Scheduling problem and a Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy. Experiments on MVTec 3D-AD and Eyecandies datasets demonstrate superior performance and efficiency.
Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection
This paper introduces JuRe (Just Repair), a minimal denoising network for time series anomaly detection that matches or exceeds complex neural baselines on the TSB-AD and UCR benchmarks, demonstrating that a proper manifold-projection training objective is more important than architectural complexity.
Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate
This paper introduces a diagnostic framework for multivariate time series anomaly detection benchmarks and finds that labeled anomalies are mostly detectable from individual channels, challenging the need for cross-channel modeling. The authors call for more structurally diverse evaluation sets.
Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers
The article introduces SAGE, a multi-agent LLM framework for time-series anomaly detection that uses specialized analyzers to improve interpretability and reliability. It demonstrates superior performance over baselines on three benchmarks and enhances diagnostic reporting through structured evidence consolidation.