TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv cs.LG 06/18/26, 04:00 AM Papers
time-series-forecasting benchmark robustness fault-injection evaluation structural-faults machine-learning
Summary
This paper introduces TS-Fault, a benchmark for evaluating time series forecasting models under structured fault scenarios like broken dependencies and regime changes, finding that clean-data accuracy often anti-correlates with robustness and that foundation models are especially fragile.
arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:44 AM
# Benchmarking Time Series Forecasters Against Structural Faults ⊠Corresponding author.
Source: [https://arxiv.org/html/2606.18539](https://arxiv.org/html/2606.18539)
###### Abstract

Time series forecasting \(TSF\) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number \(e\.g\.,average error\) on clean held\-out data, under the implicit assumption that it predicts deployed reliability\. However, real faults are not i\.i\.d\. noise but structured events with temporal shape, broken cross\-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline\. Treating TSF robustness as a data\-quality problem, we present TS\-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty\. TS\-Fault organizes recurring failures into four modes along two orthogonal axes \(observation\- vs\. mechanism\-level; univariate vs\. multivariate\) and injects each fault into the most prediction\-critical window via a unified importance score\. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity\. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol\. The results reveal three findings that contradict common leaderboard intuition:*\(i\)*clean\-data accuracy anti\-correlates with robustness;*\(ii\)*clean rankings are preserved under observation\-level faults but reshuffled under mechanism\-level faults; and*\(iii\)*all catastrophic failures occur under mechanism\-level faults, with foundation models achieving the highest clean\-data accuracy yet exhibiting the greatest fragility\. The code is publicly available at[https://github\.com/Ray\-zyy/TS\-Fault](https://github.com/Ray-zyy/TS-Fault)\.

## IIntroduction

The proliferation of edge computing and mobile sensing generates a massive volume of time series data, which is being collected and stored in time series database systems\[[34](https://arxiv.org/html/2606.18539#bib.bib111),[80](https://arxiv.org/html/2606.18539#bib.bib112)\], motivating various real\-world applications\[[71](https://arxiv.org/html/2606.18539#bib.bib6),[85](https://arxiv.org/html/2606.18539#bib.bib97)\],e\.g\.,time series forecasting\[[59](https://arxiv.org/html/2606.18539#bib.bib19)\]\. The forecasting is rarely an end in itself\. In energy dispatch\[[30](https://arxiv.org/html/2606.18539#bib.bib1),[56](https://arxiv.org/html/2606.18539#bib.bib2)\], clinical monitoring\[[69](https://arxiv.org/html/2606.18539#bib.bib4),[54](https://arxiv.org/html/2606.18539#bib.bib5)\], financial risk control\[[66](https://arxiv.org/html/2606.18539#bib.bib3)\], and traffic management\[[71](https://arxiv.org/html/2606.18539#bib.bib6)\], a forecast is an input to a consequential downstream action, whose reliability rests on an assumption that current evaluation rarely tests: that the error a model exhibits during evaluation is representative of the error it will produce after deployment\. Mainstream TSF evaluation has crystallized this assumption\. For over two decades, progress in long\-term TSF has been measured by a single family of quantities,i\.e\.,average MSE/MAE on clean, complete, evenly sampled held\-out series, and successive benchmark generations \(M3/M4\[[49](https://arxiv.org/html/2606.18539#bib.bib12),[50](https://arxiv.org/html/2606.18539#bib.bib13)\], Monash\[[24](https://arxiv.org/html/2606.18539#bib.bib14)\], GIFT\-Eval\[[1](https://arxiv.org/html/2606.18539#bib.bib18)\], TFB\[[59](https://arxiv.org/html/2606.18539#bib.bib19)\]\) have largely refined how this quantity is measured rather than questioning whether it reflects deployment risk\. In short, existing leaderboards answer “which model is most accurate on clean data?” whereas deployment asks a different question: “under what conditions, and how severely, will a model fail?”

These two questions are not equivalent, and the gap is large enough to invert model\-selection decisions\. In our study, the model with the second\-lowest error on clean data collapses to the worst\-performing model under structured faults, while several models with only middling performance on clean data emerge as the most robust\. Selecting a forecaster by MSE on clean data, the prevailing practice, can therefore systematically favor the model that is most brittle in deployment, and the current evaluation paradigm is unable to reveal this in advance\.

Why the gap persists in part because existing robustness studies model a fault as a value\-level deviation from otherwise nominal observations: additive Gaussian noise, random masking\[[13](https://arxiv.org/html/2606.18539#bib.bib26),[19](https://arxiv.org/html/2606.18539#bib.bib28)\], orϵ\\epsilon\-bounded adversarial perturbations\[[26](https://arxiv.org/html/2606.18539#bib.bib104),[48](https://arxiv.org/html/2606.18539#bib.bib67)\]\. These deviations typically assume that corruptions are independent across time and across variables, and approximately homogeneous in distribution\. As a result, they are unable to express the faults that dominate real deployments\. Real\-world faults often violate these assumptions\. A frozen sensor does not emit white noise; it emits an event with onset, peak, and decay, possibly displaced in time by a buffering pipeline\. A market shock rewrites the lead–lag and gain structure among assets rather than perturbing each one independently\[[20](https://arxiv.org/html/2606.18539#bib.bib71)\], even while each series, viewed alone, still looks plausible\. When a grid enters emergency operation, its monitoring drops samples in state\-dependent, block\-structured runs concentrated around the critical transition, not at random\[[9](https://arxiv.org/html/2606.18539#bib.bib7)\]\. And upstream sensor drift propagates along causal dependencies instead of staying local, dragging downstream true states with it and inducing secondary observation failures\. None of these mechanisms can be faithfully expressed by i\.i\.d\. perturbations of a clean signal, no matter how the variance or masking rate is changed\. Evaluating TSF reliability should therefore be viewed as a data\-quality problem involving structured dirty data, rather than as a question of sensitivity to additive noise\.

We argue that closing this gap requires changing the object of evaluation, not merely adding harder samples\. We present TS\-Fault, a benchmark that evaluates time series forecasters under explicit, parameterized fault scenarios rather than clean test inputs alone\. Each scenario specifies what type of fault occurs, where it occurs, how severe it is, and which temporal or cross\-variable structures it disrupts\. This makes the fault itself a first\-class object of evaluation, enabling model degradation to be interpreted in terms of named failure mechanisms rather than reported only as an aggregate error increase\. We organize recurring real\-world failures into four fault modes along two orthogonal axes: whether a fault corrupts observations \(a transient event\) or alters the data\-generating mechanism \(a persistent regime\), and whether it acts on a single series or on the cross\-variable structure\. Crucially, every fault is injected not at a random position but into the most prediction\-critical window, selected by a unified importance score, so that the benchmark stresses the regions a model actually relies on rather than degenerating into a random perturbation\. Our main contributions can be summarized as follows:

- •*A fault\-operator framework*\(Section[III](https://arxiv.org/html/2606.18539#S3)\) that reformulates TSF robustness evaluation around explicit fault scenarios, semantic fault parameters, controllable difficulty levels, worst\-case and average risk measures, and a unified strategy for identifying prediction\-critical windows\.
- •*Four parameterized fault modes*\(Section[IV](https://arxiv.org/html/2606.18539#S4)\) spanning the2×22\\times 2taxonomy, each with a concrete, causally grounded construction and an explicit difficulty decomposition: Time\-Warped Shock \(Mode I\), Dependency\-Fracture Shock \(Mode II\), Regime\-Transition Missingness \(Mode III\), and Cascading Sensor\-to\-System Failure \(Mode IV\)\.
- •*A reproducible benchmark and large\-scale empirical study*covering 21 models, 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol\. The results reveal three findings that challenges standard leaderboard intuition: \(i\) clean\-data performance anti\-correlates with robustness; \(ii\) a sharp stratification of failure impact across modes; \(iii\) a concentration of all catastrophic failures in mechanism\-level modes, with pretrained foundation models being the most accurate yet the most fragile\.
- •*A diagnostic view of forecasting robustness*that attributes model degradation to specific fault mechanisms and difficulty levels\. Rather than treating robustness as a pass/fail noise test, TS\-Fault enables ablation\-style analysis of when, how, and why forecasters fail under structured faults\.

## IIRelated Work

### II\-ATSF Benchmarks and Evaluation Paradigms

Time series forecasting \(TSF\) has been evaluated for decades through benchmark protocols that rank models by aggregate forecasting error on held\-out data, as shown in Table[I](https://arxiv.org/html/2606.18539#S2.T1)\. The first generation \(M3, M4\[[49](https://arxiv.org/html/2606.18539#bib.bib12),[50](https://arxiv.org/html/2606.18539#bib.bib13)\]\) established accuracy comparison across heterogeneous domains; the second \(Monash\[[24](https://arxiv.org/html/2606.18539#bib.bib14)\]and the LTSF suite\[[84](https://arxiv.org/html/2606.18539#bib.bib15)\]\) standardized cross\-domain and long\-horizon multivariate protocols; the third \(GIFT\-Eval\[[1](https://arxiv.org/html/2606.18539#bib.bib18)\], TFB\[[59](https://arxiv.org/html/2606.18539#bib.bib19)\], BasicTS\+\[[67](https://arxiv.org/html/2606.18539#bib.bib65)\]\) tightened held\-out discipline and fairness of comparison; and the fourth \(fev\-bench\[[68](https://arxiv.org/html/2606.18539#bib.bib20)\], TSFM\-Bench\[[42](https://arxiv.org/html/2606.18539#bib.bib21)\], ProbTS\[[82](https://arxiv.org/html/2606.18539#bib.bib22)\]\), responding to pretrained forecasters, acknowledged that pretraining leakage can no longer be reliably excluded\.

TABLE I:Generational evolution of TSF benchmarks\.Despite these improvements, evaluation remains centered on average MSE or MAE on clean held\-out data\. This scalar ranking reveals which model is accurate, but not when, how severely, or why it fails\. In contrast, TS\-Fault evaluates models under explicit fault conditions, reporting degradation by failure mode and difficulty level and attributing each degradation to a named, parameterized mechanism\.

### II\-BTime Series Foundation Models and Their Evaluation

The evaluation problem has become more urgent with the rise of the time series foundation model \(TSFM\)\. Trained on large, heterogeneous corpora\[[35](https://arxiv.org/html/2606.18539#bib.bib60),[29](https://arxiv.org/html/2606.18539#bib.bib61)\]and applied zero\-shot, models such as Chronos\[[4](https://arxiv.org/html/2606.18539#bib.bib35)\], TimesFM\[[17](https://arxiv.org/html/2606.18539#bib.bib36)\]and Moirai\[[74](https://arxiv.org/html/2606.18539#bib.bib37)\]attain strong clean\-data accuracy without task\-specific training\[[43](https://arxiv.org/html/2606.18539#bib.bib23)\]\. However, foundation models also complicate evaluation\. Because their pretraining corpora are large, only partially auditable, and often overlap the public repositories that later supply benchmark test sets, strong held\-out performance becomes hard to distinguish from memorization or implicit exposure to benchmark data\[[62](https://arxiv.org/html/2606.18539#bib.bib42),[25](https://arxiv.org/html/2606.18539#bib.bib44),[77](https://arxiv.org/html/2606.18539#bib.bib41)\]\. Existing benchmark responses mainly tighten held\-out protocols, require disclosure of pretraining corpora, or introduce more careful dataset filtering\. These measures protect the interpretation of a clean score but leave a prior question unanswered: how such models behave when their input is structurally faulted\. TS\-Fault directly addresses this question by providing a complementary evaluation axis by synthesizing faulted instances at evaluation time from controlled scenario parameters, so that even a memorized clean sequence does not directly provide the structured failure pattern used for testing \(Sec\.[VI\-D](https://arxiv.org/html/2606.18539#S6.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.18539v1/x1.png)Figure 1:The TS\-Fault pipeline\.
### II\-CRobustness Evaluation in Time Series

Existing robustness research in time series can be organized along two complementary axes: whether a perturbation is sampled from a distribution or constructed by an explicit operator, and whether it is unstructured \(i\.i\.d\. across time and variables\) or structured \(carrying temporal shape, cross\-variable coupling, or causal propagation\)\. Table[II](https://arxiv.org/html/2606.18539#S2.T2)summarizes this landscape\. Noise and masking assume i\.i\.d\. timesteps and cannot express cross\-temporal failure structure\[[13](https://arxiv.org/html/2606.18539#bib.bib26),[19](https://arxiv.org/html/2606.18539#bib.bib28),[10](https://arxiv.org/html/2606.18539#bib.bib30)\]; adversarial perturbations are mathematical objects withinϵ\\epsilon\-balls that lack semantic grounding\[[26](https://arxiv.org/html/2606.18539#bib.bib104),[48](https://arxiv.org/html/2606.18539#bib.bib67)\]; and distribution\-shift benchmarks, though closest to our concern, remain observational, they measure degradation across time gaps without enabling counterfactual control over how the same shift would behave if made more severe\[[79](https://arxiv.org/html/2606.18539#bib.bib31),[21](https://arxiv.org/html/2606.18539#bib.bib32),[39](https://arxiv.org/html/2606.18539#bib.bib34)\]\. The remaining quadrant \(constructed and structured\), is largely unexplored in TSF robustness evaluation\. This is most relevant to deployment faults, where failures often have temporal morphology, cross\-variable dependence, state dependence, and propagation effects\. TS\-Fault fills this gap by introducing explicit, scenario\-grounded fault operators with controllable difficulty\. This follows the general philosophy of corruption\-robustness benchmarks such as ImageNet\-C\[[27](https://arxiv.org/html/2606.18539#bib.bib105)\], but adapts it to the distinctive structure of time series, where robustness depends on how faults unfold across time and variables, not just on corruption intensity\.

TABLE II:Four quadrants of TSF robustness research\.

## IIIThe Fault\-Operator Framework

This section develops an alternative to clean\-accuracy evaluation in four steps\. Sec\.[III\-A](https://arxiv.org/html/2606.18539#S3.SS1)defines fault operators and the structured test instance they produce, together with two risk measures over them\. Sec\.[III\-B](https://arxiv.org/html/2606.18539#S3.SS2)constrains what may serve as a scenario parameter and grades it with a difficulty map\. Sec\.[III\-C](https://arxiv.org/html/2606.18539#S3.SS3)places each fault in the window a model actually relies on\. Sec\.[III\-D](https://arxiv.org/html/2606.18539#S3.SS4)ties these components into a reproducible generator\. Figure[1](https://arxiv.org/html/2606.18539#S2.F1)summarizes the resulting pipeline\. The main idea of this pipeline is to introduce a transformation that injects a specific fault mechanism into clean data based on explicit and interpretable parameters\.

### III\-AStructured Instances and Risk Measures

Definition 1 \(Fault operator\)\.A*fault operator*is a map𝒯Θ:ℝL×C→ℝL×C\\mathcal\{T\}\_\{\\Theta\}:\\mathbb\{R\}^\{L\\times C\}\\\!\\to\\\!\\mathbb\{R\}^\{L\\times C\}parameterized by a scenario\-parameter vectorΘ\\Theta\. Applied to a clean contextXX, it produces a faulted contextX~=𝒯Θ\(X\)\\tilde\{X\}=\\mathcal\{T\}\_\{\\Theta\}\(X\), whereΘ\\Thetaencodes interpretable quantities such as fault onset, duration, affected channels, magnitude, and propagation range\.

Definition 2 \(Fault family\)\.A*fault family*is a set of operators sharing one failure mechanism,ℱ=\{𝒯Θ:Θ∈Φ\}\\mathcal\{F\}=\\\{\\,\\mathcal\{T\}\_\{\\Theta\}:\\Theta\\in\\Phi\\,\\\}, whereΦ\\Phiis the parameter space of that mechanism\. DistinctΘ∈Φ\\Theta\\in\\Phiinstantiate the same mechanism under different conditions and difficulties\.

We replace the clean pair\(X,Y\)\(X,Y\)with a structured instance

\(X,Y\)→𝒯Θ\(X~,Y~,Θ,δ\),X~=𝒯Θ\(X\),\(X,Y\)\\;\\xrightarrow\{\\;\\mathcal\{T\}\_\{\\Theta\}\\;\}\\;\\big\(\\tilde\{X\},\\,\\tilde\{Y\},\\,\\Theta,\\,\\delta\\big\),\\qquad\\tilde\{X\}=\\mathcal\{T\}\_\{\\Theta\}\(X\),\(1\)where the fault operator acts only on the input context, so the target is the clean, unperturbed futureY~=Y\\tilde\{Y\}=Yfor all four modes\. This design isolates*input*robustness\. A model is always scored on recovering the true future from a corrupted history, never on predicting the corruption itself\. The scalarδ=κ\(Θ\)\\delta=\\kappa\(\\Theta\)grades how difficult the instance is \(Sec\.[III\-B](https://arxiv.org/html/2606.18539#S3.SS2)\)\. Evaluation then becomes a function ofΘ\\Theta:

eval\(f,Θ\)=𝔼\[ℓ\(f\(X~\),Y~\)\|Θ\]\.\\mathrm\{eval\}\(f,\\Theta\)\\;=\\;\\mathbb\{E\}\\big\[\\,\\ell\\big\(f\(\\tilde\{X\}\),\\tilde\{Y\}\\big\)\\,\\big\|\\,\\Theta\\,\\big\]\.\(2\)Hence, the performance is reported per fault condition, rather than as a single aggregate over an unspecified mixture of conditions\.

Risk measures\.Over a fault familyℱ\\mathcal\{F\}, we define a worst\-case and an average risk:

WC\(f,ℱ\)\\displaystyle\\mathrm\{WC\}\(f,\\mathcal\{F\}\)=supΘ∈Φ𝔼\(X,Y\)∼P\[ℓ\(f\(𝒯Θ\(X\)\),Y\)\],\\displaystyle=\\sup\_\{\\Theta\\in\\Phi\}\\;\\mathbb\{E\}\_\{\(X,Y\)\\sim P\}\\big\[\\ell\\big\(f\(\\mathcal\{T\}\_\{\\Theta\}\(X\)\),Y\\big\)\\big\],\(3\)AVG\(f,ℱ\)\\displaystyle\\mathrm\{AVG\}\(f,\\mathcal\{F\}\)=𝔼Θ∼πΦ𝔼\(X,Y\)∼P\[ℓ\(f\(𝒯Θ\(X\)\),Y\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\Theta\\sim\\pi\_\{\\Phi\}\}\\;\\mathbb\{E\}\_\{\(X,Y\)\\sim P\}\\big\[\\ell\\big\(f\(\\mathcal\{T\}\_\{\\Theta\}\(X\)\),Y\\big\)\\big\]\.\(4\)
WC\\mathrm\{WC\}bounds how bad it can get under the most adverse parameter\.AVG\\mathrm\{AVG\}, taken under a priorπΦ\\pi\_\{\\Phi\}over the parameter space, is less sensitive to the exact parameterization and serves as a scalar summary for cross\-model comparison\. Reporting both separates “how bad can it get” from “how bad is it on average\.” In this paper, all tabulated summaries reportAVG\\mathrm\{AVG\}, whereasWC\\mathrm\{WC\}is reflected by the per\-dataset maxima of Sec\.[VI\-F](https://arxiv.org/html/2606.18539#S6.SS6)\.

### III\-BScenario Parameters under a Difficulty Map

Not every parameterized perturbation qualifies as a scenario parameter\. We then requireΘ\\Thetato satisfy four properties\.*\(i\) Semantic interpretability:*every component corresponds to a quantity a domain expert recognizes \(shock magnitude, event duration, coupling gain, sensor failure rate\), rather than a coordinate in an abstract space such as a gradient step size\.*\(ii\) Compositionality:*components must combine to express compound conditions \(e\.g\., a demand surge co\-occurring with sensor dropout\) without a fresh parameter vector per combination, enabling systematic coverage of the joint failure space\.*\(iii\) Causal grounding:*Θ\\Thetamust encode a mechanism\[[64](https://arxiv.org/html/2606.18539#bib.bib77),[58](https://arxiv.org/html/2606.18539#bib.bib85)\]rather than a statistical summary \(i\.e\., the trigger, propagation path, and decay dynamics of a shift\)\.*\(iv\) Difficulty controllability:*the difficultyδ\\deltavaries monotonically and predictably withΘ\\Theta, so a graded test can attribute a performance difference to a specific parameter\.

Difficulty score\.The difficultyδ=κ\(Θ\)\\delta=\\kappa\(\\Theta\)is a domain\-interpretable mapping from scenario parameters to a scalar\. Each mode decomposesδ\\deltainto interpretable terms \(event magnitude, structural distortion, information loss, coupling strength, see Sec\.[IV\-E](https://arxiv.org/html/2606.18539#S4.SS5)\), so an instance reports not only that it is hard but why\. SweepingΘ\\Thetaalongκ\\kappayields graded difficulty levels, and with them robustness curves and threshold analyses instead of single\-point comparisons\.

### III\-CLocalizing Faults: a Window\-Importance Score

Injecting a fault at a random position would let the benchmark degenerate into a noise test\. Hence, we further introduce a mechanism that selects the most prediction\-critical window for all four modes\. For a candidate windowW=\[s,e\]⊆\[1,L\]W=\[s,e\]\\subseteq\[1,L\]we compute:

S\(W\)=\\displaystyle S\(W\)=\{\}λ1Scp\(W\)\+λ2Sper\(W\)\\displaystyle\\lambda\_\{1\}S\_\{\\mathrm\{cp\}\}\(W\)\+\\lambda\_\{2\}S\_\{\\mathrm\{per\}\}\(W\)\(5\)\+λ3Svar\(W\)\+λ4Spred\(W\),\\displaystyle\+\\lambda\_\{3\}S\_\{\\mathrm\{var\}\}\(W\)\+\\lambda\_\{4\}S\_\{\\mathrm\{pred\}\}\(W\),and sampleW⋆W^\{\\star\}from the top\-KKcandidates\. The four sub\-scores capture complementary reasons a model may rely onWW, so that their sum favors windows that are critical for more than one reason \(Fig\.[2](https://arxiv.org/html/2606.18539#S3.F2)\)\.

A forecaster makes predictions based on the most recent behavior of the time series\. When that behavior shifts, the window covering the shift is what tells the model which pattern currently holds, so a fault placed there would be especially challenging\. We use the change\-point scoreScpS\_\{\\mathrm\{cp\}\}to find such windows\. Thus, we splitWWat its centerccinto halvesWL=\[s,c\]W\_\{L\}=\[s,c\]andWR=\[c\+1,e\]W\_\{R\}=\[c\+1,e\], fit a light local predictor on each half, and take the bidirectional cross\-prediction error

Scp\(W\)\\displaystyle S\_\{\\mathrm\{cp\}\}\(W\)=1\|WR\|∑t∈WR‖xt−fL\(x<t\)‖22\\displaystyle=\\frac\{1\}\{\|W\_\{R\}\|\}\\sum\_\{t\\in W\_\{R\}\}\\big\\\|x\_\{t\}\-f\_\{L\}\(x\_\{<t\}\)\\big\\\|\_\{2\}^\{2\}\(6\)\+1\|WL\|∑t∈WL‖xt−fR\(x\>t\)‖22\.\\displaystyle\\quad\+\\frac\{1\}\{\|W\_\{L\}\|\}\\sum\_\{t\\in W\_\{L\}\}\\big\\\|x\_\{t\}\-f\_\{R\}\(x\_\{\>t\}\)\\big\\\|\_\{2\}^\{2\}\.IfWWdoes not cross a change, both halves follow the same local law, and each predictor explains the other half\. The score will be high whenWWcovers a switch in trend, variance, or period\[[2](https://arxiv.org/html/2606.18539#bib.bib45),[36](https://arxiv.org/html/2606.18539#bib.bib68)\]\.

Most forecasters rely heavily on periodicity, and a cycle is easiest to pin down at its peaks and valleys\. The period scoreSperS\_\{\\mathrm\{per\}\}is therefore introduced to target windows near these points\. We estimate the dominant periodPPfrom the autocorrelation,

ACF\(k\)\\displaystyle\\mathrm\{ACF\}\(k\)=∑t\(xt−x¯\)\(xt\+k−x¯\)∑t\(xt−x¯\)2,\\displaystyle=\\frac\{\\sum\_\{t\}\(x\_\{t\}\-\\bar\{x\}\)\(x\_\{t\+k\}\-\\bar\{x\}\)\}\{\\sum\_\{t\}\(x\_\{t\}\-\\bar\{x\}\)^\{2\}\},\(7\)P\\displaystyle P=arg⁡max2≤k≤L/2⁡ACF\(k\)\.\\displaystyle=\\arg\\max\_\{2\\leq k\\leq L/2\}\\mathrm\{ACF\}\(k\)\.We then locate the peak/valley anchors of the nearest cycle and score the Gaussian\-decayed closeness of the window center to the nearest anchor \(bandwidthσp\\sigma\_\{p\}\),

Sper\(W\)=max⁡\{e−dpeak2/σp2,e−dvalley2/σp2\},S\_\{\\mathrm\{per\}\}\(W\)=\\max\\\!\\Big\\\{e^\{\-d\_\{\\mathrm\{peak\}\}^\{2\}/\\sigma\_\{p\}^\{2\}\},\\,e^\{\-d\_\{\\mathrm\{valley\}\}^\{2\}/\\sigma\_\{p\}^\{2\}\}\\Big\\\},\(8\)so that windows aligned with a peak or a valley score highest\.

A fault in a flat, uninformative stretch barely changes the forecast, so the volatility scoreSvarS\_\{\\mathrm\{var\}\}is also designed to measure how much information a window carries\. With a wavelet decomposition ofWWand per\-level detail energyEj\(W\)=∑t∈W\|dj,t\|2E\_\{j\}\(W\)=\\sum\_\{t\\in W\}\|d\_\{j,t\}\|^\{2\}, we take the robustly normalized weighted sum \(level weightsωj\\omega\_\{j\}\)

Svar\(W\)=∑jωjEj\(W\)MAD\(W\)\+ε,S\_\{\\mathrm\{var\}\}\(W\)=\\frac\{\\sum\_\{j\}\\omega\_\{j\}E\_\{j\}\(W\)\}\{\\mathrm\{MAD\}\(W\)\+\\varepsilon\},\(9\)favoring windows with rich internal dynamics, which tend to carry more predictive information\.

The above three scores are computed from the data perspective, so they focus on approximating what a model actually uses\. The occlusion scoreSpredS\_\{\\mathrm\{pred\}\}measures this directly\. With a reference predictorff, we compare the full\-input forecasty^\\hat\{y\}to the forecasty^\(\\W\)\\hat\{y\}^\{\(\\backslash W\)\}obtained after occludingWW,

Spred\(W\)=∑h=1Hαh‖y^L\+h−y^L\+h\(\\W\)‖22,S\_\{\\mathrm\{pred\}\}\(W\)=\\sum\_\{h=1\}^\{H\}\\alpha\_\{h\}\\,\\big\\\|\\hat\{y\}\_\{L\+h\}\-\\hat\{y\}^\{\(\\backslash W\)\}\_\{L\+h\}\\big\\\|\_\{2\}^\{2\},\(10\)directly identifying the windows the model actually depends on\. The weightsλi\\lambda\_\{i\}are per\-mode hyperparameters \(e\.g\., Mode I raisesλ3\\lambda\_\{3\}andλ4\\lambda\_\{4\}to favor internally dynamic windows with high predictive leverage, whereas Mode III raisesλ1\\lambda\_\{1\}to favor windows near a change point, detailed in Sec[IV](https://arxiv.org/html/2606.18539#S4)\)\. This could guarantee that the benchmark does not collapse into a random perturbation\.

![Refer to caption](https://arxiv.org/html/2606.18539v1/x2.png)Figure 2:Illustrative window\-importance scoring\. Among candidate windows \(grey\),S\(W\)S\(W\)selectsW⋆W^\{\\star\}\(blue\) because it is simultaneously close to a change point \(ScpS\_\{\\mathrm\{cp\}\}\), near identifiable period anchors \(SperS\_\{\\mathrm\{per\}\}\), internally volatile \(SvarS\_\{\\mathrm\{var\}\}\), and highly influential on the forecast under occlusion \(SpredS\_\{\\mathrm\{pred\}\}\)\. Faults are injected here, not at random positions\.
### III\-DGenerating a TS\-Fault Instance

Algorithm[1](https://arxiv.org/html/2606.18539#alg1)ties the components together\. The same windowing mechanism \(lines 2–3\) is shared by all four modes, which is what makes their results directly comparable\. Only the mode\-specific structure selection \(line 4\) and the operator \(line 5\) differ\.

The generation cost is relatively cheap\. With light local predictors,ScpS\_\{\\mathrm\{cp\}\}andSpredS\_\{\\mathrm\{pred\}\}are linear in the number of candidate windows, the remaining sub\-scores are linear in window length, and all per\-channel computations are independent, so generation scales linearly with the number of channels and parallelizes trivially across instances\. Because instances are produced by an explicit operator rather than sampled, the benchmark can also be regenerated at any difficulty by re\-sweepingκ\\kappa, and previously unexposedΘ\\Thetacombinations can be held out at release time \(Sec\.[VII\-B](https://arxiv.org/html/2606.18539#S7.SS2)\)\.

Algorithm 1Generating a TS\-Fault instance0:clean window

XX, target

YY; mode

ii; difficulty

ss
1:sample parameters

Θ∼Φi\\Theta\\sim\\Phi\_\{i\}at difficulty

ss\(i\.e\., subject to

κi\(Θ\)=δs\\kappa\_\{i\}\(\\Theta\)=\\delta\_\{s\}, the target difficulty of level

ss\)

2:score candidates

S\(W\)←∑kλk\(i\)Sk\(W\)S\(W\)\\leftarrow\\sum\_\{k\}\\lambda^\{\(i\)\}\_\{k\}S\_\{k\}\(W\)
3:sample the critical window

W⋆W^\{\\star\}from the top\-

KKby

S\(⋅\)S\(\\cdot\)
4:select mode\-specific structure on

W⋆W^\{\\star\}\(variable subset

SS; or root set

RRand downstream

DD; or switch center

τ\\tau\)

5:apply the operator, localized to

W⋆W^\{\\star\}:

X~←𝒯Θ\(X\)\\tilde\{X\}\\leftarrow\\mathcal\{T\}\_\{\\Theta\}\(X\)
6:

Y~←Y\\tilde\{Y\}\\leftarrow Y// target unchanged; only the input is faulted

7:

δ←κi\(Θ\)\\delta\\leftarrow\\kappa\_\{i\}\(\\Theta\)
8:returnstructured instance

\(X~,Y~,Θ,δ\)\(\\tilde\{X\},\\tilde\{Y\},\\Theta,\\delta\)

![Refer to caption](https://arxiv.org/html/2606.18539v1/x3.png)Figure 3:The2×22\\times 2fault taxonomy\. Scope \(observation\- vs\. mechanism\-level\)×\\timesvariate scope \(uni\- vs\. multivariate\) yields four modes\.![Refer to caption](https://arxiv.org/html/2606.18539v1/x4.png)Figure 4:Visual signatures of the four fault modes at three difficulties \(clean in grey, faulted in color, untouched future target in green\)\.

## IVThe Four Fault Modes

Drawing on documented failures in energy\[[9](https://arxiv.org/html/2606.18539#bib.bib7),[3](https://arxiv.org/html/2606.18539#bib.bib48)\], financial\[[20](https://arxiv.org/html/2606.18539#bib.bib71)\], and clinical\[[53](https://arxiv.org/html/2606.18539#bib.bib50),[65](https://arxiv.org/html/2606.18539#bib.bib51)\]systems, we organize meaningful TSF faults along two orthogonal axes \(Figure[3](https://arxiv.org/html/2606.18539#S3.F3)\)\. The first axis is scope: a fault either corrupts observations while the data\-generating mechanism stays intact \(observation\-level, typically a transient event\), or it alters the mechanism itself \(mechanism\-level, typically a persistent regime\)\. The second axis is the variable scope: a fault either acts primarily on a single series \(univariate\) or on the cross\-variable structure \(multivariate\)\. The four resulting modes \(illustrated in Figure[4](https://arxiv.org/html/2606.18539#S3.F4)\) form a comprehensive evaluation\.

### IV\-AMode I: Time\-Warped Shock \(Observation×\\timesUnivariate\)

Mode I is the observation\-level, univariate case: a localized, shaped event whose timing is also distorted, as when a sensor briefly spikes or saturates and its reading is then displaced on the time axis by a buffering pipeline\. The fault therefore acts jointly in the value domain \(an event with onset, peak, and decay\) and on the temporal alignment\. This is a combination that i\.i\.d\. noise, forbidding both temporal shape and a coherent time shift, cannot produce\.

We select the critical windowW⋆W^\{\\star\}with the importance scoreS\(W\)S\(W\)\(Eq\. \([5](https://arxiv.org/html/2606.18539#S3.E5)\)\), raisingλ3,λ4\\lambda\_\{3\},\\lambda\_\{4\}so the event lands on an internally dynamic window the model relies on, and inside it add a local event prototype at magnitudeα\\alpha:

bt=atmt,x~t=xt\+αbt,t∈W⋆,b\_\{t\}=a\_\{t\}\\,m\_\{t\},\\qquad\\tilde\{x\}\_\{t\}=x\_\{t\}\+\\alpha\\,b\_\{t\},\\quad t\\in W^\{\\star\},\(11\)where the support maskmtm\_\{t\}fixes which timesteps the event touches and the shapeata\_\{t\}its local amplitude profile\. The prototype is a narrow impulse \(a transient spike\), a decaying burst \(a short overload\), or a flat transient shift \(a brief baseline drift\)\. We then distort the event’s position in time by re\-sampling the shocked window through a temporal mapϕ\(t\)\\phi\(t\),xt′=x~ϕ\(t\)x^\{\\prime\}\_\{t\}=\\tilde\{x\}\_\{\\phi\(t\)\}\(by interpolation\):

ϕ\(t\)=\{t\+Δ,shifts\+αw\(t−s\),scalet\+Δsin⁡\(2π\(t−s\)/\|W⋆\|\)\+ξt,nonlinear\.\\phi\(t\)=\\begin\{cases\}t\+\\Delta,&\\text\{shift\}\\\\\[1\.0pt\] s\+\\alpha\_\{w\}\(t\-s\),&\\text\{scale\}\\\\\[1\.0pt\] t\+\\Delta\\sin\\\!\\big\(2\\pi\(t\-s\)/\|W^\{\\star\}\|\\big\)\+\\xi\_\{t\},&\\text\{nonlinear\}\.\\end\{cases\}\(12\)A shift offsets the timestamp, a scale stretches or compresses the event \(αw\>1\\alpha\_\{w\}\\\!\>\\\!1stretches,αw<1\\alpha\_\{w\}\\\!<\\\!1compresses\), and a nonlinear warp distorts its leading, middle, and trailing phases unevenly; only the corresponding segment is replaced, so the fault stays local\. The scenario parametersΘI=\(Θevt,Θwrp,Θctx,Θcpl\)\\Theta\_\{\\mathrm\{I\}\}=\(\\Theta\_\{\\mathrm\{evt\}\},\\Theta\_\{\\mathrm\{wrp\}\},\\Theta\_\{\\mathrm\{ctx\}\},\\Theta\_\{\\mathrm\{cpl\}\}\)collect the event, the warp, the window context, and the event–warp coupling \(the warp acting independently, centered on the event peak, or scaling with shock strength\), and the difficultyδI=∑kβkDk\\delta\_\{\\mathrm\{I\}\}=\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}sums the matching event, warp, context, and coupling terms \(Table[IV\-D](https://arxiv.org/html/2606.18539#S4.SS4)\)\.Mode I probeswhether a model over\-relies on local peaks, short\-term patterns, or precise temporal alignment\.

### IV\-BMode II: Dependency\-Fracture Shock \(Observation×\\timesMultivariate\)

Mode II is the observation\-level, multivariate case: a group of variables that should move together has its lead–lag and gain structure covertly broken, where assets react to a common shock but with timing or sign that no longer matches their normal relationship\. Each variate, viewed alone, still looks plausible\. Only the cross\-variable structure is abnormal, which is a configuration that per\-channel i\.i\.d\. noise cannot represent\.

BesidesW⋆W^\{\\star\}, we select the variable subsetSSmost embedded in the local dependency structure\. Over the window neighborhoodN\(W;r\)N\(W;r\), we score each variable by its total lead–lag coupling to the rest,

Rij\(W\)=max\|τ\|≤τmax⁡\|Corr\(xt\(i\),xt−τ\(j\)\)\|,Gi\(W\)=∑j≠iRij\(W\),\\begin\{gathered\}R\_\{ij\}\(W\)=\\max\_\{\|\\tau\|\\leq\\tau\_\{\\max\}\}\\big\|\\mathrm\{Corr\}\\big\(x^\{\(i\)\}\_\{t\},x^\{\(j\)\}\_\{t\-\\tau\}\\big\)\\big\|,\\\\ G\_\{i\}\(W\)=\\sum\_\{j\\neq i\}R\_\{ij\}\(W\),\\end\{gathered\}\(13\)the maximum lagged cross\-correlation summed over partners, and takeS=TopM\{Gi\(W\)\}S=\\mathrm\{TopM\}\\\{G\_\{i\}\(W\)\\\}\. We inject a shared event prototypeutu\_\{t\}with heterogeneous per\-variable gains \(x~t\(i\)=xt\(i\)\+giut\\tilde\{x\}^\{\(i\)\}\_\{t\}=x^\{\(i\)\}\_\{t\}\+g\_\{i\}u\_\{t\},i∈Si\\in S\), so the group appears to have undergone one common event but not identically; we then pick a rootrrand, for each followerjj, estimate its normal response template by least squares and deliberately falsify it:

\(τ^j,g^j\)=arg⁡minτ,g∑t∈N\(W;r\)\(xt\(j\)−gxt−τ\(r\)\)2,τj′=τ^j\+Δτj,gj′=g^j\+Δgj,x′t\(j\)=xt\(j\)\+gj′ut−τj′,\\begin\{gathered\}\(\\hat\{\\tau\}\_\{j\},\\hat\{g\}\_\{j\}\)=\\arg\\min\_\{\\tau,g\}\\sum\_\{t\\in N\(W;r\)\}\\big\(x^\{\(j\)\}\_\{t\}\-g\\,x^\{\(r\)\}\_\{t\-\\tau\}\\big\)^\{2\},\\\\ \\tau^\{\\prime\}\_\{j\}=\\hat\{\\tau\}\_\{j\}\+\\Delta\\tau\_\{j\},\\quad g^\{\\prime\}\_\{j\}=\\hat\{g\}\_\{j\}\+\\Delta g\_\{j\},\\quad\{x^\{\\prime\}\}^\{\(j\)\}\_\{t\}=x^\{\(j\)\}\_\{t\}\+g^\{\\prime\}\_\{j\}\\,u\_\{t\-\\tau^\{\\prime\}\_\{j\}\},\\end\{gathered\}\(14\)where sign flips ongj′g^\{\\prime\}\_\{j\}are permitted \(turning a co\-movement into an anti\-movement\) while the root keeps its normal response\. The result looks like one shared event per channel, yet the inter\-variable timing, strength, and direction are falsified\. The parametersΘII=\(Θshk,Θfrc,Θctx,Θscl\)\\Theta\_\{\\mathrm\{II\}\}=\(\\Theta\_\{\\mathrm\{shk\}\},\\Theta\_\{\\mathrm\{frc\}\},\\Theta\_\{\\mathrm\{ctx\}\},\\Theta\_\{\\mathrm\{scl\}\}\)collect the shock, the fracture offsets\(Δτ,Δg\)\(\\Delta\\tau,\\Delta g\), the window/variable context, and the participation scale, andδII=∑kβkDk\\delta\_\{\\mathrm\{II\}\}=\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}sums the shock, fracture, group, and position terms \(Table[IV\-D](https://arxiv.org/html/2606.18539#S4.SS4)\)\.*Mode II probes*whether a model that gains accuracy from cross\-channel correlation\.

### IV\-CMode III: Regime\-Transition Missingness \(Mechanism×\\timesUnivariate\)

Mode III moves to the mechanism\-level, univariate corner: the data\-generating process itself changes, and samples go missing in a state\-dependent, non\-random way\[[44](https://arxiv.org/html/2606.18539#bib.bib46)\], such as when a grid enters emergency operation \(new trend, period, and volatility\)\[[9](https://arxiv.org/html/2606.18539#bib.bib7),[61](https://arxiv.org/html/2606.18539#bib.bib9)\]while its monitoring drops blocks of samples around the very transition\. The model therefore observes a biased, block\-missing mixture of the old and new regimes, something neither value\-level corruption nor random masking can produce\.

We choose a switch centerτ\\tauby a quality score that rewards a salient change in slope, period, or volatility near the forecast origin\. Fort≥τt\\geq\\tauwe rewrite the trend, season, and residual ofxt=Tt\+St\+Rtx\_\{t\}=T\_\{t\}\+S\_\{t\}\+R\_\{t\}into a new regime,

T~t=Tt\+Δβ\(t−τ\)\+Δb,S~t=asSϕs\(t\),R~t=crRt,\\tilde\{T\}\_\{t\}=T\_\{t\}\+\\Delta\_\{\\beta\}\(t\-\\tau\)\+\\Delta\_\{b\},\\quad\\tilde\{S\}\_\{t\}=a\_\{s\}\\,S\_\{\\phi\_\{s\}\(t\)\},\\quad\\tilde\{R\}\_\{t\}=c\_\{r\}R\_\{t\},\(15\)where\(Δβ,Δb\)\(\\Delta\_\{\\beta\},\\Delta\_\{b\}\)shift the trend slope and level,asa\_\{s\}rescales the seasonal amplitude,ϕs\(t\)=τ\+PP′\(t−τ\)\+ψ\\phi\_\{s\}\(t\)=\\tau\+\\tfrac\{P\}\{P^\{\\prime\}\}\(t\-\\tau\)\+\\psiretimes its period and phase, andcrc\_\{r\}inflates the residual\. We form the ideal new\-regime trajectoryz¯t=T~t\+S~t\+R~t\\bar\{z\}\_\{t\}=\\tilde\{T\}\_\{t\}\+\\tilde\{S\}\_\{t\}\+\\tilde\{R\}\_\{t\}and blend it with the old one through a smooth sigmoid gateωt=σ\(\(t−τ\)/wτ\)\\omega\_\{t\}=\\sigma\(\(t\-\\tau\)/w\_\{\\tau\}\),zt=\(1−ωt\)xt\+ωtz¯tz\_\{t\}=\(1\-\\omega\_\{t\}\)x\_\{t\}\+\\omega\_\{t\}\\bar\{z\}\_\{t\}, so the switch is a transition rather than a hard cut\. The observation process then drops blocks where the system is most stressed, with missing\-run start probability

pstart\(i\)\(t\)=σ\(a0\+a1ht\+a2vt\(i\)\+a3ϱt\(i\)\+a4ci\),p^\{\(i\)\}\_\{\\mathrm\{start\}\}\(t\)=\\sigma\\\!\\big\(a\_\{0\}\+a\_\{1\}h\_\{t\}\+a\_\{2\}v^\{\(i\)\}\_\{t\}\+a\_\{3\}\\varrho^\{\(i\)\}\_\{t\}\+a\_\{4\}c\_\{i\}\\big\),\(16\)whereht=4ωt\(1−ωt\)h\_\{t\}=4\\omega\_\{t\}\(1\-\\omega\_\{t\}\)peaks at the switch core,vt\(i\)v^\{\(i\)\}\_\{t\}is local volatility,ϱt\(i\)\\varrho^\{\(i\)\}\_\{t\}the standardized residual magnitude, andcic\_\{i\}a per\-channel fragility; run lengths are geometric, giving the observed seriesxtobs,\(i\)=mt\(i\)zt\(i\)\+\(1−mt\(i\)\)vtfill,\(i\)x^\{\\mathrm\{obs\},\(i\)\}\_\{t\}=m^\{\(i\)\}\_\{t\}z^\{\(i\)\}\_\{t\}\+\(1\-m^\{\(i\)\}\_\{t\}\)\\,v^\{\\mathrm\{fill\},\(i\)\}\_\{t\}\. The model sees only this masked, regime\-mixed historyX~\\tilde\{X\}while the target stays clean \(Y~=Y\\tilde\{Y\}=Y\); the parametersΘIII=\(Θrgm,Θmis,Θctx,Θcpl\)\\Theta\_\{\\mathrm\{III\}\}=\(\\Theta\_\{\\mathrm\{rgm\}\},\\Theta\_\{\\mathrm\{mis\}\},\\Theta\_\{\\mathrm\{ctx\}\},\\Theta\_\{\\mathrm\{cpl\}\}\)and difficultyδIII=∑kβkDk\\delta\_\{\\mathrm\{III\}\}=\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}\(regime, missingness, proximity, coupling; Table[IV\-D](https://arxiv.org/html/2606.18539#S4.SS4)\) follow accordingly\.*Mode III probes*whether a model recovers the true future or instead extrapolates the spurious new regime and the gaps\. It is the most structurally destructive mode as it rewrites the statistics a model sees in its input while hiding exactly the samples that would reveal the change\.

### IV\-DMode IV: Cascading Sensor\-to\-System Failure \(Mechanism×\\timesMultivariate\)

Mode IV is the remaining mechanism\-level, multivariate case: a fault that originates at an upstream sensor, propagates along causal dependencies into downstream states, and finally degrades downstream observations\[[18](https://arxiv.org/html/2606.18539#bib.bib72),[8](https://arxiv.org/html/2606.18539#bib.bib73),[15](https://arxiv.org/html/2606.18539#bib.bib53),[41](https://arxiv.org/html/2606.18539#bib.bib49)\]\. The fault itself propagates through the input window across three layers \(root reading, downstream state, downstream observation\)\.

BesidesW⋆W^\{\\star\}, we separate upstream drivers from downstream victims with a local directed\-influence score

Δi→j\(W\)=Errself\(j\)−Erraug\(i→j\)Errself\(j\)\+ε\.\\Delta\_\{i\\to j\}\(W\)=\\frac\{\\mathrm\{Err\}^\{\(j\)\}\_\{\\mathrm\{self\}\}\-\\mathrm\{Err\}^\{\(i\\to j\)\}\_\{\\mathrm\{aug\}\}\}\{\\mathrm\{Err\}^\{\(j\)\}\_\{\\mathrm\{self\}\}\+\\varepsilon\}\.\(17\)The relative drop injj’s self\-prediction error onceii’s history is added as a predictor\. The roots are the strongest driversR=TopKr\{∑jΔi→j\}R=\\mathrm\{Top\}\_\{K\_\{r\}\}\\\{\\sum\_\{j\}\\Delta\_\{i\\to j\}\\\}and the downstream set the strongest victimsD=TopKd\{∑r∈RΔr→j\}D=\\mathrm\{Top\}\_\{K\_\{d\}\}\\\{\\sum\_\{r\\in R\}\\Delta\_\{r\\to j\}\\\}, taken disjoint fromRR, with trigger timeτ1\\tau\_\{1\}where the roots deviate most from their window median\. Fromτ1\\tau\_\{1\}on, each root stands for a faulted reading under one of four sensor\-fault operators \(*bias drift*,*saturation*, coarse*quantization*, or*stuck\-at*\), producing a fault erroret\(r\)e^\{\(r\)\}\_\{t\}that seeds the cascade\. The error reaches each downstreamjjwith gainγrj\\gamma\_\{rj\}, delayΔrj\\Delta\_\{rj\}, and a decaying kernelkrj\(ℓ\)=e−ℓ/hrj/Zrjk\_\{rj\}\(\\ell\)=e^\{\-\\ell/h\_\{rj\}\}/Z\_\{rj\},

ζt\(j\)=∑r∈Rγrj∑ℓ=0nkkrj\(ℓ\)et−Δrj−ℓ\(r\),\\zeta^\{\(j\)\}\_\{t\}=\\sum\_\{r\\in R\}\\gamma\_\{rj\}\\sum\_\{\\ell=0\}^\{n\_\{k\}\}k\_\{rj\}\(\\ell\)\\,e^\{\(r\)\}\_\{t\-\\Delta\_\{rj\}\-\\ell\},\(18\)dragging the downstream state tozt′⁣\(j\)=zt\(j\)\+ζt\(j\)z^\{\\prime\(j\)\}\_\{t\}=z^\{\(j\)\}\_\{t\}\+\\zeta^\{\(j\)\}\_\{t\}, which in turn makes the downstream observation fragile and induces secondary dropouts whose rate grows with the local cascade magnitude\|ζt\(j\)\|\|\\zeta^\{\(j\)\}\_\{t\}\|\. The observed input is therefore piecewise by channel role, faulted readings onRR, displaced\-and\-masked values onDD, clean elsewhere, while the target stays clean \(Y~=Y\\tilde\{Y\}=Y\)\. The parametersΘIV=\(Θrt,Θdwn,Θflt,Θprp,Θsec\)\\Theta\_\{\\mathrm\{IV\}\}=\(\\Theta\_\{\\mathrm\{rt\}\},\\Theta\_\{\\mathrm\{dwn\}\},\\Theta\_\{\\mathrm\{flt\}\},\\Theta\_\{\\mathrm\{prp\}\},\\Theta\_\{\\mathrm\{sec\}\}\)and difficultyδIV=∑kβkDk\\delta\_\{\\mathrm\{IV\}\}=\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}\(root, cascade, delay, secondary; Table[IV\-D](https://arxiv.org/html/2606.18539#S4.SS4)\) follow accordingly\.*Mode IV probes*whether a model that aggregates across channels contains an upstream fault or instead spreads it into its own forecasts of the true future\.

TABLE III:Difficulty decomposition for the four fault modes\. Eachδ\\deltais the weighted sum∑kβkDk\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}of the terms in its block\.### IV\-Edifficulty Control and Fault Composition

Each mode’s difficultyδ=κ\(Θ\)\\delta=\\kappa\(\\Theta\)is a weighted sum of four interpretable terms,δ=∑kβkDk\\delta=\\sum\_\{k\}\\beta\_\{k\}D\_\{k\}, so a hard instance records not only that it is hard but why\. Table[IV\-D](https://arxiv.org/html/2606.18539#S4.SS4)states the terms explicitly\. We also introduce difficulty to systematically conduct the evaluation\. The five levelss1→s5s\_\{1\}\\\!\\to\\\!s\_\{5\}sweepΘ\\Thetaalongκ\\kappaso that every term increases monotonically\. Further, becauseκ\\kappais monotone and each term is interpretable, the resulting robustness curves \(Sec\.[VI\-E](https://arxiv.org/html/2606.18539#S6.SS5)\) could measure the sensitivity to a named parameter\.

The four modes are operators and therefore composable\. A compound condition is the composition

𝒯Θ=𝒯Θk∘⋯∘𝒯Θ1,\\mathcal\{T\}\_\{\\Theta\}\\;=\\;\\mathcal\{T\}\_\{\\Theta\_\{k\}\}\\circ\\cdots\\circ\\mathcal\{T\}\_\{\\Theta\_\{1\}\},\(19\)which is generally non\-commutative\. A drift followed by an impulse yields a different faulted window than the reverse\. Thus, the order is itself a scenario parameter\. Because each operator exposes its ownΘ\\Theta, the contribution of every fault in a composite instance remains attributable, turning “which failure degraded this model” from a guess into a decomposition\. We treat single\-mode evaluation as the core protocol and leave systematic compositional sweeps to future work\.

## VTS\-Fault Benchmark and Experimental Setup

With the framework \(Sec\.[III](https://arxiv.org/html/2606.18539#S3)\) and the four modes \(Sec\.[IV](https://arxiv.org/html/2606.18539#S4)\) in place, we now instantiate TS\-Fault as a concrete benchmark and describe the protocol of our large\-scale evaluation study, including the datasets, the evaluated models, the paired evaluation procedure, the metrics, and the reproducibility details\.

TABLE IV:The six multivariate datasets in TS\-Fault\.### V\-ADatasets

We build TS\-Fault on six widely used multivariate long\-term forecasting datasets spanning the energy, load, and climate domains \(Table[IV](https://arxiv.org/html/2606.18539#S5.T4)\)\[[84](https://arxiv.org/html/2606.18539#bib.bib15),[76](https://arxiv.org/html/2606.18539#bib.bib16),[40](https://arxiv.org/html/2606.18539#bib.bib69)\]\. We chose them to vary widely in the two properties that interact with our fault modes: dimensionality \(from77channels in ETT to321321in Electricity\) and sampling granularity \(from1010\-minute to hourly\)\. Dimensionality matters because Mode II and Mode IV act on cross\-variable structure\. For example, the321321densely correlated channels of Electricity make it the most demanding test of cross\-channel fracture and propagation\. Following standard forecasting protocol in\[[12](https://arxiv.org/html/2606.18539#bib.bib109),[55](https://arxiv.org/html/2606.18539#bib.bib17),[83](https://arxiv.org/html/2606.18539#bib.bib110)\], each clean test window comprises a length\-336336history and a length\-9696target\.

![Refer to caption](https://arxiv.org/html/2606.18539v1/x5.png)Figure 5:Each line links a model’s clean\-accuracy rank \(left\) to its robustness rank \(right\);TABLE V:Aggregate results of the 21 models, grouped by model category, over the metrics defined in Sec\.[V\-D](https://arxiv.org/html/2606.18539#S5.SS4)\. Bold marks the best value per column \(k=103k=10^\{3\}\)\.ModelCleanFaultedRobustnessPer\-mode RD \(%\)RankMSEMAEMSEMAEΔ\\DeltaMSErrIIIIIIIVd10/d02d\_\{10\}/d\_\{02\}RclnR\_\{\\mathrm\{cln\}\}RrobR\_\{\\mathrm\{rob\}\}Δ\\Deltarank\\rowcolorblack\!6StatisticalNaive\[[31](https://arxiv.org/html/2606.18539#bib.bib103)\]1\.2380\.780129\.54\.798128\.3105\.70027\.4k2\.1k23\.2214\-17SeasonalNaive\[[31](https://arxiv.org/html/2606.18539#bib.bib103)\]0\.9050\.620118\.54\.469117\.6165\.511026\.8k2\.6k24\.11713\-4ARIMA\[[7](https://arxiv.org/html/2606.18539#bib.bib101)\]0\.9990\.694121\.74\.531120\.7120\.7\-1027\.8k4\.8k25\.71815\-3ETS\[[33](https://arxiv.org/html/2606.18539#bib.bib102)\]1\.0440\.730126\.74\.708125\.7122\.90028\.2k2\.4k23\.5196\-13\\rowcolorblack\!6Linear / lightweightDLinear\[[81](https://arxiv.org/html/2606.18539#bib.bib92)\]0\.5620\.50666\.853\.33866\.29197\.37125\.4k1\.8k23\.5720\+13NLinear\[[81](https://arxiv.org/html/2606.18539#bib.bib92)\]0\.5400\.52172\.233\.39771\.69142\.115127\.3k2\.5k23\.759\+4N\-BEATS\[[57](https://arxiv.org/html/2606.18539#bib.bib93)\]0\.4490\.45821\.591\.67021\.1454\.66015\.3k46520\.117\+6\\rowcolorblack\!6Recurrent / conv\.LSTM\[[28](https://arxiv.org/html/2606.18539#bib.bib94)\]0\.7330\.6380\.7800\.6610\.0471\.0700981\.0142\-12GRU\[[14](https://arxiv.org/html/2606.18539#bib.bib95)\]0\.6800\.6070\.7750\.6510\.0951\.140026231\.0123\-9TCN\[[6](https://arxiv.org/html/2606.18539#bib.bib96)\]0\.8740\.7107\.3891\.3006\.5157\.9000900578\.1161\-15\\rowcolorblack\!6Decomposition Transf\.Autoformer\[[76](https://arxiv.org/html/2606.18539#bib.bib16)\]0\.8220\.65824\.192\.22523\.3748\.0406\.6k38816\.4158\-7FEDformer\[[85](https://arxiv.org/html/2606.18539#bib.bib97)\]1\.0720\.79324\.292\.31123\.2222\.6305\.1k27015\.6205\-15\\rowcolorblack\!6Attention / SOTAPatchTST\[[55](https://arxiv.org/html/2606.18539#bib.bib17)\]0\.5510\.50086\.633\.75586\.08262\.613137\.3k2\.9k24\.2614\+8iTransformer\[[46](https://arxiv.org/html/2606.18539#bib.bib98)\]0\.5290\.49583\.163\.66882\.63272\.311238\.0k3\.4k24\.4321\+18TimeXer\[[73](https://arxiv.org/html/2606.18539#bib.bib108)\]0\.5370\.49587\.983\.77487\.44301\.29039\.7k3\.1k24\.0416\+12TimeMixer\[[72](https://arxiv.org/html/2606.18539#bib.bib99)\]0\.5900\.51890\.083\.84289\.49264\.27138\.9k2\.8k24\.6919\+10TimesNet\[[75](https://arxiv.org/html/2606.18539#bib.bib70)\]0\.5710\.52853\.342\.88252\.77141\.78119\.9k1\.6k22\.5812\+4Nonstationary Transf\.r\[[47](https://arxiv.org/html/2606.18539#bib.bib100)\]0\.6080\.55055\.162\.92854\.55101\.118320\.0k1\.9k24\.41011\+1\\rowcolorblack\!6Foundation, zero\-shotTimesFM\[[17](https://arxiv.org/html/2606.18539#bib.bib36)\]0\.5160\.436162\.74\.887162\.2555\.28187\.2k10\.2k27\.9218\+16Chronos\[[4](https://arxiv.org/html/2606.18539#bib.bib35)\]0\.6130\.499165\.65\.068165\.0512\.53069\.0k5\.8k25\.51117\+6Moirai\[[74](https://arxiv.org/html/2606.18539#bib.bib37)\]0\.6820\.538153\.14\.954152\.4365\.62045\.8k4\.9k25\.71310\-3

### V\-BEvaluated Models

We evaluate2121forecasting models grouped into six methodological categories \(Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)\), from classical statistics to pretrained foundation models\. These categories cover fundamentally different mechanisms: autoregressive and exponential\-smoothing statistics; linear and basis\-expansion maps; recurrent and convolutional sequence models; decomposition\-based and modern attention Transformers; and the more recent trend of large time series foundation models\. The first five categories are fit to each dataset under a common configuration \(Sec\.[V\-C](https://arxiv.org/html/2606.18539#S5.SS3)\), that is, the deep models trained with a shared optimizer recipe and the statistical models fit per window\. For the time series foundation models, they are evaluated strictly zero\-shot with their released pretrained weights and are never fine\-tuned on our data, so their performance \(both clean and faulted situations\) reflects the general capability after their pretraining process\.

### V\-CProtocol

Training\.Trained models follow one shared recipe to keep comparisons fair: at most1010epochs with early stopping \(patience33\), the Adam optimizer, learning rate10−410^\{\-4\}, MSE loss, and batch size3232\(reduced to1616on the321321\-channel Electricity\)\. Statistical models are fit directly on each window and foundation models perform zero\-shot inference\.

Difficulty\.Each mode is instantiated at five increasing difficulty levelss1s\_\{1\}\(mildest\) tos5s\_\{5\}\(most severe\), obtained by sweeping the mode’s scenario parametersΘ\\Thetaalong the difficulty mapκ\\kappa\(Sec\.[III\-B](https://arxiv.org/html/2606.18539#S3.SS2)\)\. These correspond to the released configurationsd02,d04,d06,d08,d10d\_\{02\},d\_\{04\},d\_\{06\},d\_\{08\},d\_\{10\}\.

Paired clean/corrupt evaluation\.The benchmark is organized into66datasets×\\times44modes×\\times55difficulties=120=120configurations, each including2020paired windows\. A pair is a clean contextXXand its faulted counterpartX~=𝒯Θ\(X\)\\tilde\{X\}=\\mathcal\{T\}\_\{\\Theta\}\(X\), both forecast by the same model and compared against the same target\. This pairing separates the two quantities we focus on\. The clean error measures a model’s intrinsic accuracy, while the increase in error from clean to faulted input measures its sensitivity to faults\.

### V\-DMetrics

For a modelffand a fault modeℱi\\mathcal\{F\}\_\{i\}, letL\(f\)=𝔼\[ℓ\(f\(X\),Y\)\]L\(f\)=\\mathbb\{E\}\[\\ell\(f\(X\),Y\)\]be its clean error on the non\-corrupted data andAVG\(f,ℱi\)\\mathrm\{AVG\}\(f,\\mathcal\{F\}\_\{i\}\)the average faulted error \(Eq\. \([4](https://arxiv.org/html/2606.18539#S3.E4)\)\)\. We use the standard squared\-error loss to evaluate the forecasting accuracy\[[32](https://arxiv.org/html/2606.18539#bib.bib54),[23](https://arxiv.org/html/2606.18539#bib.bib56)\]\. We also introduce and report the absolute degradationΔMSEi\(f\)=AVG\(f,ℱi\)−L\(f\)\\Delta\\mathrm\{MSE\}\_\{i\}\(f\)=\\mathrm\{AVG\}\(f,\\mathcal\{F\}\_\{i\}\)\-L\(f\), the robustness ratiori\(f\)=AVG\(f,ℱi\)/L\(f\)r\_\{i\}\(f\)=\\mathrm\{AVG\}\(f,\\mathcal\{F\}\_\{i\}\)/L\(f\), and the relative degradationRDi\(f\)=\(ri\(f\)−1\)×100%\\mathrm\{RD\}\_\{i\}\(f\)=\\big\(r\_\{i\}\(f\)\-1\\big\)\\times 100\\%\. A ratiori=1r\_\{i\}=1denotes perfect robustness\. We treat any cell withri≥10r\_\{i\}\\geq 10as a catastrophic failure: a full order\-of\-magnitude error inflation, large enough to flip a downstream operational decision\. A model’s aggregated ratio averages its within\-dataset ratio across the six datasets, so that high\-magnitude datasets do not dominate the summary\. Becauserris a ratio, it is sensitive to the scale of the clean error\. A model with a very smallL\(f\)L\(f\)can post a largerrfrom only a modest absolute change, so two models with nearly identicalΔMSE\\Delta\\mathrm\{MSE\}can differ largely inrr\(e\.g\. N\-BEATS vs\. FEDformer in Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)\)\. We therefore reportΔMSE\\Delta\\mathrm\{MSE\},rr, andRD\\mathrm\{RD\}side by side\.

The central metric is the Spearman rank correlationρ\\rhobetween a model’s clean\-accuracy ranking and its faulted \(robustness\) ranking, computed globally and per mode\. The accompanying two\-sidedpp\-value tests the null hypothesis of no rank association \(ρ=0\\rho=0\); ifp<0\.05p<0\.05, the correlation is regarded as statistically significant, meaning that the observed preservation or reversal of rankings is unlikely to arise from random ordering alone\. Hereρ→1\\rho\\\!\\to\\\!1means the clean leaderboard predicts deployed robustness, whereasρ→0\\rho\\\!\\to\\\!0means it does not, which directly quantifies the model\-selection risk that motivates this work\. To reduce sensitivity to a few diverging configurations, clean rankings are formed from the mean clean error and robustness rankings from the median relative degradationRDi\\mathrm\{RD\}\_\{i\}\. The mean faulted error is dominated by a handful of extreme configurations, so we report the median which is the more faithful summary of typical robustness\. Unlikerr, the median over configurations is also robust to the few inflated ratios a small clean error can produce\.

### V\-EReproducibility

TS\-Fault is designed to be fully reproducible\. We release: \(i\) the parameterized fault generators for all four modes, together with their scenario\-parameter schemasΘ\\Thetaand the unified window\-importance mechanism; \(ii\) the generated benchmark instances,i\.e\., paired clean/corrupt windows with their targets and exposedΘ\\Thetaand difficultyδ\\delta; \(iii\) the evaluation and per\-model configurations; and \(iv\) trained checkpoints\. The codes are available here:[github\.com/Ray\-zyy/TS\-Fault](https://github.com/Ray-zyy/TS-Fault)\. Because faulted instances are produced by an explicit operator at evaluation time, the benchmark can be regenerated at any difficulty, and previously unexposedΘ\\Thetacombinations can be held out to guard against benchmark overfitting in the future\.

## VIResults and Analysis

We organize our findings around six research questions, moving from the headline model\-selection result \(RQ1\) to the validity checks on difficulty and data dependence \(RQ5–RQ6\)\.

Overall Performance Summary\.Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)summarizes results for all 21 models, reporting clean and faulted forecasting performance alongside key robustness metrics, including degradation, robustness ratio, per\-mode degradation, difficulty slope, and rank shifts\. Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)highlights three findings\.First, theRclnR\_\{\\mathrm\{cln\}\}andRrobR\_\{\\mathrm\{rob\}\}columns are near mirror images: the most robust models, GRU, LSTM, and TCN, are only mid\-pack on clean\-data accuracy, whereas the clean\-accuracy leaders \(N\-BEATS, TimesFM, iTransformer, TimeXer\) sink to the bottom of the robustness order\.Second, the per\-mode columns are negligible under the observation\-level Modes I/II \(≤18%\\leq\\\!18\\%\) yet explode to tens of thousands of percent under the mechanism\-level Modes III/IV\.Third, the three foundation models \(the bottom three rows\) pair the best clean\-data accuracy in the table with the worst value in every robustness column\.

### VI\-ARQ1: Does clean\-data accuracy Predict Robustness?

No\. The Spearman correlation between a model’s clean\-accuracy rank and its robustness rank is negative and significant,ρ=−0\.544\\rho=\-0\.544\(p=0\.011p=0\.011\) across all2121models; restricted to the1818non\-foundation models it is−0\.509\-0\.509\(p=0\.031p=0\.031\), so adding pretrained models strengthens rather than dilutes the effect\. The dislocations are large: iTransformer, the third\-most\-accurate model on clean data, ranks dead last \(2121st\) in robustness, a shift of\+18\+18ranks, the largest in the study, and the foundation model TimesFM, second on clean\-data accuracy, falls to1818th \(\+16\+16\)\. Conversely, TCN, LSTM, and GRU \(clean\-accuracy ranks1616/1414/1212\) occupy the top three robustness ranks\.

Figure[5](https://arxiv.org/html/2606.18539#S5.F5)traces these crossings, and theRclnR\_\{\\mathrm\{cln\}\},RrobR\_\{\\mathrm\{rob\}\}, andΔrank\\Delta\\mathrm\{rank\}columns of Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)list them numerically\. These robustness ranks use the median relative degradation \(Sec\.[V\-D](https://arxiv.org/html/2606.18539#S5.SS4)\); this differs from a ranking by mean faulted MSE, under which the foundation models, whose few extreme configurations inflate the mean, rank worst instead\. Both orderings deliver the same verdict: clean\-data accuracy does not predict robustness\. The practical consequence is the model\-selection risk that motivates this work: choosing a forecaster by clean MSE, today’s default, systematically favors the architecture most likely to fail once structured faults appear in deployment\.

### VI\-BRQ2: Is Failure Impact Uniform or Stratified?

Sharply stratified, and this stratification is the methodological payoff of TS\-Fault\. Table[VII](https://arxiv.org/html/2606.18539#S6.T7)reports the clean\-vs\-faulted rank correlation per mode\. Under the two observation\-level modes, the clean ranking is almost perfectly preserved \(Mode Iρ=0\.925\\rho=0\.925, Mode IIρ=0\.952\\rho=0\.952, bothp<0\.001p<0\.001\); under the two mechanism\-level modes, it is essentially obliterated \(Mode IIIρ=0\.032\\rho=0\.032,p=0\.889p=0\.889; Mode IVρ=0\.055\\rho=0\.055,p=0\.814p=0\.814\)\.Figure[6](https://arxiv.org/html/2606.18539#S6.F6)shows the same split visually: the Mode I/II panels hug the identity diagonal, while the Mode III/IV panels scatter at random\. The same stratification is already visible in Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1): the Mode I and Mode II columns stay below18%18\\%for every model, whereas the Mode III and Mode IV columns run from hundreds to tens of thousands of percent\. Adding the foundation models again pushes the mechanism\-level correlations further toward zero \(from0\.210\.21/0\.190\.19at1818models to0\.030\.03/0\.060\.06\)\. The reading is direct and actionable\. Under observation\-level faults, a localized event or a covertly fractured dependency, clean MSE remains a sound selection proxy\. Under mechanism\-level faults, a regime switch coupled with missingness, or a fault cascading along a sensing chain, clean MSE carries essentially no information about which model will survive\. Because TS\-Fault exposes which mode produced a degradation, it can attribute fragility to a named mechanism rather than reporting an undifferentiated aggregate\. Table[VI](https://arxiv.org/html/2606.18539#S6.T6)lists the per\-mode median degradation for all2121models, sorted by Mode III: Modes I/II remain≤18%\\leq\\\!18\\%for every architecture, whereas Mode III ranges from9%9\\%\(LSTM\) to87,000%87\{,\}000\\%\(TimesFM\) and Mode IV from8%8\\%to10,000%10\{,\}000\\%, a three\-to\-four\-order\-of\-magnitude gap between the two halves of the taxonomy\.

TABLE VI:Median relative degradation \(%\) by fault mode for all2121models, sorted by Mode III \(k=103k=10^\{3\}\)\. Observation\-level Modes I/II stay≤18%\\leq\\\!18\\%for every architecture; mechanism\-level Modes III/IV explode by three to four orders of magnitude\. Foundation rows shaded\.![Refer to caption](https://arxiv.org/html/2606.18539v1/x6.png)Figure 6:Clean rank vs\. faulted rank, per mode\. The dashed diagonal denotes a perfectly preserved ranking\. Modes I/II hug the diagonal \(ρ\>0\.92\\rho\>0\.92\); Modes III/IV scatter almost completely \(ρ≈0\.03\\rho\\approx 0\.03/0\.060\.06\)\. Gold points are foundation models, concentrated in the high\-fragility region under Modes III/IV\.TABLE VII:Spearmanρ\\rhobetween clean and faulted model rankings, per mode\. Observation\-level faults preserve the ranking; mechanism\-level faults destroy it\.
### VI\-CRQ3: Where Do Catastrophic Failures Concentrate?

Entirely in the mechanism\-level modes\. Counting every cell whose error inflates by at least an order of magnitude \(r≥10r\\geq 10\), we find884884catastrophic failures across the grid, distributed with striking asymmetry:0in Mode I,0in Mode II,537537in Mode III \(85\.9%85\.9\\%of its625625configurations\), and347347in Mode IV \(55\.5%55\.5\\%\), so Modes III/IV account for100%100\\%of all catastrophic failures\. Figure[7](https://arxiv.org/html/2606.18539#S6.F7)explains why: the observation\-level modes degrade by a median of≈3\.7%\\approx\\\!3\.7\\%\(Mode I\) and≈0\.1%\\approx\\\!0\.1\\%\(Mode II\) with tight spread, whereas the mechanism\-level modes degrade by medians of≈1\.8×104%\\approx\\\!1\.8\\times 10^\{4\}\\%\(Mode III\) and≈1\.4×103%\\approx\\\!1\.4\\times 10^\{3\}\\%\(Mode IV\) and are enormously dispersed, a single model can degrade by100%100\\%or by100,000%100\{,\}000\\%depending on the configuration\. That dispersion is itself a risk: a model’s behavior under mechanism\-level faults cannot be estimated in advance\. By model, the catastrophic count is led by TimesFM \(5353\) and the attention/foundation cluster \(TimeXer, iTransformer, TimeMixer, Chronos, Moirai,5252each\), against only1717for TCN\.

### VI\-DRQ4: Are Pretrained Foundation Models More Robust?

They are the opposite, strong but fragile\. On clean data, the three foundation models are top\-tier: TimesFM attains MSE0\.5160\.516\(second overall\) and MAE0\.4360\.436\(first\), with Chronos \(0\.6130\.613\) and Moirai \(0\.6820\.682\) ahead of many specialists\. Yet their robustness ratios,555555/512512/366366, are the three worst of all2121models, and their mean faulted errors \(163163/166166/153153\) are the three highest\. The verdict does not hinge on the ratio: in*absolute*terms their degradationΔMSE\\Delta\\mathrm\{MSE\}\(162162/165165/152152\) is also the largest in the table, so the fragility is not an artifact of dividing by a small clean error\. The gap is widest exactly where it matters: under Mode III the median degradation reaches87,000%87\{,\}000\\%for TimesFM,69,000%69\{,\}000\\%for Chronos, and46,000%46\{,\}000\\%for Moirai, against40,000%40\{,\}000\\%for the worst specialist \(TimeXer\); on the321321\-channel Electricity dataset TimesFM’s ratio peaks at20902090\. Figure[8](https://arxiv.org/html/2606.18539#S6.F8)portrays the two faces side by side\. A plausible mechanism is that zero\-shot models rely on strong learned priors\[[22](https://arxiv.org/html/2606.18539#bib.bib86),[16](https://arxiv.org/html/2606.18539#bib.bib87)\], typical periodicity, smoothness, cross\-channel co\-movement, and, lacking any online adaptation\[[38](https://arxiv.org/html/2606.18539#bib.bib33)\]to the current series, cannot recover when Modes III/IV rewrite those very structures; the prior that buys clean\-data accuracy then amplifies the error\. Pretraining thus delivers accuracy, not fault robustness, and including these models is precisely what strengthens the global No\-Free\-Lunch correlation from−0\.509\-0\.509to−0\.544\-0\.544\.

![Refer to caption](https://arxiv.org/html/2606.18539v1/x7.png)Figure 7:Per\-mode degradation distributions \(logyy; box = IQR, points = single configuration\)\. Modes I/II are low and tightly concentrated; Modes III/IV are higher by several orders of magnitude and far more dispersed\.![Refer to caption](https://arxiv.org/html/2606.18539v1/x8.png)Figure 8:Clean MSE and robustness ratio across the2121models\. Gold stars mark foundation models\.
### VI\-ERQ5: Is Difficulty Controllable and Monotonic?

Yes\. Degradation rises monotonically from difficultys1s\_\{1\}tos5s\_\{5\}for every mode \(Figure[9](https://arxiv.org/html/2606.18539#S6.F9)\), confirming that the difficulty mapκ\\kappabehaves as designed and supports graded robustness curves rather than single\-point comparisons\. The sloped10/d02d\_\{10\}/d\_\{02\}also separates three clear sensitivity tiers: low \(≈1\.0\\approx\\\!1\.0\) for GRU and LSTM; mid \(≈8\\approx\\\!8–2020\) for TCN, FEDformer, Autoformer, and N\-BEATS; and high \(≈22\\approx\\\!22–2828\) for the attention SOTA, the linear models \(DLinear, NLinear\), the statistical models, and all three foundation models, with TimesFM the steepest at27\.927\.9\. The tiers track architecture: recurrent hidden\-state smoothing dilutes an injected fault, trend–season decomposition is largely immune to aggregate statistic change, whereas per\-position attention and frozen zero\-shot priors transmit difficulty into error almost linearly\. Table[VIII](https://arxiv.org/html/2606.18539#S6.T8)reports the per\-level mean error for all2121models \(averaged over the six datasets and four modes\): error grows with severity across essentially every row, and the sloped10/d02d\_\{10\}/d\_\{02\}separates the same three tiers, GRU and LSTM flat at≈1\.0\\approx\\\!1\.0, TCN and the decomposition models in the middle, and the attention SOTA, linear, statistical, and all three foundation models climbing∼22\\sim\\\!22–28×28\\times, with TimesFM steepest at27\.927\.9\.

TABLE VIII:Mean faulted errorMSEcor\\mathrm\{MSE\_\{cor\}\}at each difficulty level \(averaged over66datasets×4\\times\\,4modes\), sorted byd02d\_\{02\}\. The sloped10/d02d\_\{10\}/d\_\{02\}summarizes sensitivity to severity and separates three tiers\. Foundation rows shaded\.![Refer to caption](https://arxiv.org/html/2606.18539v1/x9.png)Figure 9:Degradation grows monotonically with difficulty for all four modes \(median over2121models; logyy, shaded IQR\)\. Mechanism\-level Modes III/IV dominate observation\-level Modes I/II by orders of magnitude at every level\.
### VI\-FRQ6: Do Data Characteristics Modulate Fragility?

Robustness is primarily a property of the architecture, but dimensionality sets the amplitude\. Across the six datasets, the model ordering is highly consistent, so fragility is structural rather than dataset\-specific\. What dimensionality controls is difficulty: on the321321\-channel Electricity dataset, whose dense cross\-channel correlation Modes II/IV disrupt most, ratios reach their extremes \(TimesFM20902090, TimeXer11121112\), whereas the low\-dimensional, milder Weather keeps most ratios in the11–5555range; even on Electricity, GRU and LSTM remain near1\.21\.2\. Sampling granularity \(hourly ETTh vs\.1515\-minute ETTm\) barely changes the ordering, recurrent and convolutional models stay robust, and foundation models stay fragile across both\. Table[VI\-F](https://arxiv.org/html/2606.18539#S6.SS6)reports the full per\-dataset breakdown for all2121models \(ARIMA omitted on the321321\-channel Electricity\)\. Every dataset preserves the same broad ordering, while the robustness ratio scales with dimensionality: from the11–5555band on the low\-dimensional Weather, to peaks of20902090\(TimesFM\) and11121112\(TimeXer\) on the321321\-channel Electricity, even as GRU and LSTM stay near1\.21\.2everywhere\. Full per\-dataset instances are additionally released with the benchmark artifacts\.

TABLE IX:Per\-dataset faulted error and robustness ratio for all2121models, grouped by category\. For each dataset we report the mean faulted errorMSEcor\\mathrm\{MSE\_\{cor\}\}\(averaged over44modes×5\\times\\,5difficulties\) and the robustness ratior=MSEcor/MSEcleanr=\\mathrm\{MSE\_\{cor\}\}/\\mathrm\{MSE\_\{clean\}\}\(r=1r\\\!=\\\!1is perfect robustness\)\. ARIMA is omitted on the321321\-channel Electricity, where per\-series fitting is impractical \(“–”\)\. Bold marks the best \(lowest\)rrper dataset\.Taken together, RQ1–RQ6 show that no model occupies the accurate\-and\-robust regime: the strongest clean forecasters are the most fragile, the most robust are only average on clean data, and the entire failure budget concentrates in the mechanism\-level modes that clean\-data leaderboards cannot see\.

## VIIDiscussion

### VII\-AModel\-Selection Risk

Sections[VI\-A](https://arxiv.org/html/2606.18539#S6.SS1)and[VI\-B](https://arxiv.org/html/2606.18539#S6.SS2)together expose a concrete hazard we call model\-selection risk\. Clean MSE is a trustworthy selection criterion only under observation\-level faults, where the clean and faulted rankings agree \(ρ\>0\.92\\rho\>0\.92\); under the mechanism\-level faults that any real deployment eventually meets, the two rankings are uncorrelated \(Modes III/IV,ρ<0\.06\\rho<0\.06\) and, taken globally, anti\-correlated \(ρ=−0\.544\\rho=\-0\.544\)\. Choosing a forecaster by its position on a clean leaderboard, the prevailing practice, can therefore be worse than choosing at random in a deployment dominated by regime change or cascading sensor faults, because the clean ranking actively points toward the most fragile architectures \(Table[V\-A](https://arxiv.org/html/2606.18539#S5.SS1)\)\. The corollary is methodological: a model’s value cannot be summarized by a single clean\-accuracy number but should be reported as a joint claim over clean\-data accuracy and structural robustness, stratified by mode and difficulty\. This matters beyond any one deployment, because evaluation metrics steer research\[[60](https://arxiv.org/html/2606.18539#bib.bib83),[37](https://arxiv.org/html/2606.18539#bib.bib84)\], a community that rewards only clean held\-out error will keep producing architectures optimized for it, leaving the fault robustness that deployment requires unmeasured and unimproved\.

### VII\-BGuarding Against Goodhart

A benchmark that exposes its scenario parameters invites an obvious objection\[[51](https://arxiv.org/html/2606.18539#bib.bib59)\]: OnceΘ\\Thetais public, models will be tuned to the values TS\-Fault uses, and reported robustness rises without real improvement\. The objection is valid but overlooks a defense unavailable to accuracy\-centric benchmarks\. BecauseΘ\\Thetais compositional \(Sec\.[IV](https://arxiv.org/html/2606.18539#S4), Eq\. \([19](https://arxiv.org/html/2606.18539#S4.E19)\)\), a release can*hold out*parameter combinations: the parameterization is published while evaluation runs on combinations no participant has seen, and the space of held\-out combinations grows with the dimensionality ofΘ\\Theta\. This is the time series analogue of the held\-out corruptions used to keep ImageNet\-C honest\[[52](https://arxiv.org/html/2606.18539#bib.bib106)\]\. Fundamentally, the alternative to exposed parameters is not a benchmark immune to gaming but one whose gaming is invisible, an opaque test set does not prevent overfitting, it only hides it\. Transparency aboutΘ\\Thetais a feature, not a vulnerability\.

### VII\-CRelation to Adjacent Problems

TS\-Fault borders several established problems but answers a different question in each case\.*\(i\) Anomaly detection*\[[11](https://arxiv.org/html/2606.18539#bib.bib47),[63](https://arxiv.org/html/2606.18539#bib.bib63),[78](https://arxiv.org/html/2606.18539#bib.bib90),[5](https://arxiv.org/html/2606.18539#bib.bib64)\]asks whether a sample is anomalous; TS\-Fault asks whether a forecaster still produces a trustworthy forecast when an anomaly is its input\. The former treats the anomaly as a label, the latter as an operator acting on the input\.*\(ii\) Change\-point detection*\[[2](https://arxiv.org/html/2606.18539#bib.bib45),[36](https://arxiv.org/html/2606.18539#bib.bib68)\]locates where processes change; Modes III/IV invert this, we inject a change point at a known location and test whether it misleads the forecaster\.*\(iii\) Concept drift*\[[79](https://arxiv.org/html/2606.18539#bib.bib31),[21](https://arxiv.org/html/2606.18539#bib.bib32)\]studies online adaptation as a distribution slowly shifts; TS\-Fault is an offline protocol that measures robustness without retraining\. The two are complementary\.*\(iv\) Adversarial robustness*\[[48](https://arxiv.org/html/2606.18539#bib.bib67),[26](https://arxiv.org/html/2606.18539#bib.bib104),[45](https://arxiv.org/html/2606.18539#bib.bib27)\]seeks the worst\-case perturbation inside anϵ\\epsilon\-ball: a tight bound, but a physically meaningless one\. TS\-Fault replaces the worst\-case search with semantically interpretable operators, trading “how much adversarial noise can a model absorb?” for “under which named failure mechanism does it break?”*\(v\) Distribution\-shift benchmarks*\[[39](https://arxiv.org/html/2606.18539#bib.bib34),[79](https://arxiv.org/html/2606.18539#bib.bib31),[21](https://arxiv.org/html/2606.18539#bib.bib32)\]treat the train–test gap as a single static object; TS\-Fault separates its structural subtypes, regime transition, dependency fracture, cascading propagation, and lets a single evaluation traverse them at controlled difficulty\.

### VII\-DThreats to Validity

*\(i\)Ecological validity:*our faults are constructed, not harvested from incident logs; we mitigate this by grounding each mode in documented failures\[[9](https://arxiv.org/html/2606.18539#bib.bib7),[20](https://arxiv.org/html/2606.18539#bib.bib71),[53](https://arxiv.org/html/2606.18539#bib.bib50)\]and, as in ImageNet\-C\[[27](https://arxiv.org/html/2606.18539#bib.bib105)\], by locating validity in structural fidelity rather than historical origin \(Sec\.[II\-C](https://arxiv.org/html/2606.18539#S2.SS3)\), leaving large\-scale field validation to future work\.*\(ii\)Hyperparameters:*the weightsλ\\lambdaandβ\\betaare fixed per mode, yet degradation is monotone in difficulty \(Sec\.[VI\-E](https://arxiv.org/html/2606.18539#S6.SS5)\), and the mechanism\-level collapse is so complete \(ρ≈0\.03\\rho\\approx 0\.03\) that no reweighting plausibly restores the ranking\.*\(iii\)Scope:*the ordering is consistent across our six energy, load, and climate datasets \(Sec\.[VI\-F](https://arxiv.org/html/2606.18539#S6.SS6)\), suggesting structural conclusions, though finance, traffic, and clinical domains remain untested\.*\(iv\)Training budget:*a shared1010\-epoch recipe keeps comparisons fair, and heavier tuning is unlikely to reverse a stratification that tracks architecture rather than fit\.

## VIIIConclusion

Time series forecasting now informs decisions where errors cost money, time, and safety\[[70](https://arxiv.org/html/2606.18539#bib.bib52)\], yet its models are still ranked by a single number, the average error on clean held\-out data, which can be misleading\. To provide a more robust way of evaluating TSF methods, we design and introduce TS\-Fault, a benchmark that replaces the clean test pair with a structured instance generated by an explicit fault operator and carrying semantic parametersΘ\\Thetaand a controllable difficultyδ\\delta\. We evaluated2121forecasting models across66datasets,44fault modes, and55difficulties under a paired clean/corrupt protocol, and reached five findings\.*\(i\)*Clean\-data accuracy and structural robustness are anti\-correlated \(ρ=−0\.544\\rho=\-0\.544,p=0\.011p=0\.011\), and the strongest clean models are among the most fragile\.*\(ii\)*Basic recurrent and convolutional models, namely LSTM and GRU \(ratios≈1\\approx\\\!1\) and TCN \(7\.97\.9\), are the most robust by a wide margin, against ratios from the tens to several hundred for every other family, despite only average clean\-data accuracy\.*\(iii\)*Failure impact is stratified\. Observation\-level faults preserve the clean ranking \(ρ\>0\.92\\rho\>0\.92\) while mechanism\-level faults destroy it \(ρ<0\.06\\rho<0\.06\)\.*\(iv\)*All884884catastrophic failures fall in the two mechanism\-level modes\.*\(v\)*Pretrained foundation models are the extreme case of strong\-but\-fragile, top\-tier on clean data, but worst of all on robustness\. Overall, TS\-Fault provides a novel perspective in evaluating time series forecasting methods, especially the robustness of the prediction models\. It could also lead to new research directions to further promote the development of new TSF methods and enhance their generalization capability in real\-world applications\.

## References

- \[1\]\(2024\)Gift\-eval: a benchmark for general time series forecasting model evaluation\.NeurIPS Workshop\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[2\]S\. Aminikhanghahi and D\. J\. Cook\(2017\)A survey of methods for time series change point detection\.KAIS51\(2\),pp\. 339–367\.Cited by:[§III\-C](https://arxiv.org/html/2606.18539#S3.SS3.p2.7),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[3\]G\. Andersson, P\. Donalek, R\. Farmer, N\. Hatziargyriou, I\. Kamwa, P\. Kundur, N\. Martins, J\. Paserba, P\. Pourbeik, J\. Sanchez\-Gasca,et al\.\(2005\)Causes of the 2003 major grid blackouts in north america and europe, and recommended means to improve system dynamic performance\.T\-PWRS20\(4\),pp\. 1922–1928\.Cited by:[§IV](https://arxiv.org/html/2606.18539#S4.p1.1)\.
- \[4\]A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.TMLR\.External Links:ISSN 2835\-8856Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.33.27.1)\.
- \[5\]J\. Audibert, P\. Michiardi, F\. Guyard, S\. Marti, and M\. A\. Zuluaga\(2020\)Usad: unsupervised anomaly detection on multivariate time series\.InSIGKDD,pp\. 3395–3404\.Cited by:[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[6\]S\. Bai, J\. Z\. Kolter, and V\. Koltun\(2018\)An empirical evaluation of generic convolutional and recurrent networks for sequence modeling\.arXiv preprint arXiv:1803\.01271\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.20.14.1)\.
- \[7\]G\. E\. Box, G\. M\. Jenkins, G\. C\. Reinsel, and G\. M\. Ljung\(2015\)Time series analysis: forecasting and control\.John Wiley & Sons\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.11.5.1)\.
- \[8\]S\. V\. Buldyrev, R\. Parshani, G\. Paul, H\. E\. Stanley, and S\. Havlin\(2010\)Catastrophic cascade of failures in interdependent networks\.Nature464\(7291\),pp\. 1025–1028\.Cited by:[§IV\-D](https://arxiv.org/html/2606.18539#S4.SS4.p1.1)\.
- \[9\]J\. W\. Busby, K\. Baker, M\. D\. Bazilian, A\. Q\. Gilbert, E\. Grubert, V\. Rai, J\. D\. Rhodes, S\. Shidore, C\. A\. Smith, and M\. E\. Webber\(2021\)Cascading risks: understanding the 2021 winter blackout in texas\.Energy Res\. Social Sci\.77\(1\),pp\. 102106\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§IV\-C](https://arxiv.org/html/2606.18539#S4.SS3.p1.1),[§IV](https://arxiv.org/html/2606.18539#S4.p1.1),[§VII\-D](https://arxiv.org/html/2606.18539#S7.SS4.p1.4)\.
- \[10\]W\. Cao, D\. Wang, J\. Li, H\. Zhou, L\. Li, and Y\. Li\(2018\)Brits: bidirectional recurrent imputation for time series\.NeurIPS31,pp\. 6776–6786\.Cited by:[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.1.3.1.1)\.
- \[11\]V\. Chandola, A\. Banerjee, and V\. Kumar\(2009\)Anomaly detection: a survey\.CSUR41\(3\),pp\. 1–58\.Cited by:[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[12\]S\. Chen, C\. Li, S\. O\. Arik, N\. C\. Yoder, and T\. Pfister\(2023\)TSMixer: an all\-MLP architecture for time series forecast\-ing\.TMLR\.External Links:ISSN 2835\-8856Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[13\]H\. Cheng, Q\. Wen, Y\. Liu, and L\. Sun\(2024\)RobustTSF: towards theory and design of robust time series forecasting with anomalies\.InICLR,pp\. 5787–5813\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.1.3.1.1)\.
- \[14\]K\. Cho, B\. Van Merriënboer, Ç\. Gulçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio\(2014\)Learning phrase representations using rnn encoder–decoder for statistical machine translation\.InEMNLP,pp\. 1724–1734\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.19.13.1)\.
- \[15\]N\. A\. E\. R\. Corporation\(2018\)Reliability standards for the bulk electric systems of north america\.US Atlanta\.Cited by:[§IV\-D](https://arxiv.org/html/2606.18539#S4.SS4.p1.1)\.
- \[16\]A\. D’Amour, K\. Heller, D\. Moldovan, B\. Adlam, B\. Alipanahi, A\. Beutel, C\. Chen, J\. Deaton, J\. Eisenstein, M\. D\. Hoffman,et al\.\(2022\)Underspecification presents challenges for credibility in modern machine learning\.JMLR23\(226\),pp\. 1–61\.Cited by:[§VI\-D](https://arxiv.org/html/2606.18539#S6.SS4.p1.23)\.
- \[17\]A\. Das, W\. Kong, R\. Sen, and Y\. Zhou\(2024\)A decoder\-only foundation model for time\-series forecasting\.ICML\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.32.26.1)\.
- \[18\]I\. Dobson, B\. A\. Carreras, V\. E\. Lynch, and D\. E\. Newman\(2007\)Complex systems analysis of series of blackouts: cascading failure, critical points, and self\-organization\.Chaos17\(2\)\.Cited by:[§IV\-D](https://arxiv.org/html/2606.18539#S4.SS4.p1.1)\.
- \[19\]W\. Du, D\. Côté, and Y\. Liu\(2023\)Saits: self\-attention\-based imputation for time series\.ESWA219,pp\. 119619\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.1.3.1.1)\.
- \[20\]K\. J\. Forbes and R\. Rigobon\(2002\)No contagion, only interdependence: measuring stock market comovements\.J Finance57\(5\),pp\. 2223–2261\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§IV](https://arxiv.org/html/2606.18539#S4.p1.1),[§VII\-D](https://arxiv.org/html/2606.18539#S7.SS4.p1.4)\.
- \[21\]J\. Gagnon\-Audet, K\. Ahuja, M\. Darvishi\-Bayazi, P\. Mousavi, G\. Dumas, and I\. Rish\(2023\)Woods: benchmarks for out\-of\-distribution generalization in time series\.TMLR\.External Links:ISSN 2835\-8856Cited by:[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.3.1.2.1.1),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[22\]R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann\(2020\)Shortcut learning in deep neural networks\.Nat\. Mach\. Intell\.2\(11\),pp\. 665–673\.Cited by:[§VI\-D](https://arxiv.org/html/2606.18539#S6.SS4.p1.23)\.
- \[23\]T\. Gneiting and A\. E\. Raftery\(2007\)Strictly proper scoring rules, prediction, and estimation\.J Am Stat Assoc\.102\(477\),pp\. 359–378\.Cited by:[§V\-D](https://arxiv.org/html/2606.18539#S5.SS4.p1.17)\.
- \[24\]R\. W\. Godahewa, C\. Bergmeir, G\. I\. Webb, R\. Hyndman, and P\. Montero\-Manso\(2021\)Monash time series forecasting archive\.InNeurIPS,Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[25\]S\. Golchin and M\. Surdeanu\(2024\)Time travel in llms: tracing data contamination in large language models\.ICLR\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[26\]I\. J\. Goodfellow, J\. Shlens, and C\. Szegedy\(2015\)Explaining and harnessing adversarial examples\.ICLR\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.1.1.1.1),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[27\]D\. Hendrycks and T\. Dietterich\(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.ICLR\.Cited by:[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[§VII\-D](https://arxiv.org/html/2606.18539#S7.SS4.p1.4)\.
- \[28\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Comput\.9\(8\),pp\. 1735–1780\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.18.12.1)\.
- \[29\]J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.NeurIPS\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[30\]T\. Hong and S\. Fan\(2016\)Probabilistic electric load forecasting: a tutorial review\.Int\. J\. Forecast\.32\(3\),pp\. 914–938\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[31\]R\. J\. Hyndman and G\. Athanasopoulos\(2018\)Forecasting: principles and practice\.OTexts\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.10.4.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.9.3.1)\.
- \[32\]R\. J\. Hyndman and A\. B\. Koehler\(2006\)Another look at measures of forecast accuracy\.Int\. J\. Forecast\.22\(4\),pp\. 679–688\.Cited by:[§V\-D](https://arxiv.org/html/2606.18539#S5.SS4.p1.17)\.
- \[33\]R\. Hyndman, A\. Koehler, K\. Ord, and R\. Snyder\(2008\)Forecasting with exponential smoothing: the state space approach\.Springer\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.12.6.1)\.
- \[34\]S\. K\. Jensen, T\. B\. Pedersen, and C\. Thomsen\(2018\)Modelardb: modular model\-based time series management with spark and cassandra\.PVLDB11\(11\),pp\. 1688–1701\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[35\]J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei\(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[36\]E\. Keogh, J\. Lin, and A\. Fu\(2005\)Hot sax: efficiently finding the most unusual time series subsequence\.InICDM,pp\. 8–pp\.Cited by:[§III\-C](https://arxiv.org/html/2606.18539#S3.SS3.p2.7),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[37\]D\. Kiela, M\. Bartolo, Y\. Nie, D\. Kaushik, A\. Geiger, Z\. Wu, B\. Vidgen, G\. Prasad, A\. Singh, P\. Ringshia,et al\.\(2021\)Dynabench: rethinking benchmarking in nlp\.InNAACL,pp\. 4110–4124\.Cited by:[§VII\-A](https://arxiv.org/html/2606.18539#S7.SS1.p1.3)\.
- \[38\]T\. Kim, J\. Kim, Y\. Tae, C\. Park, J\. Choi, and J\. Choo\(2021\)Reversible instance normalization for accurate time\-series forecasting against distribution shift\.InICLR,Cited by:[§VI\-D](https://arxiv.org/html/2606.18539#S6.SS4.p1.23)\.
- \[39\]P\. W\. Koh, S\. Sagawa, H\. Marklund, S\. M\. Xie, M\. Zhang, A\. Balsubramani, W\. Hu, M\. Yasunaga, R\. L\. Phillips, I\. Gao,et al\.\(2021\)Wilds: a benchmark of in\-the\-wild distribution shifts\.InICML,pp\. 5637–5664\.Cited by:[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.3.1.2.1.1),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[40\]G\. Lai, W\. Chang, Y\. Yang, and H\. Liu\(2018\)Modeling long\-and short\-term temporal patterns with deep neural networks\.InSIGIR,pp\. 95–104\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[41\]C\. Li, Y\. Sun, and X\. Chen\(2007\)Analysis of the blackout in europe on november 4, 2006\.IPEC,pp\. 939–944\.Cited by:[§IV\-D](https://arxiv.org/html/2606.18539#S4.SS4.p1.1)\.
- \[42\]Z\. Li, X\. Qiu, P\. Chen, Y\. Wang, H\. Cheng, Y\. Shu, J\. Hu, C\. Guo, A\. Zhou, C\. S\. Jensen,et al\.\(2025\)Tsfm\-bench: a comprehensive and unified benchmark of foundation models for time series forecasting\.InSIGKDD,pp\. 5595–5606\.Cited by:[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[43\]Y\. Liang, H\. Wen, Y\. Nie, Y\. Jiang, M\. Jin, D\. Song, S\. Pan, and Q\. Wen\(2024\)Foundation models for time series analysis: a tutorial and survey\.InSIGKDD,pp\. 6555–6565\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[44\]R\. J\. Little and D\. B\. Rubin\(2019\)Statistical analysis with missing data\.John Wiley & Sons\.Cited by:[§IV\-C](https://arxiv.org/html/2606.18539#S4.SS3.p1.1)\.
- \[45\]F\. Liu, H\. Liu, and W\. Jiang\(2022\)Practical adversarial attacks on spatiotemporal traffic forecasting models\.InNeurIPS,pp\. 19035–19047\.Cited by:[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[46\]Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long\(2024\)Itransformer: inverted transformers are effective for time series forecasting\.InICLR,Vol\.2024,pp\. 11116–11140\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.26.20.1)\.
- \[47\]Y\. Liu, H\. Wu, J\. Wang, and M\. Long\(2022\)Non\-stationary transformers: exploring the stationarity in time series forecasting\.InNeurIPS,Vol\.35,pp\. 9881–9893\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.30.24.1)\.
- \[48\]A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu\(2017\)Towards deep learning models resistant to adversarial attacks\.arXiv preprint arXiv:1706\.06083\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.1.1.1.1),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[49\]S\. Makridakis and M\. Hibon\(2000\)The m3\-competition: results, conclusions and implications\.Int\. J\. Forecast\.16\(4\),pp\. 451–476\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[50\]S\. Makridakis, E\. Spiliotis, and V\. Assimakopoulos\(2020\)The m4 competition: 100,000 time series and 61 forecasting methods\.Int\. J\. Forecast\.36\(1\),pp\. 54–74\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[51\]D\. Manheim and S\. Garrabrant\(2018\)Categorizing variants of goodhart’s law\.arXiv preprint arXiv:1803\.04585\.Cited by:[§VII\-B](https://arxiv.org/html/2606.18539#S7.SS2.p1.4)\.
- \[52\]E\. Mintun, A\. Kirillov, and S\. Xie\(2021\)On interaction between augmentations and corruptions in natural corruption robustness\.InNeurIPS,Vol\.34,pp\. 3571–3583\.Cited by:[§VII\-B](https://arxiv.org/html/2606.18539#S7.SS2.p1.4)\.
- \[53\]M\. Moor, B\. Rieck, M\. Horn, C\. R\. Jutzeler, and K\. Borgwardt\(2021\)Early prediction of sepsis in the icu using machine learning: a systematic review\.Front\. Med\.8,pp\. 607952\.Cited by:[§IV](https://arxiv.org/html/2606.18539#S4.p1.1),[§VII\-D](https://arxiv.org/html/2606.18539#S7.SS4.p1.4)\.
- \[54\]M\. A\. Morid, O\. R\. L\. Sheng, and J\. Dunbar\(2023\)Time series prediction using deep learning methods in healthcare\.TMIS14\(1\),pp\. 1–29\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[55\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InICLR,Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.25.19.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[56\]J\. Nowotarski and R\. Weron\(2018\)Recent advances in electricity price forecasting: a review of probabilistic forecasting\.RSER81\(1\),pp\. 1548–1568\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[57\]B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio\(2020\)N\-beats: neural basis expansion analysis for interpretable time series forecasting\.InICLR,Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.16.10.1)\.
- \[58\]J\. Peters, D\. Janzing, and B\. Scholkopf\(2017\)Elements of causal inference: foundations and learning algorithms\.MIT press\.Cited by:[§III\-B](https://arxiv.org/html/2606.18539#S3.SS2.p1.4)\.
- \[59\]X\. Qiu, J\. Hu, L\. Zhou, X\. Wu, J\. Du, B\. Zhang, C\. Guo, A\. Zhou, C\. S\. Jensen, Z\. Sheng, and B\. Yang\(2024\)TFB: towards comprehensive and fair benchmarking of time series forecasting methods\.PVLDB17\(9\),pp\. 2363–2377\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[60\]M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh\(2020\)Beyond accuracy: behavioral testing of nlp models with checklist\.InACL,pp\. 4902–4912\.Cited by:[§VII\-A](https://arxiv.org/html/2606.18539#S7.SS1.p1.3)\.
- \[61\]G\. Ruan, D\. Wu, X\. Zheng, H\. Zhong, C\. Kang, M\. A\. Dahleh, S\. Sivaranjani, and L\. Xie\(2020\)A cross\-domain approach to analyzing the short\-run impact of covid\-19 on the us electricity sector\.Joule4\(11\),pp\. 2322–2337\.Cited by:[§IV\-C](https://arxiv.org/html/2606.18539#S4.SS3.p1.1)\.
- \[62\]O\. Sainz, J\. Campos, I\. García\-Ferrero, J\. Etxaniz, O\. L\. de Lacalle, and E\. Agirre\(2023\)NLP evaluation in trouble: on the need to measure llm data contamination for each benchmark\.InEMNLP Findings,pp\. 10776–10787\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[63\]S\. Schmidl, P\. Wenig, and T\. Papenbrock\(2022\)Anomaly detection in time series: a comprehensive evaluation\.PVLDB15\(9\),pp\. 1779–1797\.Cited by:[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[64\]B\. Schölkopf, F\. Locatello, S\. Bauer, N\. R\. Ke, N\. Kalchbrenner, A\. Goyal, and Y\. Bengio\(2021\)Toward causal representation learning\.Proc\. IEEE109\(5\),pp\. 612–634\.Cited by:[§III\-B](https://arxiv.org/html/2606.18539#S3.SS2.p1.4)\.
- \[65\]C\. W\. Seymour, F\. Gesten, H\. C\. Prescott, M\. E\. Friedrich, T\. J\. Iwashyna, G\. S\. Phillips, S\. Lemeshow, T\. Osborn, K\. M\. Terry, and M\. M\. Levy\(2017\)Time to treatment and mortality during mandated emergency care for sepsis\.NEJM376\(23\),pp\. 2235–2244\.Cited by:[§IV](https://arxiv.org/html/2606.18539#S4.p1.1)\.
- \[66\]O\. B\. Sezer, M\. U\. Gudelek, and A\. M\. Ozbayoglu\(2020\)Financial time series forecasting with deep learning: a systematic literature review: 2005–2019\.ASOC90,pp\. 106181\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[67\]Z\. Shao, F\. Wang, Y\. Xu, W\. Wei, C\. Yu, Z\. Zhang, D\. Yao, T\. Sun, G\. Jin, X\. Cao,et al\.\(2024\)Exploring progress in multivariate time series forecasting: comprehensive benchmarking and heterogeneity analysis\.TKDE37\(1\),pp\. 291–305\.Cited by:[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[68\]O\. Shchur, A\. F\. Ansari, C\. Turkmen, L\. Stella, N\. Erickson, P\. Guerron, M\. Bohlke\-Schneider, and Y\. Wang\(2025\)Fev\-bench: a realistic benchmark for time series forecasting\.arXiv preprint arXiv:2509\.26468\.Cited by:[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[69\]B\. Shickel, P\. J\. Tighe, A\. Bihorac, and P\. Rashidi\(2017\)Deep ehr: a survey of recent advances in deep learning techniques for electronic health record \(ehr\) analysis\.IEEE J\-BHI22\(5\),pp\. 1589–1604\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[70\]N\. A\. Smuha\(2025\)Regulation 2024/1689 of the eur\. parl\. & council of june 13, 2024 \(eu artificial intelligence act\)\.Int\. Leg\. Mater\.64\(5\),pp\. 1234–1381\.Cited by:[§VIII](https://arxiv.org/html/2606.18539#S8.p1.13)\.
- \[71\]E\. I\. Vlahogianni, M\. G\. Karlaftis, and J\. C\. Golias\(2014\)Short\-term traffic forecasting: where we are and where we’re going\.TRANSPORT RES C\-EMER43\(1\),pp\. 3–19\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[72\]S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Zhang, and J\. Zhou\(2024\)Timemixer: decomposable multiscale mixing for time series forecasting\.InICLR,Vol\.2024,pp\. 38626–38652\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.28.22.1)\.
- \[73\]Y\. Wang, H\. Wu, J\. Dong, Y\. Liu, Y\. Qiu, H\. Zhang, J\. Wang, and M\. Long\(2024\)Timexer: empowering transformers for time series forecasting with exogenous variables\.InNeurIPS,Vol\.37,pp\. 469–498\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.27.21.1)\.
- \[74\]G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo\(2024\)Unified training of universal time series forecasting transformers\.InICML,Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.34.28.1)\.
- \[75\]H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long\(2023\)Timesnet: temporal 2d\-variation modeling for general time series analysis\.InICLR,Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.29.23.1)\.
- \[76\]H\. Wu, J\. Xu, J\. Wang, and M\. Long\(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.InNeurIPS,Vol\.34,pp\. 22419–22430\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.22.16.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[77\]C\. Xu, S\. Guan, D\. Greene, M\. Kechadi,et al\.\(2024\)Benchmark data contamination of large language models: a survey\.arXiv preprint arXiv:2406\.04244\.Cited by:[§II\-B](https://arxiv.org/html/2606.18539#S2.SS2.p1.1)\.
- \[78\]J\. Xu, H\. Wu, J\. Wang, and M\. Long\(2022\)Anomaly transformer: time series anomaly detection with association discrepancy\.InICLR,Cited by:[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[79\]H\. Yao, C\. Choi, B\. Cao, Y\. Lee, P\. W\. W\. Koh, and C\. Finn\(2022\)Wild\-time: a benchmark of in\-the\-wild distribution shift over time\.InNeurIPS,Vol\.35,pp\. 10309–10324\.Cited by:[§II\-C](https://arxiv.org/html/2606.18539#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.18539#S2.T2.1.3.1.2.1.1),[§VII\-C](https://arxiv.org/html/2606.18539#S7.SS3.p1.1)\.
- \[80\]Y\. Yao, L\. Chen, Z\. Fang, Y\. Gao, C\. S\. Jensen, and T\. Li\(2025\)Camel: efficient compression of floating\-point time series\.SIGMOD2\(6\),pp\. 1–26\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1)\.
- \[81\]A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu\(2023\)Are transformers effective for time series forecasting?\.InAAAI,Vol\.37,pp\. 11121–11128\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.14.8.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.15.9.1)\.
- \[82\]J\. Zhang, X\. Wen, Z\. Zhang, S\. Zheng, J\. Li, and J\. Bian\(2024\)ProbTS: benchmarking point and distributional forecasting across diverse prediction horizons\.InNeurIPS,Vol\.37,pp\. 48045–48082\.Cited by:[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1)\.
- \[83\]Y\. Zhang, L\. Ma, S\. Pal, Y\. Zhang, and M\. Coates\(2024\)Multi\-resolution time\-series transformer for long\-term forecasting\.InAISTATS,Vol\.238,pp\. 4222–4230\.Cited by:[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[84\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InAAAI,Vol\.35,pp\. 11106–11115\.Cited by:[§II\-A](https://arxiv.org/html/2606.18539#S2.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.p1.6)\.
- \[85\]T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin\(2022\)Fedformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InICML,pp\. 27268–27286\.Cited by:[§I](https://arxiv.org/html/2606.18539#S1.p1.1),[§V\-A](https://arxiv.org/html/2606.18539#S5.SS1.8.8.6.23.17.1)\.
TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

Similar Articles

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

Submit Feedback

Similar Articles

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting
Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting
TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification
Assessing the Operational Viability of Foundation Models for Time Series Forecasting