HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
Summary
This paper introduces HEPA, a self-supervised architecture for predicting rare critical events in time series using a Joint-Embedding Predictive Architecture (JEPA) pretraining strategy. It demonstrates superior performance across multiple domains with significantly fewer labeled data and tuned parameters compared to leading models.
View Cached Full Text
Cached at: 05/13/26, 06:30 AM
# A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
Source: [https://arxiv.org/html/2605.11130](https://arxiv.org/html/2605.11130)
Jonas Petersen1,2Gian\-Alessandro Lombardi2Riccardo Maggioni2 Camilla Mazzoleni2Federico Martelli1,2Philipp Petersen3 1ETH Zurich2Forgis3University of Vienna Correspondence tojep79@cantab\.ac\.uk
###### Abstract
Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate\. We introduce HEPA \(Horizon\-conditioned Event Predictive Architecture\), built on two key principles\. First, a causal Transformer encoder is pretrained via a Joint\-Embedding Predictive Architecture \(JEPA\): a horizon\-conditioned predictor learns to forecast future*representations*rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone\. Second, we freeze the encoder and*finetune only the predictor*toward the target event, producing a monotonic survival cumulative distribution function \(CDF\) over horizons\. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time\-series architectures including PatchTST, iTransformer, MAE, and Chronos\-2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data\.
## 1Introduction
Figure 1:One label\-efficient architecture, domain\- and event\-agnostic\.\(a\) h\-AUROC \(↑\\uparrow; horizon\-averaged AUROC\) across 14 benchmarks in 11 domains\. HEPA wins on 10 out of 14 at full labels; at 10% labels \(open circles\) it retains≥\\geq92% of full\-label performance on lifecycle datasets\. \(b\) Predicted probability surfacesp\(t,Δt\)p\(t,\\Delta t\)for turbofan degradation \(top\) and cardiac arrhythmia \(bottom\)\.A turbine blade cracks after 12,000 flight hours\. A bearing degrades over weeks of vibration data\. A satellite sensor drifts silently for 48 hours before triggering a cascade\. These events are rare in operational data, yet they follow partially predictable precursor dynamics\[[25](https://arxiv.org/html/2605.11130#bib.bib90)\]: temperatures rise gradually before overheating, vibration amplitudes grow before mechanical failure, and sensor readings deviate systematically before spacecraft faults\. A range of machine\-learning methods attempt to predict such events from multivariate sensor streams\. Remaining\-useful\-life \(RUL\) models\[[10](https://arxiv.org/html/2605.11130#bib.bib70)\]estimate how long until a machine fails; anomaly detectors\[[35](https://arxiv.org/html/2605.11130#bib.bib75),[14](https://arxiv.org/html/2605.11130#bib.bib6)\]flag when sensor readings look abnormal\. Although general\-purpose architectures exist for both, the two communities develop separate benchmarks, metrics, and evaluation protocols: RUL models never see anomaly benchmarks; anomaly detectors never forecast time\-to\-failure\. Yet all these tasks share the same structure: given observations up to timett, estimate the probabilityP\(event withinΔt\)P\(\\text\{event within \}\\Delta t\)for each prediction horizonΔt\\Delta t\.
This structural uniformity suggests a separation of concerns\. The*encoder*learns temporal dynamics from unlabeled data without knowing which event matters downstream\. The*predictor*, finetuned with a small number of event labels, specialises the learned dynamics to whichever event is relevant\. The key design choice is what the encoder should forecast during pretraining\. Value\-forecasting approaches, whether supervised\[[22](https://arxiv.org/html/2605.11130#bib.bib14)\]or pretrained on large corpora\[[1](https://arxiv.org/html/2605.11130#bib.bib19),[6](https://arxiv.org/html/2605.11130#bib.bib18)\], shape representations around all variation in the signal, including noise irrelevant to the downstream event\. The Joint\-Embedding Predictive Architecture \(JEPA\)\[[2](https://arxiv.org/html/2605.11130#bib.bib1)\]offers an alternative: by forecasting future*representations*rather than future values, the encoder learns a latent space that retains what is predictable about the future and discards what is not\.
We apply this principle to time series as HEPA \(Horizon\-conditioned Event Predictive Architecture\)\. A causal Transformer encodes observations up to timett; a horizon\-conditioned predictor maps the encoding and a horizonΔt\\Delta tto a predicted future representation, forcing the encoder to internalise dynamics at multiple timescales \([fig\.˜1](https://arxiv.org/html/2605.11130#S1.F1)\)\. After self\-supervised pretraining, the standard JEPA recipe discards the predictor and trains a linear probe on the frozen encoder\. We instead retain the predictor: freeze the encoder but finetune the predictor alongside a lightweight event head that outputs a discrete\-time survival CDF, ensuring that the predicted event probability never decreases as the horizon grows\. This “predictor finetuning” recipe tunes only 198K parameters, roughly11×11\{\\times\}fewer than end\-to\-end training, yet is more expressive than a linear probe because the predictor reshapes its horizon\-conditioned outputs to align with the downstream event\.
Our contributions are:
1. 1\.One architecture, any event, any domain\.A single 2\.16 M\-parameter architecture with fixed hyperparameters, evaluated on 14 benchmarks across 11 domains via a unified probability surfacep\(t,Δt\)p\(t,\\Delta t\)\. HEPA wins on 10 out of 14 benchmarks while tuning11×11\{\\times\}fewer parameters than PatchTST\.
2. 2\.Predictor finetuning as the downstream recipe\.Freezing the encoder and finetuning only the predictor and event head tunes11×11\{\\times\}fewer parameters than end\-to\-end training\. On the C\-MAPSS benchmark\[[24](https://arxiv.org/html/2605.11130#bib.bib31)\], where degradation unfolds over hundreds of cycles, HEPA retains 92% of full\-label h\-AUROC at just 2% of labels\. An information\-theoretic bound \([proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)\) formalises when and why this works, and the bound’s key prediction, that lower pretraining loss implies stronger downstream performance, is consistent with the empirical trend across 14 datasets \([fig\.˜3](https://arxiv.org/html/2605.11130#S3.F3)\)\.
## 2Related Work
#### Self\-supervised learning for time series\.
Self\-supervised learning \(SSL\) for time\-series representation learning falls into three families\. Contrastive methods, including TS2Vec\[[37](https://arxiv.org/html/2605.11130#bib.bib8)\], TNC\[[28](https://arxiv.org/html/2605.11130#bib.bib9)\], TimesURL\[[19](https://arxiv.org/html/2605.11130#bib.bib17)\], CPC\[[30](https://arxiv.org/html/2605.11130#bib.bib58)\], and CoST\[[33](https://arxiv.org/html/2605.11130#bib.bib10)\], learn representations by contrasting positive and negative pairs\. Masked reconstruction approaches such as PatchTST\[[22](https://arxiv.org/html/2605.11130#bib.bib14)\], SimMTM\[[8](https://arxiv.org/html/2605.11130#bib.bib13)\], and TimesNet\[[34](https://arxiv.org/html/2605.11130#bib.bib101)\]recover masked patches in input space\. JEPA\[[2](https://arxiv.org/html/2605.11130#bib.bib1),[4](https://arxiv.org/html/2605.11130#bib.bib2)\]takes a different path: predicting future*representations*rather than reconstructing inputs, avoiding tying the latent space to value\-level fidelity\. For time series, TS\-JEPA\[[9](https://arxiv.org/html/2605.11130#bib.bib5)\]applies temporal masking for classification, and MTS\-JEPA\[[14](https://arxiv.org/html/2605.11130#bib.bib6)\]adds codebook regularisation for anomaly detection\. All these methods discard their pretraining head at inference and probe only the encoder\. HEPA instead retains the predictor and finetunes it toward the downstream event, treating the predictor as a learnable bridge between frozen representations and event probabilities\. The collapse\-prevention mechanism follows the LeJEPA / SIGReg line\[[3](https://arxiv.org/html/2605.11130#bib.bib44)\]rather than the EMA schedule of I\-JEPA\.
#### Foundation models for time series\.
Chronos\-2\[[1](https://arxiv.org/html/2605.11130#bib.bib19)\], TFM\-2\.5\[[6](https://arxiv.org/html/2605.11130#bib.bib18)\], MOMENT\[[13](https://arxiv.org/html/2605.11130#bib.bib20)\], Moirai\[[32](https://arxiv.org/html/2605.11130#bib.bib11)\], and UniTS\[[11](https://arxiv.org/html/2605.11130#bib.bib22)\]pretrain on large\-scale corpora for generic value forecasting\. Generative pretraining\[[21](https://arxiv.org/html/2605.11130#bib.bib23)\]and LLM repurposing\[[38](https://arxiv.org/html/2605.11130#bib.bib24)\]offer alternative transfer strategies\. These approaches target future channel values; HEPA targets event probabilities\. The encoder is mid\-scale and pretrained per\-dataset; what transfers across domains is the*recipe*\(architecture \+ predictor finetuning\), not the weights\. We benchmark HEPA against four of these foundation models, using identical downstream heads to isolate encoder quality \([sections˜5](https://arxiv.org/html/2605.11130#S5),[G](https://arxiv.org/html/2605.11130#A7)and[G](https://arxiv.org/html/2605.11130#A7)\)\.
#### Prognostics, anomaly prediction, and survival modelling\.
C\-MAPSS\[[24](https://arxiv.org/html/2605.11130#bib.bib31)\]is the standard remaining\-useful\-life \(RUL\) benchmark, where the supervised state of the art is STAR\[[10](https://arxiv.org/html/2605.11130#bib.bib70)\]\(root mean square error, RMSE, 10\.61\)\. Self\-supervised approaches to RUL prediction remain limited\[[7](https://arxiv.org/html/2605.11130#bib.bib67),[31](https://arxiv.org/html/2605.11130#bib.bib68)\]\. Anomaly detection methods such as Anomaly Transformer\[[35](https://arxiv.org/html/2605.11130#bib.bib75)\], DCdetector\[[36](https://arxiv.org/html/2605.11130#bib.bib76)\], and TranAD\[[29](https://arxiv.org/html/2605.11130#bib.bib102)\]report point\-adjusted F1, a metric shown to inflate scores dramatically by crediting entire segments from a single detection\[[15](https://arxiv.org/html/2605.11130#bib.bib80),[26](https://arxiv.org/html/2605.11130#bib.bib79)\]\. These domain\-specific metrics are incomparable across tasks\. HEPA’s downstream parameterisation builds on discrete\-time survival models\[[17](https://arxiv.org/html/2605.11130#bib.bib83),[12](https://arxiv.org/html/2605.11130#bib.bib91)\], which decompose event probability into per\-interval hazards composed into a survival CDF; we adapt this to a multi\-horizon event prediction setting\. We unify evaluation through h\-AUROC, the mean of per\-horizon AUROC values computed over the probability surface, which is threshold\-free and robust to class imbalance \([section˜4](https://arxiv.org/html/2605.11130#S4)\)\. Domain\-specific metrics are reported as lossy projections of the same surface for comparability with published baselines\.
## 3Method
### 3\.1Architecture and Pretraining
Figure 2:HEPA architecture\.Both stages sweep over all\(t,Δt\)\(t,\\Delta t\)pairs per episode\.*Stage 1:*The causal encoderfθf\_\{\\theta\}maps𝐱≤t\\mathbf\{x\}\_\{\\leq t\}to𝐡t\\mathbf\{h\}\_\{t\}; the predictorgϕ\(𝐡t,Δt\)g\_\{\\phi\}\(\\mathbf\{h\}\_\{t\},\\Delta t\)predicts future representations via a self\-supervised JEPA objective\.*Stage 2:*Encoder frozen; the predictor producesKKhorizon\-specific hazard ratesλΔt\\lambda\_\{\\Delta t\}composed into a survival CDF \(cumulative distribution function\)p\(t,Δt\)p\(t,\\Delta t\)\.HEPA consists of three components that interact across two phases \([fig\.˜2](https://arxiv.org/html/2605.11130#S3.F2)\)\. Thecontext encoderfθf\_\{\\theta\}is a causal Transformer \(d=256d\{=\}256, 2 layers, 4 heads\) that maps observations𝐱≤t\\mathbf\{x\}\_\{\\leq t\}, tokenised into non\-overlapping patches of sizeP=16P\{=\}16\(following PatchTST\[[22](https://arxiv.org/html/2605.11130#bib.bib14)\]\) with per\-context instance normalisation\[[16](https://arxiv.org/html/2605.11130#bib.bib92)\]and sinusoidal positional encodings, to a summary embedding𝐡t=fθ\(𝐱≤t\)∈ℝd\\mathbf\{h\}\_\{t\}=f\_\{\\theta\}\(\\mathbf\{x\}\_\{\\leq t\}\)\\in\\mathbb\{R\}^\{d\}\. Thepredictorgϕg\_\{\\phi\}is a 2\-layer multilayer perceptron \(MLP\) that takes the encoder output𝐡t\\mathbf\{h\}\_\{t\}together with a prediction horizonΔt\\Delta tand produces a predicted embedding of the future interval:
𝐡^\(t,t\+Δt\]=gϕ\(𝐡t,Δt\)\.\\hat\{\\mathbf\{h\}\}\_\{\(t,t\+\\Delta t\]\}=g\_\{\\phi\}\(\\mathbf\{h\}\_\{t\},\\Delta t\)\.\(1\)During pretraining,Δt\\Delta tis sampled from a log\-uniform distribution over\[1,Δtmax\]\[1,\\Delta t\_\{\\text\{max\}\}\], forcing the encoder to internalise dynamics at multiple timescales\. The same encoderfθf\_\{\\theta\}, applied bidirectionally to𝐱\(t,t\+Δt\]\\mathbf\{x\}\_\{\(t,t\+\\Delta t\]\}with attention pooling, produces thetarget representation𝐡\(t,t\+Δt\]∗∈ℝd\\mathbf\{h\}^\{\*\}\_\{\(t,t\+\\Delta t\]\}\\in\\mathbb\{R\}^\{d\}\. Both encoders are trained jointly via the optimizer; a SIGReg \(Sketched Isotropic Gaussian Regularisation\) termℒSIG\\mathcal\{L\}\_\{\\mathrm\{SIG\}\}\[[3](https://arxiv.org/html/2605.11130#bib.bib44)\]on the predictor output prevents representation collapse, replacing the exponential moving average \(EMA\) momentum schedule used in standard JEPA \([section˜I\.3](https://arxiv.org/html/2605.11130#A9.SS3)\)\. SIGReg constrains the predicted representations toward an isotropic Gaussian, whichBalestriero and LeCun \[[3](https://arxiv.org/html/2605.11130#bib.bib44)\]prove is the optimal embedding distribution for minimising downstream prediction risk in joint\-embedding architectures; this eliminates collapse without ad\-hoc heuristics\. A single mixing weightα=0\.1\\alpha\{=\}0\.1controls its contribution to the total loss \([section˜I\.3](https://arxiv.org/html/2605.11130#A9.SS3)\)\.
#### Relation to canonical JEPA\.
HEPA differs from BYOL/I\-JEPA/V\-JEPA\-style joint\-embedding predictive architectures in two ways: \(a\) the target encoder is a weight\-shared copy offθf\_\{\\theta\}rather than an EMA copy or a stop\-gradient branch, and \(b\) collapse is prevented by SIGReg \(isotropic Gaussian constraint on the predictor output\) rather than by the online/target asymmetry\. Trivial collapseH^=H∗=\\hat\{H\}=H^\{\*\}=const is prevented jointly by SIGReg*and*by the asymmetric inputs \(𝐱≤t\\mathbf\{x\}\_\{\\leq t\}for the online branch vs\.𝐱\(t,t\+Δt\]\\mathbf\{x\}\_\{\(t,t\+\\Delta t\]\}for the target branch\): the predictor never sees the future window directly\. This puts HEPA closer to LeJEPA / SIGReg variants\[[3](https://arxiv.org/html/2605.11130#bib.bib44)\]than to the original I\-JEPA recipe\.
The pretraining loss combines an L1 prediction objective \(chosen over L2 because L1 distributes gradient magnitude equally across samples, avoiding domination by outlier predictions\) with the SIGReg regulariser:
ℒ=\(1−α\)‖𝐡^−𝐡∗‖1\+αℒSIG,\\mathcal\{L\}=\(1\-\\alpha\)\\,\\\|\\hat\{\\mathbf\{h\}\}\-\\mathbf\{h\}^\{\*\}\\\|\_\{1\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{SIG\}\},\(2\)whereα\\alphabalances the two terms\. Because the target encoder shares weights with the online encoder, no stop\-gradient is needed; both receive gradients through the optimizer\. No labels are used\. Pretraining takes under one minute per dataset on a single A10G GPU, with the full 14\-dataset, 5\-seed sweep completing in under two hours\. Per\-dataset preprocessing details are in[appendix˜L](https://arxiv.org/html/2605.11130#A12)\.
### 3\.2Downstream: Predictor Finetuning
After pretraining, we freeze the encoderfθf\_\{\\theta\}and finetune only the predictorgϕg\_\{\\phi\}together with a lightweight linear event head\. This “predictor finetuning” \(pred\-FT\) recipe tunes 198K parameters, compared to 2\.16M for end\-to\-end training and 513 for a frozen linear probe\. Finetuning reshapes the predictor’s per\-horizon outputs to separate event\-relevant from event\-irrelevant dynamics, making it more expressive than a linear probe, while the frozen encoder supplies the pretrained dynamical knowledge that makes few labels sufficient\. End\-to\-end finetuning achieves equivalent h\-AUROC at full labels \([table˜4](https://arxiv.org/html/2605.11130#A6.T4)\); pred\-FT’s advantage is computational efficiency and robustness under label scarcity \([section˜5\.4](https://arxiv.org/html/2605.11130#S5.SS4)\)\.
The predictor is run at each ofKKdiscrete horizonsΔt=1,…,K\\Delta t=1,\\ldots,K\(unit steps;K=150K\{=\}150for C\-MAPSS/TEP,K=200K\{=\}200otherwise\)\. A shared linear head maps each predicted representation to a per\-interval*conditional hazard*:
λΔt\(t\)=σ\(𝐰⊤𝐡^\(t,t\+Δt\]\+b\)∈\(0,1\),\\lambda\_\{\\Delta t\}\(t\)\\;=\\;\\sigma\\\!\\bigl\(\\mathbf\{w\}^\{\\top\}\\hat\{\\mathbf\{h\}\}\_\{\(t,t\+\\Delta t\]\}\+b\\bigr\)\\;\\in\\;\(0,1\),\(3\)whereσ\\sigmais the sigmoid function andλΔt\(t\)\\lambda\_\{\\Delta t\}\(t\)approximatesP\(event in\(Δt−1,Δt\]∣T∗\>Δt−1,𝐱≤t\)P\(\\text\{event in \}\(\\Delta t\{\-\}1,\\Delta t\]\\mid T^\{\*\}\>\\Delta t\{\-\}1,\\mathbf\{x\}\_\{\\leq t\}\), withT∗T^\{\*\}denoting the time to the first event aftertt\. The event probability surface is then parameterised as a discrete\-time survival CDF\[[17](https://arxiv.org/html/2605.11130#bib.bib83),[12](https://arxiv.org/html/2605.11130#bib.bib91)\]:
p\(t,Δt\)=1−∏j=1Δt\(1−λj\(t\)\)\.p\(t,\\Delta t\)\\;=\\;1\-\\prod\_\{j=1\}^\{\\Delta t\}\(1\-\\lambda\_\{j\}\(t\)\)\.\(4\)Because each factor\(1−λj\)∈\(0,1\)\(1\-\\lambda\_\{j\}\)\\in\(0,1\), the survival product is non\-increasing inΔt\\Delta t, sop\(t,Δt\)p\(t,\\Delta t\)increases monotonically with the prediction horizon by construction\. No distributional assumptions are required: eachλΔt\\lambda\_\{\\Delta t\}is a free function of𝐡t\\mathbf\{h\}\_\{t\}via the predictor network\. The finetuning loss sums positive\-weighted binary cross\-entropy \(BCE\) over horizons:
ℒFT=∑Δt=1Kw\+⋅BCE\(p\(t,Δt\),y\(t,Δt\)\),\\mathcal\{L\}\_\{\\text\{FT\}\}=\\sum\_\{\\Delta t=1\}^\{K\}w^\{\+\}\\cdot\\text\{BCE\}\\bigl\(p\(t,\\Delta t\),\\;y\(t,\\Delta t\)\\bigr\),\(5\)wherey\(t,Δt\)=𝟙\[event in\(t,t\+Δt\]\]y\(t,\\Delta t\)=\\mathds\{1\}\[\\text\{event in \}\(t,t\{\+\}\\Delta t\]\]andw\+=Nneg/Nposw^\{\+\}=N\_\{\\text\{neg\}\}/N\_\{\\text\{pos\}\}compensates for class imbalance\.111We apply BCE to the cumulative event probabilityp\(t,Δt\)p\(t,\\Delta t\)rather than to the per\-step hazardsλj\(t\)\\lambda\_\{j\}\(t\)against per\-step indicators \(the standard discrete\-survival likelihood, e\.g\. nnet\-survival\[[12](https://arxiv.org/html/2605.11130#bib.bib91)\]\)\. This is a deliberate design choice: BCE on the cumulative surface acts as a smoothing regulariser across horizons \(each hazardλj\\lambda\_\{j\}contributes to BCE for everyΔt≥j\\Delta t\\geq j\), which empirically improves h\-AUROC under our positive\-weighted regime but distorts the probability scale \([appendixO](https://arxiv.org/html/2605.11130#A15)\)\.
### 3\.3Theoretical Analysis
Predictor finetuning rests on a premise: the pretrained encoder retains enough event\-relevant information that a small downstream head can extract it\. We formalise when this holds and connect the bound to experiments\.
LetX≤tX\_\{\\leq t\}denote observations up to timett, and letEt\+Δt∈\{0,1\}E\_\{t\+\\Delta t\}\\in\\\{0,1\\\}be a binary indicator that equals 1 if an event occurs in the interval\(t,t\+Δt\]\(t,t\{\+\}\\Delta t\]and 0 otherwise\. The encoder producesHt=fθ\(X≤t\)∈ℝdH\_\{t\}=f\_\{\\theta\}\(X\_\{\\leq t\}\)\\in\\mathbb\{R\}^\{d\}; the target encoder producesH∗=f¯θ\(X\(t,t\+Δt\]\)∈ℝdH^\{\*\}=\\bar\{f\}\_\{\\theta\}\(X\_\{\(t,t\+\\Delta t\]\}\)\\in\\mathbb\{R\}^\{d\}from the future interval; and the predictor producesH^=gϕ\(Ht,Δt\)\\hat\{H\}=g\_\{\\phi\}\(H\_\{t\},\\Delta t\)\. We define the event posteriorη\(h\)≔P\(Et\+Δt=1∣H∗=h\)\\eta\(h\)\\coloneqq P\(E\_\{t\+\\Delta t\}\{=\}1\\mid H^\{\*\}\{=\}h\)and the marginal event rateπe≔P\(Et\+Δt=1\)\\pi\_\{e\}\\coloneqq P\(E\_\{t\+\\Delta t\}\{=\}1\), usingπe\\pi\_\{e\}to distinguish it from the probability surfacep\(t,Δt\)p\(t,\\Delta t\)\.
###### Proposition 1\(Event\-Information Retention\)\.
Suppose\(A1\)the eventEt\+ΔtE\_\{t\+\\Delta t\}is conditionally independent ofX≤tX\_\{\\leq t\}givenH∗H^\{\*\},\(A2\)the pretraining loss satisfies𝔼\[‖H^−H∗‖22\]≤ε\\mathbb\{E\}\[\\\|\\hat\{H\}\-H^\{\*\}\\\|\_\{2\}^\{2\}\]\\leq\\varepsilon,\(A3\)the event posteriorη\(h\)\\eta\(h\)isLL\-Lipschitz, and\(A4\)the posterior is bounded:η\(H∗\)∈\[η¯,η¯\]⊂\(0,1\)\\eta\(H^\{\*\}\)\\in\[\\underline\{\\eta\},\\overline\{\\eta\}\]\\subset\(0,1\)a\.s\. Then
I\(Ht;Et\+Δt\)≥I\(H∗;Et\+Δt\)−CηL2ε,I\(H\_\{t\};\\,E\_\{t\+\\Delta t\}\)\\;\\geq\\;I\(H^\{\*\};\\,E\_\{t\+\\Delta t\}\)\\;\-\\;C\_\{\\eta\}\\,L^\{2\}\\,\\varepsilon\\;,\(6\)whereCη=\(2η¯\(1−η¯\)\)−1C\_\{\\eta\}=\(2\\,\\underline\{\\eta\}\\,\(1\{\-\}\\overline\{\\eta\}\)\)^\{\-1\}andI\(⋅;⋅\)I\(\\cdot;\\cdot\)denotes mutual information\.
The proof proceeds in three steps \(full details in[appendix˜A](https://arxiv.org/html/2605.11130#A1)\)\. First, becauseH^\\hat\{H\}is a deterministic function ofHtH\_\{t\}, the data processing inequality givesI\(Ht;E\)≥I\(H^;E\)I\(H\_\{t\};E\)\\geq I\(\\hat\{H\};E\)\. Second, a Jensen\-gap argument on the convex KL divergence, combined with the Lipschitz condition and prediction error bound, yieldsI\(H∗;E\)−I\(H^;E\)≤CηL2εI\(H^\{\*\};E\)\-I\(\\hat\{H\};E\)\\leq C\_\{\\eta\}L^\{2\}\\varepsilon\. Combining these two inequalities produces the result\.
The bound makes a falsifiable prediction: as pretraining proceeds andε\\varepsilonshrinks, downstream h\-AUROC should rise\. The bound’s constantsLL\(Lipschitz ofη\\eta\),CηC\_\{\\eta\}\(posterior bound\), and the target sufficiencyI\(H⋆;Et\+Δt\)I\(H^\{\\star\};E\_\{t\+\\Delta t\}\)are functions of the data\-generating process: they vary across datasets but are held fixed within a dataset\. The bound is therefore directly testable only*within*a dataset, by varyingε\\varepsilonalone\. We do this on three contrasting domains, turbofan lifecycle \(C\-MAPSS\-3\), cardiac arrhythmia \(MBA\), and spacecraft telemetry anomalies \(SMAP\), by snapshotting the encoder during pretraining at epochs\{1,3,8,25\}\\\{1,3,8,25\\\}plus the converged best, and at each snapshot running the standard predictor finetuning recipe to obtain h\-AUROC on the held\-out test split \(3 seeds per dataset\)\. The bound’s monotone prediction holds across all three: pooled Spearmanρ\(ε,h\-AUROC\)=−0\.67\\rho\(\\varepsilon,\\text\{h\-AUROC\}\)=\-0\.67\(p=0\.017p\{=\}0\.017,n=12n\{=\}12\) on C\-MAPSS\-3,ρ=−0\.64\\rho\{=\}\{\-\}0\.64\(p=0\.026p\{=\}0\.026,n=12n\{=\}12\) on MBA, andρ=−0\.49\\rho\{=\}\{\-\}0\.49\(p=0\.13p\{=\}0\.13,n=11n\{=\}11\) on SMAP\. SMAP shows the largest visible h\-AUROC range \(0\.40 atε=0\.033\\varepsilon\{=\}0\.033rising to 0\.65 atε=0\.026\\varepsilon\{=\}0\.026\)\. The converged\-best snapshot regresses slightly relative to epoch 25 on all three datasets, consistent with mild over\-pretraining at fixed labels\. C\-MAPSS\-1 \(the original lifecycle benchmark,ρ=−0\.87\\rho\{=\}\{\-\}0\.87,p<0\.001p\{<\}0\.001\) gives an even stronger signal and is reported in[appendix˜A](https://arxiv.org/html/2605.11130#A1)\.[Corollary˜2](https://arxiv.org/html/2605.11130#Thmproposition2)predicts a fourth regime where the bound becomes vacuous: on short\-window anomaly benchmarks like GECCO we observe a within\-datasetρ=\+0\.14\\rho\{=\}\{\+\}0\.14\(p=0\.67p\{=\}0\.67\) with finetuning instability across early snapshots, exactly as expected when extended precursors are weak \(also in[appendix˜A](https://arxiv.org/html/2605.11130#A1)\)\.


Figure 3:Self\-supervised pretraining learns task\-relevant structure\.\(a\) Pretraining lossε\\varepsilonvs\. downstream h\-AUROC \(↑\\uparrow\) at fixed checkpoints across three domains \(C\-MAPSS\-3:ρ=−0\.67\\rho\{=\}\{\-\}0\.67; MBA:ρ=−0\.64\\rho\{=\}\{\-\}0\.64; SMAP:ρ=−0\.49\\rho\{=\}\{\-\}0\.49; 3 seeds, error bars±1\\pm 1std\)\. Within a dataset,LL,CηC\_\{\\eta\}, andI\(H⋆;Et\+Δt\)I\(H^\{\\star\};E\_\{t\+\\Delta t\}\)are constant, so the bound’s monotone prediction is directly testable\.★\\bigstarmarks the converged\-best snapshot;ε\\varepsilonscales differ across datasets so curves cannot be compared horizontally\. \(b\) Principal component analysis \(PCA\) of pretrained C\-MAPSS\-1 representations for four test engines\. Open circles: first observation \(healthy\); stars: last observation \(near failure\)\. PC1 captures 61% of variance; the encoder organises representations into a smooth degradation manifold without any labels\.A cross\-dataset scatter, by contrast, does*not*validate the bound: pooling the convergedε\\varepsilonacross the 14 Table[1](https://arxiv.org/html/2605.11130#S5.T1)datasets gives Pearsonr=−0\.05r\{=\}\{\-\}0\.05\(p=0\.90p\{=\}0\.90\), becauseLL,CηC\_\{\\eta\}, and the absolute scale of the target representation differ by dataset, dominating any signal fromε\\varepsilonalone; the same incommensurability[fig\.˜3](https://arxiv.org/html/2605.11130#S3.F3)makes visible \(C\-MAPSS\-3 clusters aroundε∼0\.015\\varepsilon\{\\sim\}0\.015, SMAP around0\.0270\.027, MBA around0\.060\.06\)\. This does not contradict the bound; it shows that comparingε\\varepsilonacross datasets compares incommensurable quantities, motivating the within\-dataset protocol above\. Two further caveats remain\. The constantsLLandCηC\_\{\\eta\}are not estimated directly;[fig\.˜3](https://arxiv.org/html/2605.11130#S3.F3)validates only the monotonic relationship, not the full quantitative bound\. And A1 \(target sufficiency\) may fail when event precursors span intervals longer than the target window; when A1 is violated, the bound becomes loose in a*favourable*direction \(see[appendix˜A](https://arxiv.org/html/2605.11130#A1)for assumption\-by\-assumption failure modes\)\.
###### Corollary 2\(Precursor necessity\)\.
The bound is non\-vacuous if and only if the future interval contains event precursors that the target encoder captures \(I\(H∗;Et\+Δt\)\>0I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\>0\) and the predictor approximates the target well enough \(ε<I\(H∗;Et\+Δt\)/\(CηL2\)\\varepsilon<I\(H^\{\*\};E\_\{t\+\\Delta t\}\)/\(C\_\{\\eta\}L^\{2\}\)\)\.
This corollary explains both HEPA’s successes and its failures\. On C\-MAPSS, degradation unfolds over hundreds of cycles, soI\(H∗;Et\+Δt\)I\(H^\{\*\};E\_\{t\+\\Delta t\}\)is large and pretraining drivesε\\varepsilonsmall, yielding h\-AUROC≥0\.81\\geq 0\.81\. On datasets without extended precursors, the bound is vacuous regardless of pretraining quality\.
## 4Evaluation Framework
The model outputs a probability surfacep\(t,Δt\)p\(t,\\Delta t\)\([eq\.˜4](https://arxiv.org/html/2605.11130#S3.E4)\) for each observation timettand prediction horizonΔt\\Delta t\. This surface is the complete prediction; every metric is computed deterministically from it \([fig\.˜4](https://arxiv.org/html/2605.11130#S4.F4)\), enabling direct comparison with published baselines without retraining\.
Figure 4:Evaluation framework\.\(a\) The probability surfacep\(t,Δt\)p\(t,\\Delta t\)on a representative C\-MAPSS\-1 engine \(lifetime 174 cycles\) unifies all event\-prediction metrics as lossy projections\. The colour scale matches Fig\.[1](https://arxiv.org/html/2605.11130#S1.F1)b\. RMSE requires converting the survival curve to a point estimateτ^=∑ΔtΔt⋅P\(event atΔt\)\\hat\{\\tau\}=\\sum\_\{\\Delta t\}\\Delta t\\cdot P\(\\text\{event at \}\\Delta t\); this projection is sensitive to calibration \([appendix˜J](https://arxiv.org/html/2605.11130#A10)\)\. PA\-F1 thresholdsp\(t,1\)p\(t,1\)at the smallest horizon and credits entire anomaly segments from a single detection \(inflated\[[15](https://arxiv.org/html/2605.11130#bib.bib80)\]\)\. F1 collapses to a single\(t,Δt\)\(t,\\Delta t\)cell\. h\-AUROC averages AUROC over all horizons, using the full surface\. \(b\) Per\-horizon AUROC on GECCO \(K=200K\{=\}200for HEPA / PatchTST / Chronos\-2; sparseK=8K\{=\}8for iTransformer / MAE following the v34 protocol\)\. Mean h\-AUROC \(↑\\uparrow\) per method shown in the legend; dashed lines mark the per\-method mean\. HEPA holds AUROC≥0\.82\{\\geq\}0\.82across the full horizon range while value\-level baselines decay sharply\.As a cross\-domain metric, we useh\-AUROC: the mean of per\-horizon AUROC values pooled over\(t,Δt\)\(t,\\Delta t\)cells\. Per\-horizon prevalence varies wildly across datasets, and even within a single surface: on C\-MAPSS\-1, the event “failure withinΔt\\Delta tsteps” has prevalence 0\.5% atΔt=1\\Delta t\{=\}1and 96% atΔt=150\\Delta t\{=\}150, a∼200×\{\\sim\}200\{\\times\}range\. Pooled area under the precision\-recall curve \(AUPRC\) over all\(t,Δt\)\(t,\\Delta t\)cells inherits a 0\.957 baseline on C\-MAPSS\-1, because a model predicting only per\-horizon prevalence already scores there\. h\-AUROC solves this by decomposing the surface into independent per\-horizon binary classification problems, each with a universal 0\.5 baseline that does not depend on prevalence\. The uniform average treats all horizons equally; in practice, specific horizons matter more \(long\-range for turbine maintenance, short\-range for arrhythmia\)\. We use the uniform average for cross\-domain comparability; the full surface is always stored for application\-specific weighting\. Domain\-specific metrics \(RMSE for remaining\-useful\-life, PA\-F1 for anomaly detection\) are derived as projections of the same surface for comparability with published baselines \([appendix˜J](https://arxiv.org/html/2605.11130#A10)\)\. All numbers are reported as mean±\\pmstd across 5 seeds \(HEPA, PatchTST, iTransformer, MAE\) or 3 seeds \(Chronos\-2\)\.
## 5Experiments
### 5\.1Setup
We pretrain a separate HEPA encoder per dataset from unlabeled training data\. Architecture and hyperparameters are identical across all domains; only the input projection \(sensor countSS\) changes\. All comparison methods share the same 198K\-param downstream MLP head, positive\-weighted BCE loss, and evaluation protocol; only the frozen encoder differs\. Dense unit\-step horizons are used throughout:K=150K\{=\}150for C\-MAPSS and TEP,K=200K\{=\}200for all others\. The dataset overview \(14 datasets, 11 domains\) is in[table˜3](https://arxiv.org/html/2605.11130#A4.T3)\.
### 5\.2Main Results
Table 1:Main results\(mean±\\pmstd; 5 seeds for HEPA, PatchTST, iTransformer, MAE; 3 seeds for Chronos\-2\)\. All methods use matched\-capacity downstream heads on frozen encoders \([appendix˜C](https://arxiv.org/html/2605.11130#A3)\)\. Each dataset has two rows: 100% labels and 10% labels \(gray\)\.Bold= best mean per row\.Matched downstream heads \(HEPA 198K pred\-FT; baselines 264K dt\-MLP\), positive\-weighted BCE, identical protocol\.K=150K\{=\}150for C\-MAPSS/TEP,K=200K\{=\}200otherwise\. Domain SOTAs are detection or RUL baselines; HEPA domain metrics projected fromp\(t,Δt\)p\(t,\\Delta t\)at the matching horizon \(Δt=1\\Delta t\{=\}1for PA\-F1/F1,𝔼\[Δt\]\\mathbb\{E\}\[\\Delta t\]for RMSE;[appendix˜J](https://arxiv.org/html/2605.11130#A10)\)\. PA\-F1 = point\-adjusted F1\[[15](https://arxiv.org/html/2605.11130#bib.bib80)\]\. Pairwise Welch’stt\-tests in[appendix˜N](https://arxiv.org/html/2605.11130#A14)\.†Beijing\-AQ PatchTST: 3 stations at 100%\. Bottom block has no published domain SOTA\. Additional baselines \(MOMENT, TFM\-2\.5, Moirai, MTS\-JEPA\) in[appendices˜G](https://arxiv.org/html/2605.11130#A7)and[H](https://arxiv.org/html/2605.11130#A8)\.
[Table˜1](https://arxiv.org/html/2605.11130#S5.T1)compares HEPA against two classes of methods\. The primary comparison is*architectural*: PatchTST\[[22](https://arxiv.org/html/2605.11130#bib.bib14)\], iTransformer\[[20](https://arxiv.org/html/2605.11130#bib.bib74)\], and a masked autoencoder \(MAE\) baseline use the same per\-dataset regime with identical downstream heads, isolating the effect of JEPA pretraining versus alternative self\-supervised and supervised objectives\. The secondary comparison is against the*foundation model*Chronos\-2\[[1](https://arxiv.org/html/2605.11130#bib.bib19)\], which pretrains on a large external corpus and operates in a fundamentally different regime\. A full comparison against MTS\-JEPA\[[14](https://arxiv.org/html/2605.11130#bib.bib6)\]\(matched protocol\) is in[appendix˜H](https://arxiv.org/html/2605.11130#A8); HEPA wins on 8 out of 9 datasets where MTS\-JEPA could be reproduced \(TEP excluded: the public MTS\-JEPA release does not include a chemical\-process benchmark\)\.
#### HEPA vs\. architectural baselines\.
HEPA wins on 10 out of 14 benchmarks at 100% labels, including all four C\-MAPSS variants and the newly added FD004 \(the hardest subset: six fault modes, six operating conditions\)\. HEPA’s representation\-level prediction captures temporal structure that supervised training \(PatchTST\) and reconstruction\-based SSL \(MAE\) miss, particularly on datasets with extended precursor dynamics \(C\-MAPSS, GECCO, PSM, TEP\)\. MAE is a strong second: it matches or exceeds HEPA on spacecraft telemetry \(SMAP\) and power systems \(ETTm1\), suggesting that reconstruction\-based pretraining transfers well when the dominant failure mode is gradual drift\. iTransformer’s variate\-attention mechanism excels on MBA \(h\-AUROC 0\.84 vs\. HEPA’s 0\.75\), where arrhythmia patterns are localised across specific leads\.
#### HEPA vs\. Chronos\-2\.
HEPA matches or exceeds Chronos\-2 on most benchmarks\. Per\-dataset JEPA excels when events have extended precursors that the local training data fully represents; large\-corpus pretraining helps when event signatures resemble patterns seen at scale\.
#### Honest losses\.
HEPA is below the best baseline on four datasets at 100% labels\. The pattern is interpretable: BATADAL and MBA have sensor\-localised events where channel\-fusion tokenisation dilutes the relevant subset, so per\-variate attention \(iTransformer\) or channel\-independent training \(PatchTST\) wins; MAE’s reconstruction objective transfers well when the dominant failure mode is gradual drift \(SMAP, ETTm1\)\. Adopting a sensor\-as\-token strategy\[[20](https://arxiv.org/html/2605.11130#bib.bib74)\]within the HEPA encoder is a natural way to close this gap\.
### 5\.3What Does Pretraining Learn?
[Figure˜3](https://arxiv.org/html/2605.11130#S3.F3)visualises encoder representations after self\-supervised pretraining on C\-MAPSS\-1\. Without any labels, the encoder organises representations into a smooth degradation manifold: PC1 alone captures 61% of variance and tracks time\-to\-failure monotonically within each engine \(median per\-engine Spearmanρ=\+0\.97\{\\rho\}\{=\}\{\+\}0\.97,84%84\\%of enginesρ\>0\.9\{\\rho\}\{\>\}0\.9\)\. Engines starting from different healthy regions converge toward a shared failure region\. This structure explains why so few labels suffice: the encoder has already separated healthy from degraded states\.
### 5\.4Label Efficiency
All methods in[table˜1](https://arxiv.org/html/2605.11130#S5.T1)freeze their encoder and train only a downstream head, so all benefit from pretraining under label scarcity\. The question is whether HEPA’s representations degrade more gracefully\.[Table˜2](https://arxiv.org/html/2605.11130#S5.T2)shows that on C\-MAPSS, where degradation unfolds over hundreds of cycles and the JEPA predictor achieves low pretraining loss, HEPA retains 92% of full\-label h\-AUROC with just 2 training engines out of 85\. C\-MAPSS\-3 retains 97% at 10% labels\. This is consistent with[proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1): lowε\\varepsilonon lifecycle datasets means the encoder already separates healthy from degraded states, so the finetuned predictor needs only a few labelled examples to map them to event probabilities\.
The advantage is not universal\. At 10% labels across all 14 datasets \([table˜1](https://arxiv.org/html/2605.11130#S5.T1), gray rows\), HEPA wins on 6 out of 14, compared to 10 out of 14 at full labels\. On anomaly datasets without extended precursors \(SMAP, PSM, GECCO at 10%\), the frozen\-encoder setup limits how much any method can degrade, so margins compress\. The label\-efficiency story is strongest where HEPA’s pretraining loss is lowest: extended\-precursor lifecycle datasets\.
Table 2:Label efficiency on C\-MAPSS lifecycle datasets \(HEPA only, 3 seeds\)\.h\-AUROC \(↑\\uparrow\) and retention relative to full labels\. C\-MAPSS\-1 retains 92% at 2% labels \(2 of 85 training engines\)\.
## 6Conclusion & Future Work
HEPA demonstrates that self\-supervised JEPA pretraining combined with predictor finetuning provides a practical recipe for event prediction\. The encoder learns temporal dynamics from unlabelled data; the predictor learns which dynamics signal the target event\. One architecture handles degradation forecasting, anomaly prediction, and arrhythmia detection across 14 benchmarks in 11 domains, matching or exceeding PatchTST, iTransformer, MAE, and Chronos\-2 on the majority of benchmarks while tuning an order of magnitude fewer parameters\. On lifecycle datasets, the recipe is robust to extreme label scarcity: 92% of full\-label performance with 2% of labels on C\-MAPSS, consistent with the information\-retention guarantee of[proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)\. Because the recipe is domain\-agnostic, the same architecture that predicts turbine failure from flight\-recorder data can flag arrhythmia risk from ECG streams or detect water contamination from sensor networks, each time requiring only a handful of event labels\.
Looking ahead, cross\-domain pretraining is the natural next step toward industrial deployment, and sensor\-as\-token strategies\[[20](https://arxiv.org/html/2605.11130#bib.bib74)\]could close the gap on systems where event\-relevant information is concentrated in a few channels\. On the theory side, deriving fully empirical versions of the information\-retention bound that estimateLLandCηC\_\{\\eta\}directly from data remains an interesting open problem\. Wherever multivariate sensors record the precursors to rare but consequential events, HEPA offers a path from unlabelled streams to actionable predictions\.
## References
- \[1\]A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. Pineda\-Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.arXiv preprint arXiv:2403\.07815\.Cited by:[Appendix C](https://arxiv.org/html/2605.11130#A3.SS0.SSS0.Px2.p5.1),[§1](https://arxiv.org/html/2605.11130#S1.p2.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.11130#S5.SS2.p1.1)\.
- \[2\]\(2023\)Self\-supervised learning from images with a joint\-embedding predictive architecture\.InCVPR,Cited by:[§1](https://arxiv.org/html/2605.11130#S1.p2.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]R\. Balestriero and Y\. LeCun\(2025\)LeJEPA: provable and scalable self\-supervised learning without the heuristics\.arXiv preprint arXiv:2511\.08544\.Cited by:[§I\.3](https://arxiv.org/html/2605.11130#A9.SS3.p1.5),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.11130#S3.SS1.SSS0.Px1.p1.4),[§3\.1](https://arxiv.org/html/2605.11130#S3.SS1.p1.15)\.
- \[4\]A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas\(2024\)Revisiting feature prediction for learning visual representations from video\.TMLR\.Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]T\. M\. Cover and J\. A\. Thomas\(2006\)Elements of information theory\.2nd edition,Wiley\-Interscience\.Cited by:[§A\.1](https://arxiv.org/html/2605.11130#A1.SS1.p1.13),[§A\.3](https://arxiv.org/html/2605.11130#A1.SS3.2.p2.6),[§A\.3](https://arxiv.org/html/2605.11130#A1.SS3.4.p4.7)\.
- \[6\]A\. Das, W\. Kong, R\. Sen, and Y\. Zhou\(2024\)A decoder\-only foundation model for time\-series forecasting\.InICML,Cited by:[Appendix G](https://arxiv.org/html/2605.11130#A7.p1.1),[§1](https://arxiv.org/html/2605.11130#S1.p2.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]Y\. Ding, M\. Jia, Q\. Miao, and Y\. Cao\(2022\)Self\-supervised pretraining via multi\-modally augmented representations for remaining useful life prediction\.IEEE Transactions on Industrial Informatics18\(9\),pp\. 5954–5964\.Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]J\. Dong, H\. Wu, H\. Zhang, L\. Zhang, J\. Wang, and M\. Long\(2023\)SimMTM: a simple pre\-training framework for masked time\-series modeling\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]S\. Ennadir, S\. Golkar, and L\. Sarra\(2024\)Joint embeddings go temporal\.InNeurIPS Workshop on Time Series in the Age of Large Models,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]Z\. Fan, W\. Li, and K\. Chang\(2024\)A two\-stage attention\-based hierarchical transformer for turbofan engine remaining useful life prediction\.Sensors24\(3\),pp\. 824\.External Links:[Document](https://dx.doi.org/10.3390/s24030824)Cited by:[Appendix J](https://arxiv.org/html/2605.11130#A10.p1.3),[Appendix L](https://arxiv.org/html/2605.11130#A12.p2.1),[§1](https://arxiv.org/html/2605.11130#S1.p1.3),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.16.14.17),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.41.39.17),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.66.64.17)\.
- \[11\]S\. Gao, T\. Koker, O\. Queen, T\. Hartvigsen, T\. Tsiligkaridis, and M\. Zitnik\(2024\)UniTS: building a unified time series model\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]M\. F\. Gensheimer and B\. Narasimhan\(2019\)A scalable discrete\-time survival model for neural networks\.PeerJ7,pp\. e6257\.External Links:[Document](https://dx.doi.org/10.7717/peerj.6257)Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.11130#S3.SS2.p2.9),[footnote 1](https://arxiv.org/html/2605.11130#footnote1)\.
- \[13\]M\. Goswami, K\. Szafer, A\. Choudhry, Y\. Cai, S\. Li, and A\. Dubrawski\(2024\)MOMENT: a family of open time\-series foundation models\.InICML,Cited by:[Appendix G](https://arxiv.org/html/2605.11130#A7.p1.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
- \[14\]Y\. He, Y\. Wen, X\. Wang, and T\. Ma\(2026\)MTS\-JEPA: multi\-resolution joint\-embedding predictive architecture for time\-series anomaly prediction\.arXiv preprint\.Cited by:[Appendix J](https://arxiv.org/html/2605.11130#A10.p1.3),[Table 6](https://arxiv.org/html/2605.11130#A8.T6.2.1),[Table 6](https://arxiv.org/html/2605.11130#A8.T6.42.1),[Appendix H](https://arxiv.org/html/2605.11130#A8.p1.1),[§1](https://arxiv.org/html/2605.11130#S1.p1.3),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.11130#S5.SS2.p1.1)\.
- \[15\]S\. Kim, K\. Choi, H\. Choi, B\. Lee, and S\. Yoon\(2022\)Towards a rigorous evaluation of time\-series anomaly detection\.InAAAI,Note:arXiv:2109\.05257Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1),[Figure 4](https://arxiv.org/html/2605.11130#S4.F4),[Figure 4](https://arxiv.org/html/2605.11130#S4.F4.16.8.8),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.325.7)\.
- \[16\]T\. Kim, J\. Kim, Y\. Tae, C\. Park, J\. Choi, and J\. Choo\(2022\)Reversible instance normalization for accurate time\-series forecasting against distribution shift\.InICLR,Cited by:[§3\.1](https://arxiv.org/html/2605.11130#S3.SS1.p1.8)\.
- \[17\]C\. Lee, W\. R\. Zame, J\. Yoon, and M\. van der Schaar\(2018\)DeepHit: a deep learning approach to survival analysis with competing risks\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.11130#S3.SS2.p2.9)\.
- \[18\]J\. Liao, M\. Feder, and T\. Courtade\(2019\)Sharpening Jensen’s inequality\.IEEE Transactions on Information Theory68\(5\),pp\. 2961–2972\.Cited by:[§A\.3](https://arxiv.org/html/2605.11130#A1.SS3.6.p6.4)\.
- \[19\]J\. Liu and S\. Chen\(2024\)TimesURL: self\-supervised contrastive learning for universal time series representation learning\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long\(2024\)iTransformer: inverted transformers are effective for time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2605.11130#A3.SS0.SSS0.Px2.p3.1),[§5\.2](https://arxiv.org/html/2605.11130#S5.SS2.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.11130#S5.SS2.p1.1),[§6](https://arxiv.org/html/2605.11130#S6.p2.2)\.
- \[21\]Y\. Liu, H\. Zhang, C\. Li, X\. Huang, J\. Wang, and M\. Long\(2024\)Timer: generative pre\-trained transformers are large time series models\.InICML,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
- \[22\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InICLR,Cited by:[Appendix C](https://arxiv.org/html/2605.11130#A3.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.11130#S1.p2.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.11130#S3.SS1.p1.8),[§5\.2](https://arxiv.org/html/2605.11130#S5.SS2.p1.1)\.
- \[23\]Y\. Polyanskiy and Y\. Wu\(2024\)Information theory: from coding to learning\.Cambridge University Press\.Cited by:[§A\.3](https://arxiv.org/html/2605.11130#A1.SS3.4.p4.7)\.
- \[24\]A\. Saxena, K\. Goebel, D\. Simon, and N\. Eklund\(2008\)Damage propagation modeling for aircraft engine run\-to\-failure simulation\.InInternational Conference on Prognostics and Health Management \(PHM\),Cited by:[item 2](https://arxiv.org/html/2605.11130#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]M\. Scheffer, J\. Bascompte, W\. A\. Brock, V\. Brovkin, S\. R\. Carpenter, V\. Dakos, H\. Held, E\. H\. Van Nes, M\. Rietkerk, and G\. Sugihara\(2009\)Early\-warning signals for critical transitions\.Nature461\(7260\),pp\. 53–59\.Cited by:[§1](https://arxiv.org/html/2605.11130#S1.p1.3)\.
- \[26\]S\. Schmidl, P\. Wenig, and T\. Papenbrock\(2022\)Anomaly detection in time series: a comprehensive evaluation\.Proceedings of the VLDB Endowment15\(9\),pp\. 1779–1797\.Cited by:[Appendix K](https://arxiv.org/html/2605.11130#A11.p1.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[27\]N\. Tishby, F\. C\. Pereira, and W\. Bialek\(2000\)The information bottleneck method\.arXiv preprint physics/0004057\.Cited by:[§A\.8](https://arxiv.org/html/2605.11130#A1.SS8.p1.4)\.
- \[28\]S\. Tonekaboni, D\. Eytan, and A\. Goldenberg\(2021\)Unsupervised representation learning for time series with temporal neighborhood coding\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]S\. Tuli, G\. Casale, and N\. R\. Jennings\(2022\)TranAD: deep transformer networks for anomaly detection in multivariate time series data\.Proceedings of the VLDB Endowment15\(6\),pp\. 1201–1214\.Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[30\]A\. van den Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]H\. Wang, C\. Peng, and C\. Liu\(2024\)Masked autoencoder\-based self\-supervised learning for remaining useful life prediction of turbofan engines\.Engineering Applications of Artificial Intelligence133\.Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[32\]G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo\(2024\)Unified training of universal time series forecasting transformers\.arXiv preprint arXiv:2402\.02592\.Cited by:[Appendix G](https://arxiv.org/html/2605.11130#A7.p1.1),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]G\. Woo, C\. Liu, D\. Sahoo, A\. Kumar, and S\. Hoi\(2022\)CoST: contrastive learning of disentangled seasonal\-trend representations for time series forecasting\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long\(2023\)TimesNet: temporal 2d\-variation modeling for general time series analysis\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]J\. Xu, H\. Wu, J\. Wang, and M\. Long\(2022\)Anomaly transformer: time series anomaly detection with association discrepancy\.InICLR,Cited by:[Appendix K](https://arxiv.org/html/2605.11130#A11.p1.1),[§1](https://arxiv.org/html/2605.11130#S1.p1.3),[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.108.106.17),[Table 1](https://arxiv.org/html/2605.11130#S5.T1.133.131.17)\.
- \[36\]Y\. Yang, C\. Zhang, T\. Zhou, Q\. Wen, and L\. Sun\(2023\)DCdetector: dual attention contrastive representation learning for time series anomaly detection\.InKDD,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px3.p1.1)\.
- \[37\]Z\. Yue, Y\. Wang, J\. Duan, T\. Yang, C\. Huang, Y\. Tong, and B\. Xu\(2022\)TS2Vec: towards universal representation of time series\.InAAAI,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]T\. Zhou, P\. Niu, X\. Wang, L\. Sun, and R\. Jin\(2023\)One fits all: power general time series analysis by pretrained LM\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.11130#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix ATheoretical Analysis: Full Proofs and Discussion
### A\.1Notation and Preliminaries
We work with the following random variables on a common probability space:X≤tX\_\{\\leq t\}\(observations up to timett\),X\(t,t\+Δt\]X\_\{\(t,t\+\\Delta t\]\}\(future observations\),Et\+Δt∈\{0,1\}E\_\{t\+\\Delta t\}\\in\\\{0,1\\\}\(event indicator\),Ht=fθ\(X≤t\)∈ℝdH\_\{t\}=f\_\{\\theta\}\(X\_\{\\leq t\}\)\\in\\mathbb\{R\}^\{d\}\(encoder output\),H∗=f¯θ\(X\(t,t\+Δt\]\)∈ℝdH^\{\*\}=\\bar\{f\}\_\{\\theta\}\(X\_\{\(t,t\+\\Delta t\]\}\)\\in\\mathbb\{R\}^\{d\}\(target encoder output\),H^=gϕ\(Ht,Δt\)∈ℝd\\hat\{H\}=g\_\{\\phi\}\(H\_\{t\},\\Delta t\)\\in\\mathbb\{R\}^\{d\}\(predicted representation\)\. All mutual informationsI\(⋅;⋅\)I\(\\cdot;\\cdot\)and entropiesℍ\(⋅\)\\mathbb\{H\}\(\\cdot\)are well\-defined:Et\+ΔtE\_\{t\+\\Delta t\}is discrete \(binary\), and for the continuous variablesHt,H∗,H^H\_\{t\},H^\{\*\},\\hat\{H\}we use differential entropy and the standard extension of mutual information to mixed discrete\-continuous pairs\[[5](https://arxiv.org/html/2605.11130#bib.bib108), Ch\. 8–9\]\. We useℍ\\mathbb\{H\}\(blackboard bold\) for entropy to avoid confusion with the encoder embeddingHtH\_\{t\}\.
We define the event posteriorη\(h\)≔P\(Et\+Δt=1∣H∗=h\)\\eta\(h\)\\coloneqq P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}=h\)and the marginal event rateπe≔P\(Et\+Δt=1\)\\pi\_\{e\}\\coloneqq P\(E\_\{t\+\\Delta t\}=1\), usingπe\\pi\_\{e\}to distinguish it from the probability surfacep\(t,Δt\)p\(t,\\Delta t\)in the main text\.
### A\.2Assumptions
1. \(A1\)Target sufficiency\.Et\+Δt⟂⟂X≤t∣H∗E\_\{t\+\\Delta t\}\\perp\\\!\\\!\\\!\\perp X\_\{\\leq t\}\\mid H^\{\*\}\. *Interpretation\.*The target encoder’s representation of the future interval is a sufficient statistic for the event, given the past\. This holds when: \(a\) the event is determined by the dynamics in\(t,t\+Δt\]\(t,t\+\\Delta t\], and \(b\) the target encoder has enough capacity and sees the relevant future interval\. Becausef¯θ\\bar\{f\}\_\{\\theta\}is bidirectional with attention pooling over the full interval, it is strictly more expressive than the causal encoder for summarising the future, making this assumption mild for well\-trained target encoders\. *When it fails\.*If the event depends on context outside\(t,t\+Δt\]\(t,t\+\\Delta t\]\(for instance, a slow trend visible only inX≤tX\_\{\\leq t\}that the target encoder cannot see\), then A1 is violated and the past observations carry event information not mediated byH∗H^\{\*\}\. In this case,I\(Ht;Et\+Δt\)I\(H\_\{t\};E\_\{t\+\\Delta t\}\)may actually*exceed*our lower bound, so the bound remains valid but becomes loose in a favourable direction\.
2. \(A2\)Bounded prediction error\.𝔼\[‖H^−H∗‖22\]≤ε\\mathbb\{E\}\[\\\|\\hat\{H\}\-H^\{\*\}\\\|\_\{2\}^\{2\}\]\\leq\\varepsilon\. *Interpretation\.*The pretraining loss \(L1 on L2\-normalised representations in practice\) drives the prediction residual small\. The bound uses L2 squared error for analytical tractability\. Since‖u‖2≤‖u‖1\\\|u\\\|\_\{2\}\\leq\\\|u\\\|\_\{1\}for allu∈ℝdu\\in\\mathbb\{R\}^\{d\}, lower L1 loss implies lower L2 loss, so the L1 training loss is a monotone proxy forε\\varepsilon\. We use this monotonic relationship in our empirical validation \([fig\.˜3](https://arxiv.org/html/2605.11130#S3.F3)\) to test the bound’s qualitative prediction without requiring a precise norm conversion\. *When it fails\.*Early in training or on out\-of\-distribution horizons,ε\\varepsiloncan be large and the bound becomes vacuous\.
3. \(A3\)Smooth event dependence\.The conditional distributionP\(Et\+Δt=1∣H∗=h\)P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}=h\)is Lipschitz continuous inhhwith constantLL: for allh,h′∈ℝdh,h^\{\\prime\}\\in\\mathbb\{R\}^\{d\},\|P\(Et\+Δt=1∣H∗=h\)−P\(Et\+Δt=1∣H∗=h′\)\|≤L∥h−h′∥2\|P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}=h\)\-P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}=h^\{\\prime\}\)\|\\leq L\\\|h\-h^\{\\prime\}\\\|\_\{2\}\. *Interpretation\.*Small perturbations of the target representation do not drastically change the event probability\. This is a regularity condition on the relationship between the learned representation space and event occurrence; it holds whenever the event boundary in representation space is not a fractal or highly irregular set\. *When it fails\.*If the event probability is a discontinuous function ofH∗H^\{\*\}\(e\.g\. a hard threshold on a single component\), the Lipschitz constantLLdiverges and our continuity argument requires replacement by a discrete analysis\.
4. \(A4\)Bounded event posterior\.There exist0<η¯≤η¯<10<\\underline\{\\eta\}\\leq\\overline\{\\eta\}<1such thatP\(η\(H∗\)∈\[η¯,η¯\]\)=1P\(\\eta\(H^\{\*\}\)\\in\[\\underline\{\\eta\},\\overline\{\\eta\}\]\)=1, whereη\(h\)=P\(Et\+Δt=1∣H∗=h\)\\eta\(h\)=P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}=h\)\. *Interpretation\.*The event posterior, evaluated on the support of the target encoder’s output, is bounded away from 0 and 1 almost surely\. This matches the main\-text assumption A4\. In practice,H∗H^\{\*\}is L2\-normalised onto the unit sphere \(a compact set\), and A3 guarantees thatη\\etais Lipschitz; by the image of a compact connected set under a Lipschitz map, the range ofη\(H∗\)\\eta\(H^\{\*\}\)is a closed bounded interval, and0<η¯0<\\underline\{\\eta\}follows from the event being non\-trivially detectable from the future window\. Note that A4 implies the marginal event rateπe=𝔼\[η\(H∗\)\]\\pi\_\{e\}=\\mathbb\{E\}\[\\eta\(H^\{\*\}\)\]is bounded in\[η¯,η¯\]\[\\underline\{\\eta\},\\overline\{\\eta\}\], so the event is neither impossible nor certain\. *Role in the proof\.*This assumption is needed to boundsupqφ′′\(q\)=supq1/\(q\(1−q\)\)\\sup\_\{q\}\\varphi^\{\\prime\\prime\}\(q\)=\\sup\_\{q\}1/\(q\(1\-q\)\)over the support ofη\(H∗\)\\eta\(H^\{\*\}\)\(Step 2 of the proof\)\. Without it, the KL second derivativeφ′′\\varphi^\{\\prime\\prime\}is unbounded near 0 and 1, making the Jensen\-gap bound vacuous\. The constant in the bound becomesCη=\(2η¯\(1−η¯\)\)−1C\_\{\\eta\}=\(2\\underline\{\\eta\}\(1\-\\overline\{\\eta\}\)\)^\{\-1\}\. Asπe→0\\pi\_\{e\}\\to 0orπe→1\\pi\_\{e\}\\to 1, the posterior bounds are forced toward 0 or 1, drivingCη→∞C\_\{\\eta\}\\to\\infty; this reflects a genuine difficulty: distinguishingP\(E\|H^\)P\(E\|\\hat\{H\}\)from the prior requires high precision when events are extremely rare\. *When it fails\.*Ifη\(H∗\)\\eta\(H^\{\*\}\)concentrates near 0 or 1, thenCη→∞C\_\{\\eta\}\\to\\inftyand the bound degrades\. Empirically, this is the case on CHB\-MIT, where seizure onset cannot be reliably predicted from any 16\-second past context, andη\(h\)≈πe\\eta\(h\)\\approx\\pi\_\{e\}for allhh\(consistent withI\(H∗;E\)≈0I\(H^\{\*\};E\)\\approx 0\)\.
### A\.3Full Proof of Proposition[1](https://arxiv.org/html/2605.11130#Thmproposition1)
###### Proof\.
We proceed in three steps\.
Step 1: Data processing inequality\.For fixedΔt\\Delta t\(which we condition on throughout\),H^=gϕ\(Ht,Δt\)\\hat\{H\}=g\_\{\\phi\}\(H\_\{t\},\\Delta t\)is a deterministic function ofHtH\_\{t\}\. Since a deterministic function cannot introduce new information,Et\+Δt⟂⟂H^∣HtE\_\{t\+\\Delta t\}\\perp\\\!\\\!\\\!\\perp\\hat\{H\}\\mid H\_\{t\}, so the triple\(Et\+Δt,Ht,H^\)\(E\_\{t\+\\Delta t\},H\_\{t\},\\hat\{H\}\)satisfies the Markov chainEt\+Δt→Ht→H^E\_\{t\+\\Delta t\}\\to H\_\{t\}\\to\\hat\{H\}, and the data processing inequality\[[5](https://arxiv.org/html/2605.11130#bib.bib108), Thm\. 2\.8\.1\]gives
I\(Ht;Et\+Δt\)≥I\(H^;Et\+Δt\)\.I\(H\_\{t\};\\,E\_\{t\+\\Delta t\}\)\\;\\geq\\;I\(\\hat\{H\};\\,E\_\{t\+\\Delta t\}\)\.\(7\)This step uses only the functional relationship betweenHtH\_\{t\}andH^\\hat\{H\}; no assumptions on the data\-generating process are needed\.
Step 2: Jensen gap bound on mutual information loss\.We boundI\(H∗;Et\+Δt\)−I\(H^;Et\+Δt\)I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\-I\(\\hat\{H\};E\_\{t\+\\Delta t\}\)\.
*Expressing MI as expected KL divergence\.*For any representationRRjointly distributed with binaryE=Et\+ΔtE=E\_\{t\+\\Delta t\}:
I\(R;E\)=𝔼R\[DKL\(Ber\(ηR\(R\)\)∥Ber\(πe\)\)\],I\(R;\\,E\)=\\mathbb\{E\}\_\{R\}\\bigl\[D\_\{\\mathrm\{KL\}\}\\bigl\(\\mathrm\{Ber\}\(\\eta\_\{R\}\(R\)\)\\,\\big\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\\bigr\)\\bigr\],\(8\)whereηR\(r\)≔P\(E=1∣R=r\)\\eta\_\{R\}\(r\)\\coloneqq P\(E=1\\mid R=r\)andBer\(q\)\\mathrm\{Ber\}\(q\)denotes the Bernoulli distribution with parameterqq\. Equation \([8](https://arxiv.org/html/2605.11130#A1.E8)\) is the standard expression of mutual information as the expected KL divergence between the conditional and marginal label distributions; seeCover and Thomas \[[5](https://arxiv.org/html/2605.11130#bib.bib108), Eq\. 2\.30\]orPolyanskiy and Wu \[[23](https://arxiv.org/html/2605.11130#bib.bib109), Ch\. 3\]\. Applying \([8](https://arxiv.org/html/2605.11130#A1.E8)\) toR=H∗R=H^\{\*\}andR=H^R=\\hat\{H\}:
I\(H∗;E\)\\displaystyle I\(H^\{\*\};E\)=𝔼H∗\[DKL\(Ber\(η\(H∗\)\)∥Ber\(πe\)\)\],\\displaystyle=\\mathbb\{E\}\_\{H^\{\*\}\}\\bigl\[D\_\{\\mathrm\{KL\}\}\\bigl\(\\mathrm\{Ber\}\(\\eta\(H^\{\*\}\)\)\\,\\big\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\\bigr\)\\bigr\],\(9\)I\(H^;E\)\\displaystyle I\(\\hat\{H\};E\)=𝔼H^\[DKL\(Ber\(ηH^\(H^\)\)∥Ber\(πe\)\)\],\\displaystyle=\\mathbb\{E\}\_\{\\hat\{H\}\}\\bigl\[D\_\{\\mathrm\{KL\}\}\\bigl\(\\mathrm\{Ber\}\(\\eta\_\{\\hat\{H\}\}\(\\hat\{H\}\)\)\\,\\big\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\\bigr\)\\bigr\],\(10\)whereηH^\(h^\)≔P\(E=1∣H^=h^\)\\eta\_\{\\hat\{H\}\}\(\\hat\{h\}\)\\coloneqq P\(E=1\\mid\\hat\{H\}=\\hat\{h\}\)\.
*RelatingηH^\\eta\_\{\\hat\{H\}\}toη\\etavia A1\.*Under[\(A1\)](https://arxiv.org/html/2605.11130#A1.I1.i1),E⟂⟂X≤t∣H∗E\\perp\\\!\\\!\\\!\\perp X\_\{\\leq t\}\\mid H^\{\*\}\. SinceH^=gϕ\(fθ\(X≤t\),Δt\)\\hat\{H\}=g\_\{\\phi\}\(f\_\{\\theta\}\(X\_\{\\leq t\}\),\\Delta t\)is a composition of measurable functions, it isσ\(X≤t\)\\sigma\(X\_\{\\leq t\}\)\-measurable, and thereforeE⟂⟂H^∣H∗E\\perp\\\!\\\!\\\!\\perp\\hat\{H\}\\mid H^\{\*\}\. It follows thatP\(E=1∣H∗,H^\)=P\(E=1∣H∗\)=η\(H∗\)P\(E\{=\}1\\mid H^\{\*\},\\hat\{H\}\)=P\(E\{=\}1\\mid H^\{\*\}\)=\\eta\(H^\{\*\}\)\. By the tower property of conditional expectation, for any valueh^\\hat\{h\}:
ηH^\(h^\)=P\(E=1∣H^=h^\)=𝔼\[η\(H∗\)∣H^=h^\]\.\\eta\_\{\\hat\{H\}\}\(\\hat\{h\}\)=P\(E=1\\mid\\hat\{H\}=\\hat\{h\}\)=\\mathbb\{E\}\\bigl\[\\eta\(H^\{\*\}\)\\mid\\hat\{H\}=\\hat\{h\}\\bigr\]\.\(11\)
*Applying Jensen’s inequality\.*The functionq↦DKL\(Ber\(q\)∥Ber\(πe\)\)q\\mapsto D\_\{\\mathrm\{KL\}\}\(\\mathrm\{Ber\}\(q\)\\,\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\)is convex on\(0,1\)\(0,1\)\(its second derivative is1/\(q\(1−q\)\)\>01/\(q\(1\-q\)\)\>0\)\.*Bounding the Jensen gap\.*For a twice\-differentiable convex functionφ\\varphi, the Jensen gap satisfies\[[18](https://arxiv.org/html/2605.11130#bib.bib112), Prop\. 1\]:
𝔼\[φ\(Y\)\]−φ\(𝔼\[Y\]\)≤12supyφ′′\(y\)Var\(Y\)\.\\mathbb\{E\}\[\\varphi\(Y\)\]\-\\varphi\(\\mathbb\{E\}\[Y\]\)\\;\\leq\\;\\tfrac\{1\}\{2\}\\,\\sup\_\{y\}\\,\\varphi^\{\\prime\\prime\}\(y\)\\;\\mathrm\{Var\}\(Y\)\.\(12\)Hereφ\(q\)=DKL\(Ber\(q\)∥Ber\(πe\)\)=qln\(q/πe\)\+\(1−q\)ln\(\(1−q\)/\(1−πe\)\)\\varphi\(q\)=D\_\{\\mathrm\{KL\}\}\(\\mathrm\{Ber\}\(q\)\\,\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\)=q\\ln\(q/\\pi\_\{e\}\)\+\(1\{\-\}q\)\\ln\(\(1\{\-\}q\)/\(1\{\-\}\\pi\_\{e\}\)\)andY=η\(H∗\)Y=\\eta\(H^\{\*\}\)conditioned onH^\\hat\{H\}\. The second derivative isφ′′\(q\)=1/\(q\(1−q\)\)\\varphi^\{\\prime\\prime\}\(q\)=1/\(q\(1\{\-\}q\)\)\. Under[\(A4\)](https://arxiv.org/html/2605.11130#A1.I1.i4),η\(H∗\)∈\[η¯,η¯\]\\eta\(H^\{\*\}\)\\in\[\\underline\{\\eta\},\\overline\{\\eta\}\]almost surely\. The supremum in \([12](https://arxiv.org/html/2605.11130#A1.E12)\) is over the support of the conditional distributionY∣H^=h^Y\\mid\\hat\{H\}=\\hat\{h\}, which is a subset of\[η¯,η¯\]\[\\underline\{\\eta\},\\overline\{\\eta\}\]; we relax it to the marginal support, givingsupqφ′′\(q\)≤1/\(η¯\(1−η¯\)\)\\sup\_\{q\}\\varphi^\{\\prime\\prime\}\(q\)\\leq 1/\(\\underline\{\\eta\}\(1\{\-\}\\overline\{\\eta\}\)\)\. Applying \([12](https://arxiv.org/html/2605.11130#A1.E12)\) conditionally onH^=h^\\hat\{H\}=\\hat\{h\}:
𝔼\[DKL\(Ber\(η\(H∗\)\)∥Ber\(πe\)\)\|H^=h^\]−DKL\(Ber\(ηH^\(h^\)\)∥Ber\(πe\)\)≤Var\(η\(H∗\)∣H^=h^\)2η¯\(1−η¯\),\\mathbb\{E\}\\bigl\[D\_\{\\mathrm\{KL\}\}\\bigl\(\\mathrm\{Ber\}\(\\eta\(H^\{\*\}\)\)\\,\\big\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\\bigr\)\\,\\big\|\\,\\hat\{H\}=\\hat\{h\}\\bigr\]\-D\_\{\\mathrm\{KL\}\}\\bigl\(\\mathrm\{Ber\}\(\\eta\_\{\\hat\{H\}\}\(\\hat\{h\}\)\)\\,\\big\\\|\\,\\mathrm\{Ber\}\(\\pi\_\{e\}\)\\bigr\)\\;\\leq\\;\\frac\{\\mathrm\{Var\}\(\\eta\(H^\{\*\}\)\\mid\\hat\{H\}=\\hat\{h\}\)\}\{2\\,\\underline\{\\eta\}\\,\(1\-\\overline\{\\eta\}\)\},\(13\)where we have usedq\(1−q\)≥η¯\(1−η¯\)q\(1\-q\)\\geq\\underline\{\\eta\}\(1\-\\overline\{\\eta\}\)forq∈\[η¯,η¯\]q\\in\[\\underline\{\\eta\},\\overline\{\\eta\}\]\(from A4\)\. Note thatq\(1−q\)q\(1\-q\)is concave with minimum at the endpoints of\[η¯,η¯\]\[\\underline\{\\eta\},\\overline\{\\eta\}\]; sinceη¯\(1−η¯\)≤min\(η¯\(1−η¯\),η¯\(1−η¯\)\)\\underline\{\\eta\}\(1\-\\overline\{\\eta\}\)\\leq\\min\(\\underline\{\\eta\}\(1\-\\underline\{\\eta\}\),\\overline\{\\eta\}\(1\-\\overline\{\\eta\}\)\), the bound onφ′′\\varphi^\{\\prime\\prime\}is valid \(though not tight when the interval is asymmetric around1/21/2\)\.
*Bounding the conditional variance via A2 and A3\.*By[\(A3\)](https://arxiv.org/html/2605.11130#A1.I1.i3):\|η\(H∗\)−η\(H^\)\|≤L‖H∗−H^‖2\|\\eta\(H^\{\*\}\)\-\\eta\(\\hat\{H\}\)\|\\leq L\\\|H^\{\*\}\-\\hat\{H\}\\\|\_\{2\}\(where we evaluateη\\etaatH^\\hat\{H\}, using thatη\\etais defined on all ofℝd\\mathbb\{R\}^\{d\}\)\. The random variableη\(H∗\)\\eta\(H^\{\*\}\)is bounded in\[η¯,η¯\]⊂\(0,1\)\[\\underline\{\\eta\},\\overline\{\\eta\}\]\\subset\(0,1\)by[\(A4\)](https://arxiv.org/html/2605.11130#A1.I1.i4), so all conditional second moments below are finite\. \(Note thatη\(H^\)\\eta\(\\hat\{H\}\)need not lie in\[η¯,η¯\]\[\\underline\{\\eta\},\\overline\{\\eta\}\]sinceH^\\hat\{H\}may fall outside the support ofH∗H^\{\*\}, but this does not affect the bound: the variance inequality below requires only that the right\-hand side is finite, which follows from the Lipschitz condition and bounded L2 error\.\) For any square\-integrable random variableYY,Var\(Y∣Z\)≤𝔼\[\(Y−c\)2∣Z\]\\mathrm\{Var\}\(Y\\mid Z\)\\leq\\mathbb\{E\}\[\(Y\-c\)^\{2\}\\mid Z\]for anyZZ\-measurablecc; takingc=η\(H^\)c=\\eta\(\\hat\{H\}\):
Var\(η\(H∗\)∣H^\)≤𝔼\[\(η\(H∗\)−η\(H^\)\)2∣H^\]≤L2𝔼\[‖H∗−H^‖22∣H^\]\.\\mathrm\{Var\}\\bigl\(\\eta\(H^\{\*\}\)\\mid\\hat\{H\}\\bigr\)\\;\\leq\\;\\mathbb\{E\}\\bigl\[\(\\eta\(H^\{\*\}\)\-\\eta\(\\hat\{H\}\)\)^\{2\}\\mid\\hat\{H\}\\bigr\]\\;\\leq\\;L^\{2\}\\,\\mathbb\{E\}\\bigl\[\\\|H^\{\*\}\-\\hat\{H\}\\\|\_\{2\}^\{2\}\\mid\\hat\{H\}\\bigr\]\.\(14\)
Taking expectations overH^\\hat\{H\}in \([13](https://arxiv.org/html/2605.11130#A1.E13)\) and substituting \([14](https://arxiv.org/html/2605.11130#A1.E14)\):
I\(H∗;E\)−I\(H^;E\)\\displaystyle I\(H^\{\*\};E\)\-I\(\\hat\{H\};E\)≤L22η¯\(1−η¯\)𝔼\[‖H∗−H^‖22\]\\displaystyle\\leq\\frac\{L^\{2\}\}\{2\\,\\underline\{\\eta\}\\,\(1\-\\overline\{\\eta\}\)\}\\;\\mathbb\{E\}\\bigl\[\\\|H^\{\*\}\-\\hat\{H\}\\\|\_\{2\}^\{2\}\\bigr\]≤L2ε2η¯\(1−η¯\)=CηL2ε,\\displaystyle\\leq\\frac\{L^\{2\}\\,\\varepsilon\}\{2\\,\\underline\{\\eta\}\\,\(1\-\\overline\{\\eta\}\)\}\\;=\\;C\_\{\\eta\}\\,L^\{2\}\\,\\varepsilon,\(15\)where the last line uses[\(A2\)](https://arxiv.org/html/2605.11130#A1.I1.i2)and definesCη≔\(2η¯\(1−η¯\)\)−1C\_\{\\eta\}\\coloneqq\(2\\,\\underline\{\\eta\}\\,\(1\-\\overline\{\\eta\}\)\)^\{\-1\}using the posterior bounds from[\(A4\)](https://arxiv.org/html/2605.11130#A1.I1.i4)\. The constantCηC\_\{\\eta\}depends on the posterior bounds rather than the marginal event rate, making the bound valid without assumptions on the concentration ofη\(H∗\)\\eta\(H^\{\*\}\)aroundπe\\pi\_\{e\}\.
Step 3: Assembling the bound\.Combining \([7](https://arxiv.org/html/2605.11130#A1.E7)\) and \([15](https://arxiv.org/html/2605.11130#A1.E15)\):
I\(Ht;Et\+Δt\)≥I\(H^;Et\+Δt\)≥I\(H∗;Et\+Δt\)−CηL2ε\.∎I\(H\_\{t\};\\,E\_\{t\+\\Delta t\}\)\\;\\geq\\;I\(\\hat\{H\};\\,E\_\{t\+\\Delta t\}\)\\;\\geq\\;I\(H^\{\*\};\\,E\_\{t\+\\Delta t\}\)\\;\-\\;C\_\{\\eta\}\\,L^\{2\}\\,\\varepsilon\.\\qed\(16\)
### A\.4Tightness Analysis
#### When is the bound tight?
The bound is approximately tight when: \(a\) the conditional varianceVar\(η\(H∗\)∣H^\)\\mathrm\{Var\}\(\\eta\(H^\{\*\}\)\\mid\\hat\{H\}\)is close toL2𝔼\[‖H∗−H^‖22∣H^\]L^\{2\}\\mathbb\{E\}\[\\\|H^\{\*\}\-\\hat\{H\}\\\|\_\{2\}^\{2\}\\mid\\hat\{H\}\]\(the Lipschitz bound on variance is tight, which occurs when the prediction error aligns with the direction of steepest change inη\\eta\), and \(b\) the Jensen gap bound \([12](https://arxiv.org/html/2605.11130#A1.E12)\) is close to equality \(which occurs whenη\(H∗\)\\eta\(H^\{\*\}\)is approximately symmetrically distributed around its conditional mean givenH^\\hat\{H\}\)\. In the high\-SNR regime \(ε≪I\(H∗;Et\+Δt\)/\(CηL2\)\\varepsilon\\ll I\(H^\{\*\};E\_\{t\+\\Delta t\}\)/\(C\_\{\\eta\}L^\{2\}\)\), the penalty term is small relative to the mutual information and the bound approachesI\(Ht;Et\+Δt\)≳I\(H∗;Et\+Δt\)I\(H\_\{t\};E\_\{t\+\\Delta t\}\)\\gtrsim I\(H^\{\*\};E\_\{t\+\\Delta t\}\): nearly all event information is retained\.
#### When is the bound vacuous?
Setting the right\-hand side of \([6](https://arxiv.org/html/2605.11130#S3.E6)\) to zero gives the vacuity threshold:
εvac=I\(H∗;Et\+Δt\)CηL2\.\\varepsilon\_\{\\mathrm\{vac\}\}=\\frac\{I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\}\{C\_\{\\eta\}\\,L^\{2\}\}\.\(17\)Whenε\>εvac\\varepsilon\>\\varepsilon\_\{\\mathrm\{vac\}\}, the bound provides no guarantee\. This occurs when \(i\)I\(H∗;Et\+Δt\)≈0I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\\approx 0\(no precursors\), or \(ii\)ε\\varepsilonis large \(poor pretraining\), or \(iii\)LLis large \(irregular event boundary\)\. Importantly, a vacuous bound does not imply thatI\(Ht;Et\+Δt\)=0I\(H\_\{t\};E\_\{t\+\\Delta t\}\)=0; the encoder may retain information through paths not captured by our analysis \(e\.g\. the encoder directly encodes precursor patterns without going through the predictor\)\.
### A\.5Why Predictor Weights Do Not Transfer
Our empirical finding \([appendix˜F](https://arxiv.org/html/2605.11130#A6)\) that predictor*architecture*matters but predictor*pretrained weights*do not is explained by a codomain mismatch\.
During pretraining,gϕ:ℝd×ℝ→ℝdg\_\{\\phi\}\\colon\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}\\to\\mathbb\{R\}^\{d\}maps encoder states to predicted*representations*; its codomain is the fulldd\-dimensional representation space\. During finetuning, the predictor \(with the same architecture but different final layer\) maps to*event logits*:gϕ′:ℝd×ℝ→ℝKg\_\{\\phi\}^\{\\prime\}\\colon\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}\\to\\mathbb\{R\}^\{K\}, whereKKis the number of horizon intervals andK≪dK\\ll d\.
Formally, pretraining minimises𝔼\[‖gϕ\(Ht,Δt\)−H∗‖1\]\\mathbb\{E\}\[\\\|g\_\{\\phi\}\(H\_\{t\},\\Delta t\)\-H^\{\*\}\\\|\_\{1\}\], while finetuning minimises𝔼\[ℓBCE\(σ\(gϕ′\(Ht,Δt\)\),E\)\]\\mathbb\{E\}\[\\ell\_\{\\mathrm\{BCE\}\}\(\\sigma\(g\_\{\\phi\}^\{\\prime\}\(H\_\{t\},\\Delta t\)\),\\,E\)\]\. The pretraining objective rewards the predictor for reconstructing*all*ddcomponents ofH∗H^\{\*\}, most of which encode event\-irrelevant dynamics \(channel\-level forecasting, trend, seasonality\)\. The finetuning objective rewards the predictor for extracting only the≤K\\leq Kbits relevant to event prediction\. The optimal weight matrices for these two objectives need not be correlated, and indeed our experiments confirm they are not: at 100% labels, pred\-FT with pretrained predictor weights achieves h\-AUROC within±0\.003\\pm 0\.003of pred\-FT with randomly initialised predictor weights \(the legacy AUPRC reading was within the same margin\)\.
This is analogous to the observation in vision that pretrained*classifier heads*do not transfer across tasks while pretrained*backbones*do: the backbone compresses the input into a general\-purpose representation, while the head specialises to a specific output space\.
### A\.6Connection to Empirical Results
#### C\-MAPSS \(turbofan degradation\)\.
Degradation in turbofan engines develops over hundreds of cycles, with sensor readings progressively deviating from healthy baselines\. The target encoder, seeing the future interval bidirectionally, captures degradation state with high fidelity:I\(H∗;Et\+Δt\)I\(H^\{\*\};E\_\{t\+\\Delta t\}\)is large \(the future interval is highly informative about whether failure occurs in that interval\)\. Pretraining achieves low prediction errorε\\varepsilonbecause the dynamics are smooth and near\-deterministic\.[Proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)predicts strong event\-information retention, consistent with h\-AUROC=0\.806=0\.806on C\-MAPSS\-1 and=0\.568=0\.568on C\-MAPSS\-2 \(5 seeds, denseK=150K\{=\}150\)\.
#### CHB\-MIT \(seizure prediction\)\.
We evaluated CHB\-MIT as a pilot to test the bound’s predictions on a dataset expected to fail; it is excluded from the formal benchmark in[table˜1](https://arxiv.org/html/2605.11130#S5.T1)because scalp EEG seizure prediction is a fundamentally different signal regime, with no reliable electrographic precursors observable in the 16\-second context window used by all other datasets in our benchmark\. Seizure onset in scalp EEG has minimal electrographic precursors in the pre\-ictal period, particularly within a 16\-second context\. The target encoder’s representationH∗H^\{\*\}carries little event information:I\(H∗;Et\+Δt\)≈0I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\\approx 0\.[Corollary˜2](https://arxiv.org/html/2605.11130#Thmproposition2), condition \(i\), fails regardless of pretraining quality\. This is consistent with the empirical result of h\-AUROC=0\.497=0\.497\(chance level\), and turns CHB\-MIT into a clean negative control for the theory rather than an inappropriate target\.
#### C\-MAPSS\-2 \(multi\-operating\-condition\)\.
C\-MAPSS\-2 adds operating\-condition variability to C\-MAPSS\-1\. This increasesε\\varepsilon\(harder prediction task\) without proportionally increasingI\(H∗;Et\+Δt\)I\(H^\{\*\};E\_\{t\+\\Delta t\}\)\(event information is the same; only noise increases\)\. The bound predicts that C\-MAPSS\-2 should be harder than C\-MAPSS\-1 for the same architecture, consistent with the observed drop from0\.8060\.806to0\.5680\.568h\-AUROC\.
#### Label efficiency\.
At 5% labels, pred\-FT achieves F1=0\.261=0\.261vs\. scratch=0\.035=0\.035\. The bound supports this observation: the encoder’sI\(Ht;Et\+Δt\)I\(H\_\{t\};E\_\{t\+\\Delta t\}\)is established during pretraining and does not depend on labels\. Reducing labels degrades the finetuning optimisation \(finding the right head weights\) but cannot reduce the information available in the frozen encoder\. Note that the proposition addresses information retention in the encoder, not sample complexity of the finetuning step; the label\-efficiency advantage is consistent with, but not directly formalised by, the bound\.
### A\.7Per\-Horizon Generalisation
[Proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)is stated for a fixed horizonΔt\\Delta t\. We now make the horizon dependence explicit\. All quantities on the right\-hand side of the bound can vary withΔt\\Delta t: the event posteriorηΔt\(h\)=P\(Et\+Δt=1∣HΔt∗=h\)\\eta\_\{\\Delta t\}\(h\)=P\(E\_\{t\+\\Delta t\}=1\\mid H^\{\*\}\_\{\\Delta t\}=h\), the Lipschitz constantL\(Δt\)L\(\\Delta t\), the prediction errorε\(Δt\)=𝔼\[‖HΔt∗−H^Δt‖22\]\\varepsilon\(\\Delta t\)=\\mathbb\{E\}\[\\\|H^\{\*\}\_\{\\Delta t\}\-\\hat\{H\}\_\{\\Delta t\}\\\|\_\{2\}^\{2\}\], and the posterior boundsη¯Δt,η¯Δt\\underline\{\\eta\}\_\{\\Delta t\},\\overline\{\\eta\}\_\{\\Delta t\}from[\(A4\)](https://arxiv.org/html/2605.11130#A1.I1.i4)\.
###### Proposition 3\(Per\-horizon information retention\)\.
Under the assumptions of[proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)applied at each horizonΔt\>0\\Delta t\>0separately, with horizon\-dependent quantitiesL\(Δt\)L\(\\Delta t\),ε\(Δt\)\\varepsilon\(\\Delta t\), andCη\(Δt\)=\(2η¯Δt\(1−η¯Δt\)\)−1C\_\{\\eta\}\(\\Delta t\)=\(2\\underline\{\\eta\}\_\{\\Delta t\}\(1\-\\overline\{\\eta\}\_\{\\Delta t\}\)\)^\{\-1\}:
I\(Ht;Et\+Δt\)≥I\(HΔt∗;Et\+Δt\)−Cη\(Δt\)L\(Δt\)2ε\(Δt\)\.I\\bigl\(H\_\{t\};\\,E\_\{t\+\\Delta t\}\\bigr\)\\;\\geq\\;I\\bigl\(H^\{\*\}\_\{\\Delta t\};\\,E\_\{t\+\\Delta t\}\\bigr\)\\;\-\\;C\_\{\\eta\}\(\\Delta t\)\\,L\(\\Delta t\)^\{2\}\\,\\varepsilon\(\\Delta t\)\.\(18\)
###### Proof\.
Apply the proof of[proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)verbatim at each fixedΔt\\Delta t, with all quantities carrying a subscriptΔt\\Delta t\. ∎
#### Predicted shape of the per\-horizon AUROC curve\.
Equation \([18](https://arxiv.org/html/2605.11130#A1.E18)\) predicts how the encoder’s event information varies with horizon\. Three effects combine:
- •I\(HΔt∗;Et\+Δt\)I\(H^\{\*\}\_\{\\Delta t\};E\_\{t\+\\Delta t\}\)is typically non\-monotone: very short horizons may not yet encompass precursors; very long horizons dilute the event signal with unrelated dynamics\.
- •ε\(Δt\)\\varepsilon\(\\Delta t\)is monotonically increasing \(representations further into the future are harder to predict fromHtH\_\{t\}\)\.
- •L\(Δt\)L\(\\Delta t\)can increase withΔt\\Delta tif the event boundary sharpens as the event approaches\.
The net effect is thatI\(Ht;Et\+Δt\)I\(H\_\{t\};E\_\{t\+\\Delta t\}\)peaks at intermediateΔt\\Delta tand decays at largeΔt\\Delta t\. While MI and AUROC are not monotonically related in general, this qualitative shape is compatible with the per\-horizon AUROC curves observed in our experiments: AUROC tends to be highest atΔt∈\[10,50\]\\Delta t\\in\[10,50\]cycles for C\-MAPSS and atΔt∈\[1,20\]\\Delta t\\in\[1,20\]steps for anomaly datasets, then declines\. The decay is more pronounced for the anomaly datasets becauseε\(Δt\)\\varepsilon\(\\Delta t\)grows faster when the dynamics are irregular\.
### A\.8Relationship to the Information Bottleneck
The Information Bottleneck \(IB\) framework\[[27](https://arxiv.org/html/2605.11130#bib.bib110)\]seeks a representationTTof inputXXthat maximisesI\(T;Y\)I\(T;Y\)\(relevance\) while minimisingI\(T;X\)I\(T;X\)\(compression\):
minP\(T\|X\)I\(T;X\)−βI\(T;Y\)\.\\min\_\{P\(T\|X\)\}\\;I\(T;X\)\-\\beta\\,I\(T;Y\)\.\(19\)
JEPA pretraining differs from IB in three ways:
1. 1\.No explicit compression\.JEPA does not penaliseI\(Ht;X≤t\)I\(H\_\{t\};X\_\{\\leq t\}\)\. The encoder is free to retain all information about the past\. Implicit compression arises only from the architecture bottleneck \(d=256d=256\)\.
2. 2\.Predictive, not discriminative\.The IB targetYYis a label; the JEPA target is a future representationH∗H^\{\*\}\. JEPA implicitly encourages highI\(Ht;H∗\)I\(H\_\{t\};H^\{\*\}\)by minimising prediction error \(through the predictor\), andI\(Ht;H∗\)I\(H\_\{t\};H^\{\*\}\)is an upper bound onI\(Ht;Y\)I\(H\_\{t\};Y\)for anyYYthat is a function of the future interval \(by DPI\)\. This makes JEPA a*looser*but*more general*objective: it retains information about all downstream tasks, not just one\.
3. 3\.Asymmetric architecture\.The IB is typically symmetric inXXandYY\. JEPA imposes causal structure: the encoder sees only the past, the target encoder sees only the future\. This causal asymmetry is essential for event prediction, where we cannot condition on the future at test time\.
The connection to our result is direct\. LetY=Et\+ΔtY=E\_\{t\+\\Delta t\}\. By[proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1):
I\(Ht;Y\)≥I\(H∗;Y\)−CηL2ε\.I\(H\_\{t\};Y\)\\;\\geq\\;I\(H^\{\*\};Y\)\\;\-\\;C\_\{\\eta\}\\,L^\{2\}\\,\\varepsilon\.\(20\)As the JEPA pretraining loss drivesε→0\\varepsilon\\to 0, the encoder’s mutual information withYYapproachesI\(H∗;Y\)I\(H^\{\*\};Y\), the maximum achievable by the target representation\. This shows that minimising the JEPA pretraining loss implicitly maximises the IB relevance termI\(Ht;Y\)I\(H\_\{t\};Y\)for*any*eventYYsatisfying A1–A4, without knowingYYat pretraining time\. The price paid relative to IB is that JEPA also retains event\-irrelevant information \(it does not compress\), which manifests as higher representation dimensionality but not as reduced downstream performance\.
## Appendix BHorizon Interval Design
All methods \(HEPA, PatchTST, iTransformer, MAE, Chronos\-2\) use dense unit\-step horizons:K=150K\{=\}150for C\-MAPSS and TEP \(Δt∈\{1,2,…,150\}\\Delta t\\in\\\{1,2,\\ldots,150\\\}\), andK=200K\{=\}200for all other datasets \(Δt∈\{1,2,…,200\}\\Delta t\\in\\\{1,2,\\ldots,200\\\}\)\. All methods share the same horizon set per dataset, ensuring fair comparison\. The h\-AUROC evaluation skips degenerate horizons \(event prevalence<0\.001<0\.001or\>0\.999\>0\.999\)\.
## Appendix CBaseline Comparison Protocol
[Table˜1](https://arxiv.org/html/2605.11130#S5.T1)compares HEPA against four baselines under a matched protocol designed to isolate the effect of the pretraining objective\. All five methods share:
- •Matched downstream capacity: a horizon\-conditioned MLP that maps a frozen representation𝐡t\\mathbf\{h\}\_\{t\}and horizonΔt\\Delta tto per\-horizon event logits \(details below\), trained with positive\-weighted BCE \([section˜3\.2](https://arxiv.org/html/2605.11130#S3.SS2)\)\.
- •Identical evaluation: h\-AUROC averaged over non\-degenerate horizons, same train/val/test splits, same horizon sets \(K=150K\{=\}150orK=200K\{=\}200\)\.
- •Identical label budgets: 100% and 10% label fractions with the same subsampling procedure\.
#### Downstream head architecture\.
For HEPA, the finetuned component is the pretrained predictor MLP \(a 3\-layer MLP mapping\[𝐡t;Δt\]→𝐡^∈ℝ256\[\\mathbf\{h\}\_\{t\};\\Delta t\]\\to\\hat\{\\mathbf\{h\}\}\\in\\mathbb\{R\}^\{256\}; 197\.6K params\) plus a shared linear event head \(LayerNorm \+ linear→\\tologit; 769 params\), totalling 198K finetuned parameters\. The predictor is initialised from pretraining; however,[section˜I\.2](https://arxiv.org/html/2605.11130#A9.SS2)shows that this initialisation carries at most\+0\.003\+0\.003h\-AUROC over random initialisation, confirming that the benefit comes from the frozen encoder, not the predictor’s starting weights\.
For all baselines \(PatchTST, iTransformer, MAE, Chronos\-2\), we replace the predictor and event head with a single dt\-conditioned MLP head: a linear projection fromdinputd\_\{\\text\{input\}\}to 256, a learned horizon embedding \(256 entries×\\times256 dims\), followed by LayerNorm and a 3\-layer MLP \(256→256→256→1256\\to 256\\to 256\\to 1\)\. Withdinput=256d\_\{\\text\{input\}\}\{=\}256this totals 264K parameters, giving the baselines slightly*more*downstream capacity than HEPA’s 198K\.
#### Encoder training\.
The methods differ only in how the frozen encoder is obtained:
PatchTST\[[22](https://arxiv.org/html/2605.11130#bib.bib14)\]: channel\-independent patching with bidirectional self\-attention, trained end\-to\-end \(supervised, no self\-supervised pretraining\) on each dataset\. Same patch size \(P=16P\{=\}16\) and context length \(512 steps\) as HEPA\. 2\.26M trained parameters; at evaluation time the encoder is frozen and the baseline MLP head is trained from scratch\. 5 seeds\.
iTransformer\[[20](https://arxiv.org/html/2605.11130#bib.bib74)\]: inverted Transformer that treats each variate \(sensor channel\) as a token and applies self\-attention across variates rather than across time steps\. Trained end\-to\-end \(supervised\) on each dataset with the same context window and horizon set\. At evaluation the encoder is frozen and the baseline MLP head is trained\. 5 seeds\.
MAE\(masked autoencoder\): uses the same causal Transformer encoder architecture as HEPA but pretrained with a masked reconstruction objective: random patches are masked and the decoder reconstructs the masked input values\. This isolates the effect of predicting future*representations*\(JEPA\) versus reconstructing masked*values*\(MAE\) under an otherwise identical architecture\. After pretraining, the decoder is discarded, the encoder is frozen, and the baseline MLP head is trained\. 5 seeds\.
Chronos\-2\[[1](https://arxiv.org/html/2605.11130#bib.bib19)\]: foundation model \(119M parameters\) pretrained on a large external time\-series corpus\. Representations are extracted per\-channel \(univariate\) and mean\-pooled across channels for multivariate datasets\. The baseline MLP head is trained on the frozen features\. 3 seeds\. This is the only method that uses external pretraining data; all others are trained exclusively on the target dataset\.
#### Why this protocol is fair\.
By freezing all encoders and training only the downstream head, we ensure that differences in h\-AUROC reflect the quality of the learned representations, not differences in head capacity, loss function, or optimisation\. The only free variable is the encoder and how it was trained\. HEPA finetunes 198K params \(predictor \+ head\); baselines train 264K params \(dt\-MLP head\), giving them a slight capacity advantage\. HEPA’s predictor initialisation carries negligible benefit \([section˜I\.2](https://arxiv.org/html/2605.11130#A9.SS2)\)\. PatchTST and iTransformer serve as supervised baselines \(no pretraining benefit\); MAE serves as a reconstruction\-based SSL baseline \(same architecture as HEPA, different objective\); Chronos\-2 serves as a large\-corpus foundation model baseline\.
## Appendix DDataset Overview
Table 3:Dataset overview \(14 datasets, 11 domains\)\. Default: patch sizeP=16P\{=\}16, sliding window of 512 steps \(32 tokens\), except C\-MAPSS which uses full engine history \(8–23 tokens per cycle\)\.
## Appendix EArchitectural Decisions
## Appendix FFinetuning\-Mode Ablation
[Table˜4](https://arxiv.org/html/2605.11130#A6.T4)compares five finetuning strategies on C\-MAPSS under an earlier architecture variant \(EMA target, 790K pred\-FT params\)\. Predictor finetuning is the only mode that remains competitive at both full and low label budgets; scratch training collapses at 5% labels\.
Table 4:Finetuning\-mode ablation on C\-MAPSS \(5 seeds, F1w; earlier architecture variant with EMA target\)\.This variant uses 790K pred\-FT params and 2\.37M E2E\. Pred\-FT outperforms scratch by\+0\.226\+0\.226at 5% labels \(p=0\.039p\{=\}0\.039,d=1\.35d\{=\}1\.35\)\. Scratch training collapses at low label budgets \(RMSE 33\), confirming pretraining value\. Under the final architecture \(198K pred\-FT, SIGReg\), pred\-FT and E2E achieve equivalent performance at all label budgets \(Δ≤0\.003\\Delta\\leq 0\.003\)\.
## Appendix GFoundation Model Comparisons
[Table˜5](https://arxiv.org/html/2605.11130#A7.T5)consolidates the matched\-head comparison between HEPA \(5 seeds\) and four time series foundation models: Chronos\-2, MOMENT\-1\-large\[[13](https://arxiv.org/html/2605.11130#bib.bib20)\]\(341\.2M parameters\), TFM\-2\.5\[[6](https://arxiv.org/html/2605.11130#bib.bib18)\]\(203\.6M parameters\), and Moirai\-1\.1\-R\-base\[[32](https://arxiv.org/html/2605.11130#bib.bib11)\]\(91\.4M parameters\)\. All five encoders are frozen and feed an identical 198K\-param dt\-conditioned MLP head trained with positive\-weighted BCE under the same labels, splits, and evaluation protocol; only the frozen encoder differs\.
Table 5:HEPA vs\. four foundation models \(matched 198K MLP head, 100% labels\)\.HEPA: 5 seeds\. Chronos\-2, MOMENT, TFM\-2\.5, Moirai: 3 seeds each\.Bold= best per row\. “—”: model not run on that dataset\.Coverage\.Chronos\-2: all datasets\. MOMENT: 5 datasets \(remaining omitted due to extraction cost\)\. TFM\-2\.5: 9 datasets\. Moirai: 7 datasets \(SMAP infeasible\)\. MOMENT C\-MAPSS\-3 \(0\.470\.47\) is below chance, indicating per\-channel embeddings anti\-predict turbofan degradation\. BATADAL Moirai \(0\.360\.36\) is below chance: univariate patching discards cross\-sensor correlations\. Foundation\-model wins cluster on benchmarks favoured by their pretraining corpus; HEPA wins on lifecycle benchmarks \(C\-MAPSS\-1, C\-MAPSS\-3, ETTm1\)\.
## Appendix HFull MTS\-JEPA Comparison
[Table˜6](https://arxiv.org/html/2605.11130#A8.T6)compares HEPA against MTS\-JEPA\[[14](https://arxiv.org/html/2605.11130#bib.bib6)\]on event\-prediction benchmarks from[table˜1](https://arxiv.org/html/2605.11130#S5.T1)\. Both methods use the same context length, patch size, sequence batching, and matched\-capacity downstream head; the only difference is the self\-supervised objective \(HEPA: predictive JEPA \+ SIGReg; MTS\-JEPA: published contrastive \+ codebook loss\)\. HEPA wins on 8 out of the 9 datasets where MTS\-JEPA could be run, with the largest margins on lifecycle and ICS benchmarks \(C\-MAPSS\-1, ETTm1, BATADAL\); MTS\-JEPA wins on cardiac arrhythmia \(MBA\)\. TEP is excluded because the public MTS\-JEPA release does not include a chemical\-process benchmark and the encoder fails to converge on its5252\-dimensional input\.
Table 6:HEPA vs\. MTS\-JEPA\[[14](https://arxiv.org/html/2605.11130#bib.bib6)\]\.Mean \(±\\pmstd\) over available seeds\. HEPA: 5 seeds\. MTS\-JEPA reproduction: 1–3 seeds\. Both methods use identical patch size, sequence length, and matched\-capacity downstream head; only the SSL objective differs\.
## Appendix IAdditional Ablations
### I\.1Does the Predictor Help at Inference?
With all weights frozen, a probe on𝐡past\\mathbf\{h\}\_\{\\text\{past\}\}alone achieves 0\.299 F1w at 100% labels, while a frozen multi\-horizon probe on\[𝐡^1;…;𝐡^16\]\[\\hat\{\\mathbf\{h\}\}\_\{1\};\\ldots;\\hat\{\\mathbf\{h\}\}\_\{16\}\]achieves only 0\.148\. The pretrained predictor’s outputs have high cross\-horizon cosine similarity \(\>\>0\.98\), so the concatenation is nearly redundant\. Finetuning the predictor pushes the per\-horizon outputs apart: pred\-FT’s value is in reshaping the predictor, not in extracting fixed rollouts\.
### I\.2Pretrained vs\. Random\-Init Predictor
To isolate whether the predictor’s pretrained weights carry useful information, we compared pred\-FT with pretrained vs\. random\-initialised predictor weights \(encoder frozen, 3 seeds each\)\. On C\-MAPSS\-1, pretrained weights yield h\-AUROC0\.9257±0\.00080\.9257\\pm 0\.0008vs\. random\-init0\.9235±0\.00270\.9235\\pm 0\.0027\(p=0\.38p=0\.38\)\. On SMAP, the difference is0\.3874±0\.02050\.3874\\pm 0\.0205vs\.0\.3950±0\.02860\.3950\\pm 0\.0286\(p=0\.34p=0\.34\)\. On MBA,0\.9465±0\.00040\.9465\\pm 0\.0004vs\.0\.9435±0\.00160\.9435\\pm 0\.0016\(p=0\.089p=0\.089\)\. Pretrained predictor weights carry at most\+0\.003\+0\.003h\-AUROC, confirming that the value of pred\-FT lies in the pretrained encoder and the finetuning recipe, not in the predictor’s initial weights\.
### I\.3Target Encoder Update Rule: Joint Training vs\. EMA
Our default target\-encoder strategy is joint training: both encoders share weights and are updated by the same optimizer, with a SIGReg regulariser\[[3](https://arxiv.org/html/2605.11130#bib.bib44)\]\(α=0\.1\\alpha\{=\}0\.1\) preventing collapse\. We compare this against EMA \(τ=0\.99\\tau\{=\}0\.99\) across all 12 datasets \(3 seeds, pred\-FT downstream\)\. Joint training wins on 6 out of 12 datasets \(largest gain: MBA\+0\.108\+0\.108\), loses on 4 \(largest loss: SMAP−0\.048\-0\.048\), and ties on 2\. All deltas are within±0\.11\\pm 0\.11h\-AUROC\. Predictor finetuning is robust to the target\-encoder update rule; we use joint training for its simplicity \(no momentum schedule, no separate sync interval\)\.
### I\.4Predictor Dynamics Visualisation
[Figure˜5](https://arxiv.org/html/2605.11130#A9.F5)shows how the finetuned predictor transforms the encoder’s representations across prediction horizons\. The encoder outputs \(blue\) cluster by degradation state; as the horizonkkincreases, the predicted representations shift progressively away from the encoder manifold, indicating that the predictor learns horizon\-dependent mappings rather than simply copying the encoder output\. This separation confirms that predictor finetuning reshapes the latent space to distinguish event\-relevant from event\-irrelevant dynamics at each timescale\.
Figure 5:Predictor outputs in latent space \(C\-MAPSS\-1\)\.t\-SNE of 256\-dimensional representations; axes are the two t\-SNE components \(arbitrary units\)\. Blue: encoder output𝐡t\\mathbf\{h\}\_\{t\}\. Light to dark red: predicted representations at horizonsk=10,50,100k=10,50,100\. Outputs shift progressively withkk\.
## Appendix JLegacy Metrics
C\-MAPSS RMSE is derived fromE\[Δt\]\\text\{E\}\[\\Delta t\]of the stored probability surface; this CDF\-based estimator differs from the point\-prediction protocol of STAR\[[10](https://arxiv.org/html/2605.11130#bib.bib70)\]\(replicated RMSE 12\.2\), so the gap reflects both protocol and SSL\-vs\-supervised differences\. PA\-F1 matches the cited baselines’ protocol\. HEPA domain\-metric values: SMAP PA\-F10\.88±0\.040\.88\\pm 0\.04\(vs\. MTS\-JEPA\[[14](https://arxiv.org/html/2605.11130#bib.bib6)\]0\.34\), PSM PA\-F10\.95±0\.010\.95\\pm 0\.01\(vs\. MTS\-JEPA 0\.62\)\.
## Appendix KPA\-F1 Inflation
PA\-F1 \(Point\-Adjusted F1\) credits an entire anomaly segment as correctly predicted if any single timestep within it exceeds the threshold\[[35](https://arxiv.org/html/2605.11130#bib.bib75)\]\. This protocol inflates reported F1 dramatically when anomaly segments are long and prevalence is non\-trivial: the TSAD\-Eval study\[[26](https://arxiv.org/html/2605.11130#bib.bib79)\]shows that a random detector can exceedF1=0\.9F\_\{1\}=0\.9PA on some datasets\. We demonstrate this concretely by contrasting PA\-F1 and non\-PA F1 for HEPA: SMAP 0\.862 PA vs\. 0\.474 non\-PA, PSM 0\.950 PA vs\. 0\.575 non\-PA\. The gap in PA\-F1 between HEPA and MTS\-JEPA remains large even under a random\-baseline sanity check \(a random\-init encoder achieves 0\.604 PA\-F1 on SMAP; HEPA is 0\.79\), so the margin is not purely inflation, but any PA\-F1 number should be read alongside the corresponding non\-PA number\.
## Appendix LPreprocessing Details
The normalisation, channel\-drop thresholds, and context lengths follow conventions in the cited SSL and anomaly\-detection benchmarks\. For C\-MAPSS we use the SELECTED\_SENSORS subset \(14 sensors after removing near\-constant channels\)\. RUL cap is 125 cycles, following STAR\[[10](https://arxiv.org/html/2605.11130#bib.bib70)\]\. Specific numeric thresholds are in the code repository\.
## Appendix MHyperparameters
All hyperparameters are fixed across datasets unless noted\. Seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}for 5\-seed runs and\{0,1,2\}\\\{0,1,2\\\}for 3\-seed runs\. Pretraining trains the encoder and predictor jointly; finetuning freezes the encoder and trains only the predictor and event head\.
Table 7:Hyperparameters\.Fixed across all datasets\. Pretraining trains encoder \+ predictor; finetuning trains predictor \+ event head \(encoder frozen\)\.StageHyperparameterValuePretraining\(encoder \+predictor\)OptimizerAdamWLearning rate3×10−43\\times 10^\{\-4\}Weight decay1×10−21\\times 10^\{\-2\}Batch size64Epochs100 \(patience 10\)SIGReg weightα\\alpha0\.1Horizon samplingLogUniform\[1,K\]\[1,K\]Finetuning\(predictor \+event head\)OptimizerAdamWLearning rate1×10−31\\times 10^\{\-3\}Weight decay1×10−21\\times 10^\{\-2\}Batch size64Epochs50 \(patience 10\)Pos\-weightw\+w^\{\+\}Nneg/NposN\_\{\\text\{neg\}\}/N\_\{\\text\{pos\}\}ArchitectureEncoder layers2dmodeld\_\{\\text\{model\}\}256Attention heads4Patch sizePP16Dropout0\.1
## Appendix NPairwise Significance Tests
[Table˜8](https://arxiv.org/html/2605.11130#A14.T8)reports Welch’s two\-sided t\-test \(unpaired, unequal variance\) between HEPA and each architectural baseline at each \(dataset, label\-fraction\) cell with at least 3 seeds per arm\. Out of 42 cells \(14 datasets×\\times3 baselines, 100% labels\), HEPA wins 21 atp<0\.05p<0\.05and loses 7; the remaining 14 are not significantly different\.
Table 8:Pairwise Welch’s t\-test, HEPA vs\. each baseline\.Markers:∗p<0\.05p\{<\}0\.05,∗∗p<0\.01p\{<\}0\.01in HEPA’s favour;↓denotes HEPA loses atp<0\.05p\{<\}0\.05; blank = not significant\.The dominant wins are on the C\-MAPSS family \(all four datasets significant against at least one baseline\) and against iTransformer \(HEPA wins on 11 out of 13\)\. HEPA’s seven losses \(p<0\.05p\{<\}0\.05in the baseline’s favour\) are concentrated on SMAP, MBA, BATADAL, ETTm1, and Beijing\-AQ, where event signatures are sensor\-localised or where MAE’s reconstruction objective transfers well\.
## Appendix OCalibration Analysis
[Table˜9](https://arxiv.org/html/2605.11130#A15.T9)reports Expected Calibration Error \(ECE\) and Brier score on five datasets for which we hold verified HEPA\-SIGReg probability surfaces \(single\-seed,s=42s\{=\}42; 10 equal\-width bins; Murphy decomposition: Brier = Reliability−\-Resolution\+\+Uncertainty\)\.
Table 9:Calibration of HEPA probability surfaces on five datasets\.Calibration varies substantially: HEPA is well\-calibrated on FD004 \(ECE 0\.030\) but miscalibrated on FD001 \(ECE 0\.272\) and VIX \(ECE 0\.310\)\. The pattern is consistent with the loss design: positive\-weighted BCE applied to the cumulative survival CDF up\-weights the minority class to preserve recall, distorting the probability scale\. Resolution remains positive on every dataset, confirming the surface still discriminates events even when miscalibrated\. For applications requiring calibrated probabilities, post\-hoc Platt scaling or isotonic regression on a held\-out fold is recommended; we verified offline that this reduces ECE on FD001 to below 0\.05\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer: \[Yes\]
4. Justification: The abstract claims are substantiated:[table˜1](https://arxiv.org/html/2605.11130#S5.T1)reports the 14\-benchmark sweep with 5\-seed mean±\\pmstd; pairwise Welch’s t\-tests in[appendix˜N](https://arxiv.org/html/2605.11130#A14)\.[Section˜3\.2](https://arxiv.org/html/2605.11130#S3.SS2)defines pred\-FT;[table˜2](https://arxiv.org/html/2605.11130#S5.T2)reports label\-efficiency retention with explicit qualification \(lifecycle datasets only\)\.
5. Guidelines: Per NeurIPS 2026 guidelines\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work?
8. Answer: \[Yes\]
9. Justification:[Section˜6](https://arxiv.org/html/2605.11130#S6)discusses limitations: per\-dataset pretraining, channel\-fusion failure on sensor\-localised events, and unestimated bound constants\. Calibration analysis in[appendix˜O](https://arxiv.org/html/2605.11130#A15)\. The label\-efficiency claim is qualified in[section˜5\.4](https://arxiv.org/html/2605.11130#S5.SS4)\.
10. Guidelines: Per NeurIPS 2026 guidelines\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer: \[Yes\]
14. Justification:[Proposition˜1](https://arxiv.org/html/2605.11130#Thmproposition1)and[corollary˜2](https://arxiv.org/html/2605.11130#Thmproposition2)state all four assumptions \(A1–A4\) explicitly\. Full proofs with assumption\-by\-assumption failure modes are in[appendix˜A](https://arxiv.org/html/2605.11130#A1)\.[Figure˜3](https://arxiv.org/html/2605.11130#S3.F3)validates the bound’s qualitative prediction\.
15. Guidelines: Per NeurIPS 2026 guidelines\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
18. Answer: \[Yes\]
19. Justification: All hyperparameters are in[table˜7](https://arxiv.org/html/2605.11130#A13.T7), fixed across all 14 datasets\. Per\-dataset preprocessing is in[appendix˜L](https://arxiv.org/html/2605.11130#A12)\. Per\-seed result JSONs are in the anonymised code repository\.
20. Guidelines: Per NeurIPS 2026 guidelines\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code?
23. Answer: \[Yes\]
24. Justification: All source code, training scripts, and per\-seed result JSONs are released at the anonymised repository linked from the title page\. All 14 datasets are publicly available with download scripts included\.
25. Guidelines: Per NeurIPS 2026 guidelines\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details?
28. Answer: \[Yes\]
29. Justification:[Section˜5\.1](https://arxiv.org/html/2605.11130#S5.SS1)states the per\-dataset pretraining regime, matched\-capacity heads, and label budgets\.[Appendix˜C](https://arxiv.org/html/2605.11130#A3)details the matched protocol for all baselines\.
30. Guidelines: Per NeurIPS 2026 guidelines\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars?
33. Answer: \[Yes\]
34. Justification: All Table 1 cells report mean±\\pmstd across 5 seeds \(or 3 for Chronos\-2\)\. Full pairwise Welch’s t\-tests in[appendix˜N](https://arxiv.org/html/2605.11130#A14)\.
35. Guidelines: Per NeurIPS 2026 guidelines\.
36. 8\.Experiments compute resources
37. Question: Does the paper provide sufficient information on compute resources?
38. Answer: \[Yes\]
39. Justification: All runs use a single A10G GPU \(24 GB\)\. Pretraining takes under one minute per dataset; the full 5\-seed sweep completes in under 30 minutes\.
40. Guidelines: Per NeurIPS 2026 guidelines\.
41. 9\.Code of ethics
42. Question: Does the research conform with the NeurIPS Code of Ethics?
43. Answer: \[Yes\]
44. Justification: All datasets are publicly released benchmarks\. MBA contains de\-identified ECG signals used under its standard research license\. No human subjects were enrolled\.
45. Guidelines: Per NeurIPS 2026 guidelines\.
46. 10\.Broader impacts
47. Question: Does the paper discuss potential societal impacts?
48. Answer: \[Yes\]
49. Justification: Positive: earlier prediction of mechanical failures and physiological events\. Negative: miscalibrated probabilities \([appendix˜O](https://arxiv.org/html/2605.11130#A15)\) could mislead safety\-critical decisions without recalibration\.
50. Guidelines: Per NeurIPS 2026 guidelines\.
51. 11\.Safeguards
52. Answer: \[N/A\]
53. Justification: The model has no generative capability, no language interface, and no obvious misuse pathway\.
54. Guidelines: Per NeurIPS 2026 guidelines\.
55. 12\.Licenses for existing assets
56. Answer: \[Yes\]
57. Justification: All 14 datasets are cited with primary references\. All baseline implementations used under their open\-source licenses\.
58. Guidelines: Per NeurIPS 2026 guidelines\.
59. 13\.New assets
60. Answer: \[Yes\]
61. Justification: We release the HEPA implementation, training/evaluation scripts, and per\-seed result JSONs under MIT\. README with installation and smoke\-test included\.
62. Guidelines: Per NeurIPS 2026 guidelines\.
63. 14\.Crowdsourcing and human subjects
64. Answer: \[N/A\]
65. Justification: No crowdsourcing or human subject experiments\.
66. Guidelines: Per NeurIPS 2026 guidelines\.
67. 15\.Institutional review board \(IRB\) approval
68. Answer: \[N/A\]
69. Justification: No new human\-subject data collected\. All datasets are public benchmarks\.
70. Guidelines: Per NeurIPS 2026 guidelines\.Similar Articles
CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining
Introduces CGM-JEPA, a self-supervised pretraining framework for continuous glucose monitor data that improves cross-modal and cross-cohort performance through masked latent prediction and distributional objectives.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
This paper introduces AeroJEPA, a Joint-Embedding Predictive Architecture for scalable 3D aerodynamic field modeling. It addresses limitations in current surrogate models by predicting semantic latent representations of flow fields, enabling efficient high-fidelity analysis and design optimization.
GitHub - keon/jepa: implementing minimal versions of joint-embedding predictive architecture (JEPA)
A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.
A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry
This academic paper presents a hierarchical ensemble pipeline for anomaly detection in ESA satellite telemetry, utilizing shapelet-based and statistical feature extraction to identify subtle anomalies in multivariate time-series data.
Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models
The authors introduce Sub-JEPA, a method using Subspace Gaussian Regularization to improve the stability of end-to-end world models like LeWM, showing consistent performance gains on continuous-control benchmarks.