The Simulacrum: Decision-Theoretic Pretraining for Near-Optimal Time-Series Forecasting and Inference

arXiv cs.LG Papers

Summary

This paper introduces a unified decision-theoretic pretraining framework for neural network-based time series estimators, trained on stratified simulations to approximate near-optimal decision rules. Experiments show that the resulting estimators outperform traditional methods like maximum likelihood estimation on both synthetic and real-world benchmarks.

arXiv:2606.27711v1 Announce Type: new Abstract: We introduce a neural network-based framework for learning time series estimators through a process we term decision-theoretic pretraining. Analysts specify a generative world, a distribution over data-generating processes, and a target decision objective. A neural network trained on stratified simulations from this world approximates the corresponding optimal decision rule, yielding a neural estimator that provides forecasts, parameter estimates, predictive intervals, or model-selection for zero-shot inference on previously unseen time series. The joint specification of the generative world and objective enables the estimators to directly approximate process-level, finite-sample properties: near-optimal risk, bias control, minimax performance, and uniform calibration. Our experiments demonstrate that these neural estimators can outperform traditional baselines such as maximum likelihood estimation and model selection via AICc, for the same model structural model classes. Furthermore, even when trained purely on simulations of structural models, they achieve competitive or state-of-the-art forecasting accuracy on major real-world benchmarks, compared with statistical, neural or large pre-trained models. We illustrate the framework by addressing two longstanding challenges: finite-sample bias and miscalibration in AR(p) models, and the forecast combination puzzle. These applications highlight the approach's main advantage: its ability to approximate solutions to analytically intractable or computationally prohibitive time series problems, including complex structural equations or optimality criteria. Ultimately, by enabling explicit control over decision-theoretic trade-offs, the framework equips analysts with highly efficient estimation tools tailored to their specific analytical needs.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# Decision‑Theoretic Pretraining for Near‑Optimal Time‑Series Forecasting and Inference Preliminary versions of the core ideas in this paper were presented in earlier conference and invited talks, including CMStatistics 2022 [45], the International Symposium on Forecasting 2023[47] , and the M6 Competition Conference 2023 [46]. The present paper provides the first unified decision-theoretic formulation of this framework, together with comprehensive empirical validation.
Source: [https://arxiv.org/html/2606.27711](https://arxiv.org/html/2606.27711)
Pablo Montero\-MansoPart of this work was developed while the author was a Visiting Researcher at Google\. The views expressed are those of the authors and do not necessarily reflect those of Google\.

\(Work in Progress: \)

###### Abstract

We introduce a neural network\-based framework for learning time series estimators through a process we termdecision\-theoretic pretraining\. Analysts specify agenerative world, a distribution over data\-generating processes, and a targetdecision objective\. A neural network trained on stratified simulations from this world approximates the corresponding optimal decision rule, yielding aneural estimatorthat provides forecasts, parameter estimates, predictive intervals, or model\-selection for zero\-shot inference on previously unseen time series\.

The joint specification of the generative world and objective enables the estimators to directly approximate process\-level, finite\-sample properties: near\-optimal risk, bias control, minimax performance, and uniform calibration\. Our experiments demonstrate that these neural estimators can outperform traditional baselines such as maximum likelihood estimation and model selection via AICc, for the same model structural model classes\. Furthermore, even when trained purely on simulations of structural models, they achieve competitive or state\-of\-the\-art forecasting accuracy on major real\-world benchmarks, compared with statistical, neural or large pre\-trained models\.

We illustrate the framework by addressing two longstanding challenges: finite\-sample bias and miscalibration in AR\(pp\) models, and the forecast combination puzzle\. These applications highlight the approach’s main advantage: its ability to approximate solutions to analytically intractable or computationally prohibitive time series problems, including complex structural equations or optimality criteria\. Ultimately, by enabling explicit control over decision\-theoretic trade\-offs, the framework equips analysts with highly efficient estimation tools tailored to their specific analytical needs\.

Keywords:Neural Networks; Estimation; Time Series; Large Models; Scaling Laws; Simulation\-Based Inference; AI

## 1Introduction

Time series models are a central analytical tool across science and industry, used to infer the properties of evolving processes and forecast their trajectories\. Traditionally, researchers rely on structural models: frameworks that impose specific mathematical assumptions to capture a system’s dynamics or decompose it into interpretable components such as trends, cycles, persistence, and regime changes\. Although parsimonious and intentionally stylized, structural models are often sufficiently expressive for practical use\. Their usefulness, however, depends not only on whether the assumed structure is appropriate, but on whether its parameters, states, and predictive implications can be estimated reliably\. As we discuss in Section[8\.1](https://arxiv.org/html/2606.27711#S8.SS1), the main limitation of structural models is frequently not the structure itself, but how the parameters are estimated from data\. Well\-known estimators for popular models still suffer from data inefficiency, finite\-sample bias, numerical instability, and miscalibration\. These flaws propagate into the model’s predictions; consequently, improvements in estimation directly yield better forecast accuracy and more reliable inference\.

Recently, deep learning forecasting methods, including Time Series Foundation Models \(TSFMs\)\[[3](https://arxiv.org/html/2606.27711#bib.bib48),[12](https://arxiv.org/html/2606.27711#bib.bib47)\]and Global Forecasting Models \(GFMs\)\[[54](https://arxiv.org/html/2606.27711#bib.bib63),[51](https://arxiv.org/html/2606.27711#bib.bib39),[39](https://arxiv.org/html/2606.27711#bib.bib80)\], have achieved strong empirical accuracy by training flexible neural architectures on large, diverse collections of time series\. Their standard training objectives, however, are not designed to deliver process\-level, finite\-sample guarantees or parsimonious structural interpretation\. Parsimonious assumptions are not merely a route to better forecasts; they are often the primary object of interest, allowing researchers to test the validity of a mechanism, maintain interpretability, and ensure auditable results\. Furthermore, statistical guarantees at the individual\-series level \(such as risk coverage for a specific asset within a portfolio\) are difficult to achieve with TSFMs and GFMs\. Because these models optimize for average performance across heterogeneous datasets, strong aggregate metrics can hide systematic errors in particular series or parameter regimes\. This issue cannot be solved by pure empirical data scaling: even within massive datasets, the history of any single series remains a “small sample” lacking sufficient information to offer formal guarantees\.

To bridge this gap, this paper introduces the*Simulacrum*, a computational framework that recasts time series estimation as a statistical decision problem\[[65](https://arxiv.org/html/2606.27711#bib.bib68),[6](https://arxiv.org/html/2606.27711#bib.bib69)\]and solves it by simulation\. The analyst specifies two elements: a*generative world*, a distribution over data\-generating processes on which the estimator should perform well, and a*decision objective*that encodes the desired notion of performance\. Together these define an optimal decision rule\. Such a rule is rarely available by classical means: maximum likelihood is optimal only asymptotically and biased in finite samples\[[42](https://arxiv.org/html/2606.27711#bib.bib5)\], while minimax estimators are well defined but have no closed form\. Through a process we term*decision\-theoretic pretraining*, we utilize massive\-scale simulations to train neural networks to approximate this optimal rule directly\. By doing so, we enable the derivation of near\-optimalneural estimatorsfor both existing and novel model classes\.

The framework’s core innovation is a two\-stage, stratified simulation scheme\. An outer loop samples model parameters within a general model class \(the*generative world*\), while an inner loop generates replicate trajectories from that specific set of parameters\. A world can include multiple models, contamination operators, and data augmentations\. With enough replicates, a loss function can measure the neural estimator’s deviations from process\-level optimality \(e\.g\., conditional bias or miscalibration\)\. The network is trained to minimize this loss, converging to an approximation of the optimal decision rule for the entire model class\. For example, consider estimating a short AR\(55\) series\. Standard estimators such as maximum likelihood are biased in finite samples, and their resulting intervals are rarely uniformly calibrated\. Our framework samples a world of stationary AR\(55\) coefficient vectors, generates multiple independent trajectories from each, and trains a neural network to recover coefficients or predictive quantiles from a single observed trajectory\. The outer loop supplies the coefficient vectors, while the inner loop provides the replicates that make conditional bias or miscoverage measurable and penalizable in training, process by process\. The trained network becomes a fast, finite\-sample estimator whose statistical properties are deliberately fixed by the chosen world and objective\. See Figure[1](https://arxiv.org/html/2606.27711#S3.F1)for an overview of the framework\.

The world perspective situates the framework relative to recent work on neural forecasting\. Global forecasting models, time series foundation models, and prior\-data fitted networks\[[49](https://arxiv.org/html/2606.27711#bib.bib83)\]can each be viewed as training an estimator over a generative world, such as an empirical corpus, a large pretraining mixture, or a synthetic prior\. In these approaches, however, the objective is typically fixed: minimizing risk over the world\. Our contribution is to make the estimator itself an object of design\. The analyst specifies the world together with the properties the resulting estimator should satisfy\. When the world is simulated, we can target and verify these properties directly\. In the special case of a single Bayesian model under average risk, the framework coincides with neural Bayes estimation\[[53](https://arxiv.org/html/2606.27711#bib.bib55)\]; its reach is the generalization beyond that case, to objectives and model classes whose optimal estimators were previously unavailable\.

Our experiments showcase the world design principles: trained for accuracy, neural estimators outperform maximum likelihood under the exponential smoothing and AR\(pp\) worlds; trained for bias control, they achieve near\-unbiased finite\-sample estimation where maximum likelihood is systematically biased; trained for worst\-case risk, they flatten the risk profile and reduce the largest forecast errors\. Two further results address long standing difficulties in classical estimation\. For the AR\(pp\) class, the framework yields estimation intervals that come much closer to uniform calibration \(coverage holding conditional on each parameter value\) than classical plug\-in procedures or standard empirical risk minimization, which exhibit systematic conditional miscalibration\. For forecast combination, we recast the heuristic of averaging forecasts as estimation within a generative class: observations are modeled as a convex combination of ARIMA and exponential smoothing components driven by common shocks, a class for which the optimal estimator is analytically intractable\. Learning this combination rule directly converts a long\-standing heuristic into a principled method and yields a generative account of the Forecast Combination Puzzle\[[10](https://arxiv.org/html/2606.27711#bib.bib8),[57](https://arxiv.org/html/2606.27711#bib.bib13)\]\.

These design choices also translate into competitive accuracy on real data\. Although the neural estimators are trained entirely on simulated structural worlds and never see the test series, they are competitive with, and often surpass, both classical and modern alternatives on standard benchmarks\. On the M1 competition data, a neural estimator for the additive exponential smoothing class attains the lowest error among all 21 competition entrants on the competition’s median error metric, including several alternative estimators for the exponential smoothing class\. On the Monash archive, the combined ARIMA–ETS estimator improves on the strongest reported baselines for several datasets\.

### Contributions and Paper Organization

The paper makes three main contributions\.

- •A unified simulation\-based framework for structural time\-series estimation \(Sections[2](https://arxiv.org/html/2606.27711#S2)–[4](https://arxiv.org/html/2606.27711#S4)\):We introduce*The Simulacrum*, a decision\-theoretic framework that trains neural networks as estimators over user\-defined generative worlds\. The framework extends standard simulation\-based training beyond average\-risk minimization and makes it possible to target finite\-sample properties such as minimax risk, breakdown\-point robustness, and uniform calibration\. In §[2](https://arxiv.org/html/2606.27711#S2), we define generative worlds as a generalization of time\-series data\-generating processes \(DGPs\), together with loss functions that encode statistical objectives\. Section[3](https://arxiv.org/html/2606.27711#S3)develops the two\-stage simulation mechanism used to train neural estimators, and Section[4](https://arxiv.org/html/2606.27711#S4)discusses neural architectural considerations\. Connections to neighboring paradigms such as amortized inference, Prior\-Data Fitted Networks \(PFNs\), and global or time series foundation models are discussed later in §[8](https://arxiv.org/html/2606.27711#S8)\.
- •Methodological results on finite\-sample estimation and uncertainty quantification \(Sections[5](https://arxiv.org/html/2606.27711#S5)–[6](https://arxiv.org/html/2606.27711#S6)\):The framework recovers estimators with statistical properties that are difficult to obtain with classical procedures\. In the Exponential Smoothing \(ETS\) class, neural estimators trained for forecasting accuracy achieve superior performance relative to standard alternatives such as Maximum Likelihood, both under the true DGP \(§[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)\) and in real data \(§[5\.3](https://arxiv.org/html/2606.27711#S5.SS3), compared with five ETS estimators and 21 competing models on the M1 dataset\)\. In the AR\(pp\) class \(Section[6](https://arxiv.org/html/2606.27711#S6)\), the framework yields estimators with near\-unbiased finite\-sample behavior anduniform calibration, a stronger property than the marginal calibration typically achieved by standard deep learning approaches\. These results show that simulation\-based neural estimators can be trained to satisfy process\-level criteria, not only aggregate predictive accuracy\.
- •Empirical demonstrations in model selection, forecast combination, and comparative risk analysis \(Section[7](https://arxiv.org/html/2606.27711#S7)and related results\):We further show that the same framework applies naturally to multi\-model worlds\. In Section[7](https://arxiv.org/html/2606.27711#S7), we propose a generative treatment of the “Forecast Combination Puzzle” by modeling forecast combination itself as a structural process\. This yields direct neural estimators of combination rules without relying on the standard two\-step procedure of first fitting component models and then estimating weights\. In real data, the resulting estimator is competitive with, and in some cases improves upon, simple equal\-weight combinations, while also clarifying why equal weights remain a strong benchmark\. More broadly, the paper includes comparative risk and scaling analyses that characterize when neural estimators outperform classical alternatives and when the gap narrows\. In §[5\.2](https://arxiv.org/html/2606.27711#S5.SS2), we study how approximation rates vary with series length and model complexity\. We also compare neuralmodel selectionfor ETS subclasses with AIC\-based selection as series length increases \(§[7\.1](https://arxiv.org/html/2606.27711#S7.SS1)\), and provide detailed error curves against established baselines, including Maximum Likelihood and Bayesian MCMC for ETS \(§[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)\), as well as OLS, Yule–Walker, and Burg for the AR\(pp\) class \(§[6\.1](https://arxiv.org/html/2606.27711#S6.SS1)\)\.

##### Reading Guide\.

Readers primarily interested in practical forecasting performance and applied tools may wish to begin with Section[5\.3](https://arxiv.org/html/2606.27711#S5.SS3)and Section[7\.2\.1](https://arxiv.org/html/2606.27711#S7.SS2.SSS1), which present real\-data results for exponential smoothing and forecast combination on benchmark datasets\. Sections[2](https://arxiv.org/html/2606.27711#S2)–[3](https://arxiv.org/html/2606.27711#S3)can then be read as providing a unifying explanation of how these applications arise from a common training principle\.

Readers interested in statistical foundations, decision theory, or simulation\-based inference may prefer to begin with Sections[2](https://arxiv.org/html/2606.27711#S2)–[3](https://arxiv.org/html/2606.27711#S3)\. These sections introduce generative worlds, population risk, and conditional performance training, which form the conceptual backbone of the paper\.

Readers from a machine learning perspective may find it useful to view Sections[2](https://arxiv.org/html/2606.27711#S2)–[3](https://arxiv.org/html/2606.27711#S3)as defining a task distribution, training objectives, and a learning procedure, and Sections[5](https://arxiv.org/html/2606.27711#S5)–[7](https://arxiv.org/html/2606.27711#S7)as applications of this procedure under different world designs and losses\.

## 2A Decision Framework for Time Series Estimation

This section formulates time series estimation and forecasting as a statistical decision problem\[[65](https://arxiv.org/html/2606.27711#bib.bib68),[6](https://arxiv.org/html/2606.27711#bib.bib69)\]\. The problem is defined by two design choices: a training environment and an objective\. We represent the training environment by a*generative world*, a distribution over data\-generating processes on which an estimator is trained and evaluated\. This plays a role analogous to a task distribution in meta\-learning or an environment in reinforcement learning\[[4](https://arxiv.org/html/2606.27711#bib.bib60),[15](https://arxiv.org/html/2606.27711#bib.bib61)\]\. A world specifies which time series processes are included, how their configurations vary, and which forms of noise or contamination are treated as relevant\.

Given a world, the analyst defines a loss function that encodes the desired notion of performance for forecasts, estimates, and inferential objects, such as accuracy, calibration, bias control, or worst\-case behavior\. Together, the world and the loss define a population\-level learning objective with a well\-defined optimal decision rule\. By training neural networks on simulated data drawn from the world, we approximate this optimal rule directly, amortizing the solution to the underlying decision problem; see Section[3](https://arxiv.org/html/2606.27711#S3)\. The remainder of this section formalizes these components and shows how they combine into a unified simulation\-based framework for time series forecasting and inference\.

### 2\.1Generative Time Series Worlds

We define a generative worldΠ\\Pias a probability distribution over fully instantiated data generating processes \(DGPs\),

ω=\(m,θ\)∼Π,\\omega=\(m,\\theta\)\\sim\\Pi,wheremmdenotes a mechanism, a complete procedure for simulating time series data, andθ∈Θm\\theta\\in\\Theta\_\{m\}is its realized parameter vector\.

A mechanism may correspond to a classical parametric model class such as ARIMA, ETS, or GARCH, but it may also represent model combinations or an empirical resampling scheme\. Eachω\\omegainduces a distributionPωP\_\{\\omega\}over observable trajectories and, when present, latent states:

\(Y1:T,S1:T\)∼Pω\.\(Y\_\{1:T\},S\_\{1:T\}\)\\sim P\_\{\\omega\}\.In practice, this framework can accommodate diverse time series lengths within the world; we suppress this detail in the notation for simplicity\.

It is often useful to viewΠ\\Pias inducing a factorization

Π=Πm​Πθ∣m,\\Pi=\\Pi\_\{m\}\\,\\Pi\_\{\\theta\\mid m\},which makes explicit that a world specifies two components: a distributionΠm\\Pi\_\{m\}over mechanisms and an instantiation ruleΠθ∣m\\Pi\_\{\\theta\\mid m\}for each mechanism\.

##### Single\-mechanism worlds\.

If the world selects a fixed mechanismm⋆m\_\{\\star\}\(e\.g\., a simple exponential smoothing model\),

ω=\(m⋆,θ\),θ∼Πθ,\\omega=\(m\_\{\\star\},\\theta\),\\qquad\\theta\\sim\\Pi\_\{\\theta\},then the variability acrossω∼Π\\omega\\sim\\Piderives solely from parameter variation\. We recover a Bayesian model as the special case wherem⋆m\_\{\\star\}is a likelihood family andΠθ\\Pi\_\{\\theta\}coincides with a Bayesian prior\.

##### Multi\-mechanism worlds\.

When the world allows heterogeneous mechanisms, it first samples

m∼Πm,θ∼Πθ∣m\.m\\sim\\Pi\_\{m\},\\qquad\\theta\\sim\\Pi\_\{\\theta\\mid m\}\.The support ofΠ\\Piis then the disjoint union

Ω=⨆m\(\{m\}×Θm\),\\Omega=\\bigsqcup\_\{m\}\\bigl\(\\\{m\\\}\\times\\Theta\_\{m\}\\bigr\),allowing for worlds that include diverse model classes \(e\.g\., ARIMA, exponential smoothing, and GARCH processes\) and other types of mechanisms\. Bayesian model averaging\[[28](https://arxiv.org/html/2606.27711#bib.bib71)\]is a special case in which eachmmis a Bayesian model with a well\-defined likelihood and marginal likelihood,Πm\\Pi\_\{m\}is the model prior, andΠθ∣m\\Pi\_\{\\theta\\mid m\}is the corresponding parameter prior\.

### 2\.2World Design

The construction ofΠ\\Piis an instance of environment design\[[13](https://arxiv.org/html/2606.27711#bib.bib70)\]: specifying which mechanisms, parameter ranges, structural variations, perturbations, and data sources the estimator should be exposed to during training\. Training a predictor on a deliberately broad distribution of simulated environments so that it transfers to reality is known as domain randomization\[[62](https://arxiv.org/html/2606.27711#bib.bib73)\]; world design applies the same principle to time series estimation\.

##### Synthetic worlds\.

Analytically defined mechanism families, such as ARIMA, ETS, or GARCH, paired with parameter rulesΠθ∣m\\Pi\_\{\\theta\\mid m\}\. These worlds offer full control and broad structural coverage\.

##### Empirical worlds\.

An empirical world is formed from an observed collection of time series, using resampling or rolling windows to generate trajectories\.

##### Hybrid worlds\.

Hybrid worlds combine synthetic and empirical components to balance mechanism coverage with realism\. They are closely related to the pretraining distributions used in modern time series foundation models\.

This paper focuses on synthetic worlds, which enable a richer class of estimators by making the relevant supervisory quantities observable\.

### 2\.3Estimation Targets and Supervisory Signals

For each processω\\omegaand trajectoryy1:Ty\_\{1:T\}, the analyst specifies an*estimand*

Z​\(ω,y1:T\)∈𝒵,Z\(\\omega,y\_\{1:T\}\)\\in\\mathcal\{Z\},representing the object to be estimated\. Depending on the task, this may be a parameter, a latent state functional, a future value, a predictive functional, or an entire conditional distribution\. The estimator is a map

F:𝒴T→𝒵,Z^=F​\(y1:T\),F:\\mathcal\{Y\}\_\{T\}\\to\\mathcal\{Z\},\\qquad\\widehat\{Z\}=F\(y\_\{1:T\}\),where𝒴T\\mathcal\{Y\}\_\{T\}denotes the space of observed histories and𝒵\\mathcal\{Z\}denotes the output space appropriate to the estimand\.

Examples include:

- •Parameter estimation:a fixed functional of the DGP parameter, Z​\(ω,y1:T\)=g​\(θ\),Z\(\\omega,y\_\{1:T\}\)=g\(\\theta\),often simplyZ​\(ω\)=θZ\(\\omega\)=\\theta\.
- •Point forecasting:a functional of the predictive distribution, such as the conditional mean Z​\(ω,y1:T\)=𝔼ω​\(YT\+h∣y1:T\)\.Z\(\\omega,y\_\{1:T\}\)=\\mathbb\{E\}\_\{\\omega\}\(Y\_\{T\+h\}\\mid y\_\{1:T\}\)\.
- •Quantile forecasting:a conditional predictive quantile, Z​\(ω,y1:T\)=qαω​\(YT\+h∣y1:T\),Z\(\\omega,y\_\{1:T\}\)=q^\{\\omega\}\_\{\\alpha\}\(Y\_\{T\+h\}\\mid y\_\{1:T\}\),whereqαωq^\{\\omega\}\_\{\\alpha\}denotes the conditionalα\\alpha\-quantile\.
- •State estimation:a conditional functional of latent states, such as Z​\(ω,y1:T\)=𝔼ω​\(S1:T∣y1:T\)\.Z\(\\omega,y\_\{1:T\}\)=\\mathbb\{E\}\_\{\\omega\}\(S\_\{1:T\}\\mid y\_\{1:T\}\)\.
- •Distributional outputs:an entire conditional object, such as a predictive distribution, Z\(ω,y1:T\)=Pω\(⋅∣y1:T\)\.Z\(\\omega,y\_\{1:T\}\)=P\_\{\\omega\}\(\\cdot\\mid y\_\{1:T\}\)\.

In synthetic worlds, the simulator often supplies a realized quantity that can be used to train an estimator forZZ\. We denote this realized supervisory signal byZ~\\widetilde\{Z\}\. For some targets, the estimand is directly observed by construction\. For example, in parameter estimation the simulator drawsθ\\theta, so for a targetZ​\(ω,y1:T\)=g​\(θ\)Z\(\\omega,y\_\{1:T\}\)=g\(\\theta\)we can set

Z~=g​\(θ\)\\widetilde\{Z\}=g\(\\theta\)For other targets, the estimand is a conditional functional and is not usually available in closed form\. In this case, we chooseZ~\\widetilde\{Z\}so that minimizing the expected loss againstZ~\\widetilde\{Z\}yieldsZZas the population target\[[21](https://arxiv.org/html/2606.27711#bib.bib72)\]:

Z​\(ω,y1:T\)∈arg⁡minz∈𝒵⁡𝔼ω​\[L​\(z,Z~\)∣Y1:T=y1:T\]\.Z\(\\omega,y\_\{1:T\}\)\\in\\arg\\min\_\{z\\in\\mathcal\{Z\}\}\\mathbb\{E\}\_\{\\omega\}\\left\[L\(z,\\widetilde\{Z\}\)\\mid Y\_\{1:T\}=y\_\{1:T\}\\right\]\.

### 2\.4Loss Functions and Decision Rules

We perform evaluation through a user\-specified loss function

L​\(F​\(y1:T\),Z~\),L\\bigl\(F\(y\_\{1:T\}\),\\,\\widetilde\{Z\}\\bigr\),which may represent squared error, absolute error, pinball loss, or another objective\.

Given a worldΠ\\Pi, we define the process\-level risk of a decision ruleFFunder a DGPω\\omegaas

r​\(ω,F\)=𝔼Y1:T,Z~∼Pω​\[L​\(F​\(Y1:T\),Z~\)\],r\(\\omega,F\)=\\mathbb\{E\}\_\{Y\_\{1:T\},\\,\\widetilde\{Z\}\\sim P\_\{\\omega\}\}\\left\[L\\bigl\(F\(Y\_\{1:T\}\),\\,\\widetilde\{Z\}\\bigr\)\\right\],where the expectation is over the observed history and, where applicable, the supervisory draw conditional on that history\. The minimizerF⋆F^\{\\star\}targets the loss\-implied estimandZZdefined in Section[2\.3](https://arxiv.org/html/2606.27711#S2.SS3)\.

The*population risk*of a decision ruleFFis

ℛ​\(F;Π\)=𝔼ω∼Π​\[r​\(ω,F\)\]\.\\mathcal\{R\}\(F;\\Pi\)=\\mathbb\{E\}\_\{\\omega\\sim\\Pi\}\\bigl\[r\(\\omega,F\)\\bigr\]\.\(1\)The minimizer

F⋆∈arg⁡minF⁡ℛ​\(F;Π\)F^\{\\star\}\\in\\arg\\min\_\{F\}\\mathcal\{R\}\(F;\\Pi\)is the optimal decision rule under that world\. The functionF⋆F^\{\\star\}represents the ideal estimator or forecasting rule for the design choices encoded by\(Π,L\)\(\\Pi,L\)\. WhenΠ\\Picorresponds to a Bayesian model and the loss is separable, the population risk \([1](https://arxiv.org/html/2606.27711#S2.E1)\) becomes the prior predictive risk\. Its pointwise minimizer is the usual Bayes action under the posterior predictive distribution\.

### 2\.5Process\-Level Properties

In many applications, minimizing the population riskℛ​\(F;Π\)\\mathcal\{R\}\(F;\\Pi\)is insufficient\. The analyst often requires the decision ruleFFto satisfy statistical properties such as bias control or calibration across the support of the world\.

We focus on targeting uniform properties, defined as constraints on the distribution ofF​\(Y1:T\)F\(Y\_\{1:T\}\)at each fixed DGPω\\omega\. In contrast, marginal properties constrain only the unconditional distribution induced by the mixture overΠ\\Pi\.

Formally, the uniform problem is

minF⁡ℛ​\(F;Π\)s\.t\.Ck​\(ω,F\)≤δk∀ω∈supp⁡\(Π\),k=1,…,K,\\min\_\{F\}\\;\\mathcal\{R\}\(F;\\Pi\)\\qquad\\text\{s\.t\.\}\\qquad C\_\{k\}\(\\omega,F\)\\leq\\delta\_\{k\}\\quad\\forall\\,\\omega\\in\\operatorname\{supp\}\(\\Pi\),\\quad k=1,\\dots,K,\(2\)where eachCk​\(ω,F\)C\_\{k\}\(\\omega,F\)is a process\-level property of interest\.

Two important examples are:

- •*Bias control\.*For process\-level estimandsZ​\(ω\)Z\(\\omega\), such as fixed parameters or other DGP\-level functionals, Cbias​\(ω,F\)=‖𝔼ω​\[F​\(Y1:T\)\]−Z​\(ω\)‖\.C\_\{\\mathrm\{bias\}\}\(\\omega,F\)=\\left\\\|\\mathbb\{E\}\_\{\\omega\}\[F\(Y\_\{1:T\}\)\]\-Z\(\\omega\)\\right\\\|\.
- •*Process\-level calibration\.*For a predictive quantile estimatorqα​\(y1:T\)q\_\{\\alpha\}\(y\_\{1:T\}\)at nominal coverage levelα∈\(0,1\)\\alpha\\in\(0,1\), Ccal​\(ω,F\)=\|ℙω​\{YT\+h≤qα​\(Y1:T\)\}−α\|\.C\_\{\\mathrm\{cal\}\}\(\\omega,F\)=\\left\|\\mathbb\{P\}\_\{\\omega\}\\left\\\{Y\_\{T\+h\}\\leq q\_\{\\alpha\}\(Y\_\{1:T\}\)\\right\\\}\-\\alpha\\right\|\.

Direct optimization of \([2](https://arxiv.org/html/2606.27711#S2.E2)\) is intractable since eachCk​\(ω,F\)C\_\{k\}\(\\omega,F\)is itself a population\-level functional of the distribution underω\\omega\. The training procedures of Section[3](https://arxiv.org/html/2606.27711#S3)approximate these constraints through differentiable penalties computed from replicated simulations\. Depending on the application, these penalties can target average violations underΠ\\Pi, stratified violations over regions of the world, or smooth approximations to worst\-case violations\.

### 2\.6Worlds vs\. Models

A worldΠ\\Pispecifies the range of processes over which a decision rule should perform well\. Although any world can often be embedded in a sufficiently rich hierarchical probability model, this embedding obscures the role thatΠ\\Piplays in the present framework\.

A traditional model\-based workflow typically fixes a probability familyPθP\_\{\\theta\}and uses observed data to inferθ\\thetaor functions thereof\. In our framework, the primary object is instead the estimator itself:

F:𝒴→𝒵\.F:\\mathcal\{Y\}\\to\\mathcal\{Z\}\.The worldΠ\\Pi, lossLL, and constraint functionalsCkC\_\{k\}completely specify the desired properties of this estimator\. Simulation\-based training then approximates an optimal decision rule satisfying this specification\. In this sense, the procedure amortizes an inference machine\[[18](https://arxiv.org/html/2606.27711#bib.bib74)\]\.

This shift is why we refer toΠ\\Pias a world rather than a model\. ModifyingΠ\\Piengineers the competence profile of the resulting inference machine: dictating which mechanisms it recognizes, which regimes it handles gracefully, which data pathologies it is robust against, and which statistical trade\-offs it internalizes\.

### 2\.7Robust Design

Real\-world time series frequently exhibit irregularities such as outliers, missing values, level shifts, censoring, and measurement noise\. To encode robustness at the level of the world, we introduce contamination variables and corruption operators, following the contamination\-neighborhood formulation of robust statistics\[[32](https://arxiv.org/html/2606.27711#bib.bib79)\]\.

Letωclean=\(m,θ\)\\omega\_\{\\mathrm\{clean\}\}=\(m,\\theta\)denote a clean process and letξ\\xidenote a contamination specification drawn from a distributionΠξ\\Pi\_\{\\xi\}\. A robust world draws

ωclean∼Πclean,ξ∼Πξ∣ωclean,\\omega\_\{\\mathrm\{clean\}\}\\sim\\Pi\_\{\\mathrm\{clean\}\},\\qquad\\xi\\sim\\Pi\_\{\\xi\\mid\\omega\_\{\\mathrm\{clean\}\}\},and generates data by applying a corruption operatorCξC\_\{\\xi\}to the clean process:

\(Y1:T,S1:T\)∼Pω,Pω=Cξ​\#​Pωclean,\(Y\_\{1:T\},S\_\{1:T\}\)\\sim P\_\{\\omega\},\\qquad P\_\{\\omega\}=C\_\{\\xi\}\\\#P\_\{\\omega\_\{\\mathrm\{clean\}\}\},whereCξ​\#​PωcleanC\_\{\\xi\}\\\#P\_\{\\omega\_\{\\mathrm\{clean\}\}\}denotes the distribution induced by applyingCξC\_\{\\xi\}to draws from the clean process\. HereCξC\_\{\\xi\}may encode additive outliers, missingness masks, censoring rules, level shifts, measurement noise, or other perturbations\.

The robust world can therefore be regarded as a distribution over contaminated processesω=\(ωclean,ξ\)\\omega=\(\\omega\_\{\\mathrm\{clean\}\},\\xi\), even though the clean generative world is written in the simpler formω=\(m,θ\)\\omega=\(m,\\theta\)\. Specifying contamination as part of the world design allows the estimator to learn structural responses to irregularities rather than treating them only as post hoc exceptions\.

This completes the definition of the unified decision framework\. For the remainder of the paper, we treatΠ\\Pias fixed and study the decision rules that arise from minimizing \([1](https://arxiv.org/html/2606.27711#S2.E1)\) or its generalizations\.

## 3Simulation\-Based Training

Section[2](https://arxiv.org/html/2606.27711#S2)defined the target decision rule as the optimizer of a population objective over a generative world\. We now describe how we can approximate this objective in practice by simulation\.

### 3\.1Training algorithm

We train the neural estimatorFFby stochastic gradient descent \(SGD\) on a Monte Carlo approximation of the population risk \([1](https://arxiv.org/html/2606.27711#S2.E1)\)\. We construct each mini\-batch in two stages: an outer loop samplesJJdata generating processes fromΠ\\Pi, and an inner loop generatesRRindependent trajectories from each sampled process\. The total batch size isB=J​RB=JR\.

Formally, at each SGD update we:

1. 1\.SampleJJprocesses fromΠ\\Pi: ω\(j\)=\(m\(j\),θ\(j\)\)∼Π,j=1,…,J\.\\omega^\{\(j\)\}=\(m^\{\(j\)\},\\theta^\{\(j\)\}\)\\sim\\Pi,\\qquad j=1,\\dots,J\.
2. 2\.For eachω\(j\)\\omega^\{\(j\)\}, sampleRRi\.i\.d\. trajectories: \(Y1:T\(j,r\),S1:T\(j,r\)\)​∼i\.i\.d\.​Pω\(j\),r=1,…,R\.\(Y^\{\(j,r\)\}\_\{1:T\},S^\{\(j,r\)\}\_\{1:T\}\)\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}P\_\{\\omega^\{\(j\)\}\},\\qquad r=1,\\dots,R\.
3. 3\.For each replicate, compute the network output F\(j,r\)=F​\(y1:T\(j,r\)\)F^\{\(j,r\)\}=F\(y^\{\(j,r\)\}\_\{1:T\}\)and the supervisory signalZ~\(j,r\)\\widetilde\{Z\}^\{\(j,r\)\}\. For parameter targets,Z~\(j,r\)=g​\(θ\(j\)\)\\widetilde\{Z\}^\{\(j,r\)\}=g\(\\theta^\{\(j\)\}\), often simplyθ\(j\)\\theta^\{\(j\)\}itself\. For latent\-state targets,Z~\(j,r\)=S1:T\(j,r\)\\widetilde\{Z\}^\{\(j,r\)\}=S^\{\(j,r\)\}\_\{1:T\}\. For predictive targets,Z~\(j,r\)\\widetilde\{Z\}^\{\(j,r\)\}is the realized future value or path, such asYT\+h\(j,r\)Y^\{\(j,r\)\}\_\{T\+h\}orYT\+1:T\+H\(j,r\)Y^\{\(j,r\)\}\_\{T\+1:T\+H\}\.
4. 4\.Form the empirical batch lossL^\\widehat\{L\}from these quantities\.
5. 5\.UpdateFFby stochastic gradient descent onL^\\widehat\{L\}\.

The simplest case is empirical risk minimization,

L^ERM​\(F\)=1J​R​∑j=1J∑r=1RL​\(F​\(y1:T\(j,r\)\),Z~\(j,r\)\)\.\\widehat\{L\}\_\{\\mathrm\{ERM\}\}\(F\)=\\frac\{1\}\{JR\}\\sum\_\{j=1\}^\{J\}\\sum\_\{r=1\}^\{R\}L\\bigl\(F\(y^\{\(j,r\)\}\_\{1:T\}\),\\widetilde\{Z\}^\{\(j,r\)\}\\bigr\)\.For ordinary risk minimization,R=1R=1is sufficient\. HavingR\>1R\>1replicates from a fixed DGP enables Monte Carlo estimation of process\-level functionals ofFFat that DGP\. This is what allows the framework to target the properties introduced in Section[2\.5](https://arxiv.org/html/2606.27711#S2.SS5)\. The next subsections construct losses for bias control, calibration, and minimax estimation\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/x1.png)Figure 1:Overview of the Simulacrum framework\.The analyst specifies a generative worldΠ\\Pi, which defines the mechanisms, parameter ranges, and possible contaminations on which the estimator should perform well\. Training uses a two\-stage simulation scheme: an outer loop samples data\-generating processes from the world, and an inner loop generates replicated trajectories from each process\. These simulations provide supervised targets and process\-level Monte Carlo estimates, allowing a neural network to learn near\-optimal forecasting or inference rules\.
### 3\.2Targeting Process\-Level Properties

The constrained problem in \([2](https://arxiv.org/html/2606.27711#S2.E2)\) is an ideal population formulation\. In practice, we approximate it through differentiable penalties computed from replicated simulations\. The key idea is that the outer loop samples processesω\(j\)\\omega^\{\(j\)\}, while the inner loop provides repeated histories from each fixed process\. These repeated histories allow us to estimate properties ofFFconditional on the DGP, such as process\-level bias or coverage\.

### 3\.3Bias control

Classical time series estimators are often biased in finite samples, and bias\-correction procedures, when available, are usually model\-specific: analytic bias expansions must be derived separately for each model class\[[42](https://arxiv.org/html/2606.27711#bib.bib5),[55](https://arxiv.org/html/2606.27711#bib.bib7)\], and bootstrap\-based corrections are constructed in a similarly bespoke way\[[36](https://arxiv.org/html/2606.27711#bib.bib65)\]\. The simulation framework provides a more general route: by generating replicated histories at fixedω\\omega, we can estimate process\-level properties of the decision ruleFFdirectly\.

We begin with bias control for estimands of the formZ​\(ω\)Z\(\\omega\), such as parameters or other fixed functionals of the DGP, that do not depend on the realized trajectoryy1:Ty\_\{1:T\}\. For these targets, the supervisory signal is deterministic conditional onω\\omega, soZ~=Z​\(ω\)\\widetilde\{Z\}=Z\(\\omega\)\. The bias ofFFat processω\\omegais

b​\(ω\)=𝔼ω​\[F​\(Y1:T\)\]−Z​\(ω\)\.b\(\\omega\)=\\mathbb\{E\}\_\{\\omega\}\\left\[F\(Y\_\{1:T\}\)\\right\]\-Z\(\\omega\)\.GivenRRreplicated trajectories fromω\(j\)\\omega^\{\(j\)\}, we estimate this bias by

b^​\(ω\(j\)\)=1R​∑r=1RF​\(y1:T\(j,r\)\)−Z​\(ω\(j\)\)\.\\widehat\{b\}\(\\omega^\{\(j\)\}\)=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}F\\bigl\(y^\{\(j,r\)\}\_\{1:T\}\\bigr\)\-Z\(\\omega^\{\(j\)\}\)\.
There are several ways to aggregate these process\-level estimates across sampled DGPs\. The ideal uniform constraint controls the worst process in the support ofΠ\\Pi, but in finite training we observe only a sampled set of processes\. In our experiments, we use the average squared bias over sampled processes as a relaxation\. A base estimatorF0F\_\{0\}is first trained to minimizeL^ERM​\(F\)\\widehat\{L\}\_\{\\mathrm\{ERM\}\}\(F\)\. A second training phase then optimizes

L^bias​\(F\)=L^ERM​\(F\)\+λb​1J​∑j=1J‖b^​\(ω\(j\)\)‖2\.\\widehat\{L\}\_\{\\mathrm\{bias\}\}\(F\)=\\widehat\{L\}\_\{\\mathrm\{ERM\}\}\(F\)\+\\lambda\_\{b\}\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\\|\\widehat\{b\}\(\\omega^\{\(j\)\}\)\\\|^\{2\}\.Training proceeds along a short path of penalty weightsλb\\lambda\_\{b\}, with each stage warm\-started from the previous one\. The final estimator is selected to balance empirical risk against squared bias\. This objective is a Lagrangian relaxation of

minF⁡ℛ​\(F;Π\)subject to𝔼ω∼Π​\[‖b​\(ω\)‖2\]≤δ2,\\min\_\{F\}\\mathcal\{R\}\(F;\\Pi\)\\qquad\\text\{subject to\}\\qquad\\mathbb\{E\}\_\{\\omega\\sim\\Pi\}\\left\[\\\|b\(\\omega\)\\\|^\{2\}\\right\]\\leq\\delta^\{2\},withλb\\lambda\_\{b\}acting as the dual variable\.

This approach encourages low bias across the sampled world, but is not a uniform constraint\. Stronger sampled\-uniform variants can be obtained by replacing the average overjjwith log\-sum\-exp aggregation as in Section[3\.5](https://arxiv.org/html/2606.27711#S3.SS5)\. When the outer sampling scheme gives adequate coverage of the relevant regions ofΠ\\Pi, the average penalty can nevertheless provide effective empirical control across the world while remaining more stable to optimize than a worst\-case objective\.

More broadly, process\-level Monte Carlo estimates make other forms of bias reduction possible\. One possible extension is to train a*bias\-correction head*: a sub\-module that learns a correction𝒞​\(y\)\\mathcal\{C\}\(y\)on top of a frozen base estimatorF0F\_\{0\}, yielding

F​\(y\)=F0​\(y\)\+𝒞​\(y\)\.F\(y\)=F\_\{0\}\(y\)\+\\mathcal\{C\}\(y\)\.The correction can be trained using the same replicate\-based penalty, so that systematic deviations ofF0F\_\{0\}fromZ​\(ω\)Z\(\\omega\)are reduced across regions of the process space\.

### 3\.4Calibration and Coverage

Calibration is, together with sharpness, the standard basis for evaluating probabilistic forecasts\[[20](https://arxiv.org/html/2606.27711#bib.bib62)\]\. Whereas that paradigm treats calibration as a diagnostic applied after estimation, the simulation framework makes it an explicit training target\.

SupposeFFdelivers a predictive quantile

qα​\(y1:T\)q\_\{\\alpha\}\(y\_\{1:T\}\)for a scalar realized targetZ~\\widetilde\{Z\}, such asZ~=YT\+h\\widetilde\{Z\}=Y\_\{T\+h\}, at nominal coverage levelα∈\(0,1\)\\alpha\\in\(0,1\)\. The desired coverage property is

ℙω​\{Z~≤qα​\(Y1:T\)\}≈α\.\\mathbb\{P\}\_\{\\omega\}\\left\\\{\\widetilde\{Z\}\\leq q\_\{\\alpha\}\(Y\_\{1:T\}\)\\right\\\}\\approx\\alpha\.This is a process\-level coverage target, required to hold at each fixedω\\omega\. It is therefore stronger, but more model\-dependent, than the marginal, distribution\-free coverage of conformal prediction\[[64](https://arxiv.org/html/2606.27711#bib.bib75)\]and its variants for non\-exchangeable time series\[[19](https://arxiv.org/html/2606.27711#bib.bib76)\]\.

UsingRRreplicates forω\(j\)\\omega^\{\(j\)\}, we approximate process\-level coverage by

cov^α​\(ω\(j\)\)=1R​∑r=1R𝟏​\{Z~\(j,r\)≤qα​\(y1:T\(j,r\)\)\}\.\\widehat\{\\mathrm\{cov\}\}\_\{\\alpha\}\\bigl\(\\omega^\{\(j\)\}\\bigr\)=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\mathbf\{1\}\\left\\\{\\widetilde\{Z\}^\{\(j,r\)\}\\leq q\_\{\\alpha\}\\bigl\(y^\{\(j,r\)\}\_\{1:T\}\\bigr\)\\right\\\}\.\(3\)A simple calibration loss is

Lcalib​\(F\)=1J​∑j=1J\(cov^α​\(ω\(j\)\)−α\)2\.L\_\{\\mathrm\{calib\}\}\(F\)=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\left\(\\widehat\{\\mathrm\{cov\}\}\_\{\\alpha\}\\bigl\(\\omega^\{\(j\)\}\\bigr\)\-\\alpha\\right\)^\{2\}\.\(4\)
To obtain gradients, we replace the indicator in \([3](https://arxiv.org/html/2606.27711#S3.E3)\) by a smooth surrogate such as

σ​\(κ​\{qα​\(y1:T\)−Z~\}\),\\sigma\\left\(\\kappa\\\{q\_\{\\alpha\}\(y\_\{1:T\}\)\-\\widetilde\{Z\}\\\}\\right\),with slope parameterκ\>0\\kappa\>0, leaving the overall loss structure unchanged\. The slopeκ\\kappacontrols the bias–variance tradeoff of the surrogate: large values better approximate the indicator but can yield unstable or sparse gradients\.

In practice, we combine calibration with a standard quantile, or pinball, loss\[[37](https://arxiv.org/html/2606.27711#bib.bib77)\]:

Lquantile\+calib​\(F\)=Lpinball​\(F\)\+λc​Lcalib​\(F\),L\_\{\\mathrm\{quantile\+calib\}\}\(F\)=L\_\{\\mathrm\{pinball\}\}\(F\)\+\\lambda\_\{\\mathrm\{c\}\}L\_\{\\mathrm\{calib\}\}\(F\),and adjustλc\\lambda\_\{\\mathrm\{c\}\}so that empirical coverage is close to the nominal level\. The pinball loss anchors the location of the quantile, while the calibration penalty adjusts process\-level coverage\.

Extensions to simultaneous prediction bands over a forecast path up to horizonHHrequire specifying whether coverage is marginal at each horizon or joint over the entire path\. For joint bands, the coverage event in \([3](https://arxiv.org/html/2606.27711#S3.E3)\) is replaced by an event involving the full realized pathZ~=YT\+1:T\+H\\widetilde\{Z\}=Y\_\{T\+1:T\+H\}, such as inclusion of all future observations within the predicted band; see Appendix[A](https://arxiv.org/html/2606.27711#A1)\. Training may then use a differentiable surrogate for this joint event, for example a smooth approximation to the maximum violation across horizons\.

### 3\.5Minimax Training

For some applications, we care about worst\-case expected loss across the support ofΠ\\Pi:

minF​supω∈supp⁡\(Π\)r​\(ω,F\),\\min\_\{F\}\\sup\_\{\\omega\\in\\operatorname\{supp\}\(\\Pi\)\}r\(\\omega,F\),the minimax decision criterion\[[65](https://arxiv.org/html/2606.27711#bib.bib68)\]\.

Within each training batch, we estimate the process\-level risk at each sampled DGP by

r^j​\(F\)=1R​∑r=1RL​\(F​\(y1:T\(j,r\)\),Z~\(j,r\)\)\.\\widehat\{r\}\_\{j\}\(F\)=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}L\\bigl\(F\(y^\{\(j,r\)\}\_\{1:T\}\),\\widetilde\{Z\}^\{\(j,r\)\}\\bigr\)\.Direct optimization ofmaxj⁡r^j​\(F\)\\max\_\{j\}\\widehat\{r\}\_\{j\}\(F\)is unstable for SGD because gradients flow through only one process per update\. We replace the maximum by its log\-sum\-exp smoothing:

Lminimax​\(F\)=1ψ​log⁡\(∑j=1Jexp⁡\{ψ​r^j​\(F\)\}\),L\_\{\\mathrm\{minimax\}\}\(F\)=\\frac\{1\}\{\\psi\}\\log\\left\(\\sum\_\{j=1\}^\{J\}\\exp\\left\\\{\\psi\\,\\widehat\{r\}\_\{j\}\(F\)\\right\\\}\\right\),\(5\)whereψ\>0\\psi\>0is a temperature parameter\. This gives the standard smooth approximation

maxj⁡r^j​\(F\)≤Lminimax​\(F\)≤maxj⁡r^j​\(F\)\+log⁡Jψ\.\\max\_\{j\}\\widehat\{r\}\_\{j\}\(F\)\\leq L\_\{\\mathrm\{minimax\}\}\(F\)\\leq\\max\_\{j\}\\widehat\{r\}\_\{j\}\(F\)\+\\frac\{\\log J\}\{\\psi\}\.The relaxation approximates the maximum sampled process\-level risk at finiteψ\\psiand recovers it asψ→∞\\psi\\to\\infty\. Gradients flow through all sampled processes with weights proportional to

exp⁡\{ψ​r^j​\(F\)\},\\exp\\\{\\psi\\,\\widehat\{r\}\_\{j\}\(F\)\\\},so the hardest sampled processes dominate asψ\\psigrows\. In practice, we choose a moderateψ\\psiand, if necessary, increase it over training to gradually focus on the worst processes in the batch\.

This objective is a sampled minimax approximation: it targets the worst process represented in the current batch\. When the support ofΠ\\Piis large or contains low\-probability but high\-risk regions, the quality of the approximation depends on the outer sampling scheme\. Stratified sampling or adversarial sampling overω\\omegacan be used when more uniform coverage of the world is required\.

The same construction extends to any non\-negative per\-process functional\. Replacingr^j​\(F\)\\widehat\{r\}\_\{j\}\(F\)in \([5](https://arxiv.org/html/2606.27711#S3.E5)\) with

‖b^​\(ω\(j\)\)‖2\\\|\\widehat\{b\}\(\\omega^\{\(j\)\}\)\\\|^\{2\}gives a stronger variant of the bias\-control method of Section[3\.3](https://arxiv.org/html/2606.27711#S3.SS3): the penalty targets the largest sampled squared bias rather than the average squared bias underΠ\\Pi\. Similarly, replacingr^j​\(F\)\\widehat\{r\}\_\{j\}\(F\)with the per\-process calibration error

\(cov^α​\(ω\(j\)\)−α\)2\\left\(\\widehat\{\\mathrm\{cov\}\}\_\{\\alpha\}\(\\omega^\{\(j\)\}\)\-\\alpha\\right\)^\{2\}targets the largest sampled calibration violation\.

## 4Neural Architectures and Experimental Setup

The framework described in Sections[2](https://arxiv.org/html/2606.27711#S2)–[3](https://arxiv.org/html/2606.27711#S3)is agnostic about the particular neural architecture used to approximate the target decision rule\. In principle, any sufficiently flexible function approximator could be employed\. In practice, however, architecture matters for two reasons\. First, it can improve optimization and generalization by introducing inductive biases that match the structure of the estimation problem\. Second, it can enforce properties that are known to hold, or are desirable to impose, in a given setting, such as invariances or equivariances with respect to location, scale, trend, or finite temporal context\. These considerations are particularly useful when the statistical objective is itself structured, for example when one seeks affine\-equivariant forecasts, stable estimation under rescaling, or estimators that depend only on recent observations\. A fuller discussion of these architectural considerations is given in Appendix[B](https://arxiv.org/html/2606.27711#A2)\.

In this paper, the architectural choices are deliberately kept simple so that the empirical comparisons focus on the proposed simulation\-based training framework rather than on architecture engineering\. Unless otherwise stated, the experiments use a feedforward DenseNet backbone\[[31](https://arxiv.org/html/2606.27711#bib.bib32)\]as a general\-purpose function approximator\. We explicitly enforce location and scale equivariance through reversible per\-series normalization, which reduces the need to simulate over arbitrary ranges of levels and variances and helps align the network with known properties of models such as ARIMA and exponential smoothing\. More specialized constraints, such as additive, transformation, or time\-translation equivariances, can also be incorporated within the same framework when they are appropriate for the model class or the estimation target; these extensions are described in Appendix[B](https://arxiv.org/html/2606.27711#A2)\.

To accommodate variable\-length time series within a common architecture, inputs are padded to a fixed maximum window and the padded values are encoded separately from the normalized data range\. This allows the same network to process series of different lengths while keeping the implementation simple\. Throughout the applications that follow, we therefore use a shared architectural template and vary the training world, loss function, and target quantity according to the statistical objective of interest\. The next sections show that, even with this relatively generic architecture, the proposed training principle is sufficient to deliver substantial gains in forecasting accuracy, finite\-sample bias, calibration, and worst\-case performance across several model classes\.

## 5Application I: Exponential Smoothing

In this section, we illustrate neural estimators using the popular Exponential Smoothing model class as the underlying process\. Exponential Smoothing is a simple model \(1 to 3 parameters\) that is remarkably accurate in real data\. It is ideal to highlight the impact of neural estimators vs alternative estimators without obfuscating via complex model classes or sophisticated sampling mechanisms\. The three parameters in Exponential smoothing are defined in the range\(0,1\)\(0,1\)so a simulation mechanism that uses the Uniform distribution is parsimonious\. The experiments illustrate four uses of neural estimators:

- •Finite\-sample estimation\.Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)compares neural estimators with maximum likelihood and Bayesian alternatives under the true exponential\-smoothing world, focusing on risk, bias, and minimax behavior\.
- •Forecasting and scaling\.Section[5\.2](https://arxiv.org/html/2606.27711#S5.SS2)studies how forecast accuracy changes with horizon, series length, model complexity, and simulation budget\.
- •Real\-data transfer\.Section[5\.3](https://arxiv.org/html/2606.27711#S5.SS3)evaluates whether estimators trained only on synthetic exponential\-smoothing worlds transfer to real benchmark data from the M1 competition\.
- •Tail\-risk behavior\.Section[5\.4](https://arxiv.org/html/2606.27711#S5.SS4)examines whether minimax\-oriented training reduces large forecast errors in real data, using the M3 yearly series\.

Together, these experiments show that the same structural model class can give rise to different estimators depending on the decision objective used in training\.

### 5\.1Additive Exponential Smoothing Class: Inference Results

We consider the full class of Additive Exponential Smoothing Class as well as the Simple Exponential Smoothing for clarity\.

The formal definition of the time series process with all components:

𝗅t\\displaystyle\\mathsf\{l\}\_\{t\}=α​\(yt−𝗌t−m\)\+\(1−α\)​\(𝗅t−1\+𝖻t−1\)\\displaystyle=\\alpha\(y\_\{t\}\-\\mathsf\{s\}\_\{t\-m\}\)\+\(1\-\\alpha\)\(\\mathsf\{l\}\_\{t\-1\}\+\\mathsf\{b\}\_\{t\-1\}\)\(6\)𝖻t\\displaystyle\\mathsf\{b\}\_\{t\}=β​\(𝗅t−𝗅t−1\)\+\(1−β\)​𝖻t−1\\displaystyle=\\beta\(\\mathsf\{l\}\_\{t\}\-\\mathsf\{l\}\_\{t\-1\}\)\+\(1\-\\beta\)\\mathsf\{b\}\_\{t\-1\}\(7\)𝗌t\\displaystyle\\mathsf\{s\}\_\{t\}=γ​\(yt−𝗅t−1−𝖻t−1\)\+\(1−γ\)​𝗌t−m\\displaystyle=\\gamma\(y\_\{t\}\-\\mathsf\{l\}\_\{t\-1\}\-\\mathsf\{b\}\_\{t\-1\}\)\+\(1\-\\gamma\)\\mathsf\{s\}\_\{t\-m\}\(8\)yt\+1\\displaystyle y\_\{t\+1\}=𝗅t\+𝖻t\+𝗌t\+1−m\+εt\\displaystyle=\\mathsf\{l\}\_\{t\}\+\\mathsf\{b\}\_\{t\}\+\\mathsf\{s\}\_\{t\+1\-m\}\+\\varepsilon\_\{t\}\(9\)
The states𝗅t\\mathsf\{l\}\_\{t\}represents the instantaneous level of the series \(intercept\),𝖻t\\mathsf\{b\}\_\{t\}an instantaneous linear trend and𝗌t\\mathsf\{s\}\_\{t\}the seasonal pattern that cycles with periodicitymm\. Parametersα,β,γ\\alpha,\\beta,\\gammacontrol the rate of change over time the states𝗅t,𝖻t,𝗌t\\mathsf\{l\}\_\{t\},\\mathsf\{b\}\_\{t\},\\mathsf\{s\}\_\{t\}respectively, belonging to the range\(0,1\)\(0,1\)\. Values closer to 0 imply no change over time \(historical average\) and closer to 1 total change over time\. The full model is known as Holt\-Winters \(HW\), submodels such as no seasonality \(a\.k\.a\. Holt\) or level\-only \(a\.k\.a Simple Exponential Smoothing\) are obtained by setting their respective parameters and starting states to 0\.

The experimental setup sets starting states𝗅0,𝖻0\\mathsf\{l\}\_\{0\},\\mathsf\{b\}\_\{0\}and𝗌−m\+1:0\\mathsf\{s\}\_\{\-m\+1:0\}from the uniformU​\(−10,10\)\\mathrm\{U\}\(\-10,10\)and parametersα,β,γ\\alpha,\\beta,\\gammafromU​\(0,1\)\\mathrm\{U\}\(0,1\)\. Seasonal periodmmis set to 4 and minimum length of the series is set to twice the seasonal period \+ number of parameters, \(8\+3=11\) so that competitor estimators have enough degrees of freedom to fit\. Maximum length of the series is set to 64, with lengths sampled uniformly\. Innovationsεt\\varepsilon\_\{t\}come from a standard i\.i\.d\. normal𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)\.

We start with a smaller version of the experiment for the subclass of Simple Exponential Smoothing, a special case of Equation[9](https://arxiv.org/html/2606.27711#S5.E9)settingβ,γ,𝖻0,𝗌−m\+1:0=0\\beta,\\gamma,\\mathsf\{b\}\_\{0\},\\mathsf\{s\}\_\{\-m\+1:0\}=0, then𝗅0\\mathsf\{l\}\_\{0\}to the standard normal, and fixed length=16\. This simplification is done to isolate the effect of the estimators as much as possible, as well as to compare to the more computationally expensive Bayesian estimator\.

Three neural estimators are trained,Neural\-Uniftrained to minimize expected squared error under the uniform sampling ofα\\alpha\(Section[3\.1](https://arxiv.org/html/2606.27711#S3.SS1)\);Neural\-Debiastrained with weights 0\.05 for expected error and 0\.95 for the squared bias penalty \(Section[3\.2](https://arxiv.org/html/2606.27711#S3.SS2)\);Neural\-Minimax, trained to approximate a minimax risk estimator \(Section[3\.5](https://arxiv.org/html/2606.27711#S3.SS5)\)\. The reference estimators areSES\-MLEmaximum likelihood implemented in python’s packagestatsmodelsandBayes\-Unif, a Bayesian estimator of the posterior mean of SES implemented inbayesforecastin R, with the same priors and parameters as in the experiment\.

Left panel in Figure[2](https://arxiv.org/html/2606.27711#S5.F2)shows MSE perα\\alphafor each of the estimators, on 1 Million time series, except the Bayes estimator that is run on 70000\. The Cramer\-Rao lower bound is shown in magenta for reference: the lower theoretical limit for error for unbiased estimators\. Maximum Likelihood \(SES\-MLE\) has good performance for values ofα\\alphaclose to 0 \(when the process resembles i\.i\.d\. data\) but has worse results outside of this range\. Both Neural\-Unif and Bayes\-Unif are comparable, they aim to minimize expected error \(Bayes Risk\) over the uniform distribution ofα\\alpha\(i\.e\. a Uniform prior\) and we should expect to have the same performance\. They show similar U\-shaped risk curves, but the neural estimator is overall superior, everywhere but on the central values ofα\\alphanear 0\.5\. We attribute this difference to limitations ofbayesforecastMarkov Chain Monte Carlo under the uniform prior \(such problems as the extremes of the range of values\)\. The Neural\-Debias estimator has much better performance at the extremes ofα\\alphabut trades off with a worse performance overall compared to Neural\-Unif\. The Neural\-Minimax estimator succeeds in reducing the maximum error of any of the estimators, and the risk curve has become flatter, but it has three peaks at the extremes and center of the range ofα\\alpha\. Importantly, no single estimator is dominated, and no estimator is universally better than the rest\. Even SES\-MLE outperforms the others in a limited range ofα\\alpha\. Numeric comparison is shown in the Appendix Table[10](https://arxiv.org/html/2606.27711#A3.T10)\.

Right panel in Figure[2](https://arxiv.org/html/2606.27711#S5.F2)shows the results for estimation bias, centered for better visual appreciation \(closer to 0 means better, less bias\)\. Maximum Likelihood shows a strong negative bias, consistently estimating lowerα\\alphathan the population value\. Neural\-Unif and Bayes\-Unif show bias towardsα=0\.5\\alpha=0\.5: populationα<0\.5\\alpha<0\.5are overestimated,α\>0\.5\\alpha\>0\.5are underestimated\. The bias of Bayes\-Unif is stronger than the Neural\-Unif, coherent with a better MSE of Bayes\-Unif aroundα=0\.5\\alpha=0\.5indicating stronger bias\. Neural\-Minimax estimator has substantially reduced the bias compared to Neural\-Unif\. The Neural\-Debias has vastly reduced the overall bias of all the estimators though it is not completely unbiased\. Neural\-Debias has lower squared bias than SES\-MLE for all values ofα\\alphaabove 0\.1, and lower squared bias than the other estimators overall\. The MSE\-bias loss tradeoff for this Neural\-Debias estimator was set during training toλ=0\.05\\lambda=0\.05\(bias2\\text\{bias\}^\{2\}loss receives 19 times the weight of MSE\)\. Better bias can be achieved with more penalty, but it cannot be completely reduced to 0 becauseα\\alphalies in a bounded range\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ses_alpha_mse.png)

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ses_alpha_bias.png)

Figure 2:Comparison of risk \(left\) and bias \(right\) for estimators ofα\\alphain Simple Exponential Smoothing\. Neural\-Unif has better overall performance than Maximum Likelihood and a Bayesian procedure\. Neural\-Debias pays a price in risk for reduced bias\. Neural\-Minimax has lower maximum risk than all other estimators\. Maximum Likelihood has strong negative bias \(towards 0\)\. Other estimators bias toward 0\.5, while Neural\-Debias strongly reduces bias\.The inference results for the full experiments of the Holt\-Winters modelα,β,γ\\alpha,\\beta,\\gammais summarized in Table[1](https://arxiv.org/html/2606.27711#S5.T1)\. The Neural\-Debias estimator has successfully reduced the bias of Neural\-Unif and shows a better tradeoff than Maximum Likelihood\. In conclusion, neural estimators achieved their objectives of optimizing for risk, bias reduction and lower worst\-case errors\.

Table 1:Comparison of Parameter Estimation Errors \(MSE and Bias Squared\) for Holt\-Winters Exponential Smoothing\. The neural estimators show better bias and squared error than Maximum Likelihood \(HW\-MLE\)
### 5\.2Forecasting Scaling Laws in Exponential Smoothing

We now move into forecast performance, training a neural estimator per forecast horizon, for horizonsh=1,…,8h=1,\\ldots,8\(H=8H=8\)\. The neural estimators optimize for forecast risk \(Mean Squared Error\)\. We consider Simple Exponential Smoothing and Holt\-Winters separately to gauge the effect of the increase in model complexity\. The neural estimators are set to have scale and location equivariance for the forecasts via min\-max normalization, to reduce the informativeness in the sampling\. The reference estimators are Maximum Likelihood, SES\-MLE and HW\-MLE \(instatsmodels\)\.

Figure[3](https://arxiv.org/html/2606.27711#S5.F3)shows forecast MSE per horizon, for Simple Exponential Smoothing \(left\) and Holt\-Winters \(right\), across 1M samples following the sampling mechanism in Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)\. The neural estimator outperforms Maximum likelihood, and the evolution of performance with horizon is consistent with the theoretical variance of each process, linear with horizon for SES and quadratic for HW\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ses_mse_horizon.png)

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/hw_mse_horizon.png)

Figure 3:Squared forecast error of neural estimators \(orange\) versus maximum likelihood \(blue\) as a function of the forecasting horizon\. The left panel corresponds to Simple Exponential Smoothing, while the right panel shows Holt–Winters\.We consider two ‘scaling laws’:

1. 1\.Asymptotic performance based on series length\. Once the neural network estimator is trained, how it compares to MLE as the length of the series increases \(sample size\)\.
2. 2\.Pre\-training Dataset size scaling\. How many simulated time series are needed to train a neural estimator that outperforms Maximum Likelihood at different lengths for the series\.

Figure[4](https://arxiv.org/html/2606.27711#S5.F4)shows how many training examples are required for the neural estimator until it outperforms Maximum Likelihood\. The series length are separated into 16, 32 and 64 to illustrate asymptotic properties\. We use the Percentage Wins metric, the proportion of series for which the neural estimator outperforms the reference\. This metric is more efficient when measuring small differences in MSE and more robust, accounts for numerical optimizer issues in the MLE estimators\. Experiment size is 20K time series for each length\. Left panel in Figure[4](https://arxiv.org/html/2606.27711#S5.F4)show the scaling for SES and right panel for HW\. The horizontal line in each plot is set to 0\.5, the proportion required to achieve superior performance\. Lines have been smoothed so comparisons are intended to be qualitative\. It can be seen on both Figures the increase in performance with more training and that shorter series require less training to outperform Maximum Likelihood\. Comparing SES vs HW, it can be seen that SES is a simpler problem than HW for the neural networks, they require less training to achieve superior performance\. We attribute this effect to the increase in complexity of the process \(more parameters\) as well as the increased Signal to Noise ratio in HW, an artifact of the simulation mechanism\.

It can be inferred from the plots that for very long series, Maximum Likelihood would be difficult to beat, confirming that at the asymptotic regime, the classic estimators are enough\. In this regime, differences between estimators become a numeric issue\. In the Appendix, Figure[9](https://arxiv.org/html/2606.27711#A3.F9)we measure forecast MSE, it can be confirmed that both neural and MLE have ‘converged’ to the same square error performance long before the maximum length series in the experiment\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ses_scaling_tourn.png)

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/hw_scaling_tourn.png)

Figure 4:Scaling laws for the forecast error performance of the Neural\-Unif estimator relative to maximum likelihood\. The horizontal axis shows the training set size used for the neural estimator \(simulated time series\), while the vertical axis reports the percentage of wins, defined as the proportion of test cases in which the neural estimator attains lower forecast error than MLE\. Results for Simple Exponential Smoothing are shown on the left, and Holt–Winters on the right\. Each panel reports three series lengths \(16, 32, and 64\)\. Neural estimators can outperform MLE with sufficient training data, but require increasingly larger datasets as the series length grows and the underlying process becomes more complex \(one parameter versus three parameters\)\.
### 5\.3Real Data: Forecasting Results of Neural Exponential Smoothing

We compare a neural estimator trained by the simulation scheme in Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)and we test it on the M1 competition real dataset\[[40](https://arxiv.org/html/2606.27711#bib.bib4)\]with 1001 time series for Yearly, Quarterly and Monthly frequencies\. We adapt the simulation scheme for the M1: we add seasonalitym=12m=12to cover Monthly data, sampling Yearly, Quarterly and Yearly with equal probability\. The estimator Neural\-ETS, is trained to optimize for direct forecast MSE on horizonsh=1,…,18h=1,\\ldots,18\( MonthlyH=18H=18, QuarterlyH=8H=8, YearlyH=6H=6\)\. We choose the M1 because in their seminal paper\[[34](https://arxiv.org/html/2606.27711#bib.bib2)\], Hyndman et al\. compare between a range of estimators for the same ETS class, so it is a good way of isolating the effect of estimation and highlight the benefits of the Simulacrum approach\. Specifically, they implement 5 estimators: the Maximum Likelihood estimator \(MLE\), Least Squares estimator \(MSE\), the estimator that minimizes in\-sample Mean Absolute Percentage Errors \(MAPE\), the estimator that minimizes the residual variance \(Sigma\) and the estimator that minimizes the average of 1 to 3 step\-ahead in\-sample forecast errors \(AMSE\)\.

Results in Table[2](https://arxiv.org/html/2606.27711#S5.T2)show that the neural estimator achieves the best accuracy\. Ranking the estimators, a pattern emerges: top estimators are Neural, AMSE and MSE all optimize for the criteria of forecast error across horizons \(unlike MLE\)\. AMSE gets better performance than MSE because its estimation objective is closer to the true objective in the M1 competition of average error across horizons 1 to 18\. The neural estimator can target this objective easily, even considering a set of models and does not need to rely on model selection that does not optimize directly for forecast error\. A limitation in the comparison is that Neural\-ETS restricts to the additive class and two observations were removed due to the MAPE metric being unstable \(division by 0\)\.

Table 2:Comparison of Estimation Methods by MAPE for several ETS estimators in the M1 real data benchmark dataset\. Neural\-ETS Estimator considering all additive subclasses is the top performing method, followed by the estimator that optimizes for multi\-horizon forecast error \(AMSE\)\. Table is an extension from Table 3 in\[[34](https://arxiv.org/html/2606.27711#bib.bib2)\]\.Extending the comparison beyond ETS estimators to the complete set of models that participated in the M1 competition and the recent TSFM Chronos\-2\[[2](https://arxiv.org/html/2606.27711#bib.bib54)\], we see that the neural estimator is very competitive\. For the robust median APE metric used in the M1 paper, the neural estimator outperforms all other 21 models not only on average, but for most of the horizons, shown in Table[3](https://arxiv.org/html/2606.27711#S5.T3)\. This is remarkable, many of the models that participated in the competition are variants of Exponential Smoothing, including more sophisticated alternatives, while the neural estimator is just an alternative estimator of the classic additive class\. Results on average APE are also superior \(though not for all horizons\), seen in the Appendix Table[11](https://arxiv.org/html/2606.27711#A4.T11)\.

Table 3:Median APE: all M1 Competition data \(1001\)\. The Neural\-ETS estimator outperforms all other methods that participated in the M1 competition\.
### 5\.4Real Data: Performance of Neural\-ETS Minimax Estimator

Assessing performance of a minimax estimator in real data is more difficult than in simulation, because the true parameters are unobserved and forecast errors contain substantial irreducible noise\. However, if the dataset is sufficiently heterogeneous and includes challenging series, a minimax estimator’s properties could be appreciated in the*distribution*of forecast errors: relative to an estimator optimized for average risk, it should produce fewer extremely large errors, at the cost of somewhat worse performance on easier cases\.

We examine this trade\-off using the 675 yearly series in the M3 dataset\. Yearly series provide a useful test bed because they are volatile enough to generate extreme forecast failures\. The left panel of Figure[5](https://arxiv.org/html/2606.27711#S5.F5)plots the empirical quantile functions \(EQFs\) of forecast MSE for HW\-MLE, Neural\-Unif, and Neural\-Minimax\. For a given probability levelpp, the EQFQ^​\(p\)\\hat\{Q\}\(p\)reports the error threshold below which a proportionppof the observations fall; equivalently, larger values ofppcorrespond to increasingly extreme errors\.

The left panel suggests that Neural\-Minimax and Neural\-Unif have similar overall profiles relative to HW\-MLE, as expected from the risk profiles studied in Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)\. To make the tail comparison clearer, the right panel of Figure[5](https://arxiv.org/html/2606.27711#S5.F5)reports the ratios

Q^MSE​\(Neural​\-​Minimax\)​\(p\)Q^MSE​\(HW​\-​MLE\)​\(p\)andQ^MSE​\(Neural​\-​Minimax\)​\(p\)Q^MSE​\(Neural​\-​Unif\)​\(p\)\.\\frac\{\\hat\{Q\}\_\{\\mathrm\{MSE\}\(\\mathrm\{Neural\\text\{\-\}Minimax\}\)\}\(p\)\}\{\\hat\{Q\}\_\{\\mathrm\{MSE\}\(\\mathrm\{HW\\text\{\-\}MLE\}\)\}\(p\)\}\\quad\\text\{and\}\\quad\\frac\{\\hat\{Q\}\_\{\\mathrm\{MSE\}\(\\mathrm\{Neural\\text\{\-\}Minimax\}\)\}\(p\)\}\{\\hat\{Q\}\_\{\\mathrm\{MSE\}\(\\mathrm\{Neural\\text\{\-\}Unif\}\)\}\(p\)\}\.Values above 1 indicate worse performance for Neural\-Minimax, while values below 1 indicate an advantage\. The plot has been smoothed for clarity, the results should be taken qualitatively\.

The pattern is consistent with the minimax objective\. At lower quantiles, Neural\-Minimax has slightly larger errors than HW\-MLE, indicating some sacrifice in best\-case performance\. However, from approximatelyp≥0\.5p\\geq 0\.5onward, its relative performance improves, and in the upper tail it produces smaller errors\. At the extreme, the largest forecast error of Neural\-Minimax is about 85% of the largest error of HW\-MLE\. The comparison with Neural\-Unif is qualitatively similar but less pronounced\. Overall, these results suggest that Neural\-Minimax behaves as intended: it trades off some performance on easier cases in exchange for better protection against extreme forecast failures\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/quantile_error_dist_Mcomp.png)

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/quantile_error_dist_ratio_Mcomp.png)

Figure 5:Forecast\-error distribution for HW\-MLE, Neural\-Unif, and Neural\-Minimax on the M3 Yearly dataset\. The left panel shows the empirical quantile functions \(EQFs\) of log forecast MSE\. The right panel shows smoothed ratios of the EQF of Neural\-Minimax relative to HW\-MLE and Neural\-Unif\. Values below 1 indicate smaller errors for the minimax estimator\. Neural\-Minimax is slightly worse at low quantiles but improves in the upper tail, consistent with its goal of reducing extreme forecast errors\.

## 6Application II: Solving the Bias and Calibration Problem in AR\(p\) Models

The autoregressive process of orderpp, denoted AR\(pp\), is a fundamental model in time series analysis\. Beyond forecasting applications, AR structures are critical in regression modeling to characterize correlated errors \(e\.g\.,\[[38](https://arxiv.org/html/2606.27711#bib.bib12)\]\)\. In such settings, neglecting autocorrelation compromises inference, yielding incorrect parameter estimates, biased standard errors, and invalid hypothesis tests\[[7](https://arxiv.org/html/2606.27711#bib.bib11)\]\.

Formally, an AR\(pp\) process generates observations recursively as a linear combination of past values, with coefficientsϕ1,…,ϕp\\phi\_\{1\},\\dots,\\phi\_\{p\}, plus an interceptccand an innovation termεt\\varepsilon\_\{t\}\(typically i\.i\.d\. Gaussian\):

yt\+1\\displaystyle y\_\{t\+1\}=c\+ϕ1​yt\+⋯\+ϕp​yt−p\+1\+εt\\displaystyle=c\+\\phi\_\{1\}y\_\{t\}\+\\cdots\+\\phi\_\{p\}y\_\{t\-p\+1\}\+\\varepsilon\_\{t\}\(10\)
With sufficient orderpp, these models can approximate highly complex dynamics, effectively capturing persistence and dependence\. However, despite their linearity, statistical estimation is non\-trivial due to the temporal dependency of observations\. Standard approaches face inherent limitations: Ordinary Least Squares \(OLS\) is inefficient, while the “gold standard” Exact Maximum Likelihood \(MLE\) suffers from well\-documented finite\-sample bias\.

This estimation bias severely limits applicability in scientific domains where rigorous unbiasedness is critical\. Furthermore, point estimation bias is compounded by the challenge of quantifyingparameter uncertainty\. Standard estimators fail to achieveuniform calibration: validity across the entire parameter space\. Instead, they often yield confidence intervals that are valid on average but systematically invalid for specific parameter configurations\. While considerable research has focused on bias correction \(dating back to 1954\[[42](https://arxiv.org/html/2606.27711#bib.bib5)\], with recent advances for low\-order processes\[[60](https://arxiv.org/html/2606.27711#bib.bib16)\]and machine learning\-assisted expansions\[[48](https://arxiv.org/html/2606.27711#bib.bib17)\]\), achieving uniform calibration remains more challenging\.

This section has two goals:

- •compare point\-estimation risk and bias for neural and classical estimators in AR\(5\), including maximum likelihood, OLS, Yule–Walker, and Burg estimators \(Section[6\.1](https://arxiv.org/html/2606.27711#S6.SS1)\);
- •evaluate distributional forecasting and calibration, including marginal calibration, uniform calibration across the parameter space, and simultaneous forecast\-path coverage \(Section[6\.2](https://arxiv.org/html/2606.27711#S6.SS2)\)\.

### 6\.1Estimation error and bias of Neural Estimators for the AR\(p\) class

We compare performance of two neural estimators: optimizing for minimum expected squared error \(Neural\-Unif\) and reducing the bias \(Neural\-Debias\)\. Our reference estimators are Burg \(AR\-BURG\), Yule\-Walker \(AR\-YW\), Conditional Least Squares \(AR\-OLS\) and Exact Maximum Likelihood \(AR\-MLE\), implemented instatscore package inR\.

For the experimental setup to train and evaluate the neural estimators, we restrict the order of AR top=5p=5\. To sample from the parameter space of stationary AR processes we first sampleppPartial Autocorrelation coefficients from thepp\-dimensional uniform hypercubeU​\(−1,1\)p\\mathrm\{U\}\(\-1,1\)^\{p\}\. The sampled partial autocorrelations are then transformed into AR coefficients via the Durbin\-Levinson, enforcing coefficients to the stationary subspace\. Innovation process is set toN​\(0,1\)\\mathrm\{N\}\(0,1\)innovations, constantccis set to 0\. Series lengths vary uniformly from length 12 \( 2 timespp\+ 2, so estimation is feasible\) to 64\. The reference AR models are correctly specified according to the simulation mechanism, fixing the order to 5 and disabling the estimation of the additive constant\. The neural estimators are made location and scale invariant so that the variance of the innovation process is not informative\.

The comparison of the estimators averaging across the parameter space is shown in Table[4](https://arxiv.org/html/2606.27711#S6.T4)over a test set of 1M sample size\. Neural estimators are superior to all other estimators in terms of expected squared error \(MSE\) and bias\. The result holds both when averaging across the sampling mechanism \(the inversions of the PACF hypercube\) but also for each single coefficient\. Neural\-Debias has succeeded in reducing the bias of Neural\-Unif at a tradeoff in MSE\. Among the reference estimators, Maximum Likelihood has the best performance, while Yule\-Walker and Least Squares have relatively poor performance\. This might be attributed to numeric issues that affect the squared error metric\. AR\-MLE was also prone to numeric issues with convergence of the optimization method used in the estimation, in about 10% of the time series, when errors were raised, the fallback method was set to AR\-BURG\.

Table 4:Inference errors MSE andbias2\\text\{bias\}^\{2\}for estimation of coefficients in an AR\(5\) for neural and classic estimators, averaging across each of the parameters\. The Neural\-Unif estimator achieves best overall performance for all dimensions, the Neural\-Debias effectively reduces the bias\. The best classic estimator is Exact Maximum Likelihood \(AR\-MLE\) followed by Burg \(AR\-BURG\)To showcase performance per value of a parameter, we focus onϕ4\\phi\_\{4\}, because smaller\-orderϕ\\phiare more studied elsewhere\. Results for other coefficients are analogous\. MSE for the range ofϕ4\\phi\_\{4\}is depicted in Figure[6\(a\)](https://arxiv.org/html/2606.27711#S6.F6.sf1)\. We see that no estimator completely dominates all others, AR\-MLE is superior to Neural\-Unif for values ofϕ4≤0\\phi\_\{4\}\\leq 0, so there is a range of the parameter space for which AR\-MLE is better than the Neural\-Unif, though Neural\-Unif is clearly superior\. AR\-MLE dominates AR\-OLS\. AR\-YW, has very poor performance on average, but has the best performance in a neighbourhood aroundϕ4=0\\phi\_\{4\}=0\. The Neural\-Debias estimator has good performance but it is virtually dominated by Neural\-Unif, paying the price of reducing the bias\. The bias comparison is shown in Figure[6\(b\)](https://arxiv.org/html/2606.27711#S6.F6.sf2)\. The Neural\-Debias estimator virtually dominates all other methods in terms of bias across the range of values\. All estimators show a bias towards 0 \(AR\-YW very strongly\), but AR\-MLE has a smaller bias that Neural\-Unif in the range of negative values ofϕ4\\phi\_\{4\}and larger bias for greater values\. We see that the relationship between estimators is complex, it cannot be summarized by simple rules\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_phi4_mse.png)\(\(a\)\)Estimation MSE vsϕ4\\phi\_\{4\}in AR\(5\)\.
![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_phi4_bias.png)\(\(b\)\)Estimation bias vsϕ4\\phi\_\{4\}in AR\(5\)\.

Figure 6:Detailed performance comparison for parameterϕ4\\phi\_\{4\}in the AR\(5\) class\. Left panel: Estimation MSE for each value ofϕ4\\phi\_\{4\}parameter in an AR\(5\) model\. Classic and Neural estimators for the AR\(5\) are compared\. Neural\-Unif achieves best overall performance but Maximum Likelihood is better forϕ4≤−0\.5\\phi\_\{4\}\\leq\-0\.5\. The Neural\-Debias estimator has overall worse MSE, trading off error for bias\. Right panel: Estimation bias for each value ofϕ4\\phi\_\{4\}parameter in an AR\(5\) model, centered \(closer to 0 is better\)\. Classic and Neural estimators for the AR\(5\) are compared\. The Neural\-Debias estimator strongly reduces bias over all other alternatives, which exhibit bias towards 0 \(or around 0\)\.In sum, the neural estimators have succeeded in reaching superior expected error and bias reduction for the main class of time series models in Time Series Analysis\. Distributional forecast accuracy in the AR\(p\) is covered in the following Section[6\.2](https://arxiv.org/html/2606.27711#S6.SS2)\.

### 6\.2Uniform Calibration and Distributional Accuracy of Neural Estimators for the AR\(p\) class

Having addressed parameter estimation, we now turn to uncertainty quantification \(which subsumes point estimation and is more challenging to solve\)\. The main result of this section is that our framework achieves strong distributional performance in the AR\(p\)\(p\)class, including accurate quantile forecasts, marginal calibration, simultaneous prediction paths, and, most importantly,uniform calibration\. Uniform calibration is particularly important in high\-stakes settings, where risk must be controlled for each process rather than only on average across a population:

- •Financial Regulation:The Basel Committee’s MAR32 framework requires backtesting calibration not just bank\-wide, but at the granular trading\-desk level\[[5](https://arxiv.org/html/2606.27711#bib.bib50)\]\.
- •Cloud Reliability:Providers like Google set Service Level Objectives \(SLOs\) \(e\.g\., 99\.9% uptime\) that must hold for individual availability zones, not just the global average\[[23](https://arxiv.org/html/2606.27711#bib.bib49)\]\.
- •Healthcare:Clinical decision\-making requires calibration valid for the specific patient profile, not just the average patient\[[63](https://arxiv.org/html/2606.27711#bib.bib15)\]\. This parallels the distinction between ‘strong calibration’ \(uniform\) and ‘mean calibration’ \(marginal\)\.

Standard versions of the main forecasting paradigms do not generally target this property directly:

- •Classical Frequentist:Estimators like Maximum Likelihood or OLS are biased in finite samples\. Consequently, prediction intervals are centered incorrectly, leading to a failure in uniform calibration\.
- •Bayesian Methods:Bayesian credible intervals guaranteeaveragecalibration with respect to the prior distribution \(Bayes Risk\), not uniform calibration for every fixed parameterθ\\theta\.
- •Global & Foundation Models:Models like N\-BEATS\[[51](https://arxiv.org/html/2606.27711#bib.bib39)\]or Chronos\[[3](https://arxiv.org/html/2606.27711#bib.bib48)\]are trained via Empirical Risk Minimization over massive corpora\. They optimize an aggregate objective, learning a “shared” uncertainty model that works best on average \(Marginal Calibration\)\. Achieving uniform calibration in end\-to\-end training is difficult because it imposes constraints on individual series, effectively negating the benefits of shared data learning\. Furthermore, individual series often lack sufficient data to train such complex models in isolation\.

Our framework addresses this by simulating the parameter space directly, allowing us to penalize miscalibration at the individual process level\. We compare three neural variants:Neural\-Unif\(standard Pinball loss\),Neural\-Calibr\(calibration\-aware loss\), andNeural\-Simult\(simultaneous bands\), against classical baselines\. Note thatNeural\-Unifserves as a proxy for a standard Global/Foundation/PFN model trained on synthetic data\.

The results in Figure[7](https://arxiv.org/html/2606.27711#S6.F7)demonstrate how our framework solves the hierarchy of calibration issues:

1. 1\.Forecast Accuracy:Neural estimators consistently achieve lower Pinball loss than classical estimators across all horizons \(Figure[7\(a\)](https://arxiv.org/html/2606.27711#S6.F7.sf1)\)\.
2. 2\.Marginal Forecast Calibration:Classical methods suffer from systematic undercoverage because they ignore parameter uncertainty \(Figure[7\(b\)](https://arxiv.org/html/2606.27711#S6.F7.sf2)\)\. In contrast, neural methods achieve nominal marginal coverage\.
3. 3\.Uniform Calibration:This result parallels the parameter bias result\.Neural\-Unif, a representative of standard average\-risk training \(like most TSFMs, GFMs\), achieves nominal coverage on average but exhibits systematic coverage bias conditional onϕ1\\phi\_\{1\}\(a visible “hump” in Figure[7\(c\)](https://arxiv.org/html/2606.27711#S6.F7.sf3)\)\.Neural\-Calibrcorrects this, flattening the coverage curve to the nominal level across the parameter space with negligible loss in accuracy\.
4. 4\.Simultaneous Coverage:A more challenging test of learning the DGP is bounding the entire forecast trajectoryyT\+1:T\+Hy\_\{T\+1:T\+H\}jointly\. Risk management often requires simultaneous bands \(e\.g\., “the path will stay within these bounds 95% of the time”\)\. Pointwise methods fail here, achieving only∼\\sim75\.5% coverage for a 95% target due to the compounding of errors over time\.Neural\-Simultsuccessfully learns the joint distribution, achieving the nominal 95% simultaneous coverage\. Figure[7\(d\)](https://arxiv.org/html/2606.27711#S6.F7.sf4)shows the reverse to validate the results: how a simultaneous coverage interval overcovers the pointwise interval as horizon increases\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_q95_pbl.png)\(\(a\)\)Pinball loss
![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_q95_cvr.png)\(\(b\)\)Marginal coverage
![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_phi_cvr_bias.png)\(\(c\)\)Uniform calibration
![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ar5_simult_cvr.png)\(\(d\)\)Simultaneous coverage

Figure 7:Distributional forecast performance for the 0\.95 quantile in AR\(5\)\. Panel \(c\) is the main result: Neural\-Unif achieves nominal coverage only on average, while Neural\-Calibr removes the systematic coverage bias acrossϕ1\\phi\_\{1\}and attains near\-uniform calibration\. Panel \(a\) shows that neural estimators also reduce pinball loss; panel \(b\) shows that classical intervals undercover because they ignore parameter uncertainty; and panel \(d\) shows that Neural\-Simult attains nominal simultaneous coverage of the forecast path, unlike pointwise methods\.Taken together, these results show that the proposed framework resolves a central limitation of existing forecasting paradigms: it can achieve process\-level, finite\-sample calibration without sacrificing the benefits of shared\-data learning\. In the AR\(p\)\(p\)class, this yields intervals that are not only accurate on average, but reliable across the parameter space\. Table[5](https://arxiv.org/html/2606.27711#S6.T5)summarizes how different approaches target finite\-sample distributional objectives\.

Table 5:Comparison of forecasting approaches for finite\-sample distributional objectivesNotes:✓\\checkmark= directly targeted by the standard formulation;≈\\approx= approximate, asymptotic, or average\-case; \+ = possible with additional modelling, resampling, or calibration; – = not generally targeted\. Marginal calibration refers to average coverage over the relevant world, prior, empirical distribution, or resampling scheme\. Uniform calibration refers to coverage conditional on each process or parameter configuration\. Simultaneous coverage refers to joint coverage of a multi\-horizon forecast path\.

## 7Application III: Model Selection & the Forecast Combination Puzzle \(Multi\-model Worlds\)

Sections[5](https://arxiv.org/html/2606.27711#S5)and[6](https://arxiv.org/html/2606.27711#S6)considered worlds built around a single structural model class\. We now move tomulti\-model worlds, where the world contains several candidate mechanisms and the estimator must learn a higher\-level statistical decision\. This setting reflects many practical forecasting pipelines, which select among models for parsimony and interpretation, combine model families to improve accuracy, and hedge against misspecification\.

This section considers two such decisions:

- •Model selection\.Section[7\.1](https://arxiv.org/html/2606.27711#S7.SS1)treats selection among additive ETS subclasses as the target decision, rather than optimizing directly for forecast or parameter\-estimation accuracy\. A neural selector is trained to recover the data\-generating subclass and is compared with the standard pipeline of maximum\-likelihood estimation followed by AICc\-based selection\.
- •Forecast combination\.Section[7\.2](https://arxiv.org/html/2606.27711#S7.SS2)turns the popular forecast combination heuristic from a post\-hoc averaging rule into a direct estimation problem under a combined generative world\. Rather than fitting ARIMA and ETS separately and then estimating weights for their forecasts, we assume the observed series is generated by a convex combination of ARIMA and ETS components driven by common shocks, defining a novel model class that emerges from combining the two classic models\. The resulting neural estimator learns the combined\-class decision rule directly and achieves state\-of\-the\-art results on many real benchmark datasets\. It also provides insight into the Forecast Combination Puzzle: even when the estimator is trained to target predictive accuracy directly, its gains over equal weights are modest, explaining why equal\-weight combinations remain such a robust benchmark\.

### 7\.1Model Selection for Additive Exponential Smoothing

Accurate model selection is essential for parsimony and interpretability\. Our goal is to train aNeural\-Selectionestimator that maximizes the probability of choosing the right subclass from the set of additive ETS models, comparing its accuracy to the standard approach of Maximum Likelihood followed by corrected Akaike Information Criterion \(AICc\) implemented instatsmodels\(ETS\-AIC\)\.

Using the simulation scheme from Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1), we generate synthetic time series across 8 Additive ETS submodels, covering all combinations of Trend \(Yes/No\), Seasonality \(Yes/No\), and Seasonal Period \(4 or 12\)\. This ensures coverage of typical Yearly, Quarterly, and Monthly frequencies\. To prevent ’leakage’, subclasses have equal probability and series length is controlled so it is not informative of the subclass\. We treat model selection as a supervised classification problem: the neural network is trained to minimize cross\-entropy loss, mapping input series to their true model class probabilities\.

We evaluate performance on a test set of 128,000 series\. Table[6](https://arxiv.org/html/2606.27711#S7.T6)presents the confusion matrix for Trend detection\.Neural\-Selectionachieves higher accuracy than ETS\-AIC\. Notably, while ETS\-AIC exhibits a strong bias toward selecting complex trended models when none exist \(Type I error\), the Neural estimator is more robust, showing only a slight conservative bias toward non\-trended models\.

Table 6:Confusion Matrix for Trend Selection \(Additive ETS\)\.Neural\-Selection \(bold\)outperformsETS\-AIC \(italics\), which exhibits a strong bias toward false positives \(detecting trend when none exists\)\.Regarding Seasonality \(Table[7](https://arxiv.org/html/2606.27711#S7.T7)\), both methods show a slight bias toward Period=4 \(the intermediate complexity between 0 and 12\)\. However, Neural\-Selection outperforms ETS\-AIC across all categories, achieving an overall accuracy of 97\.8% compared to 94\.5% for AIC\.

Table 7:Confusion Matrix for Seasonality Period Selection\.Neural\-Selection \(bold\)achieves 97\.8% overall accuracy, surpassingETS\-AIC \(italics\)at 94\.5%\.Figure[8](https://arxiv.org/html/2606.27711#S7.F8)illustrates the scaling of selection accuracy with series length\. While both methods improve as the time series length increases, Neural\-Selection has better accuracy for lower samples and converges faster toward perfect selection\.

![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ets-aaa-modelsel-trend.png)\(\(a\)\)Trend Selection Accuracy
![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ets-aaa-modelsel-season.png)\(\(b\)\)Seasonality Selection Accuracy

Figure 8:Model selection accuracy vs\. Series Length\. The Neural estimator \(Orange\) consistently outperforms ETS\-AIC \(Blue\) for both \(a\) Trend detection and \(b\) Seasonality period detection\.
### 7\.2A Generative Resolution to the Forecast Combination Puzzle

Combining forecasts via weighted averages is a staple in forecasting applications, dominating competitions\[[41](https://arxiv.org/html/2606.27711#bib.bib33)\]and key to success in high\-stakes scenarios such as COVID\-19 forecasting\[[56](https://arxiv.org/html/2606.27711#bib.bib46)\]\. However, the field faces the “Forecast Combination Puzzle”: simple equal\-weighted combinations often outperform theoretically optimal estimated weights based on historical performance\[[10](https://arxiv.org/html/2606.27711#bib.bib8),[57](https://arxiv.org/html/2606.27711#bib.bib13)\]\. This paradoxical effect is attributed to the fact that the error incurred in estimating weights often negates the theoretical gain in predictive power vs a simple ‘mean of forecasts’ baseline\[[9](https://arxiv.org/html/2606.27711#bib.bib14)\]\.

We propose a resolution to the puzzle by reframing combination not as a post\-hoc averaging step, but as agenerative process\. We define a “Combined World” where observations result from a convex combination of two distinct DGPs \(e\.g\., ARIMA and ETS\)\. Importantly, this differs from synthetic data augmentation methods like TSMixup \(in Chronos foundation model\)\[[3](https://arxiv.org/html/2606.27711#bib.bib48)\], which combineindependentseries\. In the forecast combination problem, candidate models are estimated on the same time series and are thus subject to the same shocks\. We capture this dependence by exposing the ARIMA and ETS components toidentical innovations\(εt\\varepsilon\_\{t\}\)\. This creates a “convex combination with common shocks” generative model that formalizes the heuristic two\-step process\. In this combined world, the analytical solution to optimal weight estimation is intractable, but since the ground truth is known, it can be approximated by a neural estimator,Neural\-Puzzle\. Even though estimation faces identifiability issues \(as ARIMA and ETS classes overlap\), the Simulacrum successfully recovers an estimator for both optimal weights and optimal forecasts in the sense of squared error directly, without having to estimate underlying component models\.

To validate this approach, we trained estimators for the convex combination of the ARIMA and Exponential Smoothing classes\. We compare three neural variants:

1. 1\.Pure Baselines:Neural\-ARIMAandNeural\-ETS, trained on their respective single\-model worlds\.
2. 2\.Convex Combination \(Neural\-Puzzle\):Trained on series generated byyt=w⋅ytETS\+\(1−w\)⋅ytARIMAy\_\{t\}=w\\cdot y\_\{t\}^\{\\text\{ETS\}\}\+\(1\-w\)\\cdot y\_\{t\}^\{\\text\{ARIMA\}\}, wherew∼U​\(0,1\)w\\sim\\mathrm\{U\}\(0,1\)\. Uniquely, the set of innovationsεt\\varepsilon\_\{t\}is identical for both models\.

Simulations for ETS and ARIMA cover Yearly and Quarterly frequencies \(Seasonality 0 and 4\)\. ETS follows the sampling described in Section[5\.1](https://arxiv.org/html/2606.27711#S5.SS1)\. ARIMA simulations follow an ARIMA\(p,d,q\)×\(P,D,Q\)4\(p,d,q\)\\times\(P,D,Q\)\_\{4\}withp,q≤3p,q\\leq 3,P,Q≤2P,Q\\leq 2,d\+D≤2d\+D\\leq 2and drift set to 0 ifd\+D=2d\+D=2\. This mechanism emulates the heuristically restricted model space in the auto\.arima from theforecastR package\[[33](https://arxiv.org/html/2606.27711#bib.bib31)\]\. AR and MA orders and differencing levels are sampled from the discrete uniform\. AR and MA coefficients follow the uniform PACF inversion \(see Section[6](https://arxiv.org/html/2606.27711#S6)\), both for generating stationary AR and invertible MA coefficients\. Series lengths are sampled uniformly subject to the constraint that each component model can be estimated given its number of parameters\.

We define the generative process for the ‘Combined World’yty\_\{t\}as follows:

1. 1\.Draw structure and parametersθETS\\theta\_\{\\text\{ETS\}\}andθARIMA\\theta\_\{\\text\{ARIMA\}\}from their respective definitions\.
2. 2\.Generate a single sequence of i\.i\.d\. innovationsε1:T∼𝒩​\(0,1\)\\varepsilon\_\{1:T\}\\sim\\mathcal\{N\}\(0,1\)\.TTis sampled uniformly with maximum of 64\. The minimum length is chosen so that both component models can be estimated\.
3. 3\.Generate the component seriesy1:TETSy^\{\\text\{ETS\}\}\_\{1:T\}andy1:TARIMAy^\{\\text\{ARIMA\}\}\_\{1:T\}using their respective recursive equations driven by thesameε1:T\\varepsilon\_\{1:T\}sequence\.
4. 4\.Normalize both series via min\-max normalization\.
5. 5\.Construct the final observed seriesyty\_\{t\}as a weighted sum of the standardized components: yt=w⋅ytETS\+\(1−w\)⋅ytARIMAy\_\{t\}=w\\cdot y^\{\\text\{ETS\}\}\_\{t\}\+\(1\-w\)\\cdot y^\{\\text\{ARIMA\}\}\_\{t\}wherew∼U​\(0,1\)w\\sim\\mathrm\{U\}\(0,1\)represents the mixing weight\.

#### 7\.2\.1Real Data: Forecast Accuracy of Puzzle\-based Combination

Table[8](https://arxiv.org/html/2606.27711#S7.T8)reports results on the Monash Time Series Archive\[[22](https://arxiv.org/html/2606.27711#bib.bib42)\]\. Theforecast combinationestimator Neural\-Puzzle achieves state\-of\-the\-art accuracy, outperforming the pure ‘Neural\-ARIMA’ and ‘Neural\-ETS’ baselines in almost all datasets and metrics\. Critically, it also outperforms the best models in the Monash benchmark, indicated by ‘\*’ in the table\.111The benchmark reference values are taken from the Monash Time Series Forecasting Archive comparison available at the time of evaluation, November 2025\. The comparison includes classical ARIMA/ETS methods and machine\-learning models such as Transformer\-, CNN\-, N\-BEATS\-, and CatBoost\-based methods\.

Table 8:Forecasting performance across neural estimators trained on ARIMA, ETS, and combined ARIMA–ETS worlds for yearly and quarterly benchmark datasets\. The “Convex Win %” row reports the proportion of individual series for whichNeural\-Puzzleachieves lower error than the corresponding single\-model estimator\. An asterisk indicates that the reported value improves on the best model in the Monash benchmark comparison snapshot described in the text\.
#### 7\.2\.2Solving the Puzzle: Comparison with Equal Weights

To assess the forecast\-combination puzzle in this setting, Table[9](https://arxiv.org/html/2606.27711#S7.T9)comparesNeural\-Puzzlewith a post\-hoc equal\-weight average ofNeural\-ETSandNeural\-ARIMA\. The performance gap is narrow, withNeural\-Puzzlewinning in approximately 49–59% of cases depending on the dataset\.

These results are consistent with the standard explanation of the puzzle: equal weights are a robust heuristic because the gains from unequal weighting are often small relative to the difficulty of estimating weights from finite samples\. Unlike the two\-step approaches analyzed in prior literature, the neural estimator directly targets the combination as the underlying DGP, providing a more complete resolution\.

Table 9:Convex Combination vs\. Post\-hoc Equal Weights\. Values represent the % of series where the trained Convex estimator yielded lower error than simple averaging\. Values near 50% indicate that the learned combination performs similarly to equal weights, consistent with the robustness of equal weighting in forecast combination\.

## 8Related Work

We now discuss how these results fit within the broader forecasting and inference literature\.

### 8\.1Motivating Evidence for Better Estimation of Structural Models as a Way Forward

Empirical studies on massive datasets suggest that the superior performance of Deep Learning Global Forecasting Models is driven by better generalization rather than the ability to model complex, non\-linear patterns\[[44](https://arxiv.org/html/2606.27711#bib.bib34)\]\. This is evidenced by the observation on many benchmark datasets that while neural networks outperform classical models \(like ARIMA or ETS\) out\-of\-sample, they often exhibit higher in\-sample error, implying that neural networks succeed by avoiding the overfitting common in classical estimation, effectively acting as robust estimators for simple signals\. Winner methods in the M4 Competition\[[59](https://arxiv.org/html/2606.27711#bib.bib38)\]used Exponential Smoothing with a recurrent neural network for the portion of the signal that could not be captured by the structural model, and similar performance was achieved just by combinations of structural models\[[43](https://arxiv.org/html/2606.27711#bib.bib37)\]\.

Recent literature reinforces the value of simple, structural models when paired with rigorous estimation or preprocessing\.\[[8](https://arxiv.org/html/2606.27711#bib.bib51)\]demonstrate that a mature preprocessing pipeline allows simple ARIMA models to achieve state\-of\-the\-art performance, matching or outperforming foundation models on the Monash Forecasting Archive\. Similarly,\[[58](https://arxiv.org/html/2606.27711#bib.bib53)\]provides evidence that structural Bayesian exponential smoothing methods can achieve top\-tier accuracy\. Collectively, these works indicate that the bottleneck in forecasting is often not theexpressivenessof the model class, but the quality of theestimator\.

### 8\.2Simulation\-Based Inference

Simulation has long been used in forecasting and inference\.\[[24](https://arxiv.org/html/2606.27711#bib.bib66)\]introduced*indirect inference*, in which the parameters of a model with an intractable likelihood are recovered by matching an auxiliary statistic computed on observed data to the same statistic computed on simulated data; the efficient method of moments of\[[16](https://arxiv.org/html/2606.27711#bib.bib67)\]is a closely related score\-based variant\. The crucial difference between these approaches and the present framework is amortization\. Classical indirect inference solves a separate optimization problem for every dataset, searching over parameters whose simulated auxiliary statistic best matches the observed one\. The Simulacrum instead solves the estimation problem once, over an entire generative world, by learning a direct mappingFFfrom observed series to estimands, so that each new series is handled by a single forward pass\.

This amortization is shared with modern simulation\-based inference \(SBI\), which trains neural networks to approximate inference mappings for models with intractable likelihoods\[[11](https://arxiv.org/html/2606.27711#bib.bib21)\]\. Neural posterior estimation\[[25](https://arxiv.org/html/2606.27711#bib.bib64)\], amortized Bayesian inference frameworks such as BayesFlow\[[52](https://arxiv.org/html/2606.27711#bib.bib58)\], and Neural Bayes Estimators\[[53](https://arxiv.org/html/2606.27711#bib.bib55)\]all learn estimators by simulation\. These methods target the Bayesian posterior; in our notation they correspond to the special case in whichΠ\\Piis a Bayesian modelΠBayesian\\Pi\_\{\\mathrm\{Bayesian\}\},

F^NBE≈arg⁡minF⁡ℛ​\(F;ΠBayesian\)\.\\widehat\{F\}\_\{\\mathrm\{NBE\}\}\\;\\approx\\;\\arg\\min\_\{F\}\\mathcal\{R\}\(F;\\Pi\_\{\\mathrm\{Bayesian\}\}\)\.Our formulation generalizes this in two respects: the worldΠ\\Pineed not be a Bayesian model, and the objective need not be posterior recovery but may target conditional, decision\-theoretic criteria \(Section[2\.5](https://arxiv.org/html/2606.27711#S2.SS5)\)\. The same amortized, task\-distribution structure underlies neural processes\[[17](https://arxiv.org/html/2606.27711#bib.bib78)\]and meta\-learning more broadly\[[4](https://arxiv.org/html/2606.27711#bib.bib60),[15](https://arxiv.org/html/2606.27711#bib.bib61)\], with the generative worldΠ\\Piplaying the role of a task distribution\. What distinguishes our setting is the object being optimized: meta\-learning typically targets fast adaptation or average accuracy across tasks, whereas we target process\-level statistical properties that must hold at each individual task\.

Another line of research uses simulation to enlarge the data on which a conventional forecaster is trained\. Synthetic series have been used to train global models where real data are insufficient\[[27](https://arxiv.org/html/2606.27711#bib.bib23)\], and synthetic augmentation has been used to improve feature\-based forecast model selection\[[61](https://arxiv.org/html/2606.27711#bib.bib24)\], often drawing on the GRATIS scheme for generating diverse, realistic series with controllable characteristics\[[35](https://arxiv.org/html/2606.27711#bib.bib22)\]\. In this literature, the simulation distribution is chosen for diversity or coverage\. Our framework generalizes these approaches by proposing deliberate world design paired with a decision\-theoretic objective, so that the simulation distribution is no longer a heuristic source of variety but the precise specification of the problem the estimator is required to solve\.

### 8\.3Neural Forecasting Methods

Many recent neural forecasting methods can be written as approximate minimizers of a population riskℛ​\(F;Π\)\\mathcal\{R\}\(F;\\Pi\)for a particular choice of generative worldΠ\\Pi\. Making this choice explicit reveals a common structure across methods that otherwise look distinct\.

#### Global forecasting models

Global models such as DeepAR\[[54](https://arxiv.org/html/2606.27711#bib.bib63)\], N\-BEATS\[[51](https://arxiv.org/html/2606.27711#bib.bib39)\], and the Temporal Fusion Transformer\[[39](https://arxiv.org/html/2606.27711#bib.bib80)\]train a shared predictor over the empirical distribution of available series,

Π=Πempirical\.\\Pi=\\Pi\_\{\\mathrm\{empirical\}\}\.This delivers cross\-series generalization, but the empirical world is often narrow, so the learner is forced to extrapolate beyond whatΠempirical\\Pi\_\{\\mathrm\{empirical\}\}supports\. The difficulty is most acute for distributional forecasting, where rare events and tail behavior are scarcely represented in finite corpora\. Hybrid worlds allowΠempirical\\Pi\_\{\\mathrm\{empirical\}\}to be expanded explicitly and in a controlled way\.

#### Time series foundation models

Time series foundation models \(TSFMs\) such as Chronos\[[3](https://arxiv.org/html/2606.27711#bib.bib48)\]and TimesFM\[[12](https://arxiv.org/html/2606.27711#bib.bib47)\]scale training to massive, heterogeneous pretraining worldsΠpretrain\\Pi\_\{\\mathrm\{pretrain\}\}assembled from diverse empirical and synthetic series,

F^TSFM≈arg⁡minF⁡ℛ​\(F;Πpretrain\),\\widehat\{F\}\_\{\\mathrm\{TSFM\}\}\\approx\\arg\\min\_\{F\}\\mathcal\{R\}\(F;\\Pi\_\{\\mathrm\{pretrain\}\}\),with zero\-shot generalization arising whenΠpretrain\\Pi\_\{\\mathrm\{pretrain\}\}has sufficient support over the test domain\. In our framework, TSFMs occupy one end of a spectrum that runs from classical single\-mechanism worlds to vast pretraining worlds\.

#### Synthetically trained forecasters

A distinct line of work trains inference machines on synthetic data alone\. In our notation,

Π=Πsynthetic,F^synth≈arg⁡minF⁡ℛ​\(F;Πsynthetic\)\.\\Pi=\\Pi\_\{\\mathrm\{synthetic\}\},\\qquad\\widehat\{F\}\_\{\\mathrm\{synth\}\}\\approx\\arg\\min\_\{F\}\\mathcal\{R\}\(F;\\Pi\_\{\\mathrm\{synthetic\}\}\)\.Prior\-data fitted networks\[[49](https://arxiv.org/html/2606.27711#bib.bib83)\]learn an inference rule by repeatedly sampling tasks from a synthetic prior and training a network to predict held\-out points; the trained network thereby approximates the Bayesian posterior predictive for that prior\. TabPFN\[[29](https://arxiv.org/html/2606.27711#bib.bib59)\]is the tabular instantiation, and ForecastPFN\[[14](https://arxiv.org/html/2606.27711#bib.bib84)\]a variant whose synthetic prior is purpose\-built for forecasting\. TabPFN\-TS\[[30](https://arxiv.org/html/2606.27711#bib.bib82)\]shows the premise is general enough that the tabular model itself, pretrained only on synthetic tables, forecasts competitively once timestamps are encoded as tabular features\.

SarSim0\[[50](https://arxiv.org/html/2606.27711#bib.bib81)\]trains standard backbones such as N\-BEATS by ordinary risk minimization on an engineered synthetic world of SARIMA processes\. In the vocabulary used here, this restricts the world to well\-behaved processes and applies designed contamination operators \(Section[2\.7](https://arxiv.org/html/2606.27711#S2.SS7)\)\. On Gift\-Eval\[[1](https://arxiv.org/html/2606.27711#bib.bib85)\], its estimators outperform the AutoARIMA procedure built on the same SARIMA family, an effect its authors describe as an emergent “student\-beats\-teacher” generalization\. The framework developed here clarifies this result\. AutoARIMA is a plug\-in procedure, and therefore a suboptimal approximation to the Bayes forecaster for the SARIMA world, the rule that integrates over order and parameter uncertainty\. A network trained by risk minimization over that world approximates the Bayes forecaster directly\.

### 8\.4Discussion

Recent progress in neural forecasting has come along two fronts: increasingly expressive architectures, and the enlargement of the world a model is trained on, from a single dataset to large pretraining corpora\. Both are, in part, inspired by foundation models for language and vision, where scaling the model and the data has been the dominant lever\. One premise of that program does not carry over to time series applications: language and vision are supplied with vast real data, whereas real time series corpora are comparatively scarce, domain\-specific, and constrained by licensing\. The synthetically trained forecasters of the previous subsection are themselves a response to this, and once that step is taken the generative world becomes a central object of design\.

TabPFN\[[29](https://arxiv.org/html/2606.27711#bib.bib59)\], a tabular foundation model, shows that a deliberately constructed synthetic prior can equip an estimator with inductive biases that transfer strongly to real tasks\. Yet across all of the methods surveyed here, the worldΠ\\Piis varied while the objective is held fixed: each is an approximate minimizer of average risk over its world\.

The contribution of this paper is to make the estimator itself a first\-class object of design\. Because the world is simulated, repeated trajectories from any fixed processω\\omegaare available at training time, and the estimator can be trained to target specific statistical properties \(Section[2\.5](https://arxiv.org/html/2606.27711#S2.SS5)\)\. Because the data\-generating process is known and can be resampled at will, these properties are verifiable: unbiasedness or coverage at a givenω\\omegacan be checked directly by Monte Carlo\. A candidate world can then be judged by whether the estimator trained on it meets a stated, in\-world property, not only by benchmark accuracy\.

This also sharpens a question the framework is equipped to study: whether a general, domain\-agnostic prior of the kind TabPFN employs, or a world deliberately matched to a deployment domain, is the better foundation for a given time series task\.

## 9Conclusions

This paper introduced the*Simulacrum*, a simulation\-based framework for training neural networks as estimators of structural time\-series models\. Rather than treating neural networks as generic black\-box predictors, the framework combines a “generative world” \(i\.e\. a generalization of data\-generating processes\), with a decision objective to learn estimators that approximate optimal finite\-sample decision rules\. This makes it possible to learn estimators for objectives that are central to statistical practice but often difficult to achieve with classical procedures, including low forecast risk, bias reduction, minimax behavior, and process\-level calibration\. The approach can be used not only to improve estimation within existing model classes, but also to derive estimators for novel classes where even suitable procedures are analytically unavailable or computationally intractable\.

Across the applications studied here, the same underlying principle leads to strong and coherent results in both simulated and real\-data settings\. For exponential smoothing models, neural estimators improve forecast accuracy and can be trained toward lower bias or lower worst\-case risk\. For the AR\(p\)\(p\)class, the framework yields estimators with substantially improved finite\-sample behavior, including near\-unbiased parameter estimation and near\-uniform calibration\. For forecast combination, it provides a direct way to learn combination rules in settings where the corresponding optimal estimator is analytically unavailable or impractical to compute, achieving state\-of\-the\-art forecasting accuracy in major benchmarking datasets\.

The results also suggest a broader methodological point\. Much of the recent progress in forecasting has been framed as a choice between classical statistical models and highly flexible machine\-learning systems, but our findings indicate that this contrast is incomplete\. When estimation is improved sufficiently, simple structural models can remain highly competitive\.

More broadly, the paper argues for an additional role for neural methods in time\-series analysis\. The goal need not be to replace structural models with increasingly flexible forecasting architectures\. An alternative is to use neural networks to*solve*estimation problems for structural models that would otherwise be analytically intractable, numerically unstable, or computationally expensive\. In this view, simulation\-based pretraining extends rather than abandons the structural modeling tradition\. It preserves parsimony, interpretability, and auditability, while enlarging the set of statistical guarantees that can be pursued in practice\. Some of these guarantees, such as uniform calibration, coverage control, and robustness to worst\-case errors, are not only statistically desirable, but operational or regulatory requirements in high\-stakes settings\.

These capabilities also expand the scientific uses of structural time\-series models\. When finite\-sample bias and miscalibration are reduced, such models become more reliable tools for inference under the umbrella of analysis under temporally correlated error, including applications involving hypothesis testing and uncertainty quantification\. More generally, the framework suggests that pretrained estimators could serve as specialized components in larger analytical or agentic workflows, where structural hypotheses are proposed, solved, and evaluated through simulation\.

Several directions remain open\. One is to expand the class of generative worlds to include richer dynamics, contamination mechanisms, hybrid synthetic–empirical designs, and the study of performance under misspecification of those worlds\. Another direction is to broaden the class of decision objectives that can be targeted with the framework\. For example, recent work studies estimation problems in which individuals may strategically misreport their data, introducing game\-theoretic requirements such as incentive compatibility and robustness to adversarial reporting\[[26](https://arxiv.org/html/2606.27711#bib.bib52)\]\.

## References

- \[1\]T\. Aksu, G\. Woo, J\. Liu, X\. Liu, C\. Liu, S\. Savarese, C\. Xiong, and D\. Sahoo\(2024\)GIFT\-eval: a benchmark for general time series forecasting model evaluation\.InNeurIPS Workshop on Time Series in the Age of Large Models,External Links:[Link](https://openreview.net/forum?id=Z2cMOOANFX)Cited by:[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p2.1)\.
- \[2\]A\. F\. Ansari, O\. Shchur, J\. Küken, A\. Auer, B\. Han, P\. Mercado, S\. S\. Rangapuram, H\. Shen, L\. Stella, X\. Zhang, M\. Goswami, S\. Kapoor, D\. C\. Maddix, P\. Guerron, T\. Hu, J\. Yin, N\. Erickson, P\. M\. Desai, H\. Wang, H\. Rangwala, G\. Karypis, Y\. Wang, and M\. Bohlke\-Schneider\(2025\)Chronos\-2: from univariate to universal forecasting\.External Links:2510\.15821,[Link](https://arxiv.org/abs/2510.15821)Cited by:[§5\.3](https://arxiv.org/html/2606.27711#S5.SS3.p3.1)\.
- \[3\]A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.arXiv preprint arXiv:2403\.07815\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p2.1),[3rd item](https://arxiv.org/html/2606.27711#S6.I3.i3.p1.1),[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p2.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx2.p1.1)\.
- \[4\]J\. Baxter\(2000\)A model of inductive bias learning\.Journal of artificial intelligence research12,pp\. 149–198\.Cited by:[§2](https://arxiv.org/html/2606.27711#S2.p1.1),[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.4)\.
- \[5\]BCBS\(2023\)Basel framework: mar32 — internal models approach: backtesting and p&l attribution test requirements\.Note:[https://archive\.is/XFzo6](https://archive.is/XFzo6)Archived snapshot via archive\.is\. Original URL:[https://www\.bis\.org/basel\_framework/chapter/MAR/32\.htm?inforce=20230101&published=20200327](https://www.bis.org/basel_framework/chapter/MAR/32.htm?inforce=20230101&published=20200327)\. Accessed 2025\-11\-21Cited by:[1st item](https://arxiv.org/html/2606.27711#S6.I2.i1.p1.1)\.
- \[6\]J\. O\. Berger\(2013\)Statistical decision theory and bayesian analysis\.Springer Science & Business Media\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p3.1),[§2](https://arxiv.org/html/2606.27711#S2.p1.1)\.
- \[7\]W\. Cheang and G\. C\. Reinsel\(2000\)Bias reduction of autoregressive estimates in time series regression model through restricted maximum likelihood\.Journal of the American Statistical Association95\(452\),pp\. 1173–1184\.Cited by:[§6](https://arxiv.org/html/2606.27711#S6.p1.2)\.
- \[8\]X\. Cheng, W\. Shen, H\. Chen, C\. Shen, J\. Ortega, J\. Liu, S\. Thomas, H\. Zheng, H\. Wu, Y\. Li, C\. Lichtendahl, J\. Ortiz, G\. Liu, H\. Qi, O\. Fatemieh, C\. Fry, and J\. J\. Long\(2025\)ARIMA\-plus: large\-scale, accurate, automatic and interpretable in\-database time series forecasting and anomaly detection in google bigquery\.External Links:2510\.24452,[Link](https://arxiv.org/abs/2510.24452)Cited by:[§8\.1](https://arxiv.org/html/2606.27711#S8.SS1.p2.1)\.
- \[9\]G\. Claeskens, J\. R\. Magnus, A\. L\. Vasnev, and W\. Wang\(2016\)The forecast combination puzzle: a simple theoretical explanation\.International Journal of Forecasting32\(3\),pp\. 754–762\.Cited by:[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p1.1)\.
- \[10\]R\. T\. Clemen\(1989\)Combining forecasts: a review and annotated bibliography\.International journal of forecasting5\(4\),pp\. 559–583\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p6.2),[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p1.1)\.
- \[11\]K\. Cranmer, J\. Brehmer, and G\. Louppe\(2020\)The frontier of simulation\-based inference\.Proceedings of the National Academy of Sciences117\(48\),pp\. 30055–30062\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1912789117)Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.2)\.
- \[12\]A\. Das, W\. Kong, R\. Sen, and Y\. Zhou\(2024\)A decoder\-only foundation model for time\-series forecasting\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p2.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx2.p1.1)\.
- \[13\]M\. Dennis, N\. Jaques, E\. Vinitsky, A\. Bayen, S\. Russell, A\. Critch, and S\. Levine\(2020\)Emergent complexity and zero\-shot transfer via unsupervised environment design\.Advances in neural information processing systems33,pp\. 13049–13061\.Cited by:[§2\.2](https://arxiv.org/html/2606.27711#S2.SS2.p1.1)\.
- \[14\]S\. Dooley, G\. S\. Khurana, C\. Mohapatra, S\. V\. Naidu, and C\. White\(2023\)Forecastpfn: synthetically\-trained zero\-shot forecasting\.Advances in Neural Information Processing Systems36,pp\. 2403–2426\.Cited by:[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p1.2)\.
- \[15\]C\. Finn, P\. Abbeel, and S\. Levine\(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational conference on machine learning,pp\. 1126–1135\.Cited by:[§2](https://arxiv.org/html/2606.27711#S2.p1.1),[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.4)\.
- \[16\]A\. R\. Gallant and G\. Tauchen\(1996\)Which moments to match?\.Econometric theory12\(4\),pp\. 657–681\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p1.1)\.
- \[17\]M\. Garnelo, D\. Rosenbaum, C\. Maddison, T\. Ramalho, D\. Saxton, M\. Shanahan, Y\. W\. Teh, D\. Rezende, and S\. A\. Eslami\(2018\)Conditional neural processes\.InInternational conference on machine learning,pp\. 1704–1713\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.4)\.
- \[18\]S\. Gershman and N\. Goodman\(2014\)Amortized inference in probabilistic reasoning\.InProceedings of the annual meeting of the cognitive science society,Vol\.36\.Cited by:[§2\.6](https://arxiv.org/html/2606.27711#S2.SS6.p2.5)\.
- \[19\]I\. Gibbs and E\. Candes\(2021\)Adaptive conformal inference under distribution shift\.Advances in Neural Information Processing Systems34,pp\. 1660–1672\.Cited by:[§3\.4](https://arxiv.org/html/2606.27711#S3.SS4.p2.5)\.
- \[20\]T\. Gneiting, F\. Balabdaoui, and A\. E\. Raftery\(2007\-04\)Probabilistic forecasts, calibration and sharpness\.Journal of the Royal Statistical Society Series B: Statistical Methodology69\(2\),pp\. 243–268\.External Links:ISSN 1369\-7412,[Document](https://dx.doi.org/10.1111/j.1467-9868.2007.00587.x),[Link](https://doi.org/10.1111/j.1467-9868.2007.00587.x),https://academic\.oup\.com/jrsssb/article\-pdf/69/2/243/49794500/jrsssb\_69\_2\_243\.pdfCited by:[§3\.4](https://arxiv.org/html/2606.27711#S3.SS4.p1.1)\.
- \[21\]T\. Gneiting\(2011\)Making and evaluating point forecasts\.Journal of the American Statistical Association106\(494\),pp\. 746–762\.Cited by:[§2\.3](https://arxiv.org/html/2606.27711#S2.SS3.p3.7)\.
- \[22\]R\. Godahewa, C\. Bergmeir, G\. I\. Webb, R\. J\. Hyndman, and P\. Montero\-Manso\(2021\)Monash time series forecasting archive\.InNeural Information Processing Systems Track on Datasets and Benchmarks,Cited by:[§7\.2\.1](https://arxiv.org/html/2606.27711#S7.SS2.SSS1.p1.1)\.
- \[23\]Google\(2025\)The art of SLOs handbook\.Google\.Note:Original URL:[https://static\.googleusercontent\.com/media/sre\.google/en//static/pdf/art\-of\-slos\-handbook\-a4\.pdf](https://static.googleusercontent.com/media/sre.google/en//static/pdf/art-of-slos-handbook-a4.pdf)External Links:[Link](https://web.archive.org/web/20250426162007/https://static.googleusercontent.com/media/sre.google/en//static/pdf/art-of-slos-handbook-a4.pdf)Cited by:[2nd item](https://arxiv.org/html/2606.27711#S6.I2.i2.p1.1)\.
- \[24\]C\. Gourieroux, A\. Monfort, and E\. Renault\(1993\)Indirect inference\.Journal of applied econometrics8\(S1\),pp\. S85–S118\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p1.1)\.
- \[25\]D\. Greenberg, M\. Nonnenmacher, and J\. Macke\(2019\)Automatic posterior transformation for likelihood\-free inference\.InInternational conference on machine learning,pp\. 2404–2414\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.2)\.
- \[26\]M\. Haviv\(2025\)Individual versus social optimization in statistical estimation\.Operations Research Letters61,pp\. 107284\.Cited by:[§9](https://arxiv.org/html/2606.27711#S9.p6.1)\.
- \[27\]H\. Hewamalage, C\. Bergmeir, and K\. Bandara\(2022\)Global models for time series forecasting: a simulation study\.Pattern Recognition124,pp\. 108441\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p3.1)\.
- \[28\]J\. A\. Hoeting, D\. Madigan, A\. E\. Raftery, and C\. T\. Volinsky\(1999\)Bayesian model averaging: a tutorial \(with comments by m\. clyde, david draper and ei george, and a rejoinder by the authors\.Statistical science14\(4\),pp\. 382–417\.Cited by:[§2\.1](https://arxiv.org/html/2606.27711#S2.SS1.SSS0.Px2.p1.4)\.
- \[29\]N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter\(2025\)Accurate predictions on small data with a tabular foundation model\.Nature637\(8045\),pp\. 319–326\.Cited by:[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p1.2),[§8\.4](https://arxiv.org/html/2606.27711#S8.SS4.p2.1)\.
- \[30\]S\. B\. Hoo, S\. Müller, D\. Salinas, and F\. Hutter\(2026\)From tables to time: extending tabpfn\-v2 to time series forecasting\.External Links:2501\.02945,[Link](https://arxiv.org/abs/2501.02945)Cited by:[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p1.2)\.
- \[31\]G\. Huang, Z\. Liu, L\. Van Der Maaten, and K\. Q\. Weinberger\(2017\)Densely connected convolutional networks\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 4700–4708\.Cited by:[Appendix B](https://arxiv.org/html/2606.27711#A2.SSx2.p1.1),[§4](https://arxiv.org/html/2606.27711#S4.p2.1)\.
- \[32\]P\. J\. Huber\(1992\)Robust estimation of a location parameter\.InBreakthroughs in statistics: Methodology and distribution,pp\. 492–518\.Cited by:[§2\.7](https://arxiv.org/html/2606.27711#S2.SS7.p1.1)\.
- \[33\]R\. J\. Hyndman and Y\. Khandakar\(2008\)Automatic time series forecasting: the forecast package for r\.Journal of statistical software27,pp\. 1–22\.Cited by:[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p4.5)\.
- \[34\]R\. J\. Hyndman, A\. B\. Koehler, R\. D\. Snyder, and S\. Grose\(2002\)A state space framework for automatic forecasting using exponential smoothing methods\.International Journal of forecasting18\(3\),pp\. 439–454\.Cited by:[§5\.3](https://arxiv.org/html/2606.27711#S5.SS3.p1.5),[Table 2](https://arxiv.org/html/2606.27711#S5.T2),[Table 2](https://arxiv.org/html/2606.27711#S5.T2.3.2)\.
- \[35\]Y\. Kang, R\. J\. Hyndman, and F\. Li\(2020\)GRATIS: generating time series with diverse and controllable characteristics\.Statistical Analysis and Data Mining: The ASA Data Science Journal13\(4\),pp\. 354–376\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p3.1)\.
- \[36\]L\. Kilian\(1998\)Small\-sample confidence intervals for impulse response functions\.Review of economics and statistics80\(2\),pp\. 218–230\.Cited by:[§3\.3](https://arxiv.org/html/2606.27711#S3.SS3.p1.2)\.
- \[37\]R\. Koenker and G\. Bassett Jr\(1978\)Regression quantiles\.Econometrica: journal of the Econometric Society,pp\. 33–50\.Cited by:[§3\.4](https://arxiv.org/html/2606.27711#S3.SS4.p5.2)\.
- \[38\]J\. W\. Lichstein, T\. R\. Simons, S\. A\. Shriner, and K\. E\. Franzreb\(2002\)Spatial autocorrelation and autoregressive models in ecology\.Ecological monographs72\(3\),pp\. 445–463\.Cited by:[§6](https://arxiv.org/html/2606.27711#S6.p1.2)\.
- \[39\]B\. Lim, S\. Ö\. Arık, N\. Loeff, and T\. Pfister\(2021\)Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International journal of forecasting37\(4\),pp\. 1748–1764\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p2.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx1.p1.3)\.
- \[40\]S\. Makridakis, A\. Andersen, R\. Carbone, R\. Fildes, M\. Hibon, R\. Lewandowski, J\. Newton, E\. Parzen, and R\. Winkler\(1982\)The accuracy of extrapolation \(time series\) methods: results of a forecasting competition\.Journal of forecasting1\(2\),pp\. 111–153\.Cited by:[§5\.3](https://arxiv.org/html/2606.27711#S5.SS3.p1.5)\.
- \[41\]S\. Makridakis, E\. Spiliotis, and V\. Assimakopoulos\(2018\)The m4 competition: results, findings, conclusion and way forward\.International Journal of forecasting34\(4\),pp\. 802–808\.Cited by:[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p1.1)\.
- \[42\]F\. Marriott and J\. Pope\(1954\)Bias in the estimation of autocorrelations\.Biometrika41\(3/4\),pp\. 390–402\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.27711#S3.SS3.p1.2),[§6](https://arxiv.org/html/2606.27711#S6.p4.1)\.
- \[43\]P\. Montero\-Manso, G\. Athanasopoulos, R\. J\. Hyndman, and T\. S\. Talagala\(2020\)FFORMA: feature\-based forecast model averaging\.International Journal of Forecasting36\(1\),pp\. 86–92\.Cited by:[§8\.1](https://arxiv.org/html/2606.27711#S8.SS1.p1.1)\.
- \[44\]P\. Montero\-Manso and R\. J\. Hyndman\(2021\)Principles and algorithms for forecasting groups of time series: locality and globality\.International Journal of Forecasting37\(4\),pp\. 1632–1653\.Cited by:[§8\.1](https://arxiv.org/html/2606.27711#S8.SS1.p1.1)\.
- \[45\]P\. Montero\-Manso\(2022\-12\)Big machine learning models that learn to estimate and forecast a class of autoregressive processes\.InBook of Abstracts of the 15th International Conference of the ERCIM WG on Computational and Methodological Statistics \(CMStatistics 2022\),London, UK\.Note:Accessed via Internet ArchiveExternal Links:[Link](https://web.archive.org/web/20231007230543/https://www.cmstatistics.org/CMStatistics2022/docs/BoACFECMStatistics2022.pdf)Cited by:[The Simulacrum: Decision‑Theoretic Pretraining for Near‑Optimal Time‑Series Forecasting and Inference††thanks:Preliminary versions of the core ideas in this paper were presented in earlier conference and invited talks, including CMStatistics 2022\[45\], the International Symposium on Forecasting 2023\[47\], and the M6 Competition Conference 2023\[46\]\. The present paper provides the first unified decision\-theoretic formulation of this framework, together with comprehensive empirical validation\.](https://arxiv.org/html/2606.27711#id1.id1)\.
- \[46\]P\. Montero\-Manso\(2023\-11\)Large pre\-trained models achieve near\-optimal time series forecasting\.New York, NY, USA\.Note:Presented at The Future of Forecasting and the M6 Competition ConferenceCited by:[The Simulacrum: Decision‑Theoretic Pretraining for Near‑Optimal Time‑Series Forecasting and Inference††thanks:Preliminary versions of the core ideas in this paper were presented in earlier conference and invited talks, including CMStatistics 2022\[45\], the International Symposium on Forecasting 2023\[47\], and the M6 Competition Conference 2023\[46\]\. The present paper provides the first unified decision\-theoretic formulation of this framework, together with comprehensive empirical validation\.](https://arxiv.org/html/2606.27711#id1.id1)\.
- \[47\]P\. Montero\-Manso\(2023\-06\)Pre\-trained deep networks outperform true models when predicting time series processes\.InBook of Abstracts of the 43rd International Symposium on Forecasting \(ISF 2023\),Charlottesville, VA, USA\.Note:Accessed via Internet ArchiveExternal Links:[Link](https://web.archive.org/web/20240226091729/https://isf.forecasters.org/wp-content/uploads/BookOfAbstractsISF2023.pdf)Cited by:[The Simulacrum: Decision‑Theoretic Pretraining for Near‑Optimal Time‑Series Forecasting and Inference††thanks:Preliminary versions of the core ideas in this paper were presented in earlier conference and invited talks, including CMStatistics 2022\[45\], the International Symposium on Forecasting 2023\[47\], and the M6 Competition Conference 2023\[46\]\. The present paper provides the first unified decision\-theoretic formulation of this framework, together with comprehensive empirical validation\.](https://arxiv.org/html/2606.27711#id1.id1)\.
- \[48\]M\. M\. Müller, G\. Specht, L\. Kleinheinz, and J\. Walde\(2024\)Bias correction and machine learning in ar \(1\) estimation: bridging traditional and modern techniques\.InGvDB,Cited by:[§6](https://arxiv.org/html/2606.27711#S6.p4.1)\.
- \[49\]S\. Muller, N\. Hollmann, S\. Arango, J\. Grabocka, and F\. Hutter\(2022\)Transformers can do Bayesian inference\.InThe Tenth International Conference on Learning Representations \(ICLR’22\),Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p5.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p1.2)\.
- \[50\]B\. N\. Oreshkin, M\. Jauhari, R\. K\. Selvam, M\. Wolff, W\. Pan, S\. Ramasubramanian, K\. G\. Olivares, T\. Konstantinova, A\. Potapczynski, M\. Cao,et al\.\(2026\)Zero\-shot forecasting by simulation alone\.arXiv preprint arXiv:2601\.00970\.Cited by:[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx3.p2.1)\.
- \[51\]B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio\(2020\)N\-beats: neural basis expansion analysis for interpretable time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=r1ecqn4YwB)Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p2.1),[3rd item](https://arxiv.org/html/2606.27711#S6.I3.i3.p1.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx1.p1.3)\.
- \[52\]S\. T\. Radev, U\. K\. Mertens, A\. Voss, L\. Ardizzone, and U\. Köthe\(2020\)BayesFlow: learning complex stochastic models with invertible neural networks\.IEEE transactions on neural networks and learning systems33\(4\),pp\. 1452–1466\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.2)\.
- \[53\]M\. Sainsbury\-Dale, A\. Zammit\-Mangion, and R\. Huser\(2024\)Likelihood\-free parameter estimation with neural bayes estimators\.The American Statistician78\(1\),pp\. 1–14\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p5.1),[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p2.2)\.
- \[54\]D\. Salinas, V\. Flunkert, J\. Gasthaus, and T\. Januschowski\(2020\)DeepAR: probabilistic forecasting with autoregressive recurrent networks\.International Journal of Forecasting36\(3\),pp\. 1181–1191\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p2.1),[§8\.3](https://arxiv.org/html/2606.27711#S8.SS3.SSSx1.p1.3)\.
- \[55\]P\. Shaman and R\. A\. Stine\(1988\)The bias of autoregressive coefficient estimators\.Journal of the American Statistical Association83\(403\),pp\. 842–848\.Cited by:[§3\.3](https://arxiv.org/html/2606.27711#S3.SS3.p1.2)\.
- \[56\]K\. Sherratt, H\. Gruson, H\. Johnson, R\. Niehus, B\. Prasse, F\. Sandmann, J\. Deuschel, D\. Wolffram, S\. Abbott, A\. Ullrich,et al\.\(2023\)Predictive performance of multi\-model ensemble forecasts of covid\-19 across european nations\.Elife12,pp\. e81916\.Cited by:[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p1.1)\.
- \[57\]J\. Smith and K\. F\. Wallis\(2009\)A simple explanation of the forecast combination puzzle\.Oxford bulletin of economics and statistics71\(3\),pp\. 331–355\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p6.2),[§7\.2](https://arxiv.org/html/2606.27711#S7.SS2.p1.1)\.
- \[58\]S\. Smyl, C\. Bergmeir, A\. Dokumentov, X\. Long, E\. Wibowo, and D\. Schmidt\(2025\)Local and global trend bayesian exponential smoothing models\.International Journal of Forecasting41\(1\),pp\. 111–127\.Cited by:[§8\.1](https://arxiv.org/html/2606.27711#S8.SS1.p2.1)\.
- \[59\]S\. Smyl\(2020\)A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting\.International journal of forecasting36\(1\),pp\. 75–85\.Cited by:[§8\.1](https://arxiv.org/html/2606.27711#S8.SS1.p1.1)\.
- \[60\]S\. H\. Sørbye, P\. G\. Nicolau, and H\. Rue\(2022\)Finite\-sample properties of estimators for first and second order autoregressive processes\.Statistical Inference for Stochastic Processes25\(3\),pp\. 577–598\.Cited by:[§6](https://arxiv.org/html/2606.27711#S6.p4.1)\.
- \[61\]T\. S\. Talagala, F\. Li, and Y\. Kang\(2022\)FFORMPP: feature\-based forecast model performance prediction\.International Journal of Forecasting38\(3\),pp\. 920–943\.Cited by:[§8\.2](https://arxiv.org/html/2606.27711#S8.SS2.p3.1)\.
- \[62\]J\. Tobin, R\. Fong, A\. Ray, J\. Schneider, W\. Zaremba, and P\. Abbeel\(2017\)Domain randomization for transferring deep neural networks from simulation to the real world\.In2017 IEEE/RSJ international conference on intelligent robots and systems \(IROS\),pp\. 23–30\.Cited by:[§2\.2](https://arxiv.org/html/2606.27711#S2.SS2.p1.1)\.
- \[63\]B\. Van Calster, D\. J\. McLernon, M\. Van Smeden, L\. Wynants, E\. W\. Steyerberg, T\. G\. ‘\. diagnostic tests, and prediction models’ of the STRATOS initiative Bossuyt Patrick Collins Gary S\. Macaskill Petra McLernon David J\. Moons Karel GM Steyerberg Ewout W\. Van Calster Ben van Smeden Maarten Vickers Andrew J\.\(2019\)Calibration: the achilles heel of predictive analytics\.BMC medicine17\(1\),pp\. 230\.Cited by:[3rd item](https://arxiv.org/html/2606.27711#S6.I2.i3.p1.1)\.
- \[64\]V\. Vovk, A\. Gammerman, and G\. Shafer\(2005\)Algorithmic learning in a random world\.Springer\.Cited by:[§3\.4](https://arxiv.org/html/2606.27711#S3.SS4.p2.5)\.
- \[65\]A\. Wald\(1950\)Statistical decision functions\.InBreakthroughs in Statistics: Foundations and Basic Theory,pp\. 342–357\.Cited by:[§1](https://arxiv.org/html/2606.27711#S1.p3.1),[§2](https://arxiv.org/html/2606.27711#S2.p1.1),[§3\.5](https://arxiv.org/html/2606.27711#S3.SS5.p1.2)\.

## Appendix ASimultaneous Confidence and Prediction Bands

Pointwise coverage for neural forecasts and estimates can be targeted with the calibration\-adjusted quantile loss\. There are cases when we are interested in coverage along the entire forecast path, representing the chance for “at least one” time\-step along the forecast horizon to be outside of the band, for which pointwise intervals can substantially undercover the full forecast path\. Prediction bands are more complex than pointwise, they rely on the joint forecast distribution, which tends to be intractable to model explicitly from data\. In time series, it is very common that samples from the joint distribution of the forecast path are computed by recursively sampling from the more tractable one\-step ahead forecast distribution\. This raises an empirical tradeoff: the accumulation of errors in the one\-step ahead recursive forecasts vs tractability of direct multi\-horizon joint distributions\. The neural network can be trained specifically for this problem\. The forecast path until the desired horizon can be set as the target of the neural network in the training scheme and then bands can be induced by a special loss function\. The loss function for the bands is computed by minimizing the maximum pinball loss across the forecast horizons \(or across the parameter set for estimations\)\.

ℒbandτ,ℋ=maxh∈ℋ⁡ℒquantileτ​\(h\)\\mathcal\{L\}\_\{\\text\{band\}\}^\{\\tau,\\mathcal\{H\}\}=\\max\_\{h\\in\\mathcal\{H\}\}\\mathcal\{L\}\_\{\\text\{quantile\}\}^\{\\tau\}\(h\)
Minimizing this loss leads to bands with good simultaneous coverage for the prediction setℋ\\mathcal\{H\}, on average across the distribution of the parameter spaceπ​\(θ\)\\pi\(\\theta\)\. Confidence bands can be obtained analogously\.

## Appendix BNeural Architectures: Invariances and Equivariances

We use neural networks as function approximators; because simulations allow arbitrarily large synthetic datasets, other universal function approximators may work\. However, neural architecture matters because it can: \(i\) enforces desired estimation properties \(invariances/equivariances\) that are hard to learn by simulation alone; \(ii\) improves generalization under misspecification via structural bias; and \(iii\) reduces training time by shrinking the effective hypothesis space\. We focus on four properties and their practical implementations\.

### Properties

For the sake of brevity, we present equivariance properties, usually desirable for estimation targets such as forecasts\. Invariances, usually required for parameter estimation, can be induced analogously\.

#### Affine Equivariance

For location/scale transformations\(c,s\)\(c,s\)we target

F​\(s⋅y1:T\+c\)=s⋅F​\(y1:T\)\+c,F\(s\\cdot y\_\{1:T\}\+c\)=s\\cdot F\(y\_\{1:T\}\)\+c,\(11\)which is a known property of many models such as ARIMA or ETS\. In simulations, it removes the need to simulate for a vast range of parameters such as the variance of the innovation distribution or initial states\.Implementation:reversible input normalization \(e\.g\., per\-series affine rescaling\) and output de\-normalization, such as min\-max normalization\.

#### Additive Equivariance

Useful when there is a known deterministic effect in the model, such as a linear trend, or fixed seasonality, so the neural estimator may learn all possible trends and seasonality without needing to simulate all such effects explicitly\.

F​\(y1:T\+B​β\)=F​\(y1:T\)\+B​β,F\(y\_\{1:T\}\+B\\beta\)=F\(y\_\{1:T\}\)\+B\\beta,\(12\)Implementation:a small regression head estimatesβ^\\hat\{\\beta\}; the network processes residualsy−B​β^y\-B\\hat\{\\beta\}and re\-addsB​β^B\\hat\{\\beta\}at the output\.

#### Time\-Translation Equivariance

For the shift operator\(𝒯Δ​y\)t=yt−Δ\(\\mathcal\{T\}\_\{\\Delta\}y\)\_\{t\}=y\_\{t\-\\Delta\}, useful in finite order models such as AR\(pp\) or more general models with finite context\. Under misspecification, it can impose the guarantee that the output is a function of limited recent observations\.

F​\(𝒯Δ​y1:T\)=𝒯Δ​F​\(y1:T\),F\(\\mathcal\{T\}\_\{\\Delta\}y\_\{1:T\}\)=\\mathcal\{T\}\_\{\\Delta\}F\(y\_\{1:T\}\),\(13\)Implementation:causal 1D CNNs with kernel size matching the desired context length in the output layer\.

#### Transformation Equivariance

Useful for processes that become approximately additive/Gaussian after a transformation \(e\.g\. Box\-Cox or Yeo–Johnson transformations\)\. We target

F​\(gλ​\(y\)\)=gλ​\(F​\(y\)\),F\\\!\\left\(g\_\{\\lambda\}\(y\)\\right\)=g\_\{\\lambda\}\\\!\\left\(F\(y\)\\right\),\(14\)wheregλg\_\{\\lambda\}is a parametric, invertible transform\. Implementation:a learnable, invertible head appliesgλg\_\{\\lambda\}to inputs andgλ−1g\_\{\\lambda\}^\{\-1\}to outputs\.

### Architecture used in the experiments

All experiments in this work use a*feedforward*DenseNet\[[31](https://arxiv.org/html/2606.27711#bib.bib32)\]\(no convolutions\)\. While the framework supports the various inductive biases described above, in these experiments we strictly enforce onlylocation and scale equivariance\(unless stated otherwise\)\. This simplifies the performance comparison; the implementation of other architectural constraints \(such as those specific to certain processes\) is left for future work\.

We apply reversible per\-series min\-max normalization for location/scale handling, and we support variable\-length inputs by:

- •padding shorter series to a fixed window \(e\.g\.,Tmax=64T\_\{\\max\}=64\);
- •encoding the padded values with the special value \-1, distinct from the \[0,1\] range induced by min\-max normalization\.

## Appendix CAdditional Experiments Exponential Smoothing

### C\.1Inference for Simple Exponential Smoothing

Comparison on inference performance, MSE andBias2\\text\{Bias\}^\{2\}, averaging acrossα\\alphais seen in Table[10](https://arxiv.org/html/2606.27711#A3.T10)\. The top performing method for MSE is the Neural\-Unif that precisely tries to optimize this quantity\. The Neural\-Debias estimator has strongly reduced the bias at a MSE cost compared to Neural\-Unif, as expected from Figure[2](https://arxiv.org/html/2606.27711#S5.F2)\(right panel\)\. Maximum Likelihood is dominated overall for MSE andbias2\\text\{bias\}^\{2\}by all neural estimators at this sample size\. The Neural\-Minimax estimator sits in an interesting position in the bias\-variance tradeoff compared to the top performer Neural\-Unif: in terms of Root MSE, it is only 5% worse but it is 45% better in terms of Rootbias2\\text\{bias\}^\{2\}\.

Table 10:Inference errors forα\\alphain the SES model\. Neural\-Unif outperforms the equivalent Bayesian method\. Neural\-Debias largely reduced bias\. Neural Minimax sits at a good bias\-variance tradeoff\.![Refer to caption](https://arxiv.org/html/2606.27711v1/fig/ses_length_mse.png)Figure 9:Squared Prediction Error of SES\-MLE \(blue\) and Neural estimator \(orange\) vs Time Series Length\.

## Appendix DEmpirical accuracy of Neural\-ETS in M1 competition dataset

Table 11:Average MAPE: all M1 Competition data \(1001\)\. The Neural\-ETS estimator gets a strong performance compared to other methods that participated in the M1 Competition\.

Similar Articles

Mind the Sim-to-Real Gap & Think Like a Scientist

arXiv cs.AI

This paper studies when and how a planner should supplement a pre-trained simulator with real experiments in sequential decision problems, proposing Fisher-SEP to minimize posterior variance of a target policy's value.

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

arXiv cs.AI

Introduces ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, designed to provide controlled, immediately resolvable tasks for evaluating probabilistic reasoning in AI systems.

How Good Can Linear Models Be for Time-Series Forecasting?

Hugging Face Daily Papers

This paper demonstrates that careful preprocessing—especially context length selection, normalization, and regularization—can make simple linear models like Ridge regression competitive with or superior to large Transformer, MLP, and CNN models on time-series forecasting benchmarks.

A decoder-only foundation model for time-series forecasting

Papers with Code Trending

This article presents a research paper on Time-Series Foundation Model (TimeFM), a decoder-only model that achieves near-optimal zero-shot performance across diverse time-series datasets by adapting large language model techniques.

Forecasting Future Behavior as a Learning Task

Hugging Face Daily Papers

This paper proposes training Behavior Forecasters to predict large reasoning model outputs from single trajectories, outperforming large language models like GPT-5.4 and Claude Opus-4.6 at lower computational cost, bypassing traditional explainability methods.