Assessing the Operational Viability of Foundation Models for Time Series Forecasting

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
This paper presents an applied evaluation of foundation models for time series forecasting compared to supervised approaches across four operational domains, and proposes a Complexity Router to selectively assign series to the optimal model class for balancing accuracy and inference cost.
arXiv:2605.24381v1 Announce Type: new Abstract: Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:05 AM
# Assessing the Operational Viability of Foundation Models for Time Series Forecasting
Source: [https://arxiv.org/html/2605.24381](https://arxiv.org/html/2605.24381)
Debanshu DasGoogle, USA \{kavinsoni, debanshu, vamshigud\}@google\.comVamshi GuduguntlaGoogle, USA \{kavinsoni, debanshu, vamshigud\}@google\.com

###### Abstract

Time series forecasting remains central to operational decision\-making across domains such as financial planning, transportation, and energy systems\. While supervised learning approaches including gradient\-boosted trees and recurrent neural networks achieve strong empirical performance, they require domain\-specific training pipelines, feature engineering, and ongoing maintenance\. Recently, large\-scale foundation models have emerged as a potential alternative, enabling zero\-shot forecasting without task\-specific training, conceptually similar to Large Language Models \(LLMs\)\. In this work, we present an applied evaluation of foundation models compared against industry\-standard supervised approaches\. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human\-centric systems \(traffic\), physically constrained processes \(energy\), stochastic financial markets, and heterogeneous demand forecasting \(M4\)\. Our results characterize optimal deployment regimes: Foundational Models demonstrate competitive performance in domains with transferable periodic structure and offer operational efficiencies in cold\-start or long\-tail scenarios\. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints\. In high\-entropy financial domains, newer foundation model architectures are rapidly closing the gap with supervised specialists\. Furthermore, we quantify the trade\-offs in inference latency, adaptability to data drift, and deployment constraints\. These findings provide a framework for practitioners to determine the optimal balance between foundation models and task\-specific supervised pipelines\. Building on these findings, we propose a Complexity Router that assigns each series to the optimal model class using empirically\-derived features\. We demonstrate that selective routing achieves higher accuracy than universal foundation model deployment at significantly lower inference cost, providing practitioners with a principled framework for balancing generalization and efficiency\.

## 1Introduction

The ability to accurately forecast future values based on historical temporal data is central to modern operational and strategic decision\-making\. From managing grid stability in energy systems and optimizing traffic flow in urban networks to hedging risk in volatile financial markets, robust forecasting underpins efficiency, planning, and risk management\. As high\-frequency sensor data and economic indicators become ubiquitous, the demand for scalable and automated forecasting solutions continues to grow\.

Historically, this challenge has been addressed through two primary paradigms\. First, classical statistical methods, such as Seasonal Naive \(SNAIVE\)\[[8](https://arxiv.org/html/2605.24381#bib.bib6)\]and Auto\-Regressive Integrated Moving Average \(ARIMA\)\[[3](https://arxiv.org/html/2605.24381#bib.bib15)\], provide robust and interpretable baselines, but often struggle to capture complex, non\-linear dependencies in high\-dimensional data\. Second, supervised machine learning approaches, particularly gradient\-boosted trees \(e\.g\., XGBoost\) and recurrent neural networks \(e\.g\., LSTMs\), have demonstrated strong empirical performance by learning intricate temporal patterns\. However, these specialist models are inherently resource\-intensive and require bespoke data pipelines, extensive feature engineering, and repeated retraining for each task or domain\.

More recently, foundation models have emerged as a third paradigm for time series forecasting\. Inspired by advances in Natural Language Processing, these models aim to enable inference by leveraging large\-scale pre\-training on diverse time series data\. In principle, such models can generate forecasts for previously unseen series without task\-specific training, offering the potential to reduce engineering overhead and time\-to\-deployment compared to traditional forecasting pipelines\.

This paper presents a focused, independent evaluation of this generalist versus specialist trade\-off\. We isolate the performance of foundation models in an inference\-only setting \(no gradient updates\) and compare it against commonly deployed supervised approaches \(XGBoost, LSTM, PatchTST, DLinear\)\. Rather than conducting a broad survey, we analyze model behavior across four distinct regimes: traffic forecasting with strong periodicity, energy systems governed by physical constraints, exchange rates characterized by high volatility, and heterogeneous demand forecasting from the M4 dataset\.

From an applied perspective, we present a regime\-based analysis across four distinct operational environments: periodic traffic, thermodynamic energy systems, stochastic financial markets, and heterogeneous demand\. We demonstrate that model dominance is not random but governed by the underlying data generating process\. Building on these empirical findings, we propose the Complexity Router, a deployment architecture that assigns forecast tasks to the optimal model class using features derived from an analysis of 5,089 series\. By treating Foundation Models and Supervised Baselines as complementary components within a single system, rather than competing alternatives, we identify a Pareto\-dominant configuration that optimizes both accuracy and operational cost in large\-scale forecasting systems\.

## 2Related Work

### 2\.1Statistical and Supervised Baselines

For decades, Makridakis Competitions \(M3, M4, M5\)\[[12](https://arxiv.org/html/2605.24381#bib.bib17),[13](https://arxiv.org/html/2605.24381#bib.bib7),[14](https://arxiv.org/html/2605.24381#bib.bib8)\]have served as a standard benchmark for forecasting progress\. Purely statistical methods, such as ARIMA and Exponential Smoothing \(ETS\), held dominance until the M4 competition marked a turning point where hybrid approaches \(e\.g\., ES\-RNN\) began to outperform pure statistics\[[17](https://arxiv.org/html/2605.24381#bib.bib18)\]\. The M5 competition \(focused on retail sales\) further cemented the superiority of machine learning, particularly Gradient Boosted Trees \(LightGBM\) and Deep Learning, when sufficient data is available\. However, these “Specialist” models are fundamentally limited by their need for task\-specific training and extensive feature engineering\.

### 2\.2Deep Learning

Prior to Foundation Models, deep learning architectures evolved to address the scaling limitations of statistical models\. Architectures like N\-BEATS\[[16](https://arxiv.org/html/2605.24381#bib.bib13)\]and PatchTST\[[15](https://arxiv.org/html/2605.24381#bib.bib11)\]introduced mechanisms to capture long\-term dependencies effectively\. However, the field has seen robust debate regarding the efficacy of Transformers for time series\. Zeng et al\. \(2023\) famously argued that simple linear models \(DLinear\) could outperform complex Transformers\[[20](https://arxiv.org/html/2605.24381#bib.bib19)\], questioning the utility of attention mechanisms in low\-semantic density data\. Our work re\-evaluates this hypothesis in the era of Pre\-Trained Transformers, assessing whether massive scale overcomes the architectural skepticism raised by earlier studies\.

### 2\.3Foundation Models and Recent Benchmarks

The shift to inference\-only setting capabilities represents the current frontier\. TimesFM\[[6](https://arxiv.org/html/2605.24381#bib.bib1)\]leverages a decoder\-only architecture trained on over 100 billion real\-world time points, contrasting with Chronos\[[2](https://arxiv.org/html/2605.24381#bib.bib2)\], which quantizes series into discrete tokens to leverage T5\-based LLM architectures, and Moirai\[[18](https://arxiv.org/html/2605.24381#bib.bib3)\], which focuses on any\-variate attention for diverse frequencies\.

Recent benchmarking efforts have attempted to evaluate this landscape\. GIFT\-Eval \(2024\) and FoundTS \(2025\)\[[1](https://arxiv.org/html/2605.24381#bib.bib20),[22](https://arxiv.org/html/2605.24381#bib.bib21)\]provided the first large\-scale comparisons of these foundation models\. However, these benchmarks largely focus on ranking the foundation models against each other \(e\.g\., “Does Chronos beat TimesFM?”\)\. Our work differentiates itself by focusing on theStructural Trade\-offbetween the Generalist class and the Specialist class\. Rather than simply declaring a winner, we analyze the regimes \(Physics vs\. Periodicity\) and operational constraints \(Latency, Privacy\) that dictate when a practitioner should abandon the Foundational model paradigm in favor of classical Feature Engineering\.

## 3Methodology

### 3\.1Problem Formulation

Given a historical time seriesXhist=\[x1,x2,…,xT\]X\_\{hist\}=\[x\_\{1\},x\_\{2\},\.\.\.,x\_\{T\}\], the goal is to predict the future horizonHH, denoted asXpred=\[xT\+1,…,xT\+H\]X\_\{pred\}=\[x\_\{T\+1\},\.\.\.,x\_\{T\+H\}\]\. For standard supervised baselines \(e\.g\., XGBoost\), a model is trained on a training split ofXXto minimize a loss functionL\(Xtrue,fθ\(Xhist\)\)L\(X\_\{true\},f\_\{\\theta\}\(X\_\{hist\}\)\)\. In the foundation model paradigm, we employ a pre\-trained modelFpreF\_\{pre\}in aninference\-only mode\. The forecast is generated strictly asFpre\(Xhist\)F\_\{pre\}\(X\_\{hist\}\), without performing any task\-specific gradient updates or fine\-tuning on the target datasetXX\.

### 3\.2Datasets and Domain Diversity

To evaluate the “Generalist vs\. Specialist” trade\-off rigorously, we selected four benchmark datasets representing distinct statistical regimes, ranging from strictly periodic physical systems to high\-entropy financial markets \(see Table[1](https://arxiv.org/html/2605.24381#S3.T1)\)\.

Table 1:Dataset Characteristics\. We cover diverse regimes from periodic \(Traffic\) to stochastic \(Finance\)\. MASE is computed with m=24 for Traffic and Energy, m=7 for Exchange and M4\.1. 1\.Traffic Flow \(Transportation / Hourly\):We utilized the Traffic dataset from the California Department of Transportation \(PeMS\), which records hourly lane occupancy rates across 862 sensors in the San Francisco Bay Area\. This dataset is characterized by strong temporal regularity, exhibiting sharp daily rush\-hour peaks and clear weekly periodicity \(S=168S=168\)\. It serves as a benchmark for a model’s ability to capture complex but deterministic seasonality\.\[[4](https://arxiv.org/html/2605.24381#bib.bib24),[11](https://arxiv.org/html/2605.24381#bib.bib14)\]\.
2. 2\.Electricity Transformer Temperature \(Energy / Hourly\):The ETTh1 dataset tracks the oil temperature of electricity transformers at an hourly resolution\. Unlike traffic flow, this data represents a continuous physical process governed by thermodynamic constraints and load factors\. It exhibits smooth local fluctuations rather than bursty spikes, testing a model’s ability to model continuous dynamics and short\-term mean reversion\.\[[23](https://arxiv.org/html/2605.24381#bib.bib9)\]\.
3. 3\.Exchange Rates \(Finance / Daily\):We included a dataset of 8 daily exchange rates \(e\.g\., USD, EUR, JPY, GBP\) spanning 26 years\. Financial time series are notoriously difficult to forecast due to their Random Walk behavior and high volatility\. This dataset tests the models’ robustness against low signal\-to\-noise ratios and shifting trends, where traditional seasonality is weak or non\-existent\. Although the Exchange Rate dataset contains a limited number of series, it is widely used in forecasting benchmarks and is specifically designed to capture high\-entropy financial dynamics\.\[[11](https://arxiv.org/html/2605.24381#bib.bib14)\]\.
4. 4\.M4 Competition \(General / Daily\):To test broad generalization capabilities, we utilized the Daily subset of the M4 competition, comprising 4,227 diverse series from macroeconomics, micro\-industry, and demographics\. This heterogeneous collection requires models to handle varying scales, trends, and noise levels without over\-fitting to a single domain’s physics\.\[[13](https://arxiv.org/html/2605.24381#bib.bib7)\]\.

##### Pre\-Training Overlap\.

We explicitly acknowledge that standard benchmarks such as M4, Traffic, and ETTh1 are frequently included in the pre\-training corpora of large time series models\[[6](https://arxiv.org/html/2605.24381#bib.bib1)\]\. Therefore, our evaluation on these datasets assesses the model’s ability to recall and adapt to canonical operational patterns via in\-context learning, rather than pure out\-of\-distribution generalization\.

### 3\.3Foundation Models Under Evaluation

We evaluate two architecturally distinct foundation model families\. TimesFM\[[6](https://arxiv.org/html/2605.24381#bib.bib1)\]is a decoder\-only architecture employing patch\-based continuous representation, available as TimesFM 2\.0 \(500M parameters\) and TimesFM 2\.5 \(200M parameters, 16K context length\)\. Chronos\[[2](https://arxiv.org/html/2605.24381#bib.bib2)\]adopts a T5\-based encoder\-decoder architecture that quantizes continuous values into discrete tokens\. These two families represent fundamentally different approaches to the generalist forecasting problem and are evaluated in zero\-shot inference\-only mode with no fine\-tuning\.

#### 3\.3\.1Theoretical Framework: Foundational Model via In\-Context Learning

Unlike traditional time series Transformers \(e\.g\., Informer\[[23](https://arxiv.org/html/2605.24381#bib.bib9)\], Autoformer\[[19](https://arxiv.org/html/2605.24381#bib.bib10)\]\) which rely on encoder\-decoder structures and require supervised training on target data, TimesFM employs a decoder\-only architecture conceptually similar to Large Language Models \(LLMs\)\. The core innovation of TimesFM is the shift from gradient\-based adaptation to In\-Context Learning\. Unlike supervised models where knowledge is stored statically in dataset\-specific weights, TimesFM keeps parameters frozen\. This facilitates true Foundational Model inference, where the model adapts to local seasonality and trends \(e\.g\., a specific Traffic sensor\) dynamically by attending to the historical context window \(XhistX\_\{hist\}\), without a single gradient update\.

#### 3\.3\.2Contextualization: The Foundation Model Landscape

To properly assess the viability of TimesFM, we must contextualize it within the broader landscape of foundation model architectures, each of which tackles the continuous data problem through distinct mechanisms:

![Refer to caption](https://arxiv.org/html/2605.24381v1/model.png)Figure 1:Architectural Divergence in Time Series Foundation Models\.\(A\) TimesFM aggregates continuous time points into dense vectors \(‘Patches’\), preserving local numerical semantics\. \(B\) Chronos quantizes values into discrete bins\. \(C\) Moirai utilizes ‘Any\-variate’ attention to handle variable numbers of input series\.Chronos \(Quantization via LLMs\):\[[2](https://arxiv.org/html/2605.24381#bib.bib2)\]Chronos adopts a vocabulary\-based approach, quantizing continuous time series values into fixed discrete tokens \(bins\) to train standard T5 language models\. While this allows it to treat forecasting as a classification task and leverage existing LLM infrastructure, it introduces resolution loss, where fine\-grained variations, crucial in scientific or energy domains, may be obscured if they fall within the same quantization bin\.

Moirai \(Any\-Variate Attention\):\[[18](https://arxiv.org/html/2605.24381#bib.bib3)\]Moirai targets the heterogeneity of time series \(varying frequencies and variate counts\) using Any\-variate attention\. Unlike univariate\-centric models, Moirai flattens multivariate series into a single sequence, employing multiple patch\-size projection layers to model interactions between variables across diverse frequencies\.

Time\-MoE \(Sparse Mixture of Experts\):\[[21](https://arxiv.org/html/2605.24381#bib.bib22)\]The most recent evolution, Time\-MoE, introduces a sparse Mixture of Experts \(MoE\) architecture\. Rather than activating all parameters for every token, Time\-MoE routes different temporal patterns to specialized “experts” \(e\.g\., distinguishing high\-frequency volatility from long\-term trend\)\. This enables the scaling of models to billions of parameters \(e\.g\., 2\.4B\) while maintaining inference costs comparable to smaller dense models\.

#### 3\.3\.3Justification for Model Selection

While architectural innovations differ, recent large\-scale benchmarking on the GIFT\-Eval framework\[[1](https://arxiv.org/html/2605.24381#bib.bib20)\]indicates that while individual foundation models exhibit varying domain\-level strengths, the structural gap between the Foundation Model class and Specialist Baselines is consistently larger than the within\-class variation\. The GIFT\-Eval study reveals that no single foundation model dominates across all regimes; rather, they share a common regime\-dependent pattern: stronger performance in periodic and trend\-driven domains, weaker performance in physically constrained and high\-entropy domains\.

In this study we evaluate TimesFM 2\.0, TimesFM 2\.5, and Chronos as representative foundation models spanning two architecturally distinct families\. The operational constraints we identify, the Throughput Gap and Inference Rigidity, are attributable to the FM paradigm broadly rather than any specific implementation, as confirmed by our latency benchmarks across all three models in Operational Efficiency section\.

### 3\.4Baseline Models

We selected XGBoost and LSTM as primary baselines to reflect the prevailing deployment reality in industry\. To further validate our findings, we additionally benchmark against PatchTST\[[15](https://arxiv.org/html/2605.24381#bib.bib11)\]and DLinear\[[20](https://arxiv.org/html/2605.24381#bib.bib19)\], representing state\-of\-the\-art supervised deep learning architectures\.

#### 3\.4\.1Neural Network Baseline: LSTM

We employed a standard LSTM network\[[7](https://arxiv.org/html/2605.24381#bib.bib5)\]to represent sequential deep learning architectures\. Unlike standard Recurrent Neural Networks \(RNNs\) which suffer from vanishing gradients, LSTMs introduce a memory cellctc\_\{t\}modulated by non\-linear gating mechanisms, allowing the model to learn long\-term dependencies essential for time series forecasting\.

##### Experimental Configuration\.

The model consists of a single LSTM layer with 30–64 units \(tuned per dataset\), followed by a fully connected dense layer to project the forecast horizon\. We adjusted the input/output structure to match the standard benchmarks for each dataset: a lookback window of 96 days and forecast horizon of 96 days for Exchange Rate; a lookback of 30 days and horizon of 14 days for M4 \(Daily\); a lookback of 96 hours and horizon of 24 hours for the Energy dataset; and a lookback of 168 hours with a horizon of 168 hours for the Traffic dataset\. The model was trained using the Adam optimizer\[[10](https://arxiv.org/html/2605.24381#bib.bib23)\]and Mean Squared Error \(MSE\) loss\. Batch sizes \(64–256\) and training duration \(3–10 epochs\) were tuned per dataset to ensure convergence without overfitting\.

##### Data Preprocessing and Normalization\.

To ensure robust training across diverse domains, we applied domain\-specific normalization techniques prior to training\.

- •M4, Exchange Rate, & Energy:Given the roughly normal distribution of these series, we applied Z\-score normalization \(zero mean, unit variance\) per series\. Statistics were calculated strictly on the training set and applied to the validation/test sets to prevent data leakage\.
- •Traffic:This dataset exhibits strictly positive values with heavy variance and bursty spikes\. To stabilize gradient descent, we applied log\-normalization \(log⁡\(1\+x\)\\log\(1\+x\)\) to compress the range of extreme values\. All predictions were inverse\-transformed \(exp⁡\(x\)−1\\exp\(x\)\-1\) for evaluation\.

#### 3\.4\.2Gradient Boosting Baseline: XGBoost

XGBoost\[[5](https://arxiv.org/html/2605.24381#bib.bib4)\]is an ensemble learning method that constructs a predictive model by aggregating the outputs ofKKClassification and Regression Trees \(CART\)\. Unlike the LSTM which processes raw sequences, XGBoost treats the forecasting task as a tabular regression problem\. We utilized the histogram\-based tree method \(tree\_method=’hist’\) for computational efficiency, optimizing the Squared Error objective with a learning rate \(η\\eta\) of 0\.05–0\.1 and a maximum tree depth of 6–8 \(tuned per dataset\) with early stopping\.

##### Feature Engineering \(Tabularization\)\.

To capture temporal dependencies without recurrent architectures, we constructed a rich set of domain\-specific features for each series:

- •Temporal Covariates:Explicit calendar features including hour\-of\-day \(for Traffic/Energy\), day\-of\-week, month, and week\-of\-year\.
- •Autoregressive Lags:For hourly data, lags att−1t\-1,t−24t\-24\(daily\), andt−168t\-168\(weekly\), with additional short\-term lags \(t−6t\-6,t−12t\-12\) for Energy\. For daily Exchange Rate data, lags att−1t\-1,t−7t\-7,t−14t\-14,t−30t\-30, andt−96t\-96\.
- •Rolling Statistics:Rolling means and standard deviations over short windows \(7 days for Exchange, 24 hours for Traffic/Energy\) and long windows \(30 days for Exchange, 168 hours for Energy\)\.
- •Series Identity:Label encoding of the Series ID to allow the tree ensemble to learn series\-specific baselines\.

Unlike the LSTM baseline, we did not apply global normalization to the target variables, leveraging the decision trees’ inherent robustness to varying scales via invariant split points\.

#### 3\.4\.3Modern Deep Learning Baselines: PatchTST and DLinear

To address the scope of specialist baselines, we additionally evaluate PatchTST\[[15](https://arxiv.org/html/2605.24381#bib.bib11)\]and DLinear\[[20](https://arxiv.org/html/2605.24381#bib.bib19)\]\. PatchTST segments the input into patches and applies a standard Transformer encoder in channel\-independent mode \(patch length 16, stride 8, d\_model 128, 3 encoder layers\)\. DLinear decomposes the input via a moving average kernel and applies independent linear projections to trend and seasonal components\. Both models use the same normalization, train/test splits, and evaluation protocol as the LSTM baseline\. Neither model receives temporal covariates \(e\.g\., hour\-of\-day, day\-of\-week\), operating on raw value windows only\.

### 3\.5Evaluation Metrics

We employ RMSE \(scale\-dependent\), sMAPE \(scale\-independent percentage error\), and Mean Absolute Scaled Error \(MASE\)\[[9](https://arxiv.org/html/2605.24381#bib.bib16)\]\.

RMSE:

RMSE=1H∑t=T\+1T\+H\(yt−y^t\)2RMSE=\\sqrt\{\\frac\{1\}\{H\}\\sum\_\{t=T\+1\}^\{T\+H\}\(y\_\{t\}\-\\hat\{y\}\_\{t\}\)^\{2\}\}\(1\)
sMAPE:

sMAPE=100%H∑t=T\+1T\+H\|yt−y^t\|\(\|yt\|\+\|y^t\|\)/2sMAPE=\\frac\{100\\%\}\{H\}\\sum\_\{t=T\+1\}^\{T\+H\}\\frac\{\|y\_\{t\}\-\\hat\{y\}\_\{t\}\|\}\{\(\|y\_\{t\}\|\+\|\\hat\{y\}\_\{t\}\|\)/2\}\(2\)
MASE:

MASE=1H∑t=T\+1T\+H\|yt−y^t\|1T−m∑i=m\+1T\|yi−yi−m\|MASE=\\frac\{1\}\{H\}\\sum\_\{t=T\+1\}^\{T\+H\}\\frac\{\|y\_\{t\}\-\\hat\{y\}\_\{t\}\|\}\{\\frac\{1\}\{T\-m\}\\sum\_\{i=m\+1\}^\{T\}\|y\_\{i\}\-y\_\{i\-m\}\|\}\(3\)wheremmis the seasonal period used as the naïve scaling denominator\. MASE scales absolute error against the Seasonal Naïve forecast, which repeats the last observed seasonal cycle\. We usem=24m=24for Traffic and Energy \(daily cycle\) andm=7m=7for Exchange and M4 \(weekly cycle\)\.

### 3\.6Implementation Details

Experiments utilized Python 3\.10, TensorFlow \(v2\.x\), and XGBoost \(v2\.0\+\)\. TimesFM 2\.0 used the 500M\-parameter checkpoint \(context length 2048\)\. TimesFM 2\.5 used the 200M\-parameter checkpoint \(context length matched to dataset lookback; compiled withnormalize\_inputs=True\)\. Chronos used the T5\-Base 200M checkpoint \(amazon/chronos\-t5\-base\)\. All foundation models operate in zero\-shot inference mode with no fine\-tuning\. XGBoost used histogram\-based trees\. To ensure reproducibility, all source code, data preprocessing scripts, and simulation logic are publicly available at:[https://github\.com/kavin\-soni/timeseries\-zeroshot\-eval](https://github.com/kavin-soni/timeseries-zeroshot-eval)\.

Table 2:Quantitative Evaluation across seven models and four operational regimes\. Foundation models \(TFM 2\.0, TFM 2\.5, Chronos\) are evaluated zero\-shot; supervised models are trained per dataset\. TimesFM 2\.5 shows substantial improvement over 2\.0 on Exchange and Energy\. The regime\-dependent pattern persists across both FM architectures\. PatchTST and DLinear receive only raw value windows without temporal covariates, consistent with their standard evaluation protocol\. Energy metrics are reported on unscaled data \(∘C\)\.

## 4Experiments and Results

### 4\.1Quantitative Results

We evaluated TimesFM 2\.0, TimesFM 2\.5, and Chronos against supervised baselines \(see Table[2](https://arxiv.org/html/2605.24381#S3.T2)\)\.

### 4\.2Operational Efficiency

While Foundation Models offer zero\-shot capabilities, our benchmarks reveal a significant Inference Tax\. Table[3](https://arxiv.org/html/2605.24381#S4.T3)compares the throughput across models on dedicated hardware\.

Table 3:The Throughput Gap\.The supervised XGBoost baseline delivers over three orders of magnitude higher throughput on cheaper CPU hardware compared to the Foundation Model on GPU \(CFM≈1000×CXGBC\_\{FM\}\\approx 1000\\times C\_\{XGB\}\)\.Analysis:The results reveal a consistent and architecture\-independent throughput gap\. All three foundation models require GPU hardware and produce latencies in the 200–1,764ms range, while every supervised baseline runs on CPU in under 120ms\. Notably, DLinear \(0\.13ms\) and PatchTST \(9\.03ms\) are among the fastest models overall despite being state\-of\-the\-art deep learning architectures, demonstrating that specialist efficiency is not contingent on architectural simplicity\. The T5\-based Chronos is the slowest foundation model due to its autoregressive tokenization pipeline, while TimesFM 2\.5 achieves lower latency than 2\.0 despite running on the slower T4 GPU, reflecting efficiency gains from halving the parameter count \(500M→\\rightarrow200M\)\. This pattern confirms that the Throughput Gap is a structural property of the FM paradigm\.

### 4\.3Qualitative Analysis

Figure[2](https://arxiv.org/html/2605.24381#S4.F2): Capturing Seasonality \(Traffic\)\.In the Traffic domain \(Sensor ID: 231\), TimesFM 2\.0 \(Blue\) demonstrates remarkable alignment\. The model’s pre\-trained understanding oftemporal periodicityallows it to anticipate the amplitude and timing of peaks\. XGBoost \(Red\) captures the general rhythm but consistentlysmooths over sharp transitions\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/fig1.png)Figure 2:Traffic Flow Forecast\.The fine\-grained double\-peak structure is captured by TimesFM 2\.0, whereas XGBoost underestimates the extremes\.Figure[3](https://arxiv.org/html/2605.24381#S4.F3): Hypothesis: The Global Memory Gap \(Energy Dataset\)\.The Energy dataset \(Series: HULL\) highlights the structural divergence between the two paradigms\.XGBoost \(Red\) – Global Memory:The supervised model exhibits strongmean reversion\. Having been trained on the sensor’s long\-term history \(≈\\approx2 years\), it effectively “remembers” the global stationary mean \(≈3\.0\\approx 3\.0\) and ignores the recent local dip, prioritizing historical consistency over local volatility\.TimesFM 2\.0 \(Blue\) – Local Adaptability Hypothesis:In contrast, the Foundation Model anchors on Local Context\. Despite a 2048\-step context window, it lacks in\-weights memory of this specific sensor’s multi\-year behavior\. Consequently, it over\-weights the sharp drop atT=0T=0\(Recency Bias\), interpreting the local anomaly as a regime shift rather than a temporary fluctuation\. This supports our hypothesis that Foundation Models trade global stationarity awareness for local pattern matching\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/fig2.png)Figure 3:Energy Forecast\.Highlighting the mean reversion trade\-off\. XGBoost collapses to the mean, while TimesFM 2\.0 observes a scale shift, exhibiting a visible discontinuity from the historical context \(Grey\) at T=0\. \(evaluated without test\-time scaling\)\.

## 5Discussion

### 5\.1Hypothesis: The Regime\-Dependent Trade\-off

Our empirical results reveal a distinct dichotomy in performance\. We propose aRegime\-Dependent Hypothesis\(Figure[4](https://arxiv.org/html/2605.24381#S5.F4)\) to explain this divergence\. We posit that model dominance is not random, but a function of the underlying data generating process \(DGP\): Generalist models excel where patterns are universal \(periodicity\), while Specialist models dominate where dynamics rely on site\-specific global memory \(thermodynamics\)\. It is also worth noting that widely available benchmarks like Traffic and M4 are frequently included in the pre\-training corpora of large foundation models\. Consequently, the superior performance on these datasets likely reflects the model’s ability to recall canonical operational patterns\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/matrix.png)Figure 4:The Generalist vs\. Specialist Regime Map\.The X\-axis represents the strength of periodicity and human\-centric patterns\. The Y\-axis represents the intensity of physical constraints \(Thermodynamics\) or stochastic entropy \(Finance\)\. Foundation Models \(Generalist\) perform strongest in the lower\-right quadrant \(High Periodicity\) where patterns are universal\. XGBoost \(Specialist\) outperforms in the upper\-left quadrant, where domain\-specific physics or noise require feature\-engineered inductive biases\.
### 5\.2The Foundation Model Advantage: Traffic and M4

On the Traffic dataset, the TimesFM 2\.0 model achieved a MASE of 0\.482 through episodic memory, surpassing the Gradient Boosted specialist \(XGBoost 0\.514\) and reducing the error of the LSTM baseline \(0\.861\) by 44%\. It demonstrates that for canonical data with strong periodicity, a pre\-trained Foundation Model can outperform a fully tuned supervised model without requiring any task\-specific training\. This advantage persists even against modern supervised architectures: PatchTST \(MASE 1\.251\) and DLinear \(MASE 1\.165\) both achieve MASE above 1\.0, as neither receives temporal covariates and must infer periodicity from raw value windows alone\. Notably, Chronos \(MASE 0\.729\) and TimesFM 2\.5 \(MASE 0\.621\) also outperform all supervised deep learning baselines on Traffic despite being zero\-shot models, confirming that the FM advantage in periodic domains is not specific to TimesFM 2\.0 but shared across architecturally distinct foundation models\.

On the heterogeneous M4 \(Daily\) dataset, all foundation models achieve competitive MASE: TimesFM 2\.0 \(1\.426\), TimesFM 2\.5 \(1\.234\), and Chronos \(1\.212\), all surpassing XGBoost \(1\.763\)\. DLinear \(1\.211\) achieves the lowest MASE overall, indicating that on heterogeneous data no single paradigm holds a decisive advantage\.

### 5\.3The Specialist Advantage: Energy and Exchange Rates

The Energy \(ETTh1\) dataset strongly favored the specialized, supervised approach\. The Exchange Rate dataset presents a more nuanced picture: while TimesFM 2\.0 failed, newer foundation models now lead this regime, suggesting the regime boundary is shifting\.

On the Energy dataset, XGBoost achieved a dominant MASE of 0\.573\. Transformer temperature \(ETTh1\) is a physical process governed by thermodynamics and load constraints, exhibiting smooth, continuous fluctuations rather than the bursty behavior of traffic\. The decision tree ensemble successfully learned these specific physical constraints from the training split\. TimesFM 2\.0 \(MASE 2\.576\) and Chronos \(MASE 1\.102\) both underperform specialists\. We hypothesize that TimesFM 2\.0’s underperformance stems from a structural trade\-off between Local Adaptability and Global Memory\. With a context window of 2048 steps, the model captures local dynamics but lacks the in\-weights memory of the sensor’s multi\-year history, interpreting the drop atT=0T=0as a regime shift rather than a temporary fluctuation\. Notably, TimesFM 2\.5 \(MASE 1\.046\) substantially improves over 2\.0 \(MASE 2\.576\) on this dataset, suggesting that architectural refinements including a 16K context window and improved normalization partially address this limitation\. However, XGBoost \(MASE 0\.573\) still leads by a considerable margin, confirming that trained site\-specific knowledge remains difficult to replicate through pre\-training alone\.

If our Local\-Global hypothesis holds, then Retrieval\-Augmented Generation \(RAG\) for time series, injecting historical snippets of global behavior into the local context window, should theoretically resolve the challenges observed in the Energy dataset\.

On the Exchange Rate dataset, TimesFM 2\.5 achieves the lowest MASE overall \(2\.167\), followed by PatchTST \(2\.460\), DLinear \(2\.546\), and Chronos \(2\.569\)\. This represents a dramatic improvement over TimesFM 2\.0 \(MASE 12\.374\), suggesting that architectural improvements in the 2\.5 generation substantially close the FM\-specialist gap in high\-entropy domains\. XGBoost \(3\.942\) and LSTM \(9\.677\) are outperformed by newer architectures\.

Financial time series are characterized by high volatility and random walk behavior, where the signal\-to\-noise ratio is low\. TimesFM 2\.0 \(MASE 12\.374\) failed severely in this regime, consistent with our hypothesis that early foundation models over\-fit to complex temporal patterns in high\-entropy data\. However, TimesFM 2\.5 \(MASE 2\.167\) leads all models on this dataset, including supervised specialists, and Chronos \(MASE 2\.569\) is similarly competitive\. This represents a striking reversal: newer FM architectures, with improved normalization and larger context windows, appear to have learned to prioritize recent lags effectively\. This finding suggests that the FM\-specialist gap in high\-entropy domains is narrowing rapidly with architectural improvements, and that the Exchange regime may transition from a specialist\-dominated to a contested regime as FM architectures mature\.

### 5\.4The Operational Case for Foundational Model: Cold Starts and The Long Tail

While supervised baselines dominate in high\-stakes, data\-rich environments \(Energy\), Foundation Models offer a decisive operational advantage in “Cold Start” and “Long Tail” scenarios\.

##### The Cold\-Start Problem\.

In scenarios such as new product launches or rapidly deploying sensors to new geographies, historical training data is non\-existent\. Supervised models like XGBoost typically require a minimum historical horizon \(e\.g\., 2\+ seasonal cycles\) to learn valid splits\.

To quantify this, we simulated a cold start scenario on the Traffic dataset \(Figure[5](https://arxiv.org/html/2605.24381#S5.F5)\)\.XGBoost \(Red\) – Feature Starvation:Restricted to a short 48\-hour history, the supervised pipeline effectively breaks\. Lacking sufficient data to compute essential long\-term seasonal features \(e\.g\., weekly seasonality att−168t\-168\), the model is forced to rely on immediate lags\. This results in a degraded, reactive forecast that fails to anticipate the amplitude of the daily peaks\.TimesFM 2\.0 \(Blue\) – Semantic Transfer:In contrast, the Foundation Model leverages its pre\-trained inductive bias\. Despite having no access to long\-term history for this specific sensor, it recognizes the daily traffic archetype from the 48\-hour context\. Its episodic memory transfers this structural knowledge to generate a robust forecast, correctly predicting the double\-peak structure\. This supports our hypothesis that Foundation Models utilizesemantic memory\(general patterns\) to compensate when local history is unavailable\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/fig3.png)Figure 5:The Cold Start Problem\.Visualizing the Feature Starvation trade\-off\. With only 48 hours of history \(Grey\), XGBoost \(Red\) fails to capture seasonality due to undefined lag features\. TimesFM 2\.0 \(Blue\), leveraging pre\-trained knowledge, instantly recognizes the daily cycle and produces a robust forecast\.
##### Pipeline Consolidation for the Long Tail\.

Industrial forecasting often involves a Long Tail distribution, e\.g\., a retailer with 100,000 SKUs where the top 1% drive the majority of revenue\. Maintaining, tuning, and retraining 100,000 distinct XGBoost models is often prohibitively expensive in terms of engineering hours\. A single Zero\-Shot inference pass can handle this entire long tail with zero training overhead\. This suggests a hybrid deployment pattern: use Specialist models for the “Head” \(critical, high\-volume series\) and efficient Generalist models to automate the “Tail” as long as computing resources are not constrained\.

### 5\.5Operational Viability: The Constraints of Production

While predictive accuracy remains the primary academic benchmark, the operational viability of a forecasting system in an industrial setting is determined by a broader set of constraints: latency, computational cost, and adaptability to drift\.

##### The Inference Rigidity Trade\-off

A critical challenge in deploying foundation models is the rigidity of the inference mechanism\. In a production environment, data distribution shift is inevitable\. A supervised architecture like XGBoost offers a direct mechanism to address this: the model can be retrained daily on the most recent data to capture structural breaks \(e\.g\., a shift in energy pricing dynamics\)\. In contrast, a zero\-shot model is effectively “frozen\.” Because it relies entirely on its pre\-training and context window, practitioners lack the ability to perform gradient updates to correct for local drift\. This necessitates expensive shadow deployment strategies to validate stability, whereas lightweight supervised models can be monitored and retrained dynamically\.

##### Latency and The Throughput Gap\.

While TimesFM excels in accuracy on Traffic data, it incurs a substantial computational cost\. XGBoost inference typically operates in the microsecond range, making it viable for high\-frequency trading or real\-time bidding systems\. In contrast, Foundation Models involve passing tokens through massive Transformer stacks, resulting in inference latencies in the hundreds of milliseconds to seconds\. For edge deployment \(e\.g\., IoT sensors on grid infrastructure\), the hardware footprint of a Foundation Model is often prohibitive compared to decision trees, which run efficiently on low\-power CPUs\.

##### Data Sovereignty and Privacy\.

Beyond computational costs, the choice of architecture introduces distinct privacy implications\. While certain Foundation Models can technically execute on CPUs, their computational density creates a bottleneck for high\-throughput applications\. Running a 500M\+ parameter model on secure, resource\-constrained database servers often introduces unacceptable latency or resource contention, practically forcing practitioners to migrate sensitive data to centralized, high\-compute environments \(e\.g\., GPU clusters or Cloud VPCs\) to meet SLAs\. In contrast, traditional supervised artifacts are computationally lightweight and can be deployed directly alongside data in secure, air\-gapped environments without destabilizing critical infrastructure\. This bring\-compute\-to\-data capability offers a robust mechanism for data residency compliance in Finance and Healthcare that Foundation Models struggle to match at scale\.

### 5\.6Feature Engineering as Inductive Bias

The performance divergence observed for TimesFM 2\.0 in high\-entropy financial domains suggests that specialist advantages often arise from feature engineering rather than model architecture alone\. In finance and energy, rolling statistics and domain\-specific covariates act as an explicit inductive bias, filtering stochastic noise and stabilizing learning\. When this structure is removed, zero\-shot models must infer these statistics implicitly from raw sequences, which in high\-entropy regimes can lead to overfitting\. In such settings, feature engineering functions not as a limitation, but as a necessary guardrail for model stability\. Our extended results with PatchTST and DLinear provide direct evidence for this claim\. Both achieve MASE above 1\.0 on Traffic, indicating performance below the seasonal naive level, while XGBoost \(with explicit calendar and lag features\) and TimesFM 2\.0 \(with pre\-trained temporal knowledge\) succeed\. This confirms that in periodic domains, inductive bias, whether learned through pre\-training or engineered through features, is essential\.

## 6Proposed System Architecture

Based on the regime\-dependent characteristics identified in Section 5, we propose a practical deployment architecture for industrial\-scale forecasting: The Complexity Router\. Rather than a heuristic rule, the router is grounded in an empirical analysis of 5,089 time series across two benchmark datasets, deriving data\-driven thresholds that determine when Foundation Model inference cost is justified by accuracy gains\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/fig4.png)Figure 6:Proposed Hybrid Deployment Architecture\.The Complexity Router computes four series\-level features — spectral entropy, coefficient of variation, seasonal autocorrelation, and trend strength — to assign each incoming series to the optimal model class\. Stable series are routed to CPU\-based supervised models for low\-latency inference; high\-entropy or strongly periodic series are routed to the Foundation Model service\.### 6\.1The Complexity Router Logic

Instead of a monolithic deployment, we introduce a routing layer that assigns each series to the optimal model class based on four empirically\-derived features computed from training history\. To derive routing thresholds, we evaluated TimesFM 2\.5 and Chronos against PatchTST and DLinear across 5,089 series \(862 Traffic sensors, 4,227 M4 daily series\) and computed per\-series MASE for each model\. A series is labeled FM\-preferred if the best FM MASE is lower than the best specialist MASE\. Feature thresholds were set at the decile where FM win rate first exceeds 60%\.

Routing Features and Empirical Thresholds:

1. 1\.Spectral Entropy≥0\.24\\geq 0\.24:High entropy indicates a richly structured but non\-trivial periodic signal \(e\.g\., Traffic rush\-hour patterns\)\. FM win rate rises monotonically from decile 5 onward, reaching\>\>90% at the highest decile\.
2. 2\.Coefficient of Variation≥0\.22\\geq 0\.22:Higher volatility favors Foundation Models, which adapt via in\-context attention rather than fixed trained weights\.
3. 3\.Seasonal Autocorrelation≥0\.72\\geq 0\.72or<0\.50<0\.50:FM win rate is elevated at both extremes — strongly periodic series \(Traffic,r≈0\.81r\\approx 0\.81\) and near\-stationary series with weak seasonality\. Supervised specialists hold an advantage in the mid\-range\.
4. 4\.Trend Strength \(R2R^\{2\}\)<0\.05<0\.05:FM win rate exceeds 90% for near\-trend\-free series\. Specialists recover at moderate and high trend strength\.

Figure[7](https://arxiv.org/html/2605.24381#S6.F7)shows the FM win rate across deciles for each feature\. The routing rule is:route to FM if any two or more thresholds are satisfied; otherwise route to specialist\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/x1.png)Figure 7:Empirical Routing Feature Analysis\.FM win rate by decile for four series\-level features across 5,089 series \(Traffic \+ M4\)\. Dashed red line marks 60% threshold\. FM models: TimesFM 2\.5, Chronos\. Specialist models: PatchTST, DLinear\. FM win rate rises with spectral entropy and coefficient of variation, peaks at both extremes of seasonal autocorrelation, and is highest at very low trend strength\.
### 6\.2Resource Efficiency Analysis

Deploying Foundation Models universally is resource\-intensive for high\-throughput systems\. We formalize the expected compute cost per series as:

E\[C\]=α⋅CFM\+\(1−α\)⋅CspecE\[C\]=\\alpha\\cdot C\_\{FM\}\+\(1\-\\alpha\)\\cdot C\_\{spec\}\(4\)
whereα\\alphais the fraction of series routed to the Foundation Model, andCspec≈CXGBC\_\{spec\}\\approx C\_\{XGB\}given the comparable CPU latency of DLinear and XGBoost \(Table[3](https://arxiv.org/html/2605.24381#S4.T3)\)\.

![Refer to caption](https://arxiv.org/html/2605.24381v1/x2.png)Figure 8:Cost\-Accuracy Pareto Frontier\.Complexity router MASE vs\. relative inference cost asα\\alphavaries from 0 \(pure specialist\) to 1 \(pure FM\)\. FM cost = 1,000×\\timesspecialist\. Orange dot marks the Pareto knee atα=0\.30\\alpha=0\.30, cost = 301×\\times, MASE = 0\.970\. The Complexity router outperforms both pure deployments\.Key finding:The Pareto knee occurs atα=0\.30\\alpha=0\.30, routing the 30% most FM\-favorable series to the Foundation Model and the remaining 70% to supervised specialists\. This configuration achieves:

- •MASE 0\.970: better than both pure FM \(0\.989\) and pure specialist \(1\.178\)
- •70% cost reductionrelative to universal FM deployment \(301×\\timesvs 1,000×\\times\)
- •17\.6% MASE improvementover pure specialist deployment

The hybrid system is strictly Pareto\-dominant: it achieves higher accuracy than pure FM at 30% of the cost\. The routing signal \(FM advantage score derived from series features\) successfully identifies the minority of series where FM inference cost is justified by measurable accuracy gains\.

## 7Conclusion and Future Work

This study contributes to a rigorous empirical evaluation of the Generalist vs\. Specialist paradigm in time series forecasting\. Contrary to the assumption that massive pre\-training is universally superior, our results reveal a nuanced landscape where the optimal modeling choice depends heavily on the underlying data generating process\.

We demonstrate that Foundation Models have reached a critical maturity point for human\-centric and periodic systems\. On the Traffic dataset, TimesFM 2\.0 outperformed all baselines \(MASE 0\.482\), while TimesFM 2\.5 and Chronos outperformed all supervised deep learning baselines despite running zero\-shot\. This suggests that the massive pre\-training of Foundation Models provides a robust inductive bias for capturing universal temporal structures like weekly seasonality and trend through Episodic Memory\.

However, our results on Energy and Exchange Rates reveal a nuanced picture\. Specialists still lead on Energy \(XGBoost MASE 0\.573 vs best FM 1\.046\), confirming the enduring value of trained domain\-specific models for physically constrained processes\. On Exchange, newer foundation models \(TimesFM 2\.5 MASE 2\.167\) now match or outperform legacy supervised baselines, suggesting the FM\-specialist gap is narrowing rapidly in stochastic domains\.

Operational trade\-offs further complicate this decision\. While Foundation Models offer a dramatic reduction in time\-to\-result by eliminating training pipelines, they suffer from a Throughput Gap in inference latency, data sovereignty risks, and rigidity against data drift\. In contrast, purpose\-built supervised architectures \(including Gradient Boosting and efficient Neural Networks\) remain the indispensable choice for high\-stakes physical and financial applications\. Their ability to deliver microsecond latency, interpretability, and privacy\-preserving on\-premise deployment provides an operational robustness that current Foundation Models cannot yet match\.

Finally, we operationalize these findings through the Complexity Router, a deployment architecture that assigns each series to the optimal model class based on four empirically\-derived features\. Validated across 5,089 series, the router identifies a Pareto\-dominant operating point: routing the 30% most foundation\-model\-favorable series to FM inference while serving the remainder with lightweight specialists achieves better accuracy than universal FM deployment \(MASE 0\.970 vs 0\.989\) at a 70% reduction in inference cost\. This demonstrates that the generalist\-specialist trade\-off is not a binary choice but a routing problem with a quantifiable optimum\.

##### Future Work\.

Future research should focus on bridging the gap between Generalist structural knowledge and Specialist precision\. Two promising directions include:

- •Episodic Retrieval \(RAG\):Investigating if in\-context learning can be improved by retrieving specific, similar historicalpatterns\(RAG for Time Series\) to guide the forecast, rather than relying solely on the immediate lookback window\.
- •Global Distributional Anchoring:Developing mechanisms to inject aggregate statistics \(e\.g\., long\-term mean and variance\) directly into the token sequence\. This anchors local predictions against a global summary of the series, mitigating recency bias without the cost of full\-history attention\.

## 8Broader Impact Statement

### 8\.1Environmental and Economic Implications

This work highlights a critical operational trade\-off in the industrial adoption of Foundation Models for time series forecasting\. While Foundational Model architectures democratize access to high\-quality forecasting by removing the barrier of specialized training expertise, they introduce a distinct inference overhead\.

##### Computational Cost of Inference\.

Replacing lightweight models \(which run efficiently on low\-power CPUs\) with massive Transformer architectures \(requiring GPU clusters\) for routine, high\-frequency tasks, such as daily inventory planning for millions of SKUs, significantly alters the energy profile of forecasting pipelines\. While Foundation Models offer generalization, this comes at the cost of increased floating\-point operations \(FLOPs\) per prediction compared to decision tree ensembles\.

##### The Green AI Perspective\.

Practitioners must weigh this computational cost against the marginal gains in accuracy\. Our results suggest that for high\-frequency or edge\-constrained applications, standard supervised architectures often remain theresource\-efficientchoice\. Aligning with “Green AI” principles involves not just model optimization, but also appropriate model selection: deploying heavy\-weight Foundation Models where generalization is strictly necessary, and utilizing lightweight specialists where structural efficiency is paramount\.

## References

- \[1\]Aksuet al\.\(2024\)GIFT\-eval: general time series forecasting model evaluation\.arXiv preprint arXiv:2410\.10393\.Cited by:[§2\.3](https://arxiv.org/html/2605.24381#S2.SS3.p2.1),[§3\.3\.3](https://arxiv.org/html/2605.24381#S3.SS3.SSS3.p1.1)\.
- \[2\]A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor,et al\.\(2024\)Chronos: learning the language of time series\.arXiv preprint arXiv:2403\.07815\.Cited by:[§2\.3](https://arxiv.org/html/2605.24381#S2.SS3.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.24381#S3.SS3.SSS2.p2.1),[§3\.3](https://arxiv.org/html/2605.24381#S3.SS3.p1.1)\.
- \[3\]G\. E\. P\. Box and G\. M\. Jenkins\(1970\)Time series analysis: forecasting and control\.Holden\-Day\.Cited by:[§1](https://arxiv.org/html/2605.24381#S1.p2.1)\.
- \[4\]California Department of TransportationPerformance measurement system \(pems\) data source\.Note:WebsiteAccessed 2026\-01\-14Cited by:[item 1](https://arxiv.org/html/2605.24381#S3.I1.i1.p1.1)\.
- \[5\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§3\.4\.2](https://arxiv.org/html/2605.24381#S3.SS4.SSS2.p1.2)\.
- \[6\]A\. Das, W\. Kong, A\. Leach, S\. Mathur, R\. Sen, and R\. Yu\(2023\)TimesFM: a decoder\-only foundation model for time series forecasting\.arXiv preprint arXiv:2310\.10688\.Cited by:[§2\.3](https://arxiv.org/html/2605.24381#S2.SS3.p1.1),[§3\.2](https://arxiv.org/html/2605.24381#S3.SS2.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2605.24381#S3.SS3.p1.1)\.
- \[7\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.Cited by:[§3\.4\.1](https://arxiv.org/html/2605.24381#S3.SS4.SSS1.p1.1)\.
- \[8\]R\. J\. Hyndman and G\. Athanasopoulos\(2018\)Forecasting: principles and practice\.OTexts\.Cited by:[§1](https://arxiv.org/html/2605.24381#S1.p2.1)\.
- \[9\]R\. J\. Hyndman and A\. B\. Koehler\(2006\)Another look at measures of forecast accuracy\.International Journal of Forecasting22\(4\),pp\. 679–688\.Cited by:[§3\.5](https://arxiv.org/html/2605.24381#S3.SS5.p1.1)\.
- \[10\]D\. P\. Kingma and J\. Ba\(2015\)Adam: a method for stochastic optimization\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§3\.4\.1](https://arxiv.org/html/2605.24381#S3.SS4.SSS1.Px1.p1.1)\.
- \[11\]G\. Lai, W\. Chang, Y\. Yang, and H\. Liu\(2018\)Modeling long\-and short\-term temporal patterns with deep neural networks\.InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval,pp\. 95–104\.Cited by:[item 1](https://arxiv.org/html/2605.24381#S3.I1.i1.p1.1),[item 3](https://arxiv.org/html/2605.24381#S3.I1.i3.p1.1)\.
- \[12\]S\. Makridakis and M\. Hibon\(2000\)The m3\-competition: results, conclusions and implications\.International Journal of Forecasting16\(4\),pp\. 451–476\.Cited by:[§2\.1](https://arxiv.org/html/2605.24381#S2.SS1.p1.1)\.
- \[13\]S\. Makridakis, E\. Spiliotis, and V\. Assimakopoulos\(2018\)The m4 competition: results, findings, conclusions and way forward\.International Journal of Forecasting34\(4\),pp\. 802–808\.Cited by:[§2\.1](https://arxiv.org/html/2605.24381#S2.SS1.p1.1),[item 4](https://arxiv.org/html/2605.24381#S3.I1.i4.p1.1)\.
- \[14\]S\. Makridakis, E\. Spiliotis, and V\. Assimakopoulos\(2022\)The m5 accuracy competition: results, findings and conclusions\.International Journal of Forecasting38\(4\),pp\. 1346–1364\.Cited by:[§2\.1](https://arxiv.org/html/2605.24381#S2.SS1.p1.1)\.
- \[15\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.24381#S2.SS2.p1.1),[§3\.4\.3](https://arxiv.org/html/2605.24381#S3.SS4.SSS3.p1.1),[§3\.4](https://arxiv.org/html/2605.24381#S3.SS4.p1.1)\.
- \[16\]B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio\(2020\)N\-beats: neural basis expansion analysis for interpretable time series forecasting\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.24381#S2.SS2.p1.1)\.
- \[17\]S\. Smyl\(2020\)A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting\.International Journal of Forecasting\.Cited by:[§2\.1](https://arxiv.org/html/2605.24381#S2.SS1.p1.1)\.
- \[18\]G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo\(2024\)Unified training of universal time series forecasting transformers\.arXiv preprint arXiv:2402\.02592\.Cited by:[§2\.3](https://arxiv.org/html/2605.24381#S2.SS3.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.24381#S3.SS3.SSS2.p3.1)\.
- \[19\]H\. Wu, J\. Xu, J\. Wang, and M\. Long\(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 22419–22430\.Cited by:[§3\.3\.1](https://arxiv.org/html/2605.24381#S3.SS3.SSS1.p1.1)\.
- \[20\]A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu\(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2\.2](https://arxiv.org/html/2605.24381#S2.SS2.p1.1),[§3\.4\.3](https://arxiv.org/html/2605.24381#S3.SS4.SSS3.p1.1),[§3\.4](https://arxiv.org/html/2605.24381#S3.SS4.p1.1)\.
- \[21\]Q\. Zhang, Q\. Wen, Y\. Wang, P\. Chen, A\. Zhou, and B\. Yang\(2024\)Time\-moe: billion\-scale time series foundation models with mixture of experts\.arXiv preprint arXiv:2409\.16040\.Cited by:[§3\.3\.2](https://arxiv.org/html/2605.24381#S3.SS3.SSS2.p4.1)\.
- \[22\]P\. C\. Zhe Liet al\.\(2024\)FoundTS: comprehensive and unified benchmarking of foundation models for time series forecasting\.arXiv preprint arXiv:2410\.11802\.Cited by:[§2\.3](https://arxiv.org/html/2605.24381#S2.SS3.p2.1)\.
- \[23\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[item 2](https://arxiv.org/html/2605.24381#S3.I1.i2.p1.1),[§3\.3\.1](https://arxiv.org/html/2605.24381#S3.SS3.SSS1.p1.1)\.

## Appendix AAppendix: Baseline Model Equations

### A\.1LSTM Transition Equations and Forecast Projection

Given an input sequenceX=\[x1,…,xT\]X=\[x\_\{1\},\\ldots,x\_\{T\}\], the hidden statehth\_\{t\}and cell statectc\_\{t\}at time stepttare updated as:

ft\\displaystyle f\_\{t\}=σ\(Wf\[ht−1,xt\]\+bf\)\\displaystyle=\\sigma\(W\_\{f\}\[h\_\{t\-1\},x\_\{t\}\]\+b\_\{f\}\)\(5\)it\\displaystyle i\_\{t\}=σ\(Wi\[ht−1,xt\]\+bi\)\\displaystyle=\\sigma\(W\_\{i\}\[h\_\{t\-1\},x\_\{t\}\]\+b\_\{i\}\)C~t\\displaystyle\\tilde\{C\}\_\{t\}=tanh⁡\(WC\[ht−1,xt\]\+bC\)\\displaystyle=\\tanh\(W\_\{C\}\[h\_\{t\-1\},x\_\{t\}\]\+b\_\{C\}\)Ct\\displaystyle C\_\{t\}=ft⊙Ct−1\+it⊙C~t\\displaystyle=f\_\{t\}\\odot C\_\{t\-1\}\+i\_\{t\}\\odot\\tilde\{C\}\_\{t\}ot\\displaystyle o\_\{t\}=σ\(Wo\[ht−1,xt\]\+bo\)\\displaystyle=\\sigma\(W\_\{o\}\[h\_\{t\-1\},x\_\{t\}\]\+b\_\{o\}\)ht\\displaystyle h\_\{t\}=ot⊙tanh⁡\(Ct\)\\displaystyle=o\_\{t\}\\odot\\tanh\(C\_\{t\}\)The final forecast is generated by projecting the last hidden statehTh\_\{T\}through a fully connected linear layer:

y^T\+H=WouthT\+bout\.\\hat\{y\}\_\{T\+H\}=W\_\{out\}h\_\{T\}\+b\_\{out\}\.\(6\)The model is optimized by minimizing the Mean Squared Error \(MSE\) loss:

ℒ=1N∑\(y^i−yi\)2\.\\mathcal\{L\}=\\frac\{1\}\{N\}\\sum\(\\hat\{y\}\_\{i\}\-y\_\{i\}\)^\{2\}\.\(7\)

### A\.2XGBoost Regularized Objective

XGBoost is trained in an additive manner, minimizing the regularized objective:

ℒ\(t\)=∑i=1nl\(yi,y^i\(t−1\)\+ft\(xi\)\)\+Ω\(ft\),\\mathcal\{L\}^\{\(t\)\}=\\sum\_\{i=1\}^\{n\}l\(y\_\{i\},\\hat\{y\}\_\{i\}^\{\(t\-1\)\}\+f\_\{t\}\(x\_\{i\}\)\)\+\\Omega\(f\_\{t\}\),\(8\)where the regularization term is:

Ω\(f\)=γT\+12λ∥w∥2\.\\Omega\(f\)=\\gamma T\+\\frac\{1\}\{2\}\\lambda\\lVert w\\rVert^\{2\}\.\(9\)
Assessing the Operational Viability of Foundation Models for Time Series Forecasting

Similar Articles

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

A decoder-only foundation model for time-series forecasting

Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

Submit Feedback

Similar Articles

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting
A decoder-only foundation model for time-series forecasting
Retrieval Mechanisms Surpass Long-Context Scaling in Time Series Forecasting
TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models