@ClementDelangue: Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in …

X AI KOLs Following 05/14/26, 06:24 PM Models

time-series foundation-models scaling-laws open-weights forecasting benchmarks datascience

Summary

Datadog releases Toto 2.0, a family of open-weights time series foundation models from 4M to 2.5B parameters, achieving state-of-the-art results on three benchmarks. The models demonstrate scaling laws for time series, improving predictably with parameter count.

Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in Apache 2.0 on @huggingface. It's a family of open-weights TSFMs from 4M to 2.5B parameters, where every size beats the last from a single hyperparameter config. First across the leading benchmarks: BOOM, GIFT-Eval, and TIME. Most TSFM families ship multiple sizes that all perform roughly the same. This one doesn't. Why it matters: scaling laws gave language and vision a predictable relationship between compute, data, parameters, and downstream performance. Time series hasn't had that curve until now. Once you have it, you can scale data and compute with confidence, and start asking which new capabilities emerge at the next order of magnitude. 2.5B open-source weights: https://huggingface.co/Datadog/Toto-2.0-2.5B… 4M open-source weights: https://huggingface.co/Datadog/Toto-2.0-4m… Blogpost: https://datadoghq.com/blog/ai/toto-2/?utm_content=blog&utm_medium=organicsocial…

Original Article

View Cached Full Text

Cached at: 05/14/26, 06:42 PM

Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in Apache 2.0 on @huggingface. It’s a family of open-weights TSFMs from 4M to 2.5B parameters, where every size beats the last from a single hyperparameter config. First across the leading benchmarks: BOOM, GIFT-Eval, and TIME. Most TSFM families ship multiple sizes that all perform roughly the same. This one doesn’t. Why it matters: scaling laws gave language and vision a predictable relationship between compute, data, parameters, and downstream performance. Time series hasn’t had that curve until now. Once you have it, you can scale data and compute with confidence, and start asking which new capabilities emerge at the next order of magnitude. 2.5B open-source weights: https://huggingface.co/Datadog/Toto-2.0-2.5B… 4M open-source weights: https://huggingface.co/Datadog/Toto-2.0-4m… Blogpost: https://datadoghq.com/blog/ai/toto-2/?utm_content=blog&utm_medium=organicsocial…

Datadog/Toto-2.0-2.5B · Hugging Face

Source: https://huggingface.co/Datadog/Toto-2.0-2.5B Toto (Time Series Optimized Transformer forObservability) is a family of time series foundation models for multivariate forecasting developed byDatadog. Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks:BOOM, our observability benchmark;GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistantTIMEbenchmark.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%8A-performance📊 Performance

Pareto frontier on BOOM and GIFT-Eval Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9A%A1-quick-start⚡ Quick Start

Inference code is available onGitHub.

https://huggingface.co/Datadog/Toto-2.0-2.5B#installationInstallation

pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"

https://huggingface.co/Datadog/Toto-2.0-2.5B#inference-exampleInference Example

import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-2.5B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)

For more examples, see theQuick Start notebookandGluonTS integration notebook.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%92%BE-available-checkpoints💾 Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

ModelParamsWeights (fp32)LatencyRecommended forToto‑2.0‑4m4m16 MB~3.8 msEdge / CPU deployment; tightest latency or memory budgets.Toto‑2.0‑22m22m84 MB~5.0 msEfficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters.Toto‑2.0‑313m313m1.2 GB~15.4 msStrong general-purpose checkpoint; top-3 foundation model on GIFT-Eval.Toto‑2.0‑1B1B3.9 GB~20.9 msBest quality / cost tradeoff for production workloads.Toto‑2.0‑2.5B2.5B9.1 GB~36.2 msHighest accuracy; #1 foundation model on every benchmark.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9C%A8-key-features✨ Key Features

**Zero-Shot Forecasting:**Forecast without fine-tuning on your specific time series.
**Multi-Variate Support:**Efficiently process multiple variables using alternating time/variate attention.
**Probabilistic Predictions:**Generate point forecasts and uncertainty estimates via a quantile output head.
**Decoder-Only Architecture:**Support for variable prediction horizons and context lengths.
**u-μP Scaling:**A single training recipe transfers cleanly across all five sizes (4m → 2.5B).

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%8F%97%EF%B8%8F-architecture🏗️ Architecture

Overview of the Toto 2.0 architecture. A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 addscontiguous patch masking (CPM)for single-pass parallel decoding, aquantile output headtrained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See thetechnical reportfor details.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%94%97-additional-resources🔗 Additional Resources

Technical Report—(coming soon)
Blog Post
GitHub Repository
Toto 2.0 Collection— all five base checkpoints
BOOM Dataset— Datadog’s observability time-series benchmark
Toto 1.0 Weights

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%96-citation📖 Citation

(citation coming soon)

@ClementDelangue: Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in …

Datadog/Toto-2.0-2.5B · Hugging Face

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%8A-performance📊 Performance

https://huggingface.co/Datadog/Toto-2.0-2.5B#installationInstallation

https://huggingface.co/Datadog/Toto-2.0-2.5B#inference-exampleInference Example

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%92%BE-available-checkpoints💾 Available Checkpoints

https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9C%A8-key-features✨ Key Features

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%8F%97%EF%B8%8F-architecture🏗️ Architecture

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%96-citation📖 Citation

Similar Articles

@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…

@raphaelsrty: We're releasing LateOn and DenseOn today. Two open retrieval models, 149M parameters each. LateOn (ColBERT, multi-vecto…

Introducing gpt-oss

@AdinaYakup: Intern S2 preview A scientific multimodal model from Shanghai AI Lab @intern_lm 35B matches their own 1T model on scien…

Submit Feedback

Similar Articles

@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…

@raphaelsrty: We're releasing LateOn and DenseOn today. Two open retrieval models, 149M parameters each. LateOn (ColBERT, multi-vecto…

@AdinaYakup: Intern S2 preview A scientific multimodal model from Shanghai AI Lab @intern_lm 35B matches their own 1T model on scien…