@ClementDelangue: Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in …

X AI KOLs Following Models

Summary

Datadog releases Toto 2.0, a family of open-weights time series foundation models from 4M to 2.5B parameters, achieving state-of-the-art results on three benchmarks. The models demonstrate scaling laws for time series, improving predictably with parameter count.

Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in Apache 2.0 on @huggingface. It's a family of open-weights TSFMs from 4M to 2.5B parameters, where every size beats the last from a single hyperparameter config. First across the leading benchmarks: BOOM, GIFT-Eval, and TIME. Most TSFM families ship multiple sizes that all perform roughly the same. This one doesn't. Why it matters: scaling laws gave language and vision a predictable relationship between compute, data, parameters, and downstream performance. Time series hasn't had that curve until now. Once you have it, you can scale data and compute with confidence, and start asking which new capabilities emerge at the next order of magnitude. 2.5B open-source weights: https://huggingface.co/Datadog/Toto-2.0-2.5B… 4M open-source weights: https://huggingface.co/Datadog/Toto-2.0-4m… Blogpost: https://datadoghq.com/blog/ai/toto-2/?utm_content=blog&utm_medium=organicsocial…
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:42 PM

Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in Apache 2.0 on @huggingface. It’s a family of open-weights TSFMs from 4M to 2.5B parameters, where every size beats the last from a single hyperparameter config. First across the leading benchmarks: BOOM, GIFT-Eval, and TIME. Most TSFM families ship multiple sizes that all perform roughly the same. This one doesn’t. Why it matters: scaling laws gave language and vision a predictable relationship between compute, data, parameters, and downstream performance. Time series hasn’t had that curve until now. Once you have it, you can scale data and compute with confidence, and start asking which new capabilities emerge at the next order of magnitude. 2.5B open-source weights: https://huggingface.co/Datadog/Toto-2.0-2.5B… 4M open-source weights: https://huggingface.co/Datadog/Toto-2.0-4m… Blogpost: https://datadoghq.com/blog/ai/toto-2/?utm_content=blog&utm_medium=organicsocial…


Datadog/Toto-2.0-2.5B · Hugging Face

Source: https://huggingface.co/Datadog/Toto-2.0-2.5B Toto (Time Series Optimized Transformer forObservability) is a family of time series foundation models for multivariate forecasting developed byDatadog. Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks:BOOM, our observability benchmark;GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistantTIMEbenchmark.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%8A-performance📊 Performance

Pareto frontier on BOOM and GIFT-EvalEvery Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9A%A1-quick-start⚡ Quick Start

Inference code is available onGitHub.

https://huggingface.co/Datadog/Toto-2.0-2.5B#installationInstallation

pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"

https://huggingface.co/Datadog/Toto-2.0-2.5B#inference-exampleInference Example

import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-2.5B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)

For more examples, see theQuick Start notebookandGluonTS integration notebook.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%92%BE-available-checkpoints💾 Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

ModelParamsWeights (fp32)LatencyRecommended forToto‑2.0‑4m4m16 MB~3.8 msEdge / CPU deployment; tightest latency or memory budgets.Toto‑2.0‑22m22m84 MB~5.0 msEfficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters.Toto‑2.0‑313m313m1.2 GB~15.4 msStrong general-purpose checkpoint; top-3 foundation model on GIFT-Eval.Toto‑2.0‑1B1B3.9 GB~20.9 msBest quality / cost tradeoff for production workloads.Toto‑2.0‑2.5B2.5B9.1 GB~36.2 msHighest accuracy; #1 foundation model on every benchmark.

https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9C%A8-key-features✨ Key Features

  • **Zero-Shot Forecasting:**Forecast without fine-tuning on your specific time series.
  • **Multi-Variate Support:**Efficiently process multiple variables using alternating time/variate attention.
  • **Probabilistic Predictions:**Generate point forecasts and uncertainty estimates via a quantile output head.
  • **Decoder-Only Architecture:**Support for variable prediction horizons and context lengths.
  • **u-μP Scaling:**A single training recipe transfers cleanly across all five sizes (4m → 2.5B).

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%8F%97%EF%B8%8F-architecture🏗️ Architecture

Overview of the Toto 2.0 architecture.A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 addscontiguous patch masking (CPM)for single-pass parallel decoding, aquantile output headtrained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See thetechnical reportfor details.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%94%97-additional-resources🔗 Additional Resources

https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%96-citation📖 Citation

(citation coming soon)

Similar Articles

Introducing gpt-oss

OpenAI Blog

OpenAI releases gpt-oss-120b and gpt-oss-20b, two state-of-the-art open-weight language models under Apache 2.0 license that achieve near-parity with proprietary models while being optimizable for consumer hardware and edge devices. Both models demonstrate strong reasoning and tool-use capabilities with comprehensive safety evaluations.