@ClementDelangue: Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in …
Summary
Datadog releases Toto 2.0, a family of open-weights time series foundation models from 4M to 2.5B parameters, achieving state-of-the-art results on three benchmarks. The models demonstrate scaling laws for time series, improving predictably with parameter count.
View Cached Full Text
Cached at: 05/14/26, 06:42 PM
Are scaling laws finally working for time series foundation models? Today, @datadoghq is releasing Toto 2.0 weights in Apache 2.0 on @huggingface. It’s a family of open-weights TSFMs from 4M to 2.5B parameters, where every size beats the last from a single hyperparameter config. First across the leading benchmarks: BOOM, GIFT-Eval, and TIME. Most TSFM families ship multiple sizes that all perform roughly the same. This one doesn’t. Why it matters: scaling laws gave language and vision a predictable relationship between compute, data, parameters, and downstream performance. Time series hasn’t had that curve until now. Once you have it, you can scale data and compute with confidence, and start asking which new capabilities emerge at the next order of magnitude. 2.5B open-source weights: https://huggingface.co/Datadog/Toto-2.0-2.5B… 4M open-source weights: https://huggingface.co/Datadog/Toto-2.0-4m… Blogpost: https://datadoghq.com/blog/ai/toto-2/?utm_content=blog&utm_medium=organicsocial…
Datadog/Toto-2.0-2.5B · Hugging Face
Source: https://huggingface.co/Datadog/Toto-2.0-2.5B Toto (Time Series Optimized Transformer forObservability) is a family of time series foundation models for multivariate forecasting developed byDatadog. Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.
The family sets a new state of the art on three forecasting benchmarks:BOOM, our observability benchmark;GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistantTIMEbenchmark.
https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%8A-performance📊 Performance
Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9A%A1-quick-start⚡ Quick Start
Inference code is available onGitHub.
https://huggingface.co/Datadog/Toto-2.0-2.5B#installationInstallation
pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"
https://huggingface.co/Datadog/Toto-2.0-2.5B#inference-exampleInference Example
import torch
from toto2 import Toto2Model
model = Toto2Model.from_pretrained("Datadog/Toto-2.0-2.5B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()
# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
{"target": target, "target_mask": target_mask, "series_ids": series_ids},
horizon=96,
decode_block_size=768,
has_missing_values=False,
)
For more examples, see theQuick Start notebookandGluonTS integration notebook.
https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%92%BE-available-checkpoints💾 Available Checkpoints
All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
ModelParamsWeights (fp32)LatencyRecommended forToto‑2.0‑4m4m16 MB~3.8 msEdge / CPU deployment; tightest latency or memory budgets.Toto‑2.0‑22m22m84 MB~5.0 msEfficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters.Toto‑2.0‑313m313m1.2 GB~15.4 msStrong general-purpose checkpoint; top-3 foundation model on GIFT-Eval.Toto‑2.0‑1B1B3.9 GB~20.9 msBest quality / cost tradeoff for production workloads.Toto‑2.0‑2.5B2.5B9.1 GB~36.2 msHighest accuracy; #1 foundation model on every benchmark.
https://huggingface.co/Datadog/Toto-2.0-2.5B#%E2%9C%A8-key-features✨ Key Features
- **Zero-Shot Forecasting:**Forecast without fine-tuning on your specific time series.
- **Multi-Variate Support:**Efficiently process multiple variables using alternating time/variate attention.
- **Probabilistic Predictions:**Generate point forecasts and uncertainty estimates via a quantile output head.
- **Decoder-Only Architecture:**Support for variable prediction horizons and context lengths.
- **u-μP Scaling:**A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%8F%97%EF%B8%8F-architecture🏗️ Architecture
A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 addscontiguous patch masking (CPM)for single-pass parallel decoding, aquantile output headtrained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See thetechnical reportfor details.## https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%94%97-additional-resources🔗 Additional Resources
- Technical Report—(coming soon)
- Blog Post
- GitHub Repository
- Toto 2.0 Collection— all five base checkpoints
- BOOM Dataset— Datadog’s observability time-series benchmark
- Toto 1.0 Weights
https://huggingface.co/Datadog/Toto-2.0-2.5B#%F0%9F%93%96-citation📖 Citation
(citation coming soon)
Similar Articles
@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…
A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.
@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…
Marin AI researchers, led by William Barr Held, introduce Delphi, a methodology that pretrains small models to accurately predict the training outcomes of larger 25B-parameter runs. This research aims to establish predictable scaling for more efficient open-source AI model development.
@raphaelsrty: We're releasing LateOn and DenseOn today. Two open retrieval models, 149M parameters each. LateOn (ColBERT, multi-vecto…
Raphael released two open-source retrieval models, LateOn (ColBERT multi-vector) and DenseOn (single-vector), each 149M parameters and outperforming 4× larger models on BEIR.
Introducing gpt-oss
OpenAI releases gpt-oss-120b and gpt-oss-20b, two state-of-the-art open-weight language models under Apache 2.0 license that achieve near-parity with proprietary models while being optimizable for consumer hardware and edge devices. Both models demonstrate strong reasoning and tool-use capabilities with comprehensive safety evaluations.
@AdinaYakup: Intern S2 preview A scientific multimodal model from Shanghai AI Lab @intern_lm 35B matches their own 1T model on scien…
Shanghai AI Lab releases Intern S2, a 35B scientific multimodal model that matches their own 1T model on science benchmarks, introducing Task Scaling as a new scaling dimension. Licensed under Apache 2.0.