# Toto 2.0: Time series forecasting enters the scaling era

Source URL: <https://www.datadoghq.com/blog/ai/toto-2/>

Publisher: Datadog

Published: 2026-05-14

Authors: Emaad Khwaja, Gerald Woo, Chris Lettieri, Ameet Talwalkar, David Asker

Retrieved: 2026-05-15

## Article Snapshot

This source is a Datadog engineering article announcing Toto 2.0, an
open-weights family of time-series forecasting foundation models released on
Hugging Face. The article frames Toto 2.0 as a scaling study for forecasting
foundation models: model sizes span 4M, 22M, 313M, 1B, and 2.5B parameters, and
the reported benchmark curves improve monotonically with scale through 2.5B
parameters.

The article contrasts Toto 2.0 with Toto 1.0. Toto 1.0 established Datadog's
observability-oriented forecasting direction with a 151M-parameter open model.
Toto 2.0 keeps the observability framing but expands the release into a model
family and emphasizes parameter scaling, inference speed, and benchmark coverage
across BOOM, GIFT-Eval, and TIME.

## Official Artifacts

- Article: <https://www.datadoghq.com/blog/ai/toto-2/>
- Source repository: <https://github.com/DataDog/toto>
- Hugging Face collection: <https://huggingface.co/collections/Datadog/toto-20>
- Datadog unit-scaling wrapper: <https://github.com/DataDog/toto/tree/main/dd_unit_scaling>

## Model Family

The article describes the following Toto 2.0 family members:

- Toto-2.0-4m
- Toto-2.0-22m
- Toto-2.0-313m
- Toto-2.0-1B
- Toto-2.0-2.5B
- Toto-2.0-2.5B-FT, a version fine-tuned on forecasting data
- Toto-2.0-FnF, a "Family and Friends" ensemble over Toto 2.0 models and other
  foundation models

The quick-start example in the article installs the package from the `toto2`
branch of the official repository and loads `Datadog/Toto-2.0-22m` with
`Toto2Model.from_pretrained`.

## Main Claims

- Time-series forecasting foundation models can improve predictably with model
  scale; the article reports no saturation at the 2.5B-parameter endpoint.
- Toto 2.0 sets new Datadog-reported best results on BOOM, GIFT-Eval, and TIME.
- Toto 2.0 is more parameter-efficient than Toto 1.0; the article states that a
  much smaller Toto 2.0 variant can match or exceed Toto 1.0 quality.
- The model family was trained on observability and synthetic time-series data,
  with no public forecasting data used in pretraining.
- Contiguous patch masking lets Toto 2.0 forecast an entire horizon in one
  parallel pass, while block decoding remains available for long-horizon
  stability.

## Training And Inference Notes

Toto 2.0 is trained as a forecasting model for time-series dynamics, with a
particular focus on observability metrics. The article says the pretraining data
mix contains observability metrics and synthetic time series, while deliberately
excluding public forecasting datasets to reduce overlap with evaluation
benchmarks.

The model uses contiguous patch masking during pretraining. Instead of
autoregressively revealing one patch at a time, the pretraining task masks a
contiguous block and asks the model to predict all masked patches in parallel.
At inference time this supports single-pass forecasting for the full prediction
horizon.

For very long horizons, the article describes block decoding: the model predicts
segments sequentially and conditions each segment on the previous segment's
median forecast, while using a key-value cache to reduce repeated computation.
The article frames this as a speed-versus-stability choice: single-pass
prediction is fastest, while block decoding can help keep long-horizon
trajectories coherent.

## Reported Benchmark Results

On BOOM, the article reports that every Toto 2.0 size outranks every other
foundation model by CRPS rank, CRPS, and MASE. Reported CRPS ranks include 3.88
for Toto-2.0-2.5B, 3.96 for Toto-2.0-1B, 4.25 for Toto-2.0-313m, 5.52 for
Toto-2.0-22m, 7.17 for Toto-2.0-4m, 6.94 for Toto 1.0, and 7.39 for Chronos-2.

On GIFT-Eval's foundation-model leaderboard, the article reports CRPS ranks of
19.5 for Toto-2.0-2.5B, 20.3 for Toto-2.0-1B, 20.5 for Toto-2.0-313m, 25.7 for
Toto-2.0-22m, 33.9 for Toto 1.0, 22.1 for PatchTST-FM r1, and 22.4 for
Chronos-2. On the full GIFT-Eval leaderboard, the Toto 2.0 Family and Friends
ensemble is reported in first place, with a fine-tuned Toto-2.0-2.5B in second
place.

On TIME, the article reports that Toto-2.0-2.5B, Toto-2.0-313m, and
Toto-2.0-1B take the top three positions, while Toto-2.0-22m ranks fifth. The
article says every Toto 2.0 size at 22M parameters and above outperforms Toto
1.0 on this benchmark.

## Latency And Long-Horizon Claims

The article compares Toto 2.0's single-pass forecasting with the autoregressive
style of Toto 1.0 and Chronos-2. It states that a 1024-step forecast can require
up to 16 autoregressive steps for Toto 1.0, while Toto 2.0 can produce the same
horizon in one forward pass in single-pass mode. It also reports that
Toto-2.0-313m has roughly the same latency as Chronos-2 despite being larger,
and that for horizons of 2048 or longer even Toto-2.0-2.5B can be faster than
Chronos-2 in single-pass mode.

For long-horizon forecasting, the article shows synthetic multi-scale examples
at 2048, 4096, and 8192 prediction steps. The article's interpretation is that
larger Toto 2.0 models preserve temporal structure better than smaller models
and prior baselines, but that scale alone does not fully solve very long-horizon
forecasting.

## Future Direction

The article argues that time-series forecasting still has a gap to close against
well-tuned classical methods. It identifies data curation, model scale,
benchmark breadth, and better handling of metric-specific structure as future
work.

The article also sketches a broader observability world-model direction:
learning from metrics, traces, logs, service topology, code changes, events,
alerts, and text so systems can support proactive incident detection, root cause
analysis, counterfactual analysis, simulation, and agent training. In the
terminology of this repository, Toto 2.0 is still presented as a passive
forecasting model rather than an action-conditioned world model, but the article
explicitly points toward future models that could include actions, interventions,
and operational context.
