Toto 2.0: Time series forecasting enters the scaling era

Source

Core Claim

Datadog presents Toto 2.0 as an open-weights time-series forecasting foundation-model family that scales from 4M to 2.5B parameters, improves monotonically across the released sizes, and reports state-of-the-art results on BOOM, GIFT-Eval, and TIME while training on observability and synthetic data without public forecasting-data pretraining.

Key Contributions

  • Releases a Toto 2.0 model family rather than a single checkpoint, spanning 4M, 22M, 313M, 1B, and 2.5B parameters.
  • Treats time-series forecasting as a scaling question and reports no saturation through the largest 2.5B-parameter model.
  • Extends the observability-metrics direction from Toto 1.0 with broader benchmark coverage, stronger parameter efficiency, and faster long-horizon inference.
  • Uses contiguous patch masking so the model can forecast all patches in a horizon in parallel.
  • Frames metrics as a distinct modality and sketches a future observability world-model direction over metrics, traces, logs, topology, code changes, events, alerts, and text.

Benchmarked Model Entries

ModelRole In ArticleNotesOfficial Artifact
Toto-2.0-4mSmallest released Toto 2.0 modelEstablishes the low-size endpoint for the scaling study.Datadog Toto 2.0 collection
Toto-2.0-22mCompact Toto 2.0 checkpointReported to beat Toto 1.0 on BOOM and to rank fifth on TIME.Datadog Toto 2.0 collection
Toto-2.0-313mMid-scale Toto 2.0 checkpointReported to rank among the top Toto 2.0 models on BOOM, GIFT-Eval, and TIME.Datadog Toto 2.0 collection
Toto-2.0-1BLarge Toto 2.0 checkpointReported to continue the monotonic scaling curve without saturating.Datadog Toto 2.0 collection
Toto-2.0-2.5BLargest released Toto 2.0 checkpointReported as the strongest base model in the family.Datadog Toto 2.0 collection
Toto-2.0-2.5B-FTFine-tuned 2.5B variantReported in second place on the full GIFT-Eval leaderboard.Datadog Toto 2.0 collection
Toto-2.0-FnFFamily-and-friends ensembleReported in first place on the full GIFT-Eval leaderboard.Datadog Toto 2.0 collection

Method Notes

Toto 2.0 is a passive forecasting model for time-series dynamics. It does not yet model actions, control inputs, interventions, rollbacks, or autoscaling choices as first-class conditioning channels, but the article explicitly points toward future observability world models that would need richer operational context.

The article says Toto 2.0 was trained on observability and synthetic time-series data and deliberately excluded public forecasting datasets during pretraining. That makes the article relevant to synthetic-data pretraining and benchmark-leakage questions, even though it is an announcement article rather than a full technical report.

Contiguous patch masking is the central inference-relevant change: Toto 2.0 learns to reconstruct contiguous masked horizons, so inference can predict a whole forecast window in one parallel forward pass. For longer horizons, block decoding conditions each predicted segment on the previous segment’s median forecast and uses key-value caching to reduce repeated compute.

Evidence And Results

  • BOOM: the article reports that all Toto 2.0 sizes outrank other foundation models by CRPS rank, CRPS, and MASE; the reported CRPS ranks include 3.88 for Toto-2.0-2.5B, 3.96 for Toto-2.0-1B, and 4.25 for Toto-2.0-313m.
  • GIFT-Eval foundation-model leaderboard: the article reports CRPS ranks of 19.5 for Toto-2.0-2.5B, 20.3 for Toto-2.0-1B, and 20.5 for Toto-2.0-313m, ahead of the listed PatchTST-FM r1 and Chronos-2 ranks.
  • GIFT-Eval full leaderboard: the article reports Toto-2.0-FnF in first place and Toto-2.0-2.5B-FT in second place.
  • TIME: the article reports that Toto-2.0-2.5B, Toto-2.0-313m, and Toto-2.0-1B take the top three positions, while Toto-2.0-22m ranks fifth.
  • Latency: the article reports that single-pass forecasting lets Toto 2.0 produce long horizons with fewer sequential steps than Toto 1.0 or Chronos-2-style autoregressive inference.

Limitations

  • This source is a Datadog announcement article, not the full technical report.
  • Several central details, including the full pretraining mixture, benchmark harness, and training recipe, are only summarized in the article.
  • Toto 2.0 remains forecasting-centered; it is not yet an action-conditioned world model for intervention, deployment, rollback, or autoscaling reasoning.
  • The article itself says long-horizon forecasting is not fully solved, even though larger models preserve structure better in the shown examples.
  • Fine-tuned and ensemble leaderboard entries should be separated from base-model zero-shot results when comparing model entries.

Open Questions

  • Where does Toto 2.0 scaling saturate beyond 2.5B parameters?
  • How much of the gain comes from scale, synthetic data, observability data quality, contiguous patch masking, or decoding strategy?
  • Can the observability world-model direction incorporate actions, control inputs, interventions, and counterfactual incident-response reasoning?
  • How should GIFT-Eval and TIME comparisons separate base zero-shot models, fine-tuned models, and ensembles?