Toto 2.0: Time series forecasting enters the scaling era

Source

Raw Markdown: paper_toto-2-2026.md
Article: Datadog blog
Official source: DataDog/toto
Official model collection: Datadog Toto 2.0
Official scaling wrapper: dd_unit_scaling

Core Claim

Datadog presents Toto 2.0 as an open-weights time-series forecasting foundation-model family that scales from 4M to 2.5B parameters, improves monotonically across the released sizes, and reports state-of-the-art results on BOOM, GIFT-Eval, and TIME while training on observability and synthetic data without public forecasting-data pretraining.

Key Contributions

Releases a Toto 2.0 model family rather than a single checkpoint, spanning 4M, 22M, 313M, 1B, and 2.5B parameters.
Treats time-series forecasting as a scaling question and reports no saturation through the largest 2.5B-parameter model.
Extends the observability-metrics direction from Toto 1.0 with broader benchmark coverage, stronger parameter efficiency, and faster long-horizon inference.
Uses contiguous patch masking so the model can forecast all patches in a horizon in parallel.
Frames metrics as a distinct modality and sketches a future observability world-model direction over metrics, traces, logs, topology, code changes, events, alerts, and text.

Benchmarked Model Entries

Model	Role In Article	Notes	Official Artifact
Toto-2.0-4m	Smallest released Toto 2.0 model	Establishes the low-size endpoint for the scaling study.	Datadog Toto 2.0 collection
Toto-2.0-22m	Compact Toto 2.0 checkpoint	Reported to beat Toto 1.0 on BOOM and to rank fifth on TIME.	Datadog Toto 2.0 collection
Toto-2.0-313m	Mid-scale Toto 2.0 checkpoint	Reported to rank among the top Toto 2.0 models on BOOM, GIFT-Eval, and TIME.	Datadog Toto 2.0 collection
Toto-2.0-1B	Large Toto 2.0 checkpoint	Reported to continue the monotonic scaling curve without saturating.	Datadog Toto 2.0 collection
Toto-2.0-2.5B	Largest released Toto 2.0 checkpoint	Reported as the strongest base model in the family.	Datadog Toto 2.0 collection
Toto-2.0-2.5B-FT	Fine-tuned 2.5B variant	Reported in second place on the full GIFT-Eval leaderboard.	Datadog Toto 2.0 collection
Toto-2.0-FnF	Family-and-friends ensemble	Reported in first place on the full GIFT-Eval leaderboard.	Datadog Toto 2.0 collection

Method Notes

Toto 2.0 is a passive forecasting model for time-series dynamics. It does not yet model actions, control inputs, interventions, rollbacks, or autoscaling choices as first-class conditioning channels, but the article explicitly points toward future observability world models that would need richer operational context.

The article says Toto 2.0 was trained on observability and synthetic time-series data and deliberately excluded public forecasting datasets during pretraining. That makes the article relevant to synthetic-data pretraining and benchmark-leakage questions, even though it is an announcement article rather than a full technical report.

Contiguous patch masking is the central inference-relevant change: Toto 2.0 learns to reconstruct contiguous masked horizons, so inference can predict a whole forecast window in one parallel forward pass. For longer horizons, block decoding conditions each predicted segment on the previous segment’s median forecast and uses key-value caching to reduce repeated compute.

Evidence And Results

BOOM: the article reports that all Toto 2.0 sizes outrank other foundation models by CRPS rank, CRPS, and MASE; the reported CRPS ranks include 3.88 for Toto-2.0-2.5B, 3.96 for Toto-2.0-1B, and 4.25 for Toto-2.0-313m.
GIFT-Eval foundation-model leaderboard: the article reports CRPS ranks of 19.5 for Toto-2.0-2.5B, 20.3 for Toto-2.0-1B, and 20.5 for Toto-2.0-313m, ahead of the listed PatchTST-FM r1 and Chronos-2 ranks.
GIFT-Eval full leaderboard: the article reports Toto-2.0-FnF in first place and Toto-2.0-2.5B-FT in second place.
TIME: the article reports that Toto-2.0-2.5B, Toto-2.0-313m, and Toto-2.0-1B take the top three positions, while Toto-2.0-22m ranks fifth.
Latency: the article reports that single-pass forecasting lets Toto 2.0 produce long horizons with fewer sequential steps than Toto 1.0 or Chronos-2-style autoregressive inference.

Limitations

This source is a Datadog announcement article, not the full technical report.
Several central details, including the full pretraining mixture, benchmark harness, and training recipe, are only summarized in the article.
Toto 2.0 remains forecasting-centered; it is not yet an action-conditioned world model for intervention, deployment, rollback, or autoscaling reasoning.
The article itself says long-horizon forecasting is not fully solved, even though larger models preserve structure better in the shown examples.
Fine-tuned and ensemble leaderboard entries should be separated from base-model zero-shot results when comparing model entries.

Links Into The Wiki

Open Questions

Where does Toto 2.0 scaling saturate beyond 2.5B parameters?
How much of the gain comes from scale, synthetic data, observability data quality, contiguous patch masking, or decoding strategy?
Can the observability world-model direction incorporate actions, control inputs, interventions, and counterfactual incident-response reasoning?
How should GIFT-Eval and TIME comparisons separate base zero-shot models, fine-tuned models, and ensembles?

Alex Knowledge Base

Explorer

Toto 2.0: Time series forecasting enters the scaling era

Toto 2.0: Time series forecasting enters the scaling era

Source

Core Claim

Key Contributions

Benchmarked Model Entries

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks