Toto 2.0: Time series forecasting enters the scaling era
Source
- Raw Markdown: paper_toto-2-2026.md
- Article: Datadog blog
- Official source: DataDog/toto
- Official model collection: Datadog Toto 2.0
- Official scaling wrapper: dd_unit_scaling
Core Claim
Datadog presents Toto 2.0 as an open-weights time-series forecasting foundation-model family that scales from 4M to 2.5B parameters, improves monotonically across the released sizes, and reports state-of-the-art results on BOOM, GIFT-Eval, and TIME while training on observability and synthetic data without public forecasting-data pretraining.
Key Contributions
- Releases a Toto 2.0 model family rather than a single checkpoint, spanning 4M, 22M, 313M, 1B, and 2.5B parameters.
- Treats time-series forecasting as a scaling question and reports no saturation through the largest 2.5B-parameter model.
- Extends the observability-metrics direction from Toto 1.0 with broader benchmark coverage, stronger parameter efficiency, and faster long-horizon inference.
- Uses contiguous patch masking so the model can forecast all patches in a horizon in parallel.
- Frames metrics as a distinct modality and sketches a future observability world-model direction over metrics, traces, logs, topology, code changes, events, alerts, and text.
Benchmarked Model Entries
| Model | Role In Article | Notes | Official Artifact |
|---|---|---|---|
| Toto-2.0-4m | Smallest released Toto 2.0 model | Establishes the low-size endpoint for the scaling study. | Datadog Toto 2.0 collection |
| Toto-2.0-22m | Compact Toto 2.0 checkpoint | Reported to beat Toto 1.0 on BOOM and to rank fifth on TIME. | Datadog Toto 2.0 collection |
| Toto-2.0-313m | Mid-scale Toto 2.0 checkpoint | Reported to rank among the top Toto 2.0 models on BOOM, GIFT-Eval, and TIME. | Datadog Toto 2.0 collection |
| Toto-2.0-1B | Large Toto 2.0 checkpoint | Reported to continue the monotonic scaling curve without saturating. | Datadog Toto 2.0 collection |
| Toto-2.0-2.5B | Largest released Toto 2.0 checkpoint | Reported as the strongest base model in the family. | Datadog Toto 2.0 collection |
| Toto-2.0-2.5B-FT | Fine-tuned 2.5B variant | Reported in second place on the full GIFT-Eval leaderboard. | Datadog Toto 2.0 collection |
| Toto-2.0-FnF | Family-and-friends ensemble | Reported in first place on the full GIFT-Eval leaderboard. | Datadog Toto 2.0 collection |
Method Notes
Toto 2.0 is a passive forecasting model for time-series dynamics. It does not yet model actions, control inputs, interventions, rollbacks, or autoscaling choices as first-class conditioning channels, but the article explicitly points toward future observability world models that would need richer operational context.
The article says Toto 2.0 was trained on observability and synthetic time-series data and deliberately excluded public forecasting datasets during pretraining. That makes the article relevant to synthetic-data pretraining and benchmark-leakage questions, even though it is an announcement article rather than a full technical report.
Contiguous patch masking is the central inference-relevant change: Toto 2.0 learns to reconstruct contiguous masked horizons, so inference can predict a whole forecast window in one parallel forward pass. For longer horizons, block decoding conditions each predicted segment on the previous segment’s median forecast and uses key-value caching to reduce repeated compute.
Evidence And Results
- BOOM: the article reports that all Toto 2.0 sizes outrank other foundation models by CRPS rank, CRPS, and MASE; the reported CRPS ranks include 3.88 for Toto-2.0-2.5B, 3.96 for Toto-2.0-1B, and 4.25 for Toto-2.0-313m.
- GIFT-Eval foundation-model leaderboard: the article reports CRPS ranks of 19.5 for Toto-2.0-2.5B, 20.3 for Toto-2.0-1B, and 20.5 for Toto-2.0-313m, ahead of the listed PatchTST-FM r1 and Chronos-2 ranks.
- GIFT-Eval full leaderboard: the article reports Toto-2.0-FnF in first place and Toto-2.0-2.5B-FT in second place.
- TIME: the article reports that Toto-2.0-2.5B, Toto-2.0-313m, and Toto-2.0-1B take the top three positions, while Toto-2.0-22m ranks fifth.
- Latency: the article reports that single-pass forecasting lets Toto 2.0 produce long horizons with fewer sequential steps than Toto 1.0 or Chronos-2-style autoregressive inference.
Limitations
- This source is a Datadog announcement article, not the full technical report.
- Several central details, including the full pretraining mixture, benchmark harness, and training recipe, are only summarized in the article.
- Toto 2.0 remains forecasting-centered; it is not yet an action-conditioned world model for intervention, deployment, rollback, or autoscaling reasoning.
- The article itself says long-horizon forecasting is not fully solved, even though larger models preserve structure better in the shown examples.
- Fine-tuned and ensemble leaderboard entries should be separated from base-model zero-shot results when comparing model entries.
Links Into The Wiki
- Toto
- Time-Series Foundation Models
- Synthetic Data For Time Series
- Observability Time Series
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Toto 1.0
- Chronos-2
- TimesFM
Open Questions
- Where does Toto 2.0 scaling saturate beyond 2.5B parameters?
- How much of the gain comes from scale, synthetic data, observability data quality, contiguous patch masking, or decoding strategy?
- Can the observability world-model direction incorporate actions, control inputs, interventions, and counterfactual incident-response reasoning?
- How should GIFT-Eval and TIME comparisons separate base zero-shot models, fine-tuned models, and ensembles?