This Time is Different: An Observability Perspective on Time Series Foundation Models
Source
- Raw Markdown: paper_toto-2025.md
- PDF: paper_toto-2025.pdf
- Preprint: arXiv 2505.14766
- Official source: DataDog/toto
- Official checkpoint: Datadog/Toto-Open-Base-1.0
Core Claim
Toto argues that observability metrics are a distinct and demanding multivariate time-series forecasting domain, and that a decoder-only foundation model trained on observability, public, and synthetic time series can outperform general-purpose time-series foundation models on both observability and standard forecasting benchmarks.
Key Contributions
- Introduces Toto, a 151M-parameter open-weights time-series foundation model for zero-shot probabilistic forecasting of observability metrics.
- Adds architecture choices aimed at high-cardinality, nonstationary multivariate time series: patch-based causal instance normalization, proportional factorized time-variate attention, a Student-T mixture prediction head, and a robust composite training loss.
- Builds a pretraining mixture of about 2.36T time-series points, including Datadog internal observability metrics, public datasets, and synthetic data.
- Introduces BOOM, an observability forecasting benchmark with about 350M observations across 2,807 real-world multivariate time series.
- Reports strong results on BOOM, GIFT-Eval, and LSF, positioning observability data as both a target domain and a stress test for general time-series foundation models.
Benchmarked Model Entry
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Toto-Open-Base-1.0 | Main released and benchmarked Toto checkpoint | 151M-parameter decoder-only probabilistic forecaster with patch size 64, native context length 4096, proportional factorized attention, and a Student-T mixture output head. The paper evaluates it zero-shot on BOOM, GIFT-Eval, and LSF, with fine-tuning experiments on LSF. | Datadog/Toto-Open-Base-1.0 |
Method Notes
Toto is a passive dynamics model for multivariate time series. It forecasts future observations from historical observations and does not model actions, control inputs, interventions, or counterfactual policy choices as first-class channels in the paper.
Datadog later extends this line with Toto 2.0, which keeps the observability forecasting framing while turning the release into a scaled model family with contiguous patch masking and broader BOOM, GIFT-Eval, and TIME claims.
The model uses non-overlapping patches over time, maps each variate into a decoder-only Transformer stack, and alternates mostly time-wise attention with a smaller amount of variate-wise attention. The paper frames this proportional factorized attention as a way to preserve cross-variate structure while keeping inference practical for high-cardinality observability metrics.
The Student-T mixture output head makes Toto a probabilistic forecaster rather than only a point forecaster. The composite robust loss combines negative log likelihood with a robust point-prediction term to stabilize training on sparse, bursty, heavy-tailed observability data.
Evidence And Results
- BOOM: the paper reports Toto ahead of Moirai, TimesFM, Chronos-Bolt, Timer, Time-MoE, VisionTS, and naive baselines by normalized MASE, normalized CRPS, and average rank.
- GIFT-Eval: Toto is evaluated with the same inference settings as BOOM, using its native 4096-point context length.
- LSF: the paper reports both zero-shot and fine-tuned Toto results on ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather.
- Ablations identify causal scaling and the Student-T mixture head as especially important, with large NLL degradation when either is removed.
Limitations
- The paper is forecasting-centered; Toto is not presented as an action-conditioned world model for intervention or control reasoning.
- Observability training and benchmark data come from Datadog internal systems, so the paper provides scale and operational realism but only partial external visibility into the private corpus construction.
- BOOM uses Datadog observability metrics and preprocessing choices, so transfer to other monitoring stacks should be checked empirically.
- The benchmark claims depend on the chosen probabilistic metrics, context lengths, and official inference procedures for comparison models.
Links Into The Wiki
- Toto
- Time-Series Foundation Models
- Observability Time Series
- Time-Series Benchmark Hygiene
- TimesFM
- Moirai
- Chronos-2
- Time-MoE
- Toto 2.0
Open Questions
- How much of Toto’s advantage comes from observability-specific data scale versus the architecture changes?
- Can proportional factorized attention transfer cleanly to other high-cardinality multivariate time-series domains such as finance, telemetry, or physiology?
- What benchmark would test observability forecasting under operator actions, deployments, rollbacks, or autoscaling control inputs rather than passive metric prediction?