This Time is Different: An Observability Perspective on Time Series Foundation Models

Source

Core Claim

Toto argues that observability metrics are a distinct and demanding multivariate time-series forecasting domain, and that a decoder-only foundation model trained on observability, public, and synthetic time series can outperform general-purpose time-series foundation models on both observability and standard forecasting benchmarks.

Key Contributions

  • Introduces Toto, a 151M-parameter open-weights time-series foundation model for zero-shot probabilistic forecasting of observability metrics.
  • Adds architecture choices aimed at high-cardinality, nonstationary multivariate time series: patch-based causal instance normalization, proportional factorized time-variate attention, a Student-T mixture prediction head, and a robust composite training loss.
  • Builds a pretraining mixture of about 2.36T time-series points, including Datadog internal observability metrics, public datasets, and synthetic data.
  • Introduces BOOM, an observability forecasting benchmark with about 350M observations across 2,807 real-world multivariate time series.
  • Reports strong results on BOOM, GIFT-Eval, and LSF, positioning observability data as both a target domain and a stress test for general time-series foundation models.

Benchmarked Model Entry

ModelRole In PaperNotesOfficial Artifact
Toto-Open-Base-1.0Main released and benchmarked Toto checkpoint151M-parameter decoder-only probabilistic forecaster with patch size 64, native context length 4096, proportional factorized attention, and a Student-T mixture output head. The paper evaluates it zero-shot on BOOM, GIFT-Eval, and LSF, with fine-tuning experiments on LSF.Datadog/Toto-Open-Base-1.0

Method Notes

Toto is a passive dynamics model for multivariate time series. It forecasts future observations from historical observations and does not model actions, control inputs, interventions, or counterfactual policy choices as first-class channels in the paper.

Datadog later extends this line with Toto 2.0, which keeps the observability forecasting framing while turning the release into a scaled model family with contiguous patch masking and broader BOOM, GIFT-Eval, and TIME claims.

The model uses non-overlapping patches over time, maps each variate into a decoder-only Transformer stack, and alternates mostly time-wise attention with a smaller amount of variate-wise attention. The paper frames this proportional factorized attention as a way to preserve cross-variate structure while keeping inference practical for high-cardinality observability metrics.

The Student-T mixture output head makes Toto a probabilistic forecaster rather than only a point forecaster. The composite robust loss combines negative log likelihood with a robust point-prediction term to stabilize training on sparse, bursty, heavy-tailed observability data.

Evidence And Results

  • BOOM: the paper reports Toto ahead of Moirai, TimesFM, Chronos-Bolt, Timer, Time-MoE, VisionTS, and naive baselines by normalized MASE, normalized CRPS, and average rank.
  • GIFT-Eval: Toto is evaluated with the same inference settings as BOOM, using its native 4096-point context length.
  • LSF: the paper reports both zero-shot and fine-tuned Toto results on ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather.
  • Ablations identify causal scaling and the Student-T mixture head as especially important, with large NLL degradation when either is removed.

Limitations

  • The paper is forecasting-centered; Toto is not presented as an action-conditioned world model for intervention or control reasoning.
  • Observability training and benchmark data come from Datadog internal systems, so the paper provides scale and operational realism but only partial external visibility into the private corpus construction.
  • BOOM uses Datadog observability metrics and preprocessing choices, so transfer to other monitoring stacks should be checked empirically.
  • The benchmark claims depend on the chosen probabilistic metrics, context lengths, and official inference procedures for comparison models.

Open Questions

  • How much of Toto’s advantage comes from observability-specific data scale versus the architecture changes?
  • Can proportional factorized attention transfer cleanly to other high-cardinality multivariate time-series domains such as finance, telemetry, or physiology?
  • What benchmark would test observability forecasting under operator actions, deployments, rollbacks, or autoscaling control inputs rather than passive metric prediction?