This Time is Different: An Observability Perspective on Time Series Foundation Models

Source

Raw Markdown: paper_toto-2025.md
PDF: paper_toto-2025.pdf
Preprint: arXiv 2505.14766
Official source: DataDog/toto
Official checkpoint: Datadog/Toto-Open-Base-1.0

Core Claim

Toto argues that observability metrics are a distinct and demanding multivariate time-series forecasting domain, and that a decoder-only foundation model trained on observability, public, and synthetic time series can outperform general-purpose time-series foundation models on both observability and standard forecasting benchmarks.

Key Contributions

Introduces Toto, a 151M-parameter open-weights time-series foundation model for zero-shot probabilistic forecasting of observability metrics.
Adds architecture choices aimed at high-cardinality, nonstationary multivariate time series: patch-based causal instance normalization, proportional factorized time-variate attention, a Student-T mixture prediction head, and a robust composite training loss.
Builds a pretraining mixture of about 2.36T time-series points, including Datadog internal observability metrics, public datasets, and synthetic data.
Introduces BOOM, an observability forecasting benchmark with about 350M observations across 2,807 real-world multivariate time series.
Reports strong results on BOOM, GIFT-Eval, and LSF, positioning observability data as both a target domain and a stress test for general time-series foundation models.

Benchmarked Model Entry

Model	Role In Paper	Notes	Official Artifact
Toto-Open-Base-1.0	Main released and benchmarked Toto checkpoint	151M-parameter decoder-only probabilistic forecaster with patch size 64, native context length 4096, proportional factorized attention, and a Student-T mixture output head. The paper evaluates it zero-shot on BOOM, GIFT-Eval, and LSF, with fine-tuning experiments on LSF.	Datadog/Toto-Open-Base-1.0

Method Notes

Toto is a passive dynamics model for multivariate time series. It forecasts future observations from historical observations and does not model actions, control inputs, interventions, or counterfactual policy choices as first-class channels in the paper.

Datadog later extends this line with Toto 2.0, which keeps the observability forecasting framing while turning the release into a scaled model family with contiguous patch masking and broader BOOM, GIFT-Eval, and TIME claims.

The model uses non-overlapping patches over time, maps each variate into a decoder-only Transformer stack, and alternates mostly time-wise attention with a smaller amount of variate-wise attention. The paper frames this proportional factorized attention as a way to preserve cross-variate structure while keeping inference practical for high-cardinality observability metrics.

The Student-T mixture output head makes Toto a probabilistic forecaster rather than only a point forecaster. The composite robust loss combines negative log likelihood with a robust point-prediction term to stabilize training on sparse, bursty, heavy-tailed observability data.

Evidence And Results

BOOM: the paper reports Toto ahead of Moirai, TimesFM, Chronos-Bolt, Timer, Time-MoE, VisionTS, and naive baselines by normalized MASE, normalized CRPS, and average rank.
GIFT-Eval: Toto is evaluated with the same inference settings as BOOM, using its native 4096-point context length.
LSF: the paper reports both zero-shot and fine-tuned Toto results on ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather.
Ablations identify causal scaling and the Student-T mixture head as especially important, with large NLL degradation when either is removed.

Limitations

The paper is forecasting-centered; Toto is not presented as an action-conditioned world model for intervention or control reasoning.
Observability training and benchmark data come from Datadog internal systems, so the paper provides scale and operational realism but only partial external visibility into the private corpus construction.
BOOM uses Datadog observability metrics and preprocessing choices, so transfer to other monitoring stacks should be checked empirically.
The benchmark claims depend on the chosen probabilistic metrics, context lengths, and official inference procedures for comparison models.

Links Into The Wiki

Open Questions

How much of Toto’s advantage comes from observability-specific data scale versus the architecture changes?
Can proportional factorized attention transfer cleanly to other high-cardinality multivariate time-series domains such as finance, telemetry, or physiology?
What benchmark would test observability forecasting under operator actions, deployments, rollbacks, or autoscaling control inputs rather than passive metric prediction?

Alex Knowledge Base

Explorer

This Time is Different: An Observability Perspective on Time Series Foundation Models

This Time is Different: An Observability Perspective on Time Series Foundation Models

Source

Core Claim

Key Contributions

Benchmarked Model Entry

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks