Toto 2.0 TSALM Workshop Presentation
Source
- Transcript: paper_toto-2-tsalm-2026.md
- Slides PDF: paper_toto-2-tsalm-2026.pdf
- Speaker: Othmane Abou-Amal, Datadog
- Original recording:
/home/ipse/work/iclr2026/downloads/tsalm_workshop_39063681/tsalm_datadog_073212-080537_precise.mp4 - Related announcement article: Toto 2.0
Core Claim
This TSALM @ ICLR 2026 presentation is a primary talk source for Toto 2.0. It complements the Datadog announcement article with the verbal framing, training-recipe details, data-mix notes, and future multimodal observability world-model roadmap around the Toto 2.0 model family.
Key Contributions
- Frames Toto 1.0 as a BERT-like moment for time-series foundation models and Toto 2.0 as a GPT-2-like scaling moment for time-series forecasting.
- Presents Toto 2.0 as an open-weight model family from 4M to 2.5B parameters, with every larger size improving over the previous size on the shown scaling curve.
- Describes the architecture as a decoder-only patch Transformer that alternates time-axis causal attention and variate-axis full attention.
- Replaces the Toto 1.x output head with a quantile head trained by pinball loss, uses contiguous patch masking adapted from TiRex, adds residual MLP patch projections, and switches to arcsinh normalization.
- Reports NormMuon for matrix-shaped parameters and AdamW for the remaining parameters, plus UMuP transfer from a cheap proxy model so one hyperparameter configuration can train all released sizes.
- Gives the training mix as 42% Datadog internal observability metrics and 58% TempoPFN-generated synthetic data, with no customer data and no public forecasting datasets in the final pretraining mix.
- Moves beyond passive forecasting by sketching a multimodal observability world model over metrics, logs, traces, topology, events, alerts, source code, and operational text.
- Introduces ARFBench as an incident-response time-series question-answering benchmark built from 63 internal Datadog incidents, 142 time series, 750 QA pairs, up to 2,000 variables, and up to 40,000 timesteps per series.
- Reports Toto-1.0-QA-Experimental, a hybrid model that wires Toto 1.0 time-series representations into a VLM for incident-response QA, as a first bridge from metric forecasting toward multimodal observability understanding.
Technical Notes
The talk is especially useful for the recipe around scaling. The speaker emphasizes that pinball-loss gradients are sign-valued, making the optimizer choice unusually important, and identifies NormMuon plus AdamW as the winning split in Toto 2.0 experiments.
The UMuP and REX discussion is also a durable implementation clue. The team sweeps architecture, data-frequency weighting, optimizer, and decay schedule on proxy models, then transfers the selected configuration across sizes without per-size tuning.
The data note matters for benchmark hygiene. The speaker says public forecasting data hurt performance in the data-selection sweep, while the final recipe uses Datadog-observing-Datadog observability metrics plus TempoPFN synthetic data. In Q&A, the speaker also says the sweep could have selected TempoPFN-only synthetic data, but did not; specialized observability metrics still helped. That separates Toto 2.0 from TSFMs that pretrain on broad public forecasting corpora, but it also makes the private-data component hard to audit externally.
The future section is one of the clearest local sources for Datadog’s world-model ambition. The speaker names time series plus logs as the first multimodal combination, then describes expansion toward topology and traces, with learned simulators for SRE agents, proactive alerting, and counterfactual analysis.
The ARFBench section should be treated as a different evaluation surface from forecasting. It tests incident-response questions such as anomaly timing, anomalous related series, and metric-failure relationships. The talk reports that frontier VLMs remain below domain experts, while an oracle combining human and model answers scores substantially higher, implying complementary error modes.
The Toto-1.0-QA-Experimental result should not be read as evidence that Toto 2.0 is already an action-conditioned world model. It is a hybrid QA bridge: time-series representations help a VLM answer incident questions, but the setup still does not model deployments, rollbacks, autoscaling changes, remediations, or other operator actions as controllable interventions.
Relation To The Article
Use Toto 2.0 as the canonical public announcement source for benchmark tables and released-artifact links. Use this presentation source for the spoken context around why the family was built, how the training recipe was chosen, and how Datadog connects Toto 2.0 to multimodal observability world models.
Limitations
- The transcript is machine-generated from workshop audio and may contain transcription errors in names, model names, and benchmark names.
- The slides PDF appears to be mostly visual; extracted text is sparse, so the rendered PDF should be treated as the presentation artifact.
- Some details were presented orally before or alongside the final blog article; when exact benchmark numbers differ, prefer the public article or leaderboard source.
- The recording is a local workshop artifact, not a polished technical report.
- The speaker explicitly says he is using “world model” loosely; keep Toto 2.0 itself classified as a passive forecasting model unless an action-conditioned successor source changes that.
- The talk refers to MWQL rank on BOOM, while the current article snapshot records CRPS rank, CRPS, and MASE. Keep the metric names source-specific until the BOOM leaderboard or final technical report resolves the naming.
- A final audience question challenges the comparability of an OpenTSLM-SP 1B entry in the ARFBench overall-accuracy chart; treat that chart as a mixed diagnostic comparison rather than a clean forecasting leaderboard.
Links Into The Wiki
- Toto
- Time-Series Foundation Models
- Time-Series Scaling And Efficiency
- Observability Time Series
- High-Dimensional Time Series Forecasting
- Synthetic Data For Time Series
- Time-Series Benchmark Hygiene
- World Models
Open Questions
- Which parts of the Toto 2.0 gain come from scale, observability data, TempoPFN synthetic data, optimizer choice, UMuP transfer, or contiguous patch masking?
- Can the time-series-plus-logs stage become a genuinely action-conditioned observability world model, or will it remain a richer passive context model until operator actions are added?
- How should public evaluations separate base zero-shot checkpoints, fine-tuned variants, ensembles, and hybrid Toto-representation-plus-VLM systems?