BOOM: Benchmark of Observability Metrics

Source

Dataset metadata snapshot: source.md
Metadata JSON: metadata.json
Official Hugging Face: https://huggingface.co/datasets/Datadog/BOOM
Official leaderboard: https://huggingface.co/spaces/Datadog/BOOM
Official code: https://github.com/DataDog/toto/tree/main/boom
Introducing paper: Toto

Core Claim

BOOM is Datadog’s observability forecasting benchmark for evaluating models on high-cardinality operational metrics. It is the main dataset reason that Toto belongs in high-dimensional forecasting discussions even though its dimensionality regime is smaller than Time-HD.

Dataset Notes

BOOM contains about 350 million time-series points across 2,807 metric queries.
The Hugging Face card reports 32,887 variates, with each dataset entry containing one metric query and up to 100 variates.
Metric-query groups become related variates in one multivariate time series.
Domain labels include application usage, infrastructure, database, networking, and security.
Metric types include gauge, rate, distribution, and count.
The Toto paper also defines BOOMlet as a smaller representative subset with 32 metric queries, 1,627 variates, and about 23 million observation points.

Why It Matters

BOOM is the strongest current dataset anchor for observability-style high-dimensional forecasting in this repository. It captures high cardinality, nonstationarity, missing intervals, sparse spikes, heavy tails, and scale changes in grouped metric series.

Limitations

BOOM is passive forecasting data. It does not include deployments, rollbacks, autoscaling changes, traffic-control commands, remediations, or other operator actions as forecast-conditioning channels.
It comes from Datadog internal pre-production monitoring, so transfer to other observability stacks should be checked empirically.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Native multivariate encoding and high-channel scaling	partially closes	Groups metric-query results into related variates and covers 32,887 variates across 2,807 queries.	Entries are capped at query groups rather than whole-system telemetry graphs.
Benchmarks: what level of modeling is tested?	partially closes	Tests observability-style passive forecasting with missing intervals, heavy tails, sparse spikes, and scale changes.	Does not evaluate RCA, control utility, counterfactuals, or intervention-aware forecasts.
Causal structure, counterfactuals, and control	insufficient evidence	Dataset metadata explicitly records no operator actions or interventions.	Needs deployments, rollbacks, autoscaling, remediation, and traffic-control channels.

Alex Open Research Wiki

Explorer

BOOM: Benchmark of Observability Metrics

BOOM: Benchmark of Observability Metrics

Source

Core Claim

Dataset Notes

Why It Matters

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Graph View

Table of Contents

Backlinks