Graph Observability Dataset Benchmarks

Short Answer

The closest public dataset to “one whole service graph with time-varying node and edge features” is ChronoGraph. The closest compact RCA set with explicit per-case causal graph ground truth is ops-lite. The strongest ready-to-run RCA benchmark framework is RCAEval. The richest multimodal AD/RCA dataset is AnoMod. OpenRCA is primarily an LLM-agent telemetry investigation benchmark, and LEMMA-RCA is the largest multi-domain RCA collection. MicroSS is useful AIOps system data, but the graph is reconstructed from traces rather than supplied as one graph object.

Comparison Matrix

BenchmarkBest FitSystem / Graph StructureMain InputsTarget OutputsReported ScaleAction Channel
ChronoGraphGraph multivariate time-series forecasting and anomaly detectionExplicit directed service dependency graph with temporal node and edge featuresNode metric histories, edge metric histories, topologyFuture service metrics; anomaly/disruption detections708 services, 1529 edges, 5 node features, 8 edge features, 8005 time steps, 17 incident segmentsNone; incidents are labels/exogenous events
MicroSSAIOps system telemetry with traces/logsWhole MicroSS scenario; service calls reconstructable from trace IDs and parent IDsMetrics, traces, business logs, run/anomaly logsAnomaly/fault localization labels; forecast/log-task targetsMicroSS reports >6500 metrics and >7000000 logs; Companion Data has 406 KPI seriesAnomaly injections are benchmark events
RCAEvalReproducible RCA benchmark frameworkOnline Boutique, Sock Shop, Train Ticket; graph implicit in systems/tracesMetrics, logs, traces, anomaly timestampRanked root-cause services and indicators735 cases, 11 fault types, 9 datasets under RE1/RE2/RE3, about 5.16 GB compressedFault injections are benchmark events
LEMMA-RCALarge multi-domain RCA and transfer testingEntity-level RCA over IT microservices and OT water systemsEntity metrics, logs, optional raw traces, normal historyRanked root-cause entities and fault timestampsProduct Review 765G/4 faults/216 avg entities; Cloud Computing 540G/6 faults/168 avg entitiesFault scenarios are diagnostic events
OpenRCALLM/agent RCA over large telemetryTelecom, Bank, Market; dependency trace graphs as a modalityNatural-language query, KPI time series, trace graphs, logsRoot-cause datetime, component, and reason335 failures, 3 enterprise systems, >68 GB telemetryNone; diagnostic only
AnoModMultimodal AD/RCA with behavior and code evidenceSocialNetwork and TrainTicket; traces expose service dependenciesLogs, metrics, distributed traces, API responses, code coverageAnomaly detection; service/code-region RCA24 anomaly cases; TrainTicket 63975 traces and 98073 API requests; SocialNetwork 3958.5K log linesControlled anomaly injections
ops-liteCompact causal-graph RCA evaluationPer-case causal service graph from fault contractNormal/abnormal metric windows, injection metadata, environment snapshots, causal graphRoot-cause services and propagation path/graph scores500 cases across Train-Ticket, Hotel Reservation, OpenTelemetry Demo; mean path 3.18Chaos injections are benchmark events

Key Differences

ChronoGraph is graph-native and time-series-native. It is the best match for models that consume G=(V,E) plus temporal node and edge features and then forecast future numeric features.

RCAEval is benchmark-native. It is the best match for comparing many RCA algorithms under one framework, but a model must reconstruct or import graph context from traces/system knowledge if it wants graph-aware reasoning.

ops-lite is causal-graph-native. It is smaller than RCAEval, but every case has a causal_graph.json, which makes it attractive for graph/path scoring and propagation evaluation.

AnoMod is modality-native. It is less about continuous graph forecasting and more about testing whether logs, metrics, traces, API responses, and code coverage can be fused for AD/RCA.

OpenRCA is agent-native. Its input is not just a tensor or table; the benchmark asks an LLM/tool-using agent to inspect telemetry and emit a structured RCA answer.

LEMMA-RCA is scale/domain-native. It is valuable for RCA transfer across IT and OT domains, but its microservice subsets are entity-centric rather than one canonical service graph tensor.

GAIA/MicroSS is historical AIOps corpus data. It is useful because it has metrics, traces, logs, and anomaly-injection records, but it mixes whole-system MicroSS with Companion Data single-series/log tasks.

Practical Selection

Use ChronoGraph for graph-temporal forecasting experiments, especially if the model API expects node features, edge features, and a dependency graph.

Use RCAEval when the goal is to compare RCA algorithms against established baselines and metrics.

Use ops-lite when the target is root-cause path or causal-graph scoring and a compact benchmark is more useful than a large telemetry corpus.

Use AnoMod when the experiment needs multimodal fusion or code-aware RCA.

Use OpenRCA when the target model is an LLM agent that can retrieve, inspect, and reason over telemetry rather than a pure time-series model.

Use LEMMA-RCA when cross-domain RCA transfer matters more than a single microservice graph schema.

Use MicroSS as AIOps context or as a trace/log/metric source for custom graph reconstruction.

World-Model Caveat

All seven benchmarks are still passive or diagnostic from an action-conditioned world-model perspective. They contain incidents, faults, anomaly injections, chaos injections, or diagnostic queries, but they do not expose logged operator decisions such as deploy, rollback, autoscale, reroute traffic, change feature flags, or apply remediations with downstream outcomes.