Graph Observability Dataset Benchmarks
Short Answer
The closest public dataset to “one whole service graph with time-varying node and edge features” is ChronoGraph. The closest compact RCA set with explicit per-case causal graph ground truth is ops-lite. The strongest ready-to-run RCA benchmark framework is RCAEval. The richest multimodal AD/RCA dataset is AnoMod. OpenRCA is primarily an LLM-agent telemetry investigation benchmark, and LEMMA-RCA is the largest multi-domain RCA collection. MicroSS is useful AIOps system data, but the graph is reconstructed from traces rather than supplied as one graph object.
Comparison Matrix
| Benchmark | Best Fit | System / Graph Structure | Main Inputs | Target Outputs | Reported Scale | Action Channel |
|---|---|---|---|---|---|---|
| ChronoGraph | Graph multivariate time-series forecasting and anomaly detection | Explicit directed service dependency graph with temporal node and edge features | Node metric histories, edge metric histories, topology | Future service metrics; anomaly/disruption detections | 708 services, 1529 edges, 5 node features, 8 edge features, 8005 time steps, 17 incident segments | None; incidents are labels/exogenous events |
| MicroSS | AIOps system telemetry with traces/logs | Whole MicroSS scenario; service calls reconstructable from trace IDs and parent IDs | Metrics, traces, business logs, run/anomaly logs | Anomaly/fault localization labels; forecast/log-task targets | MicroSS reports >6500 metrics and >7000000 logs; Companion Data has 406 KPI series | Anomaly injections are benchmark events |
| RCAEval | Reproducible RCA benchmark framework | Online Boutique, Sock Shop, Train Ticket; graph implicit in systems/traces | Metrics, logs, traces, anomaly timestamp | Ranked root-cause services and indicators | 735 cases, 11 fault types, 9 datasets under RE1/RE2/RE3, about 5.16 GB compressed | Fault injections are benchmark events |
| LEMMA-RCA | Large multi-domain RCA and transfer testing | Entity-level RCA over IT microservices and OT water systems | Entity metrics, logs, optional raw traces, normal history | Ranked root-cause entities and fault timestamps | Product Review 765G/4 faults/216 avg entities; Cloud Computing 540G/6 faults/168 avg entities | Fault scenarios are diagnostic events |
| OpenRCA | LLM/agent RCA over large telemetry | Telecom, Bank, Market; dependency trace graphs as a modality | Natural-language query, KPI time series, trace graphs, logs | Root-cause datetime, component, and reason | 335 failures, 3 enterprise systems, >68 GB telemetry | None; diagnostic only |
| AnoMod | Multimodal AD/RCA with behavior and code evidence | SocialNetwork and TrainTicket; traces expose service dependencies | Logs, metrics, distributed traces, API responses, code coverage | Anomaly detection; service/code-region RCA | 24 anomaly cases; TrainTicket 63975 traces and 98073 API requests; SocialNetwork 3958.5K log lines | Controlled anomaly injections |
| ops-lite | Compact causal-graph RCA evaluation | Per-case causal service graph from fault contract | Normal/abnormal metric windows, injection metadata, environment snapshots, causal graph | Root-cause services and propagation path/graph scores | 500 cases across Train-Ticket, Hotel Reservation, OpenTelemetry Demo; mean path 3.18 | Chaos injections are benchmark events |
Key Differences
ChronoGraph is graph-native and time-series-native. It is the best match for models that consume G=(V,E) plus temporal node and edge features and then forecast future numeric features.
RCAEval is benchmark-native. It is the best match for comparing many RCA algorithms under one framework, but a model must reconstruct or import graph context from traces/system knowledge if it wants graph-aware reasoning.
ops-lite is causal-graph-native. It is smaller than RCAEval, but every case has a causal_graph.json, which makes it attractive for graph/path scoring and propagation evaluation.
AnoMod is modality-native. It is less about continuous graph forecasting and more about testing whether logs, metrics, traces, API responses, and code coverage can be fused for AD/RCA.
OpenRCA is agent-native. Its input is not just a tensor or table; the benchmark asks an LLM/tool-using agent to inspect telemetry and emit a structured RCA answer.
LEMMA-RCA is scale/domain-native. It is valuable for RCA transfer across IT and OT domains, but its microservice subsets are entity-centric rather than one canonical service graph tensor.
GAIA/MicroSS is historical AIOps corpus data. It is useful because it has metrics, traces, logs, and anomaly-injection records, but it mixes whole-system MicroSS with Companion Data single-series/log tasks.
Practical Selection
Use ChronoGraph for graph-temporal forecasting experiments, especially if the model API expects node features, edge features, and a dependency graph.
Use RCAEval when the goal is to compare RCA algorithms against established baselines and metrics.
Use ops-lite when the target is root-cause path or causal-graph scoring and a compact benchmark is more useful than a large telemetry corpus.
Use AnoMod when the experiment needs multimodal fusion or code-aware RCA.
Use OpenRCA when the target model is an LLM agent that can retrieve, inspect, and reason over telemetry rather than a pure time-series model.
Use LEMMA-RCA when cross-domain RCA transfer matters more than a single microservice graph schema.
Use MicroSS as AIOps context or as a trace/log/metric source for custom graph reconstruction.
World-Model Caveat
All seven benchmarks are still passive or diagnostic from an action-conditioned world-model perspective. They contain incidents, faults, anomaly injections, chaos injections, or diagnostic queries, but they do not expose logged operator decisions such as deploy, rollback, autoscale, reroute traffic, change feature flags, or apply remediations with downstream outcomes.