Graph Observability Dataset Benchmarks

Short Answer

The closest public dataset to “one whole service graph with time-varying node and edge features” is ChronoGraph. The closest compact RCA set with explicit per-case causal graph ground truth is ops-lite. The strongest ready-to-run RCA benchmark framework is RCAEval. The richest multimodal AD/RCA dataset is AnoMod. OpenRCA is primarily an LLM-agent telemetry investigation benchmark, and LEMMA-RCA is the largest multi-domain RCA collection. MicroSS is useful AIOps system data, but the graph is reconstructed from traces rather than supplied as one graph object.

Comparison Matrix

Benchmark	Best Fit	System / Graph Structure	Main Inputs	Target Outputs	Reported Scale	Action Channel
ChronoGraph	Graph multivariate time-series forecasting and anomaly detection	Explicit directed service dependency graph with temporal node and edge features	Node metric histories, edge metric histories, topology	Future service metrics; anomaly/disruption detections	708 services, 1529 edges, 5 node features, 8 edge features, 8005 time steps, 17 incident segments	None; incidents are labels/exogenous events
MicroSS	AIOps system telemetry with traces/logs	Whole MicroSS scenario; service calls reconstructable from trace IDs and parent IDs	Metrics, traces, business logs, run/anomaly logs	Anomaly/fault localization labels; forecast/log-task targets	MicroSS reports >6500 metrics and >7000000 logs; Companion Data has 406 KPI series	Anomaly injections are benchmark events
RCAEval	Reproducible RCA benchmark framework	Online Boutique, Sock Shop, Train Ticket; graph implicit in systems/traces	Metrics, logs, traces, anomaly timestamp	Ranked root-cause services and indicators	735 cases, 11 fault types, 9 datasets under RE1/RE2/RE3, about 5.16 GB compressed	Fault injections are benchmark events
LEMMA-RCA	Large multi-domain RCA and transfer testing	Entity-level RCA over IT microservices and OT water systems	Entity metrics, logs, optional raw traces, normal history	Ranked root-cause entities and fault timestamps	Product Review 765G/4 faults/216 avg entities; Cloud Computing 540G/6 faults/168 avg entities	Fault scenarios are diagnostic events
OpenRCA	LLM/agent RCA over large telemetry	Telecom, Bank, Market; dependency trace graphs as a modality	Natural-language query, KPI time series, trace graphs, logs	Root-cause datetime, component, and reason	335 failures, 3 enterprise systems, >68 GB telemetry	None; diagnostic only
AnoMod	Multimodal AD/RCA with behavior and code evidence	SocialNetwork and TrainTicket; traces expose service dependencies	Logs, metrics, distributed traces, API responses, code coverage	Anomaly detection; service/code-region RCA	24 anomaly cases; TrainTicket 63975 traces and 98073 API requests; SocialNetwork 3958.5K log lines	Controlled anomaly injections
ops-lite	Compact causal-graph RCA evaluation	Per-case causal service graph from fault contract	Normal/abnormal metric windows, injection metadata, environment snapshots, causal graph	Root-cause services and propagation path/graph scores	500 cases across Train-Ticket, Hotel Reservation, OpenTelemetry Demo; mean path 3.18	Chaos injections are benchmark events

Key Differences

ChronoGraph is graph-native and time-series-native. It is the best match for models that consume G=(V,E) plus temporal node and edge features and then forecast future numeric features.

RCAEval is benchmark-native. It is the best match for comparing many RCA algorithms under one framework, but a model must reconstruct or import graph context from traces/system knowledge if it wants graph-aware reasoning.

ops-lite is causal-graph-native. It is smaller than RCAEval, but every case has a causal_graph.json, which makes it attractive for graph/path scoring and propagation evaluation.

AnoMod is modality-native. It is less about continuous graph forecasting and more about testing whether logs, metrics, traces, API responses, and code coverage can be fused for AD/RCA.

OpenRCA is agent-native. Its input is not just a tensor or table; the benchmark asks an LLM/tool-using agent to inspect telemetry and emit a structured RCA answer.

LEMMA-RCA is scale/domain-native. It is valuable for RCA transfer across IT and OT domains, but its microservice subsets are entity-centric rather than one canonical service graph tensor.

GAIA/MicroSS is historical AIOps corpus data. It is useful because it has metrics, traces, logs, and anomaly-injection records, but it mixes whole-system MicroSS with Companion Data single-series/log tasks.

Practical Selection

Use ChronoGraph for graph-temporal forecasting experiments, especially if the model API expects node features, edge features, and a dependency graph.

Use RCAEval when the goal is to compare RCA algorithms against established baselines and metrics.

Use ops-lite when the target is root-cause path or causal-graph scoring and a compact benchmark is more useful than a large telemetry corpus.

Use AnoMod when the experiment needs multimodal fusion or code-aware RCA.

Use OpenRCA when the target model is an LLM agent that can retrieve, inspect, and reason over telemetry rather than a pure time-series model.

Use LEMMA-RCA when cross-domain RCA transfer matters more than a single microservice graph schema.

Use MicroSS as AIOps context or as a trace/log/metric source for custom graph reconstruction.

World-Model Caveat

All seven benchmarks are still passive or diagnostic from an action-conditioned world-model perspective. They contain incidents, faults, anomaly injections, chaos injections, or diagnostic queries, but they do not expose logged operator decisions such as deploy, rollback, autoscale, reroute traffic, change feature flags, or apply remediations with downstream outcomes.

Alex Knowledge Base

Explorer

Graph Observability Dataset Benchmarks

Graph Observability Dataset Benchmarks

Short Answer

Comparison Matrix

Key Differences

Practical Selection

World-Model Caveat

Graph View

Table of Contents

Backlinks