# LEMMA-RCA

Canonical source: <https://lemma-rca.github.io/>
Official code: <https://github.com/lemma-rca/rca_baselines>
Official datasets: <https://huggingface.co/Lemma-RCA-NEC>
Introducing source: [LEMMA-RCA](../../wiki/sources/lemma-rca-2024.md)

## Dataset Type

LEMMA-RCA is a large multi-modal multi-domain dataset collection for root cause analysis. It spans IT operations and OT operations. For the graph/microservice comparison, the relevant subsets are Product Review and Cloud Computing; the collection also includes SWaT and WADI water-system datasets.

## System Structure

The IT subsets are whole microservice platforms with hundreds of system entities. The paper frames RCA as identifying the top system entities that are most relevant to system KPIs when a fault occurs. The benchmark is entity-centric and causal-graph-oriented, but it does not expose one universal ChronoGraph-style service topology file across all datasets.

## Reported Scale

The website reports:

- Product Review: 765G original size, 4 faults, average 216 entities per fault.
- Cloud Computing: 540G original size, 6 faults, average 168 entities per fault.
- SWaT: 236M original size, 16 faults, average 51 entities per fault.
- WADI: 848M original size, 9 faults, average 123 entities per fault.

The paper also reports more than 100000 metric timestamps and millions of log-event records across the dataset collection.

## Data Structure

For IT operations, raw data contains JSON files for metrics, logs, and trace data. Preprocessed data extracts metric data and unstructured log data per pod. The benchmark code includes preprocessing for IT and OT domains, and evaluates methods under single-modal, multi-modal, offline, and online settings.

## Inputs And Outputs

Inputs are entity metrics, logs, optional raw traces in the IT releases, and historical normal data for online settings. Outputs are ranked root-cause entities and fault timestamps, evaluated with precision-at-k, MRR, and MAP-at-k style metrics.

## Reported Baselines

The paper reports eight causal-discovery or causal-graph-based RCA baselines: PC, Dynotears, C-LSTM, GOLEM, REASON, Nezha, MULAN, and CORAL. Some repository text still says six baseline methods, so prefer the paper when counting reported baselines.

## Actions Or Interventions

Fault scenarios are labels or injected/observed fault events for diagnosis. They are not a logged operator-action or remediation policy channel.

## Access And License Notes

The official website and README license section say CC BY-ND 4.0, while Hugging Face metadata and one README paragraph say CC BY-NC 4.0. Treat the dataset license as ambiguous until checked against a pinned release or maintainers.
