LEMMA-RCA
Source
- Dataset metadata snapshot: lemma-rca-2024
- Official website: https://lemma-rca.github.io/
- Official code: https://github.com/lemma-rca/rca_baselines
- Official Hugging Face organization: https://huggingface.co/Lemma-RCA-NEC
- arXiv: https://arxiv.org/abs/2406.05375
- OpenReview: https://openreview.net/forum?id=0R8JUzjSdq
Core Claim
LEMMA-RCA is a large multi-modal multi-domain dataset collection for root cause analysis. It spans IT operations microservices and OT water treatment/distribution systems.
Dataset Notes
- The four public dataset families are Product Review, Cloud Computing, SWaT, and WADI.
- Product Review and Cloud Computing are the microservice-relevant subsets.
- The website reports Product Review at 765G, 4 faults, and average 216 entities per fault.
- The website reports Cloud Computing at 540G, 6 faults, and average 168 entities per fault.
- The paper reports more than 100000 timestamps, millions of log-event records, fault timestamps, and root-cause entity labels.
Reported Baselines
The paper reports PC, Dynotears, C-LSTM, GOLEM, REASON, Nezha, MULAN, and CORAL. Repository text also mentions six baseline methods in places, so the paper should be preferred for the count.
Why It Matters
LEMMA-RCA is the largest multi-domain RCA collection in this comparison. It is especially relevant when testing whether a method transfers across IT and OT operations and across single-modal versus multi-modal RCA settings.
Gotchas
- The benchmark is entity-centric and causal-graph-oriented, but it is not packaged as one ChronoGraph-style topology plus temporal edge-feature tensor.
- License notes conflict: website/README License text says CC BY-ND 4.0, while Hugging Face metadata and one README paragraph say CC BY-NC 4.0.
- Fault scenarios are diagnostic events, not logged operator actions.