LEMMA-RCA

Source

Core Claim

LEMMA-RCA is a large multi-modal multi-domain dataset collection for root cause analysis. It spans IT operations microservices and OT water treatment/distribution systems.

Dataset Notes

  • The four public dataset families are Product Review, Cloud Computing, SWaT, and WADI.
  • Product Review and Cloud Computing are the microservice-relevant subsets.
  • The website reports Product Review at 765G, 4 faults, and average 216 entities per fault.
  • The website reports Cloud Computing at 540G, 6 faults, and average 168 entities per fault.
  • The paper reports more than 100000 timestamps, millions of log-event records, fault timestamps, and root-cause entity labels.

Reported Baselines

The paper reports PC, Dynotears, C-LSTM, GOLEM, REASON, Nezha, MULAN, and CORAL. Repository text also mentions six baseline methods in places, so the paper should be preferred for the count.

Why It Matters

LEMMA-RCA is the largest multi-domain RCA collection in this comparison. It is especially relevant when testing whether a method transfers across IT and OT operations and across single-modal versus multi-modal RCA settings.

Gotchas

  • The benchmark is entity-centric and causal-graph-oriented, but it is not packaged as one ChronoGraph-style topology plus temporal edge-feature tensor.
  • License notes conflict: website/README License text says CC BY-ND 4.0, while Hugging Face metadata and one README paragraph say CC BY-NC 4.0.
  • Fault scenarios are diagnostic events, not logged operator actions.