TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Source

Core Claim

TelecomTS argues that public observability benchmarks are missing a key operational regime: de-anonymized, scale-preserving, multimodal telecom telemetry where abrupt, noisy, bursty behavior is often normal and where useful tasks include anomaly detection, root-cause analysis, and time-series/text question answering rather than forecasting alone.

Dataset Notes

  • Data comes from a controlled 5G telecommunications testbed, not a private customer production trace.
  • The paper reports 18 KPI channels sampled at 10 Hz, with 1,020,000 normal observations and 120,000 anomalous observations.
  • The Hugging Face dataset exposes 32k chunked samples with 128 time steps per sample.
  • Each sample includes KPI arrays, a natural-language description, anomaly metadata, statistics, contextual labels, and Q&A fields.
  • Labels include zone, application, mobility, congestion, and anomaly presence.
  • The dataset includes real anomalies from controlled jamming plus synthetic anomalies generated from documented network failure modes.
  • Synthetic anomaly samples include GPT-4.1-generated troubleshooting tickets validated through a human-in-the-loop process.

Why It Matters

For Alex’s experiments, TelecomTS is a strong candidate dataset because it combines three things that are usually separated: observability-like time-series dynamics, preserved metric semantics/absolute scale, and language fields for reasoning tasks. It complements BOOM: BOOM is broader high-cardinality observability forecasting, while TelecomTS is smaller-channel but richer in labels, natural-language reasoning hooks, and anomaly/root-cause tasks.

It is especially relevant for testing whether time-series foundation models can handle scale-aware telemetry and whether multimodal models can connect natural-language context to numeric operational signals.

Gotchas

  • TelecomTS has only 18 KPI channels, so it is not a high-dimensional time-series forecasting benchmark in the Time-HD sense.
  • The dataset is lab-collected from a 5G testbed. It is cleaner and more reproducible than private production telemetry, but transfer to real operator networks should be tested.
  • Controlled jamming is an exogenous/adversarial event in this dataset, not an operator action chosen by a modeled policy.
  • Synthetic anomalies and GPT-4.1-generated tickets are useful for scale and language grounding, but they can introduce generator artifacts.
  • Forecasting metrics can look better than the task really is because stable intervals dominate MAE/RMSE while models miss abrupt peaks.
  • LLM and reasoning-model evaluations are prompt-sensitive, especially for anomaly detection and Q&A.

Key Results

  • On anomaly detection, language models tend toward false positives because normal telecom observability data can be abrupt and erratic.
  • Time-series models with trained heads generally beat prompted LLMs on anomaly tasks, but performance remains far from solved.
  • Mantis performs best in the paper’s anomaly-detection table, while Toto is strongest for anomaly duration and root-cause analysis.
  • The paper emphasizes absolute scale: models that retain or encode scale information have an advantage over approaches that normalize away operational magnitude.
  • Q&A results show that current language/reasoning models still struggle to connect engineering context with the underlying time-series data.

Open Questions

  • Should TelecomTS be part of the next experiment batch as an anomaly/root-cause benchmark, a multimodal Q&A benchmark, or both?
  • How much of the reported difficulty comes from telecom-specific semantics versus generic observability burstiness?
  • Can synthetic anomaly tickets be used for training without overfitting to GPT-4.1 phrasing patterns?