TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis
Source
- Raw Markdown: paper_telecomts-2025.md
- PDF: paper_telecomts-2025.pdf
- Dataset metadata snapshot: telecomts-2025
- arXiv: https://arxiv.org/abs/2510.06063
- Official Hugging Face dataset: https://huggingface.co/datasets/AliMaatouk/TelecomTS
- Official code: https://github.com/Ali-maatouk/TelecomTS
Core Claim
TelecomTS argues that public observability benchmarks are missing a key operational regime: de-anonymized, scale-preserving, multimodal telecom telemetry where abrupt, noisy, bursty behavior is often normal and where useful tasks include anomaly detection, root-cause analysis, and time-series/text question answering rather than forecasting alone.
Dataset Notes
- Data comes from a controlled 5G telecommunications testbed, not a private customer production trace.
- The paper reports 18 KPI channels sampled at 10 Hz, with 1,020,000 normal observations and 120,000 anomalous observations.
- The Hugging Face dataset exposes 32k chunked samples with 128 time steps per sample.
- Each sample includes KPI arrays, a natural-language description, anomaly metadata, statistics, contextual labels, and Q&A fields.
- Labels include zone, application, mobility, congestion, and anomaly presence.
- The dataset includes real anomalies from controlled jamming plus synthetic anomalies generated from documented network failure modes.
- Synthetic anomaly samples include GPT-4.1-generated troubleshooting tickets validated through a human-in-the-loop process.
Why It Matters
For Alex’s experiments, TelecomTS is a strong candidate dataset because it combines three things that are usually separated: observability-like time-series dynamics, preserved metric semantics/absolute scale, and language fields for reasoning tasks. It complements BOOM: BOOM is broader high-cardinality observability forecasting, while TelecomTS is smaller-channel but richer in labels, natural-language reasoning hooks, and anomaly/root-cause tasks.
It is especially relevant for testing whether time-series foundation models can handle scale-aware telemetry and whether multimodal models can connect natural-language context to numeric operational signals.
Gotchas
- TelecomTS has only 18 KPI channels, so it is not a high-dimensional time-series forecasting benchmark in the Time-HD sense.
- The dataset is lab-collected from a 5G testbed. It is cleaner and more reproducible than private production telemetry, but transfer to real operator networks should be tested.
- Controlled jamming is an exogenous/adversarial event in this dataset, not an operator action chosen by a modeled policy.
- Synthetic anomalies and GPT-4.1-generated tickets are useful for scale and language grounding, but they can introduce generator artifacts.
- Forecasting metrics can look better than the task really is because stable intervals dominate MAE/RMSE while models miss abrupt peaks.
- LLM and reasoning-model evaluations are prompt-sensitive, especially for anomaly detection and Q&A.
Key Results
- On anomaly detection, language models tend toward false positives because normal telecom observability data can be abrupt and erratic.
- Time-series models with trained heads generally beat prompted LLMs on anomaly tasks, but performance remains far from solved.
- Mantis performs best in the paper’s anomaly-detection table, while Toto is strongest for anomaly duration and root-cause analysis.
- The paper emphasizes absolute scale: models that retain or encode scale information have an advantage over approaches that normalize away operational magnitude.
- Q&A results show that current language/reasoning models still struggle to connect engineering context with the underlying time-series data.
Links Into The Wiki
- TelecomTS
- Observability Time Series
- Time-Series Benchmark Hygiene
- Context-Aided Forecasting
- Unified Multimodal Models
Open Questions
- Should TelecomTS be part of the next experiment batch as an anomaly/root-cause benchmark, a multimodal Q&A benchmark, or both?
- How much of the reported difficulty comes from telecom-specific semantics versus generic observability burstiness?
- Can synthetic anomaly tickets be used for training without overfitting to GPT-4.1 phrasing patterns?