TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Source

Raw Markdown: paper_telecomts-2025.md
PDF: paper_telecomts-2025.pdf
Dataset metadata snapshot: telecomts-2025
arXiv: https://arxiv.org/abs/2510.06063
Official Hugging Face dataset: https://huggingface.co/datasets/AliMaatouk/TelecomTS
Official code: https://github.com/Ali-maatouk/TelecomTS

Core Claim

TelecomTS argues that public observability benchmarks are missing a key operational regime: de-anonymized, scale-preserving, multimodal telecom telemetry where abrupt, noisy, bursty behavior is often normal and where useful tasks include anomaly detection, root-cause analysis, and time-series/text question answering rather than forecasting alone.

Dataset Notes

Data comes from a controlled 5G telecommunications testbed, not a private customer production trace.
The paper reports 18 KPI channels sampled at 10 Hz, with 1,020,000 normal observations and 120,000 anomalous observations.
The Hugging Face dataset exposes 32k chunked samples with 128 time steps per sample.
Each sample includes KPI arrays, a natural-language description, anomaly metadata, statistics, contextual labels, and Q&A fields.
Labels include zone, application, mobility, congestion, and anomaly presence.
The dataset includes real anomalies from controlled jamming plus synthetic anomalies generated from documented network failure modes.
Synthetic anomaly samples include GPT-4.1-generated troubleshooting tickets validated through a human-in-the-loop process.

Why It Matters

For Alex’s experiments, TelecomTS is a strong candidate dataset because it combines three things that are usually separated: observability-like time-series dynamics, preserved metric semantics/absolute scale, and language fields for reasoning tasks. It complements BOOM: BOOM is broader high-cardinality observability forecasting, while TelecomTS is smaller-channel but richer in labels, natural-language reasoning hooks, and anomaly/root-cause tasks.

It is especially relevant for testing whether time-series foundation models can handle scale-aware telemetry and whether multimodal models can connect natural-language context to numeric operational signals.

Gotchas

TelecomTS has only 18 KPI channels, so it is not a high-dimensional time-series forecasting benchmark in the Time-HD sense.
The dataset is lab-collected from a 5G testbed. It is cleaner and more reproducible than private production telemetry, but transfer to real operator networks should be tested.
Controlled jamming is an exogenous/adversarial event in this dataset, not an operator action chosen by a modeled policy.
Synthetic anomalies and GPT-4.1-generated tickets are useful for scale and language grounding, but they can introduce generator artifacts.
Forecasting metrics can look better than the task really is because stable intervals dominate MAE/RMSE while models miss abrupt peaks.
LLM and reasoning-model evaluations are prompt-sensitive, especially for anomaly detection and Q&A.

Key Results

On anomaly detection, language models tend toward false positives because normal telecom observability data can be abrupt and erratic.
Time-series models with trained heads generally beat prompted LLMs on anomaly tasks, but performance remains far from solved.
Mantis performs best in the paper’s anomaly-detection table, while Toto is strongest for anomaly duration and root-cause analysis.
The paper emphasizes absolute scale: models that retain or encode scale information have an advantage over approaches that normalize away operational magnitude.
Q&A results show that current language/reasoning models still struggle to connect engineering context with the underlying time-series data.

Links Into The Wiki

Open Questions

Should TelecomTS be part of the next experiment batch as an anomaly/root-cause benchmark, a multimodal Q&A benchmark, or both?
How much of the reported difficulty comes from telecom-specific semantics versus generic observability burstiness?
Can synthetic anomaly tickets be used for training without overfitting to GPT-4.1 phrasing patterns?

Alex Knowledge Base

Explorer