Unsupervised Scalable Representation Learning for Multivariate Time Series

Source

Core Claim

T-Loss argues that a causal dilated convolutional encoder trained with a fully unsupervised time-based triplet loss can learn transferable fixed-size representations for variable-length univariate and multivariate time series.

Key Contributions

  • Introduces a time-based triplet loss that samples a reference subseries, one contained positive subseries, and randomly selected negative subseries without using labels.
  • Uses an encoder built from exponentially dilated causal convolutions, residual connections, global max pooling, and a final linear projection so representation size is independent of input length.
  • Evaluates learned representations with simple downstream classifiers on UCR univariate classification and UEA multivariate classification benchmarks.
  • Demonstrates that the same representation-learning setup can scale to a long household-electricity time series and support downstream regression with large inference-time savings over raw-window features.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
T-Loss-CricketXRepo-hosted benchmark checkpoint for the CricketX UCR datasetCausal CNN encoder trained with the T-Loss recipe; the paper uses CricketX to show classification accuracy improving during unsupervised training with K=10 negative samples.models/CricketX_CausalCNN_encoder.pth

Method Notes

T-Loss is a passive time-series representation model: it learns embeddings from observed time series and does not include an action, control input, intervention, or exogenous-variable channel. The model is still relevant to world-model work because it studies how far a generic latent state for time series can transfer across downstream tasks when trained without labels.

The training objective adapts the negative-sampling intuition from word2vec to time series. A reference subseries should have a representation close to one of its own subseries and far from random subseries sampled from another time series or another part of a long series.

The encoder choice matters for scalability. The paper favors causal convolutions over recurrent encoders because dilated convolutions can capture long-range dependencies with parallel hardware-friendly computation, while max pooling turns variable-length sequences into fixed-size representations.

Evidence And Results

  • On UCR univariate classification, the combined T-Loss representation outperforms the concurrent unsupervised baselines TimeNet and RWS on most datasets where comparisons are available.
  • Against supervised non-neural classifiers on the first 85 UCR datasets, the paper reports average rank 2.92 for T-Loss, behind HIVE-COTE and close to ST.
  • On CricketX, the appendix reports combined T-Loss accuracy 0.777; the learning-curve figure tracks the CricketX encoder with K=10 during training.
  • On UEA multivariate classification, T-Loss matches or outperforms dimension-dependent DTW on 69% of the datasets.
  • On the Individual Household Electric Power Consumption series, learned day- and quarter-window representations greatly reduce downstream regression wall time while preserving similar or slightly degraded error.

Limitations

  • The paper is a representation-learning result rather than a forecasting or action-conditioned world-model result; downstream prediction still depends on task-specific SVMs or linear regressors.
  • The main classification protocol trains an encoder per dataset, so it is not a single broad foundation model in the later time-series sense.
  • The UEA multivariate benchmark was new at the time, and the paper compares against DTW-D rather than a broad set of later multivariate baselines.
  • The method uses fixed hyperparameter choices per archive, but still relies on choices such as the number of negative samples and the SVM regularization grid.

Open Questions

  • How much of T-Loss transfer comes from the triplet objective versus the causal CNN architecture?
  • Would a single encoder trained over many heterogeneous datasets retain the per-dataset performance reported here?
  • Can time-based negative sampling be adapted to action-conditioned trajectories without confusing passive temporal proximity with intervention effects?