Time-Series Classification Foundation Models

Summary

Time-series classification foundation models learn reusable embeddings for labeled time-series tasks rather than directly forecasting future observations. They are usually passive representation models: they encode observed time-series samples, then a downstream classifier uses the embeddings. They do not expose action, control input, intervention, or counterfactual rollout channels.

Mantis Lineage

Mantis is the base lightweight calibrated classification model. It uses a Mantis token generator over normalized values, differentials, and patch statistics, then evaluates frozen features, fine-tuning, multivariate adapters, and calibration.

MantisV2 extends the same line with synthetic CauKer-style pretraining and test-time representation strategies. It introduces MantisPlus as the original Mantis architecture retrained on synthetic data, and MantisV2 as a refined smaller encoder with stronger zero-shot frozen-feature behavior.

UTICA is the self-distillation branch of the same family. It keeps the Mantis tokenizer and backbone, but replaces contrastive pretraining with DINO/iBOT-style multi-objective self-distillation, using global/local crop alignment, masked patch prediction, and a KoLeo regularizer.

The useful lineage distinction is:

  • Mantis: contrastive classification pretraining plus calibration and multivariate adapters.
  • MantisV2: synthetic data plus layer, token, scale, and feature-fusion choices at test time.
  • UTICA: non-contrastive self-distillation on the Mantis-style architecture.

Other Classification And Representation Routes

UniShape is classification-specific in a different way: it uses a multiscale shape-aware adapter and prototype objectives to preserve class-discriminative local shape. Its benchmarked entries separate fine-tuned classification from frozen-feature zero-shot extraction.

NuTime centers numerical scale. It separates local normalized shape from window mean and standard deviation through numerically multi-scaled embedding, then transfers to classification, few-shot learning, clustering, and anomaly detection.

T-Loss is an older unsupervised representation baseline. It trains causal convolutional encoders with a time-based triplet loss, showing that temporal proximity and subseries containment can produce useful embeddings before the foundation-model era.

TiViT tests a transfer route from frozen vision encoders. It renders numeric time series as images, extracts intermediate hidden-layer vision features, and trains a classifier. Its main lesson is that intermediate representation geometry can be useful even when the backbone was not trained on time series.

MOMENT is a broader time-series foundation model, but its classification evidence belongs here because masked-reconstruction representations can support downstream SVM classification even when the model was not designed only for labels.

What To Compare

Classification papers should be compared on the evaluation mode, not only on average rank:

  • Frozen feature extraction tests whether the pretrained representation transfers with a lightweight downstream classifier.
  • Fine-tuning tests whether the pretrained backbone is a good initialization for a target dataset.
  • Zero-shot claims may still train a Random Forest, SVM, or logistic-regression head on target labels, so they are not label-free prediction.
  • Fusion results, such as MantisV2 plus TiViT features, should be separated from single-model entries.

UCR and UEA classification results should not be merged with forecasting benchmarks such as GIFT-Eval, BOOM, TIME, fev-bench, Monash, or LSF. They test different task surfaces.

Evidence

The classification branch repeatedly argues that shape, scale, calibration, and feature geometry matter more than direct future-value prediction. Mantis and UTICA test contrastive versus self-distilled objectives on a shared family; MantisV2 and CauKer-style synthetic data test label coverage; UniShape tests shape-aware class tokens; NuTime tests numerical-scale preservation; TiViT tests representation transfer from vision models.

Open Questions

  • Does the Mantis lineage scale cleanly beyond the 4M to 8M parameter regime without losing the deployment advantage?
  • Which gains come from synthetic data diversity, objective choice, architecture, test-time layer selection, or downstream classifier choice?
  • Can a native multivariate classification model preserve cross-channel dynamics better than channel-wise encoding and concatenation?
  • Which classification representations transfer to forecasting, anomaly detection, or action-conditioned world models rather than only UCR/UEA labels?