Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
Source
- Raw Markdown: paper_tivit-2025.md
- PDF: paper_tivit-2025.pdf
- Preprint: arXiv 2506.08641
- Official code: ExplainableML/TiViT
- Official checkpoint: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
- Official checkpoint: laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg
Core Claim
TiViT argues that strong time-series classification representations are already available inside large frozen vision and vision-language encoders: by converting numeric time series into images and probing intermediate hidden layers, pretrained ViTs can match or exceed dedicated time-series foundation models on standard classification benchmarks.
Key Contributions
- Introduces TiViT, a time-series-to-image feature extraction framework that uses frozen pretrained ViTs and trains only a downstream classifier.
- Provides a theoretical argument that 2D patching can spread label-relevant temporal patterns across more Transformer tokens than 1D patching, reducing sample complexity under the analyzed data model.
- Shows that intermediate hidden layers, especially layers with high intrinsic dimension, are better time-series classification features than final vision-model outputs.
- Evaluates on 128 UCR univariate and 27 UEA multivariate classification datasets against Moment and Mantis.
- Shows that concatenating TiViT features with time-series foundation model features improves classification, suggesting complementary representation spaces.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| TiViT-H-14-B79K | Main TiViT CLIP backbone | Converts each channel to a grayscale image, extracts mean-pooled hidden states from frozen OpenCLIP ViT-H/14, and uses layer 14 as the best UCR validation layer. The paper reports 81.3 UCR accuracy, 72.0 UEA accuracy, and 83.0 UCR / 73.7 UEA when fused with Mantis features. | laion/CLIP-ViT-H-14-laion2B-s32B-b79K |
| TiConvNext-XXLarge-AugReg | Official ConvNeXt-family TiViT artifact | Records the official LAION OpenCLIP ConvNeXt-XXLarge AugReg checkpoint for TiViT-style image feature extraction and downstream classification comparisons. | laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg |
Method Notes
TiViT treats each time series channel as a univariate numeric signal, robust-scales it, pads it, segments it into patches, stacks those patches into a 2D matrix, and resizes the matrix into a square grayscale image. A frozen vision encoder then processes the image, and TiViT averages token representations from a selected hidden layer before fitting a logistic-regression classifier.
This is a passive representation-learning and classification method, not an action-conditioned world model. It has no action, control-input, intervention, or next-state rollout channel. For multivariate time series, it uses channel independence and concatenates per-channel embeddings rather than modeling cross-channel temporal dynamics directly.
The most transferable lesson for time-series foundation models is that final-layer semantic image features are not the main value. The useful signal comes from intermediate layers whose geometry remains rich enough to represent low-level shape, periodicity, and local temporal variation after the time-series-to-image transform.
Evidence And Results
On the UCR benchmark, the paper reports TiViT-CLIP with ViT-H/14 at layer 14 as the best TiViT configuration, with 81.3 average accuracy versus 80.1 for Mantis and 79.0 for Moment.
On UEA, TiViT reports 72.0 average accuracy, roughly on par with Mantis at 72.4. The authors emphasize that this result comes from simple per-channel feature concatenation without learned multivariate fusion.
Combining representations gives the strongest reported results: TiViT + Mantis reaches 83.0 UCR accuracy and 73.7 UEA accuracy, and the paper’s alignment analysis suggests that vision-derived and time-series-derived embeddings capture complementary neighborhoods.
The ablations support the 2D image-conversion choice: overlapping 2D patching improves over 1D patching in the controlled Transformer comparison, the default patch size sqrt(T) avoids per-dataset patch search, and mean-pooled hidden tokens outperform CLS-token aggregation for the main contrastively pretrained ViT backbones.
Limitations
TiViT depends on rendering numeric time series into images and then reusing image-model features, so the representation is indirect and may discard structure that native time-series encoders could exploit. The multivariate treatment is channel-independent, the classifier still uses labeled target data, and the paper does not study forecasting, imputation, causal intervention modeling, or action-conditioned dynamics.
Links Into The Wiki
- Time-Series Foundation Models
- Time-Series Classification Foundation Models
- Time-Series Benchmark Hygiene
- Vision Foundation Models
- Mantis
- MantisV2
- MOMENT
Open Questions
- How much of TiViT’s strength comes from image pretraining scale versus the 2D patching transform itself?
- Can native multivariate tokenization preserve TiViT’s gains while modeling cross-channel dependencies directly?
- Are intermediate vision features similarly useful for forecasting, anomaly detection, or state abstraction, or is the effect specific to classification?