---
abstract: |
  Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose **Ti**me **Vi**sion **T**ransformer (**TiViT**), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at <https://github.com/ExplainableML/TiViT>.
author:
- |
  Simon Roschmann$^{1,2,3,4}$ `\quad `{=latex}Quentin Bouniot$^{1,2,3,4}$ `\quad `{=latex}Vasilii Feofanov$^{5}$\
  **Ievgen Redko**$^5$ `\quad `{=latex}**Zeynep Akata**$^{1,2,3,4}$\
  \
  $^1$Helmholtz Munich `\quad`{=latex} $^2$Technical University of Munich\
  $^3$Munich Center for Machine Learning `\quad`{=latex} $^4$MDSI `\quad`{=latex} $^5$Paris Noah's Ark Lab\
  \
  `{simon.roschmann,quention.bouniot,zeynep.akata}@tum.de`\
  `{vasilii.feofanov,ievgen.redko}@huawei.com`\
bibliography:
- references.bib
title: |
  Time Series Representations for Classification\
  Lie Hidden in Pretrained Vision Transformers
---

```{=latex}
\newcommand{\myparagraph}[1]{\textbf{#1}}
```
```{=latex}
\maketitle
```
```{=latex}
\allowdisplaybreaks
```
Introduction
============

Foundation models have disrupted the field of machine learning. Typically built upon the Transformer [@vaswani2017attention] architecture, they are trained on large-scale datasets to learn generalizable representations for a wide range of downstream tasks. In the vision domain, foundation models like DINOv2 [@oquab2024dinov2], trained with a self-supervised objective of contrastive learning and masked modeling, yield representations that can be applied in image classification or segmentation with minimal supervision. Vision language models (VLMs) such as CLIP [@radford2021clip] or SigLIP [@tschannen2025siglip2; @zhai2023siglip] can even be transferred to new tasks without any supervision since they have learned from large-scale image-text datasets to leverage natural language as a flexible anchor for semantic concepts. VLMs have been increasingly applied in modalities beyond computer vision, such as audio [@dixit2024visionlanguagemodelsfewshot; @xie2024sonicvisionlmplayingsoundvision] and medicine [@zhang2023biomedgpt].

Time series data is critically important across a wide range of domains, including healthcare, transportation, and manufacturing. Inspired by the success of foundation models in natural language processing (NLP) and computer vision, similar models have recently been developed for the analysis of time series following two different approaches. The first one is to pretrain time series foundation models (TSFMs) in a self-supervised way [@ansari2024chronos; @das2023decoder; @feofanov2025mantis; @goswami2024moment; @lin2023nutime] using a large-scale real-world time series dataset. The second one is to repurpose powerful foundation models from other domains, such as NLP [@jin2023time; @zhou2023one] and vision [@chen2024visionts; @li2023time], for time series tasks. The idea behind this approach is to benefit from the vast amount of samples that large vision and language models are trained on and which are often unavailable in the time series domain. Despite the encouraging results of these latter approaches, they often remain inferior to TSFMs trained on large-scale collections of datasets, require costly fine-tuning, or are restricted to a specific task (such as, for instance, univariate forecasting [@chen2024visionts] or classification of irregularly sampled time series [@li2023time]).

In this paper, we propose a first comprehensive study over a large collection of real-world time series datasets showing that pretrained vision models, such as DINOv2, CLIP, or SigLIP 2, can be on par with or even superior to frontier TSFMs in time series classification. Not only do our results significantly extend the scope of prior contributions using pretrained vision models in time series analysis, but they also provide theoretical and qualitative analysis showing the benefits of image-based modeling of time series and its complementarity to existing TSFMs trained on common time series datasets.

```{=latex}
\centering
```
![While TSFMs such as Mantis operate directly on the 1D time series signal, TiViT transforms time series into images to leverage pretrained ViTs for feature extraction. We display the average time series signal of two classes from ECG200 [@ecg200] and their corresponding 2D representations. Utilizing the hidden representations of OpenCLIP, TiViT significantly outperforms Mantis in linear classification on the UCR benchmark. Combining both models further improves accuracy.](figures/teaser.png){#fig: teaser width="98%"}

Our main contributions can be summarized as follows: (1) We show that pretrained vision foundation models can be superior to classification TSFMs without any fine-tuning. We achieve this by transforming time series into 2D images and by further using hidden layer representations of vision models for time series classification. (2) We propose a theoretical insight showing that imaging-based time series modeling can be efficient when used with Transformers since it reduces sample complexity during training. (3) We show that representations from TSFMs and ViTs can be concatenated to provide a further impressive average improvement of +3% on 128 UCR time series datasets, highlighting the complementarity of models extracting different information from the same data. We further study the alignment of TSFMs and ViTs confirming their complementarity.

Related work {#sec:related}
============

`\myparagraph{Time series foundation models}`{=latex} Recently, the research community has witnessed an impressive surge in the number and variety of TSFMs. At first, such models were based on repurposing large language models (LLMs) for time series tasks [@cao2023tempo; @chang2023llm4ts; @gruver2024large; @jin2023time; @xue2023promptcast; @zhou2023one] by leveraging the ability of LLMs to efficiently handle text sequences. A different approach that gained in popularity later was to train TSFMs from the ground up on extensive and diverse datasets [@ansari2024chronos; @bhethanabhotla2024mamba4cast; @das2023decoder; @feofanov2025mantis; @gao2024units; @goswami2024moment; @lin2023nutime; @liu2024moirai; @liu2024timer; @rasul2023lag; @wang2024rose]. While most of the models were designed for time series forecasting, several of them also specifically tackled time series classification [@feofanov2025mantis; @gao2024units; @goswami2024moment; @lin2023nutime; @zhou2023one]. These models are on par with or exceed the performance of other popular deep learning models proposed for time series classification, such as the famous TimesNet [@WuHLZ0L23] architecture.

`\myparagraph{Transforming time series into images}`{=latex} Time series can be transformed into images in many ways, either based on the 1D representation of the time series in the original (line plot) or transformed (frequency) space, or by using a 2D modeling (heatmap, Gramian angular field, recurrence plot) that stacks segments of the input time series based on a chosen periodicity. Vision models, often based on CNNs and their variations, were used on such image-based representations of time series since as early as 2013 (see @ni2025harnessingvisionmodelstime for a recent survey). Most of them, however, are trained in a supervised way to fit a dataset at hand. This work explores how pretrained vision models can be used as powerful feature extractors without training or fine-tuning. @li2023time showed that pretrained ViTs can be efficient in the classification of irregular time series from their line plot representations after full fine-tuning. In a similar vein, @chen2024visionts applied a masked auto-encoder with a pretrained frozen ViT to 2D transformed time series to perform univariate time series forecasting. Different from these works, we explain why vision models can be more efficient in time series analysis compared to Vanilla Transformers. Moreover, our TiViT model surpasses the performance of frontier TSFMs across a broad set of common classification benchmarks.

TiViT: Time series classification using pretrained Vision Transformers {#sec:tivit}
======================================================================

We introduce TiViT (Time Vision Transformer) leveraging pretrained frozen ViTs from the vision or vision-language domain for time series classification. First, we theoretically motivate the 2D modeling of time series. Second, we detail how we transform time series into images and describe how pretrained ViTs can effectively extract features of these images.

Theoretical motivation for 2D time series modeling {#subsection: theoretical motivation}
--------------------------------------------------

Although previous studies [@chen2024visionts; @lin2024sparsetsf; @WuHLZ0L23] have modeled time series as 2D matrices, there is no theoretical understanding of why such an approach may be beneficial in practice. Below, we develop a theoretical insight showing exactly how the 2D modeling of time series can improve the classification performance of Transformer-based models compared to conventional 1D patching.

```{=latex}
\myparagraph{Problem setup}
```
We consider a binary time series classification problem with $N$ univariate training samples $\{({\bm{t}}^n, y^n), y^n \in \{+1, -1\}\}_{n=1}^N$. Each time series $\bm{t}^n \in \mathbb{R}^T$ is characterized by a fixed periodicity $p$. Without loss of generality, we assume $T=k^2$ for some $k \in \mathbb{N}$ and $p=k$. For the ease of derivations, we further assume that $k=g^2$ for some $g \in \mathbb{N}$, ensuring that $\sqrt{k}$ is an integer. Each time series can be patched in one of the two following ways:

```{=latex}
\centering
```
![Benefits of 2D patching for time series. We consider a binary classification problem with two distinct patterns: a sine function over $[0, \pi]$, either positive or negative. Here, the negative sine function represents the label-relevant pattern. Tokens should cover at least $1 / \sqrt{k}$ of the label-relevant pattern to be considered label-relevant, i.e., all tokens in 2D (red), only one third of tokens in 1D.](figures/illustration_prop_all.png){#fig:prop width="0.95\\linewidth"}

-   **1D patching**: For a time series $\bm{t}^n \in \mathbb{R}^T$, define $\bm{X}^n \in \mathbb{R}^{1 \times T}$ as: $$\bm{X}^n = \begin{bmatrix}
        \bm{x}_1^n & \bm{x}_2^n & \cdots & \bm{x}_k^n
        \end{bmatrix}, \quad \text{where } \bm{x}_l^n = \bm{t}^n_{(l-1)k + 1 : lk} \in \mathbb{R}^k \ \forall l \in [k].$$ Each token $\bm{x}_l^n$ corresponds to a contiguous 1D segment of length $k$.

-   **2D patching**: Reshape $\bm{t}^n$ into a $k \times k$ matrix $\bm{X}'^n$, then split it into $k$ non-overlapping square patches of size $\sqrt{k} \times \sqrt{k}$: $$\bm{X}'^n = 
        \begin{bmatrix}
        \bm{x}_1^n \\
        \bm{x}_2^n \\
        \vdots \\
        \bm{x}_k^n
        \end{bmatrix}
        = \begin{bmatrix}
        \bm{P}_{1,1} & \bm{P}_{1,2} & \cdots \\
        \bm{P}_{2,1} & \bm{P}_{2,2} & \cdots \\
        \vdots & \vdots & \ddots
        \end{bmatrix}, \quad \text{where } \bm{P}_{i,j} \in \mathbb{R}^{\sqrt{k} \times \sqrt{k}} \ \forall i,j \in [\sqrt{k}].$$ Each 2D token $\bm{P}_{i,j}$ is vectorized (flattened) into $\bm{x}_{(i,j)}'^n \in \mathbb{R}^k$.

`\myparagraph{Benefits of 2D patching}`{=latex} Our key idea is to leverage the notion of label-relevant tokens and their impact on the sample complexity of training Transformers, as introduced by @li2023a. Following their data model, we consider each token to be a noisy version of distinct patterns. In binary classification, there exist two such patterns $\{\bm{\mu}_1, \bm{\mu}_2\}$, $\bm{\mu}_i \in \mathbb{R}^{k}, \forall\ i$. For a time series $\bm{t}^n$ with label $y^n = 1$, tokens $\bm{x}$ that are noisy $\bm{\mu}_1$, i.e., $||\bm{x}-\bm{\mu_1}||\leq ||\bm{x}-\bm{\mu_2}||$ are referred to as label-relevant tokens. Similarly, for a time series $\bm{t}^n$ with label $y^n = -1$, the label-relevant tokens are noisy versions of $\bm{\mu}_2$. The class prediction for $y^n$ depends on a majority vote over tokens closest to $\bm{\mu}_1/\bm{\mu}_2$.

@li2023a showed that the sample complexity of a shallow multi-head Transformer scales as $\mathcal{O}(1/\alpha_*^2)$ where $\alpha_*$ denotes the number of label-relevant tokens in the training samples (more details on the setup are in Appendix `\ref{app:sec_theory_background}`{=latex}). Our objective is to show that, under certain conditions, the fraction of label-relevant tokens is greater when the time series is transformed into a 2D representation ($\alpha_*^{\text{2D}}$) compared to the conventional 1D representation ($\alpha_*^{\text{1D}}$). This, in turn, can explain why 2D patching leads to more efficient learning than 1D patching (the proof is postponed to Appendix `\ref{app:sec_theory_proof}`{=latex}).

```{=latex}
\begin{proposition}
For arbitrary $\bm{\mu}_1, \bm{\mu}_2 \in \mathbb{R}^k$, let $\bm{t} = [\bm{x}_1 \ \ \bm{x}_2 \ \ \cdots \ \ \bm{x}_k ]^\top \in \mathbb{R}^T \text{where } \forall i \in [k], \bm{x}_i \in \mathbb{R}^k$ and either $\bm{x}_i = \bm{\mu}_1$ or $\bm{x}_i = \bm{\mu}_2$ with $\bm{\mu}_2$ being a label-relevant pattern. Let $\left| \left\{ i : \bm{x}_i = \boldsymbol{\mu}_2 \right\} \right| = n'$ and assume that $2\bm{x}'\cdot (\bm{\mu}_1 - \bm{\mu}_2) \leq ||\bm{\mu}_1 ||^2 - ||\bm{\mu}_2||^2$ whenever $\left| \left\{ i : x'_i \in \boldsymbol{\mu}_2 \right\} \right| \geq \sqrt{k}$. Then, it holds:   $$\alpha_*^{\text{2D}} \geq \alpha_*^{\text{1D}}=\frac{n'}{k},$$ and the inequality is strict if $n' \text{ mod } \sqrt{k} > 0.$
\end{proposition}
```
To better illustrate this proposition, we visualize it using a concrete example. We define $\bm{\mu}_1=\text{sin}(x)$ for $x \in [0, \pi]$ and let $\bm{\mu}_2=-\bm{\mu}_1$. Figure `\ref{fig:prop}`{=latex} (more examples are provided in Appendix `\ref{app:sec_theory_examples}`{=latex}) displays the input time series $\bm{t}$ with $k=9$ and $n'=3$. In this case, the assumption $2\bm{x}'\cdot (\bm{\mu}_1 - \bm{\mu}_2) \leq ||\bm{\mu}_1 ||^2 - ||\bm{\mu}_2||^2$ simplifies to $\bm{x}'\cdot \bm{\mu}_1\leq 0$ and is verified for all tokens in 2D case and only for $n'$ tokens in 1D case. On a higher level, this proposition formalizes the idea that having a discriminative signal spread across more tokens (each $\bm{\mu}_2$ contributes to $\sqrt{k}$ tokens in 2D case) makes it easier for a Transformer model to pick up this signal and to learn the classification task better. In the case of 1D patching, this signal is less spread, forcing the model to \`\`work\" harder to attend to important tokens during training.

Time series classification with ViT representations
---------------------------------------------------

In our real-world setting, we consider a multivariate time series dataset $\mathcal{T} = \{\bm{t}^n | \bm{t}^n \in \mathbb{R}^{T \times D}\}_{n=1}^N$ containing $N$ samples, each of length $T$ and dimensionality $D$. The corresponding targets $\mathcal{Y}=\{y^n\}_{n=1}^N$ are labels $y^n \in \{1, ..., C\}$ from $C$ different classes. In line with our theoretical framework, we transform the time series into images and apply ViTs on these images to extract representations for linear classification. Figure `\ref{fig: methodology}`{=latex} illustrates our approach.

```{=latex}
\centering
```
![Illustration of TiViT on a time series sample from ECG200 [@ecg200]. We split the time series into segments and stack them to form a grayscale image. Then, we patch the image in 2D and feed it into a frozen ViT pretrained on large-scale image datasets. We average the hidden representations from a specific layer and pass them to a learnable classification head. Combining the representations of TiViT and TSFMs such as Mantis further improves classification accuracy.](figures/methodology.png){#fig: methodology width="98%"}

```{=latex}
\myparagraph{Time series-to-image transformation}
```
`\label{subsection: time series to image transformation}`{=latex} Following the channel independence assumption, proposed by @nie2022patchtst and widely adopted in most recent time series models [@feofanov2025mantis; @goswami2024moment], we first split a multivariate time series $\bm{t}^n \in \mathbb{R}^{T \times D}$ into $D$ univariate time series $\{\bm{t}^n_d \in \mathbb{R}^{T}\}_{d=1}^{D}$. We then normalize each univariate time series $\bm{t}^n_d$ using robust scaling, defined as: $$\frac{\bm{t}^n_d - Q_2}{Q_3 - Q_1},$$ where $Q_1, Q_2, Q_3$ are the first, second (median), and third quartiles, respectively. We apply padding at the beginning of each time series by replicating its first value and subsequently segment it into $M$ patches $\{\bm{x}_m\}_{m=1}^{M}$ of size $P$. Given a patch length $P$ and stride $S$, the total number of patches is: $$M = \left\lfloor{\frac{T - P}{S}} \right\rfloor + 1.$$ We stack the patches to generate a 2D representation $\bm{X}' \in \mathbb{R}^{M \times P}$, which we then render into a grayscale image $\bm{X}' \in \mathbb{R}^{M \times P \times 3}$ by replicating its signals across three channels. To align with the square input resolution $(R, R)$ expected by the ViT, we resize the image.

```{=latex}
\myparagraph{Time series classification}
```
We feed each grayscale image $\bm{X}'$ representing a univariate time series into a pretrained and frozen ViT $v$ with $L$ hidden layers. The ViT inherent 2D patching yields a sequence $\{{\bm{x}_k'} \in \mathbb{R}^{U^2}\}_{k=1}^K$ of flattened patches where $(U, U)$ is the resolution per patch and $K = R^2/U^2$ is the resulting number of patches. ViTs generally prepend a classification token to this sequence. The ViT consumes all input tokens and produces a sequence of features at every layer: $$v(\bm{X}')=\left\{[\bm{h}_0^{(l)}, \bm{h}_1^{(l)}, ..., \bm{h}_K^{(l)}]\right\}_{l=0}^{L}$$ To obtain a single embedding vector $\bm{e}$ per image, we select a specific layer $l$ and average its $K+1$ representations: $$\bm{e} = \bm{h}_{avg}^{(l)} = \frac{1}{K + 1} \sum_{k=0}^{K} \bm{h}_k^{(l)}$$ For multivariate time series, we feed per-channel image representations $\{\bm{X}'_d\}_{d=1}^D$ separately into the ViT and concatenate the resulting embeddings for a specified layer: $\text{Concat}(\bm{e}_1, ..., \bm{e}_D)$. We only train a linear classifier on the ViT representations and their corresponding class labels. To enhance the performance, the embeddings of frozen TSFMs and ViTs can be concatenated prior to classification.

Experimental evaluation {#sec:exps}
=======================

We evaluate TiViT with different ViT backbones on two time series classification benchmarks.

```{=latex}
\myparagraph{Datasets}
```
UCR [@dau2019ucr] comprises 128 univariate time series datasets of varying sample size ($16 \leq N_\text{train} \leq 8926$) and series length ($15 \leq T \leq 2844$). UEA [@bagnall2018uea] consists of 30 multivariate time series datasets. Following @feofanov2025mantis, we exclude three datasets (AtrialFibrillation, StandWalkJump, PenDigits) from UEA due to their short sequence length or small test size.

```{=latex}
\myparagraph{Vision Transformers}
```
Our study examines three differently pretrained ViTs. CLIP [@radford2021clip] performs contrastive learning of image and text encoders on image-text pairs. We reuse the ViT image encoders of OpenCLIP [@Cherti_2023_CVPR; @ilharco2021openclip] models trained with the LAION-2B English subset of LAION-5B [@schuhmann2022laion]. SigLIP 2 [@tschannen2025siglip2] adopts contrastive learning on image-text pairs, but with a Sigmoid loss, complemented by captioning-based pretraining, self-distillation, and masked prediction. In contrast, DINOv2 [@oquab2024dinov2] is solely pretrained on images through self-distillation with a student-teacher architecture and masked modeling. For each pretraining approach, we consider multiple vision model sizes (ViT-B, ViT-L, ViT-H) with varying layer depth (12, 24, and 32 layers).

```{=latex}
\myparagraph{Baselines}
```
We compare TiViT to two state-of-the-art TSFMs exclusively pretrained on time series. Mantis [@feofanov2025mantis] is a Transformer model (8 M parameters) comprising 6 layers and 8 heads per layer, pretrained on 2 million time series with contrastive learning. Moment [@goswami2024moment] is a family of Transformers pretrained on 13 million time series with masked modeling. In our study, we consider Moment-base with 12 layers and 125 M parameters.

```{=latex}
\myparagraph{Implementation}
```
To assess the effectiveness of TiViT and TSFM representations in time series classification, we train a logistic regressor with the LBFGS solver per dataset. Our evaluation adheres to the standard train-test splits provided by the UCR and UEA archive and reserves 20% of the train split for validation. For the time series-to-image transformation, we resize the grayscale images to the resolution expected by the ViT with nearest interpolation and adjust the contrast with a factor of $0.8$. All experiments can be performed on a single NVIDIA V100 GPU with 16 GB memory. Our results are averaged over three random seeds.

```{=latex}
\centering
```
`\label{subfigure: ecg 200 sample}`{=latex} `\centering`{=latex} ![ECG200 [@ecg200] sample](figures/ecg_200_for_grayscale.png "fig:"){#fig:patch1 width="95%"}

```{=latex}
\hfill
```
`\label{subfigure: patch size small}`{=latex} `\centering`{=latex} ![$P = 1$](figures/ecg_200_grayscale_1.png "fig:"){#fig:patch2 width="95%"}

```{=latex}
\hfill
```
`\label{subfigure: patch size sqrt}`{=latex} `\centering`{=latex} ![$P = \sqrt{T}$](figures/ecg_200_grayscale_9.png "fig:"){#fig:patch3 width="95%"}

```{=latex}
\hfill
```
`\label{subfigure: patch size large}`{=latex} `\centering`{=latex} ![$P = \frac{T}{2}$](figures/ecg_200_grayscale_48.png "fig:"){#fig:patch4 width="95%"}

```{=latex}
\setlength
```
```{=latex}
\intextsep{2pt}
```
```{=latex}
\begin{wraptable}[8]{r}{0.42\linewidth}
\centering
\caption{Comparison of patching strategies on the UCR benchmark.}
\label{tab:matrix-patching-exp-res}
\begin{tabular}{@{}lcccc@{}}
\toprule
\multirow{2}{*}{Patching} & \multicolumn{2}{c}{Non-overlap} & \multicolumn{2}{c}{Overlap} \\
\cmidrule(lr){2-3}
\cmidrule(lr){4-5}
& 1D & 2D & 1D & 2D \\
\midrule
Accuracy & 76.4 & 76.5 & 76.6 & \textbf{77.4} \\
\bottomrule
\end{tabular}
\end{wraptable}
```
Comparison of 1D and 2D patching with Transformers
--------------------------------------------------

The way we transform a time series into an image can be seen as a special patching strategy for time series. We study how the 2D patching proposed in Section `\ref{subsection: theoretical motivation}`{=latex} improves the quality of representations learned by Transformers. We fix the Transformer architecture and pretraining method, and only vary the patching strategy. We then evaluate the representations learned by the Transformer model on the UCR benchmark.

Following @feofanov2025mantis, we pretrain a Transformer model with 6 layers and 8 heads per layer using contrastive learning. More implementation details can be found in Appendix `\ref{app: 1d 2d comparison}`{=latex}. We compare 1D and 2D patching with both non-overlapping and overlapping patches. As summarized in Table `\ref{tab:matrix-patching-exp-res}`{=latex}, 2D patching outperforms 1D patching, with overlapping 2D patches yielding the highest classification accuracy. This finding confirms our theoretical analysis in Section `\ref{subsection: theoretical motivation}`{=latex}, showing that the transformation of time series to images can be beneficial for time series classification. Subsequently, we build on this idea of modeling time series as images and further leverage state-of-the-art pretrained vision models for feature extraction.

```{=latex}
\centering
```
```{=latex}
\centering
```
![Validation accuracy](figures/layer_accuracy_v4-crop.png){#fig: layer vs accuracy width="95%"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![Intrinsic dimension](figures/intrinsic_dimensionality_v4-crop.png){#fig: intrinsic dimensionality width="95%"}

```{=latex}
\centering
```
```{=latex}
\begin{subtable}[t]{0.34\textwidth}
\centering
\caption{Selecting patch size $P$}
\label{tab: periodicity search}
\begin{tabular}{@{}lcc@{}}
\toprule
Patch size & $\sqrt{T}$ & $P^*$ \\
\midrule
Val accuracy & 78.2 & \textbf{79.5} \\
\bottomrule
\end{tabular}
\end{subtable}
```
```{=latex}
\hfill
```
```{=latex}
\begin{subtable}[t]{0.65\textwidth}
\centering
\caption{Effect of patch overlap on validation accuracy}
\label{tab: overlap accuracy}
\begin{tabular}{@{}lcccccc@{}}
\toprule
Overlap & 0.0 & 0.25 & 0.5 & 0.75 & 0.9 & 0.95\\
\midrule
Val accuracy & 78.2 & 79.3 & 80.2 & 80.0 & \textbf{80.4} & 80.0 \\
\bottomrule
\end{tabular}
\end{subtable}
```
Transforming time series into images for ViT feature extraction {#subsection: transforming time series into images for vit feature extraction}
---------------------------------------------------------------

The time series-to-image transformation described in Section `\ref{subsection: time series to image transformation}`{=latex} is sensitive to the patch size $P$. This parameter affects the visual appearance of the image representation provided to the ViT for feature extraction. Figure `\ref{fig:patch_size}`{=latex} displays a time series sample from the ECG200 [@ecg200] dataset along with its corresponding image representations for three different patch sizes. After patching and stacking, the 2D matrix is resized to the quadratic image resolution required by ViTs. Using very small (Figure `\ref{fig:patch2}`{=latex}) or very large (Figure `\ref{fig:patch4}`{=latex}) patch sizes results in redundant tokens representing the same input signal. To avoid a computationally expensive hyperparameter search to find the best patch size $P^*$ per dataset, we propose to select $P = \sqrt{T}$ for any dataset of length $T$. A patch size of $\sqrt{T}$ yields a square-shaped image prior to resizing and thus the most diverse set of patches without any horizontal or vertical distortion (Figure `\ref{fig:patch3}`{=latex}). Moreover, this setting is in line with our theoretical motivation in Section `\ref{subsection: theoretical motivation}`{=latex}.

Table `\ref{tab: periodicity search}`{=latex} presents the classification accuracy for TiViT with a CLIP backbone (TiViT-CLIP) and non-overlapping patches. To provide an upper bound on the classification performance, we perform a hyperparameter search for the best patch size $P^*$. Specifically, for each dataset of length $T$, we consider 20 equally spaced values in $[1, \frac{T}{2}]$ and identify the patch size that maximizes classification accuracy on the validation set. Note that, while there is a small decline in accuracy on the test set, when consistently applying $P = \sqrt{T}$, the computational cost is reduced by a factor of 20. We further investigate the impact of overlapping patches on the classification accuracy while maintaining the patch size at $P = \sqrt{T}$. Table `\ref{tab: overlap accuracy}`{=latex} reports the effect of patch overlap, defined as a fraction of the patch length, on the classification performance for TiViT-CLIP. Our findings reveal that overlapping patches enhance accuracy in downstream classification. Consequently, all subsequent experiments employ a patch size of $P = \sqrt{T}$ and a stride of $S = \frac{P}{10}$.

Hidden representations are most effective in time series classification {#subsection: hidden representatoins}
-----------------------------------------------------------------------

We repurpose frozen ViTs as feature extractors for time series data. While the final representations of ViTs typically capture high level semantics, intermediate layers encode lower level information [@dorszewski2025colors]. Our study reveals that the intermediate representations of ViTs are the most effective for downstream classification. In Figure `\ref{fig: layer vs accuracy}`{=latex} we report the classification performance of TiViT with pretrained ViTs from DINOv2, CLIP, and SigLIP 2 on the validation split of the UCR benchmark. For each dataset, we extract representations from the hidden layers of ViTs, average them, and train a linear classifier. The intermediate representations of ViTs, between 40% and 70% of the layer depth, achieve the highest classification accuracy. Table `\ref{tab: hidden layers}`{=latex} summarizes the performance on the UCR test split.

```{=latex}
\renewcommand{\arraystretch}{1.}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{.9\linewidth}{!}{
\begin{tabular}{@{}llcclc@{}}
\toprule
Model & Architecture & Layer & Parameters & Data & Accuracy\\
\midrule
TiViT-DINOv2 & ViT-L/14 & 15 & 178 M & LVD-142M & 80.0 \\
TiViT-SigLIP 2 & SoViT-400m/14 & 10 & 138 M & WebLI (10B) & 80.6 \\
TiViT-CLIP & ViT-H/14 & 14 & 257 M & LAION-2B  & \textbf{81.3} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\centering
```
```{=latex}
\centering
```
![image](figures/ucr+uea_performance.png){width="0.9\\linewidth"} `\captionof{figure}{Evaluation on 155 datasets from the UEA and UCR archive.}`{=latex} `\label{fig: ucr uea joint benchmark}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\centering
```
```{=latex}
\vspace{0.2cm}
```
```{=latex}
\resizebox{.9\linewidth}{!}{
\begin{tabular}{@{}lcc@{}}
\toprule
Model & UCR & UEA \\
\midrule
Moment & 79.0 & 69.9\\
Mantis & 80.1 & 72.4\\
\midrule
TiViT \emph{(Ours)} & 81.3 & 72.0\\
\midrule
TiViT + Moment \emph{(Ours)} & 82.5 & 72.6\\
TiViT + Mantis \emph{(Ours)} & \textbf{83.0} & \textbf{73.7} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\vspace{0.55cm}
```
```{=latex}
\captionof{table}{Classification accuracy of TSFMs and TiViT per benchmark.}
```
`\label{tab: ucr uea joint benchmark}`{=latex}

```{=latex}
\vspace{0.5cm}
```
CLIP and SigLIP 2, both optimized with a contrastive loss on image-text pairs, reach best performance in their earlier layers: layer 14 of 33 for CLIP (ViT-H) and layer 10 of 28 for SigLIP 2 (SoViT-400m). In contrast, DINOv2 (ViT-L) trained with contrastive learning and masked modeling on images only, reaches the highest classification accuracy with representations from a later layer (15 of 25). Our selection of architectures per pretraining paradigm ensures that TiViT exhibits a similar number of layers and parameters up to the best performing hidden layer. For each ViT, we determine the optimal hidden layer based on its highest validation accuracy across the 128 datasets of the UCR benchmark. This best performing layer per ViT is consistently used in all subsequent experiments.

```{=latex}
\myparagraph{Intrinsic dimension}
```
To better understand the hidden representations of ViTs, we analyze their intrinsic dimension (see Figure `\ref{fig: intrinsic dimensionality}`{=latex}) and principal components (see Appendix `\ref{app: intrinsic dim and pca}`{=latex}). @valeriani2023geometry have previously investigated the geometry of hidden representations of Transformers for in-domain vision and language applications. We measure the intrinsic dimension of ViTs applied on time series from the UCR archive using the DADApy [@dadapy] implementation of the TWO-NN estimator [@facco2017estimating]. Figure `\ref{fig: intrinsic dimensionality}`{=latex} displays for three different ViT backbones the intrinsic dimensionality of their representations at varying layer depth. The best performing layers, according to Figure `\ref{fig: layer vs accuracy}`{=latex} between 40% and 70% of the layer depth, exhibit the highest intrinsic dimensionality.

`\myparagraph{Benchmark}`{=latex} Unless stated otherwise, we refer to our best-performing model, TiViT-CLIP, as TiViT. A full comparison of TiViT and TSFMs on the UCR and UEA test set is illustrated in Figure `\ref{fig: ucr uea joint benchmark}`{=latex}. Across all 155 datasets, TiViT reaches a mean accuracy of 79.7% surpassing Mantis with a mean accuracy of 78.8%. Results per benchmark can be found in Table `\ref{tab: ucr uea joint benchmark}`{=latex}. The state-of-the-art TSFM Mantis achieves a linear classification accuracy of 80.1% on the UCR benchmark. Our statistical analysis with a paired t-test and a significance level of 0.05 confirms that TiViT significantly outperforms ($p=0.03$) Mantis across the 128 datasets of the UCR benchmark, achieving 81.3% accuracy. We further extend our analysis to the classification of multivariate time series. As depicted in Table `\ref{tab: ucr uea joint benchmark}`{=latex}, TiViT reaches a classification accuracy of 72.0%, which is statistically on par with Mantis (72.4%) on the UEA benchmark. Remarkably, for multivariate datasets, a simple concatenation of per-channel representations, without learning any explicit fusion or cross-channel interactions, reaches the state-of-the-art performance of Mantis.

Alignment and fusion of TiViT and TSFM representations {#subsection: alignment and fusion}
------------------------------------------------------

```{=latex}
\centering
```
```{=latex}
\resizebox{.6\linewidth}{!}{
\begin{tabular}{@{}cc|cc|cc@{}}
\toprule
\multicolumn{2}{c|}{TiViT} & \multicolumn{2}{c|}{TSFM} & Joint & Alignment \\
CLIP & DINOv2 & Mantis & Moment & Accuracy & Score \\
\midrule
-- & -- & 80.1 & 79.0 & 81.5 & 0.319\\
-- & 80.0 & -- & 79.0 & 81.8 & 0.296\\
81.3 & 80.0 & -- & -- & 82.0 & \textbf{0.484}\\
-- & 80.0 & 80.1 & --  & 82.2 & 0.323\\
81.3 & -- & -- & 79.0 & 82.5 & 0.321 \\
81.3 & -- & 80.1 & -- & \textbf{83.0} & 0.338\\
\bottomrule
\end{tabular}
}
```
We do not only compare the effectiveness of TiViT and TSFM representations against each other, but also explore their complementarity when concatenating their features for joint classification. As depicted in Table `\ref{fig: joint classification and alignment}`{=latex}, the combination of TiViT and TSFM consistently improves the classification performance over any standalone model. While the combination of two TSFMs yields 81.5% accuracy, fusing TiViT-CLIP with Moment and Mantis leads to even higher accuracies of 82.5% and 83.0%, respectively. These results underscore the potential of multimodal time series analysis.

To uncover the differences between representations learned by ViTs and TSFMs, we additionally assess the alignment of their representation spaces using the mutual k-nearest neighbor metric [@huh2024prh]. For each sample in the dataset, we find the $k=10$ nearest neighbors in the embedding space of two different models and measure the intersection between the two neighbor sets. The final alignment score between two models is an average across all samples from the UCR benchmark. Table `\ref{fig: joint classification and alignment}`{=latex} presents the alignment scores for CLIP, DINOv2, Mantis, and Moment. Interestingly, the alignment score of the two TSFMs is relatively low. We hypothesize that this discrepancy arises from their different pretraining paradigms: Mantis is trained contrastively while Moment is trained with masked modeling. A similarly low alignment score is observed between any TiViT and TSFM, which we attribute to their domain gap. TiViT and Mantis extract different representations for the same time series, which is beneficial for joint classification. The highest alignment is measured between TiViT-CLIP and TiViT-DINOv2, both of which are pretrained contrastively on image datasets.

Ablation studies {#subsection: ablation studies}
----------------

```{=latex}
\renewcommand{\arraystretch}{1.1}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{.95\linewidth}{!}{
    \begin{tabular}{@{}lcccccc@{}}
      \toprule
      \multicolumn{1}{c}{} & \multicolumn{4}{c}{\textbf{UCR subsets}} & \multicolumn{2}{c}{\textbf{Classifier comparison}} \\
      \cmidrule(lr){2-5} \cmidrule(lr){6-7}
      \textbf{Model} & Smallest & Largest & Shortest & Longest & Logistic Reg. & Nearest Centroid \\
      \midrule
      Moment & 85.7 & 85.5 & 86.9 & 65.8 & 79.0 & 68.6 \\
      Mantis & 86.6 & 82.8 & 88.1 & 70.5 & 80.1 & 71.2 \\
      \midrule
      TiViT \emph{(Ours)} & 89.8 & 85.3 & 87.5 & 75.0 & 81.3 & 71.6 \\
      \midrule
      TiViT + Moment \emph{(Ours)} & 89.9 & \textbf{87.1} & \textbf{88.8} & 74.9 & 82.5 & 73.3 \\
      TiViT + Mantis \emph{(Ours)} & \textbf{90.9} & 86.2 & \textbf{88.8} & \textbf{77.7} & \textbf{83.0} & \textbf{73.4} \\
      \bottomrule
    \end{tabular}
  }
```
In Section `\ref{subsection: hidden representatoins}`{=latex}, we report the performance of TiViT across all 128 UCR datasets. To further explore its capabilities, we now select four UCR subsets: 10 datasets with the fewest training samples ($16 \leq N_{train} \leq 20$), the most training samples ($1000 \leq N_{train} \leq 8926$), the shortest time series ($15 \leq T \leq 80$), and the longest time series ($1500 \leq T \leq 2844$). The results are displayed in Table `\ref{tab:ucr_combined}`{=latex}. TiViT significantly outperforms Mantis on subsets with a small training set (89.8% vs. 86.6%) and long time series (75.0% vs. 70.5%). These findings demonstrate that TiViT excels in generalizing from limited training data and in modeling long-range dependencies. On the remaining two subsets, TiViT is on par with TSFMs. Combining the representations of TiViT and TSFMs achieves the highest classification accuracy across all subsets, once again underscoring their complementarity.

While the previous experiments require to train a logistic regressor for classification, we finally investigate the effectiveness of TiViT in zero-shot classification. Here, we employ a nearest centroid classifier, where each class is represented by the centroid of its representations, and samples are assigned to the class of their nearest centroid. On the UCR benchmark, TiViT achieves a zero-shot classification accuracy of 71.6%. Our approach is on par with Mantis (71.2%) and outperforms Moment (68.6%), highlighting the ability of TiViT to extract generalizable representations. We further merge the representations of TiViT and Mantis, reaching a zero-shot accuracy of 73.4%.

Conclusion {#sec:ccl}
==========

In this paper, we showed both theoretically and empirically that modeling time series in 2D rather than 1D benefits time series classification with Transformers. Building on this insight, we introduced TiViT, leveraging large pretrained ViTs for feature extraction on images generated from time series. Our analysis revealed that the hidden representations of ViTs characterized by high intrinsic dimensionality are most effective in time series classification. Utilizing the hidden representations of OpenCLIP, TiViT significantly outperformed state-of-the-art TSFMs in time series classification on UCR, and reached comparable performance on UEA. Furthermore, we investigated multimodal time series analysis by merging the representations of TiViT and TSFMs, and achieved state-of-the-art results for foundation models in zero-shot and linear classification on both benchmarks.

```{=latex}
\myparagraph{Limitations and future work}
```
For multivariate datasets, TiViT processes the signal from each channel independently and simply concatenates the representations prior to classification. Future research could explore more sophisticated techniques for capturing inter-channel dependencies. Similarly, the joint classification of TiViT and TSFMs is currently limited to the concatenation of their representations and could be further enhanced by learning a fusion strategy. While our study focused on evaluating the representations learned by TSFMs and ViTs through linear classification, subsequent work could unfreeze the ViT backbone and investigate the potential of finetuning TiViT.

```{=latex}
\begin{ack}
This work was partially funded by the ERC (853489 - DEXIM) and the Alfried Krupp von Bohlen und Halbach Foundation, which we thank for their generous support. The authors gratefully acknowledge the scientific support and resources of the AI service infrastructure \textit{LRZ AI Systems} provided by the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences and Humanities (BAdW), funded by Bayerisches Staatsministerium für Wissenschaft und Kunst (StMWK).


\end{ack}
```
```{=latex}
\small
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\newpage
```
```{=latex}
\setcounter{table}{0}
```
```{=latex}
\renewcommand{\thetable}{A\arabic{table}}
```
```{=latex}
\setcounter{figure}{0}
```
```{=latex}
\renewcommand{\thefigure}{A\arabic{figure}}
```
```{=latex}
\appendix
```
Appendix {#appendix .unnumbered}
========

In Section `\ref{app:sec_theory}`{=latex}, we summarize the theoretical analysis of @li2023a on learning and generalization for Vision Transformers and detail our proof of label relevance for 2D patching. In Section `\ref{app: 1d 2d comparison}`{=latex}, we describe the model and pretraining setup used in our comparison of 1D and 2D patching for Transformers. In Section `\ref{app: tivit additional analysis}`{=latex}, we further analyze the size and type of TiViT backbones. In Section `\ref{app: ucr uea full benchmark}`{=latex}, we provide the benchmark results for each dataset from the UCR and UEA archive. Finally, we discuss the broader impacts of our work in Section `\ref{app: broader impacts}`{=latex}.

Details on the theoretical analysis {#app:sec_theory}
===================================

We first review the shallow ViT and data model introduced by @li2023a in their theoretical analysis of training a ViT. Their Theorem `\ref{trm:iclr2023}`{=latex} shows that the sample complexity for ViTs to achieve a zero generalization error is inversely correlated with the fraction of label-relevant tokens. Building on this insight, we provide a detailed proof of our Proposition 1 from the main paper, showing that 2D patching can increase the number of label-relevant tokens compared to 1D patching. We further illustrate our Proposition 1 with various examples of time series and their corresponding 2D representations.

Background {#app:sec_theory_background}
----------

#### Model and setup

Following the setup of @li2023a, we study a binary classification problem with $N$ training samples $\{({\bm{X}}^n, y^n)\}_{n=1}^N$. Each input ${\bm{X}}^n \in \mathbb{R}^{d \times L}$ contains $L$ tokens $\{{\bm{x}}_1^n, \dots, {\bm{x}}_L^n\}$. Labels $y^n \in \{\pm 1\}$ are determined by majority vote over discriminative tokens. A simplified Vision Transformer (ViT) [@dosovitskiy2021an] model is defined as: $$F({\bm{X}}^n) = \frac{1}{|\mathcal{S}^n|} \sum_{l \in \mathcal{S}^n} \bm{a}_{(l)}^{\top} \text{ReLU}\left({\bm{W}}_O {\bm{W}}_V {\bm{X}}^n \text{softmax}\left({\bm{X}}^{n^\top} {\bm{W}}_K^\top {\bm{W}}_Q {\bm{x}}_l^n\right)\right),$$ where $\psi = ({\bm{A}}\!=\!\{\bm{a}_{(l)}\}_l, {\bm{W}}_O, {\bm{W}}_V, {\bm{W}}_K, {\bm{W}}_Q)$ are trainable parameters. The empirical risk minimization problem is: $$\min_{\psi} f_N(\psi) = \frac{1}{N} \sum_{n=1}^N \max\left\{1 - y^n \cdot F({\bm{X}}^n), 0\right\}.$$ Training uses mini-batch SGD with fixed output layer weights ${\bm{A}}$, following standard NTK initialization practices.

#### Data model

Tokens ${\bm{x}}_l^n$ are noisy versions of $M$ patterns $\{\bm{\mu}_1, \dots, \bm{\mu}_M\}$, where $\bm{\mu}_1, \bm{\mu}_2$ are discriminative. Label $y^n$ depends on majority vote over tokens closest to $\bm{\mu}_1/\bm{\mu}_2$. Noise level $\tau$ satisfies $\tau < \kappa/4$, with $\kappa - 4\tau = \Theta(1)$.

#### Generalization of ViT

We now recap the main results from @li2023a from which we derive our result, along with the main notations in Table `\ref{tab:notations}`{=latex}.

```{=latex}
\begin{assumption*}[Initial Model Conditions, \cite{li2023a}]Initial weights \({\bm{W}}_V^{(0)}, {\bm{W}}_K^{(0)}, {\bm{W}}_Q^{(0)}\) satisfy:
\[
\|{\bm{W}}_V^{(0)}\bm{\mu}_j - {\bm{p}}_j\| \leq \sigma, \quad \|{\bm{W}}_K^{(0)}\bm{\mu}_j - {\bm{q}}_j\| \leq \delta, \quad \|{\bm{W}}_Q^{(0)}\bm{\mu}_j - {\bm{r}}_j\| \leq \delta,
\]
for orthonormal bases \(\mathcal{P}, \mathcal{Q}, \mathcal{R}\) and \(\sigma = O(1/M), \delta < 1/2\).
\end{assumption*}
```
```{=latex}
\begin{theorem*}[Generalization of ViT, \cite{li2023a}]\label{trm:iclr2023}
Under Assumption 1, with sufficient model width \(m \gtrsim \epsilon^{-2}M^2\log N\), fraction 
$$\alpha_* \geq \alpha_\#/(\epsilon_S e^{-(\delta+\tau)}(1 - (\sigma + \tau)),$$ 
and sample size 
$$N \geq \Omega\left((\alpha_* - c'(1 - \zeta) - c''(\sigma + \tau))^{-2}\right),$$ 
SGD achieves zero generalization error after 
$$T = \Theta\left(\frac{1}{(1 - \epsilon - (\sigma + \tau)M/\pi)\eta \alpha_*}\right)$$ iterations.
\end{theorem*}
```
```{=latex}
\begin{proposition*}[Generalization without Self-Attention, \cite{li2023a}]Without self-attention, achieving zero error requires \(N \geq \Omega\left((\alpha_*(\alpha_* - \sigma - \tau))^{-2}\right)\), demonstrating ViT's sample complexity reduction by \(1/\alpha_*^2\).
\end{proposition*}
```
```{=latex}
\centering
```
::: {#tab:notations}
  Notation                 Description
  ------------------------ ---------------------------------------
  $\alpha_*$               Fraction of label-relevant tokens
  $\sigma, \delta, \tau$   Initialization/token noise parameters
  $\kappa$                 Minimum pattern distance
  $M$                      Total number of patterns

  : Key Notations
:::

Proof of label relevance in 2D patches
--------------------------------------

We remind Proposition 1 from the main paper and provide a detailed proof.

```{=latex}
\setcounter{proposition}{0}
```
`\label{app:sec_theory_proof}`{=latex}

```{=latex}
\begin{proposition}
For an arbitrary $\bm{\mu}_1, \bm{\mu}_2 \in \mathbb{R}^k$, let $\bm{t} = [\bm{x}_1 \ \ \bm{x}_2 \ \ \cdots \ \ \bm{x}_k ]^\top \in \mathbb{R}^T \text{where } \forall i \in [k], \bm{x}_i \in \mathbb{R}^k$ and either $\bm{x}_i = \bm{\mu}_1$ or $\bm{x}_i = \bm{\mu}_2$ with $\bm{\mu}_2$ being a label-relevant pattern. Let $\left| \left\{ i : \bm{x}_i = \boldsymbol{\mu}_2 \right\} \right| = n'$ and assume that $2\bm{x}'\cdot (\bm{\mu}_1 - \bm{\mu}_2) \leq ||\bm{\mu}_1 ||^2 - ||\bm{\mu}_2||^2$ whenever $\left| \left\{ i : x'_i \in \boldsymbol{\mu}_2 \right\} \right| \geq \sqrt{k}$. Then, it holds that  $$\alpha_*^{\text{2D}} \geq \alpha_*^{\text{1D}}=\frac{n'}{k},$$ and the inequality is strict if $n' \text{ mod } \sqrt{k} > 0.$
\end{proposition}
```
```{=latex}
\begin{proof}
For a token \(\bm{x}'^n\) to be label-relevant (aligned with \(\bm{\mu}_2\)), it must satisfy:
\[
\|\bm{x}'^n - \bm{\mu}_2\| \leq \|\bm{x}'^n - \bm{\mu}_1\|.
\]

Expanding both sides, we have that:
\[
\|\bm{x}'^n\|^2 + 2\bm{x}'^n \cdot \bm{\mu}_1 + \|\bm{\mu}_1\|^2 \leq \|\bm{x}'^n\|^2 - 2\bm{x}'^n \cdot \bm{\mu}_2 + \|\bm{\mu}_2\|^2.
\]
Regrouping the terms gives us the desired condition:
\[
2\bm{x}'^n \cdot (\bm{\mu}_1 - \bm{\mu}_2) \leq ||\bm{\mu}_1||^2 - ||\bm{\mu}_2||^2. \tag{1}
\]

Recall that $n'$ denotes the number of segments of $\bm{\mu}_2$ in time series $\bm{t}$. Each such segment spans $\sqrt{k}$ tokens, contributing at least $\sqrt{k}$ elements to each of them. Under the assumption of the proposition, it implies (1) and makes each of these $\sqrt{k}$ tokens label-relevant. 

We now need to carefully consider how the $\bm{\mu}_2$ segments can be placed within $\bm{t}$ to understand how many tokens become label-relevant thanks to each $\bm{\mu}_2$. We consider two cases: 1) $n' = c\sqrt{k}$ for some $c \in \mathbb{N}$ satisfying $n' \in (0,k]$, and 2) $n' = c\sqrt{k} + b$ for some $a,b \in \mathbb{N}$, $\sqrt{k}>b>0$ such that $n' \in (0,k]$. In the first case, $\alpha_*^{\text{1D}}=c\sqrt{k}/k$. In the case of 2D patching, in the worst case, $\bm{\mu}_2$ segments can be placed such that they will contribute to $c\sqrt{k}$ tokens. In this case, $\alpha_*^{\text{2D}}\geq c\sqrt{k}/k$ and $\alpha_*^{\text{1D}} \leq \alpha_*^{\text{2D}}$. If $n'$ is not a multiple of $\sqrt{k}$, the same analysis applies for the $c\sqrt{k}$ segments of $\bm{\mu}_2$. To account for the remainder $b$, we note that for any $b>0$, in 2D case, it adds $\sqrt{k}$ label-relevant tokens to the fraction $\alpha_*^{\text{2D}}$ so that $\alpha_*^{\text{2D}}\geq \frac{c\sqrt{k}+\sqrt{k}}{k}$. In the case of 1D patching, $\alpha_*^{\text{1D}} = \frac{c\sqrt{k}+b}{k}$. Given that $b<\sqrt{k}$, this concludes the proof. 
\end{proof}
```
Additional illustrations of Proposition 1 {#app:sec_theory_examples}
-----------------------------------------

To illustrate the benefits of 2D modeling and patching, we present several examples of time series in Figure `\ref{fig:ex_prop1}`{=latex}. We define $\bm{\mu}_1$ using functions such as log, cosine, and sine. We then set $\bm{\mu}_2=\bm{1}_k$, $n'=3$ and randomly shuffle $\bm{\mu}_1$ and $\bm{\mu}_2$ segments within the generated input time series.

```{=latex}
\centering
```
![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/const_log_ex/illustration_prop_input.png "fig:"){#fig:ex_prop1 width="\\linewidth"} ![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/const_log_ex/illustration_prop_patch1D.png "fig:"){#fig:ex_prop1 width="\\linewidth"}

![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/const_log_ex/illustration_prop_patch2D.png){#fig:ex_prop1 width=".9\\linewidth"}

```{=latex}
\vspace{.5cm}
```
![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex/illustration_prop_input.png "fig:"){#fig:ex_prop1 width="\\linewidth"} ![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex/illustration_prop_patch1D.png "fig:"){#fig:ex_prop1 width="\\linewidth"}

![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex/illustration_prop_patch2D.png){#fig:ex_prop1 width=".9\\linewidth"}

```{=latex}
\vspace{.5cm}
```
![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex2/illustration_prop_input.png "fig:"){#fig:ex_prop1 width="\\linewidth"} ![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex2/illustration_prop_patch1D.png "fig:"){#fig:ex_prop1 width="\\linewidth"}

![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/cos_const_ex2/illustration_prop_patch2D_fix.png){#fig:ex_prop1 width=".9\\linewidth"}

```{=latex}
\vspace{.5cm}
```
![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/sin_const_ex/illustration_prop_input.png "fig:"){#fig:ex_prop1 width="\\linewidth"} ![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/sin_const_ex/illustration_prop_patch1D.png "fig:"){#fig:ex_prop1 width="\\linewidth"}

![Illustration of Proposition 1 on more generated time series. In each example considered, 2D patching is more beneficial due the higher number of label-relevant tokens.](figures/theory_ex/figs/sin_const_ex/illustration_prop_patch2D.png){#fig:ex_prop1 width=".95\\linewidth"}

```{=latex}
\newpage
```
Details on the comparison of 1D and 2D patching for Transformers {#app: 1d 2d comparison}
================================================================

Architecture and pretraining
----------------------------

To evaluate the effect of 1D versus 2D patching on representations learned by Transformers, we fix the Transformer architecture and pretraining strategy, and only change the patching approach for generating input tokens. We adopt the setup of @feofanov2025mantis since their Transformer block implementation (ViTUnit class [here](https://github.com/vfeofanov/mantis/blob/main/src/mantis/architecture/architecture.py)) for time series classification is similar to the classical ViT. Specifically, the model comprises 6 Transformer layers, each with 8 attention heads and an embedding dimension of 256.

For pretraining, we employ contrastive learning following [@feofanov2025mantis; @he2020momentum]. The augmentation technique to generate positive pairs is RandomCropResize with a crop rate varying within $[0\%,20\%]$. All time series are resized to a fixed length $T = 512$ using interpolation.

We examine both non-overlapping and overlapping patches following [@goswami2024moment; @nie2022patchtst]. For non-overlapping 1D patching, we generate 32 patches of size 16. For non-overlapping 2D patching, we first arrange the 1D patches in a matrix of size $32 \times 16$ and then extract 32 patches of size $2 \times 8$. After flattening, we obtain 32 patches of size 16, similar to the 1D setting, but semantically different. For overlapping 1D patching, we apply a stride of 8, which yields 64 patches of size 16. For overlapping 2D patching, we rearrange these 1D patches again in a matrix of size $64 \times 16$ and then extract 32 patches of size $4 \times 8$. Flattening yields 32 patches of size 32.

Dataset
-------

To pretrain the different models, we first generate a pretraining dataset from publicly available datasets that are not part of the evaluation benchmark. In detail, we consider a concatenation of the following datasets: ECG [@clifford2017ecgdataset], EMG [@goldberger2000emgdataset], Epilepsy [@andrzejak2001epilepsydataset], FD-A and FD-B [@lessmeier2016fdafdbdataset], Gesture [@liu2009gesturedataset], HAR [@anguita2013hardataset], SleepEEG [@kemp2000sleepeegdataset]. To reduce computation time, we construct a subset of the full dataset containing 100 000 samples, with a sufficiently balanced distribution across the individual source datasets. We give more details in Table `\ref{tab:pretrainig-data-tsfm}`{=latex} on how many samples were taken from each dataset to form the pretraining corpus.

```{=latex}
\centering
```
  Dataset     Number of examples   Prop. of taken examples  
  ---------- -------------------- ------------------------- --
  ECG               20835                 $45.7\%$          
  EMG                163                   $100\%$          
  Epilepsy          11480                  $100\%$          
  FD-A              10912                  $100\%$          
  FD-B              13619                  $100\%$          
  Gesture            1320                  $100\%$          
  HAR               20835                 $78.7\%$          
  SleepEEG          20836                  $4.5\%$          

  : Data used to pretrain Transformers for comparison of 1D and 2D patching.

`\label{tab:pretrainig-data-tsfm}`{=latex}

```{=latex}
\newpage 
```
Additional analysis on TiViT {#app: tivit additional analysis}
============================

Patch size and overlap
----------------------

In Section `\ref{subsection: transforming time series into images for vit feature extraction}`{=latex}, we analyze the time series-to-image transformation for TiViT-CLIP and show that a patch size $P=\sqrt{T}$ and a stride $S = \frac{P}{10}$ yields high classification accuracy for any time series of length $T$. Figure `\ref{fig: stride vs accuracy}`{=latex} displays the effect of patch overlap for TiViT with CLIP, DINOv2, and SigLIP 2 backbones while fixing the patch size at $P = \sqrt{T}$. All versions of TiViT achieve high classification accuracy when utilizing an overlap of 0.9 (corresponding to stride $S = \frac{P}{10}$).

```{=latex}
\centering
```
![Effect of patch overlap on the classification accuracy of TiViT with different backbones.](figures/overlap_accuracy.png){#fig: stride vs accuracy width="60%"}

```{=latex}
\centering
```
::: {#tab: aggregation ablation}
  ------------------------------------------------------- -------------------------------------- ------------------------------------------------- ----------------------------------------- ------- ------
  `\multirow{2}{*}{Model}`{=latex}                         `\multirow{2}{*}{\# Layers}`{=latex}   `\multicolumn{2}{c}{Average of tokens}`{=latex}   `\multicolumn{2}{c}{CLS token}`{=latex}          
  `\cmidrule`{=latex}(lr)3-4 `\cmidrule`{=latex}(lr)5-6                                                                Layer                                          Acc                     Layer   Acc
  TiViT-DINOv2                                                              25                                          15                                           80.0                      17     79.1
  TiViT-SigLIP 2                                                            28                                          10                                           80.6                      14     71.7
  TiViT-CLIP                                                                33                                          14                                         **81.3**                    18     78.6
  ------------------------------------------------------- -------------------------------------- ------------------------------------------------- ----------------------------------------- ------- ------

  : Linear classification accuracy of TiViT on the UCR dataset with different ways of aggregating the hidden representations per layer. We report the total number of layers including the output layer and the index of the best performing layer starting from 0.
:::

Aggregation of hidden token representations
-------------------------------------------

As described in Section `\ref{subsection: time series to image transformation}`{=latex}, we obtain a single embedding for each time series by averaging the ViT hidden representations in a particular layer. We now evaluate the performance of TiViT when using the CLS token from each layer instead. Table `\ref{tab: aggregation ablation}`{=latex} compares the linear classification performance on the UCR dataset using either the CLS token or the mean of all tokens. To ensure a fair comparison, we determine the best performing layer for each approach based on the validation accuracy. Across all backbones, the CLS token consistently results in lower test accuracy, confirming our choice to use the mean hidden representation in TiViT. Interestingly, the best performing CLS tokens appear in later layers compared to the best performing mean tokens. Therefore, utilizing the mean representations does not only enhance classification accuracy, but also reduce computational cost.

```{=latex}
\centering
```
::: {#tab: vit size ablation}
  Architecture    Layer (total number)   Parameters   Accuracy
  -------------- ---------------------- ------------ ----------
  ViT-B/32               8 (13)             52 M        79.8
  ViT-B/16               6 (13)             36 M        80.8
  ViT-L/14              10 (24)            178 M        80.3
  ViT-H/14              14 (32)            257 M      **81.3**

  : Linear classification of TiViT-CLIP with varying size of the ViT backbone. For each model, we report the test accuracy on the UCR dataset achieved with the best performing hidden layer representation and the number of parameters up to this layer.
:::

```{=latex}
\centering
```
```{=latex}
\centering
```
![TiViT-CLIP](figures/clip_intrinsic_dimension.png){#fig: clip id width="95%"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![TiViT-DINOv2](figures/dinov2_intrinsic_dimension.png){#fig: dinov2 id width="95%"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![TiViT-SigLIP 2](figures/siglip_intrinsic_dimension.png){#fig: siglip id width="95%"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![Number of principal components necessary to cover 95% of variance in the ViT representations per layer averaged across UCR datasets.](figures/tivit_principal_components.png){#fig: principal components width="60%"}

Intrinsic dimension and principal components of hidden representations {#app: intrinsic dim and pca}
----------------------------------------------------------------------

The intrinsic dimension quantifies the minimum number of variables required to represent a local neighborhood of samples in the representation space. To estimate the intrinsic dimension, the TWO-NN estimator introduced by @facco2017estimating leverages the distance of each data point to its first and second nearest neighbor. As noted by the authors, a larger number of data points reduces the average distance to the second neighbor, and thus increases the intrinsic dimension. To mitigate this effect, they propose to subsample the dataset. Given a dataset of size $N$, we report the intrinsic dimension for $\frac{N}{4}$ subsamples in the main paper, which is in line with @valeriani2023geometry. In Figure `\ref{fig:intrinsic dimension subsampling}`{=latex}, we compare the intrinsic dimension of average representations from hidden layers using $N$, $\frac{N}{2}$, $\frac{N}{4}$, and $\frac{N}{8}$ samples for estimation. The layer with the highest intrinsic dimension, which is central to our analysis, remains the same regardless of the subsampling ratio.

Since the intrinsic dimension only characterizes the local geometry of the representation space, we further provide a global analysis using principal components. Specifically, in Figure `\ref{fig: principal components}`{=latex}, we determine the number of principal components that are necessary to cover 95% of the variance in the data. For DINOv2, we observe a peak in the number of principal components in the middle layers that corresponds to the layers achieving the best classification accuracy. Interestingly, CLIP and SigLIP 2 exhibit two peaks in the number of principal components across the layers. The middle-layers corresponding to the first peak yield the highest time series classification accuracy.

Size of ViT backbone
--------------------

We report the performance of TiViT with CLIP ViT-H backbone in Section `\ref{subsection: hidden representatoins}`{=latex} of the main paper. Table `\ref{tab: vit size ablation}`{=latex} provides a detailed analysis of how the performance of TiViT varies with the size of the ViT backbone, including ViT-B (with two patch sizes), ViT-L, and ViT-H. Remarkably, with only 6 Transformer layers from ViT-B, TiViT achieves an accuracy of 80.8%. While matching the number of Transformer layers in Mantis, TiViT surpasses Mantis (80.1%) in classification accuracy. However, the hidden dimensionality is higher for the ViT-B backbone used in TiViT. By utilizing a larger backbone, specifically 14 hidden layers of ViT-H/14, we achieve the highest accuracy of 81.3%, significantly outperforming conventional TSFMs.

```{=latex}
\centering
```
::: {#tab: mae ablation}
  ------------------------------------------------------- -------------------------------------- ------------------------------------------------- ----------------------------------------- ------- ----------
  `\multirow{2}{*}{Architecture}`{=latex}                  `\multirow{2}{*}{\# Layers}`{=latex}   `\multicolumn{2}{c}{Average of tokens}`{=latex}   `\multicolumn{2}{c}{CLS token}`{=latex}          
  `\cmidrule`{=latex}(lr)3-4 `\cmidrule`{=latex}(lr)5-6                                                                Layer                                          Acc                     Layer     Acc
  MAE Base                                                                  13                                           8                                           72.7                       9       73.8
  MAE Large                                                                 25                                          14                                           74.3                      18       75.6
  MAE Huge                                                                  33                                          20                                           75.9                      20     **76.7**
  ------------------------------------------------------- -------------------------------------- ------------------------------------------------- ----------------------------------------- ------- ----------

  : Linear classification accuracy of TiViT with varying MAE backbone size and aggregation of hidden representations per layer. We report the total number of layers including the output layer and the index of the best performing layer starting from 0.
:::

```{=latex}
\centering
```
![Intrinsic dimensionality of CLS tokens per MAE layer averaged across UCR datasets.](figures/mae_intrinsic_dimensionality.png){#fig: mae intrinsic dim width="60%"}

Masked autoencoder backbone
---------------------------

In the main paper, we analyze the reusability of ViT backbones from CLIP [@radford2021clip; @schuhmann2022laion], DINOv2 [@oquab2024dinov2], and SigLIP 2 [@tschannen2025siglip2] in time series classification. In contrast, @chen2024visionts repurpose Masked Autoencoders (MAEs) [@he2022masked] for time series forecasting. To enable a direct comparison, we now utilize the hidden representations of MAE Base, Large, and Huge in time series classification.

Our analysis in Table `\ref{tab: mae ablation}`{=latex} shows that for MAEs using the CLS token yields better performance in time series classification than averaging token representations. Moreover, Table `\ref{tab: mae ablation}`{=latex} presents a comparison across MAEs of different sizes, showing that larger backbones consistently achieve higher accuracy. Different from contrastively pretrained models, summarized in Table `\ref{tab: hidden layers}`{=latex} of the main paper, the best representations for time series classification with MAE lie in later layers. We further observe that the hidden representations of the later MAE layers up to the output layer perform similar in time series classification, while there is a significant gap between hidden representations and output representations for TiViT-CLIP (see Figure `\ref{fig: layer vs accuracy}`{=latex} in the main paper). Figure `\ref{fig: mae intrinsic dim}`{=latex} illustrates the intrinsic dimension of the CLS tokens per layer averaged across the UCR datasets. We observe that the intrinsic dimension increases up to 60% of the layer depth, while the later layers mostly exhibit a similar intrinsic dimension, explaining their similar classification performance.

It is worth noting that MAE has only been pretrained on ImageNet-1k [@deng2009imagenet] with 1.5 million samples, whereas CLIP has been pretrained on the significantly larger LAION-2B [@schuhmann2022laion] dataset with 2 billion samples. We hypothesize that being exposed to a larger set of images during training enhances the capacity of a vision model to extract discriminative patterns from 2D time series representations.

Classifiers
-----------

In most of our experiments, we employ a linear classifier (logistic regression) to evaluate the representations of TiViT and Mantis. In Section `\ref{subsection: ablation studies}`{=latex}, we additionally assess zero-shot performance using a nearest centroid classifier. Here, we further adopt a random forest classifier following @feofanov2025mantis. Table `\ref{tab: random forest}`{=latex} presents the classification accuracy for TiViT and TSFMs under this evaluation protocol on the UCR dataset. We observe that TiViT performs on par with Mantis, and that combining the representation of both models achieves state-of-the-art classification performance. @feofanov2025mantis have demonstrated that Mantis surpasses other TSFMs such as NuTime [@lin2023nutime] when evaluated with a random forest classifier. This conclusion can now be extended to TiViT.

```{=latex}
\centering
```
```{=latex}
\centering
```
![Pairwise joint classification accuracy.](figures/accuracy_heatmap.png){#fig: accuracy heatmap width="\\textwidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![Pairwise alignment score (mutual kNN).](figures/alignment_heatmap.png){#fig: alignment heatmap width="\\textwidth"}

```{=latex}
\centering
```
::: {#tab: random forest}
  Model             Accuracy     
  ---------------- ---------- -- --
  Moment              75.6       
  Mantis              77.5       
  TiViT               77.4       
  TiViT + Moment      79.4       
  TiViT + Mantis    **79.8**     

  : Classification with random forest on the UCR dataset.
:::

Alignment and fusion of TiViT and TSFM representations {#alignment-and-fusion-of-tivit-and-tsfm-representations}
------------------------------------------------------

In Table `\ref{fig: joint classification and alignment}`{=latex} of our main paper, we report the alignment and joint classification accuracy for TiViT and TSFMs. Figure `\ref{fig:joint accuracy and alignment}`{=latex} is an additional visualization of the pairwise scores as heatmaps.

```{=latex}
\newpage
```
Detailed results on UCR and UEA benchmarks {#app: ucr uea full benchmark}
==========================================

In the main paper, we report the average accuracy of TiViT and TSFM across 128 univariate datasets from the UCR archive and 27 multivariate datasets from the UEA archive. Here, we report the full linear classification benchmark with accuracy scores for Mantis, Moment, TiViT, and their combinations on each dataset. Table `\ref{tab:detailed-ucr-benchmark}`{=latex} presents the performance on the UCR dataset, while Table `\ref{tab: detailed uea benchmark}`{=latex} reports the results on the UEA dataset. Additionally, Table `\ref{tab: mean rank}`{=latex} provides the mean rank of all five methods on both benchmarks. If multiple element share the same rank, we assign them the lowest rank in the group.

```{=latex}
\tiny
```
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Dataset                                                                              | Moment                | Mantis                | TiViT                 | TiViT + Moment        | TiViT + Mantis        |
+:=====================================================================================+:=====================:+:=====================:+:=====================:+:=====================:+:=====================:+
| `\multicolumn{6}{c}{Continuation of Table \ref{tab:detailed-ucr-benchmark}}`{=latex} |                       |                       |                       |                       |                       |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Dataset                                                                              | Moment                | Mantis                | TiViT                 | TiViT + Moment        | TiViT + Mantis        |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ```{=latex}                                                                          |                       |                       |                       |                       |                       |
| \endfoot                                                                             |                       |                       |                       |                       |                       |
| ```                                                                                  |                       |                       |                       |                       |                       |
| ```{=latex}                                                                          |                       |                       |                       |                       |                       |
| \midrule                                                                             |                       |                       |                       |                       |                       |
| ```                                                                                  |                       |                       |                       |                       |                       |
| `\multicolumn{6}{c}{End of Table}`{=latex}                                           |                       |                       |                       |                       |                       |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ```{=latex}                                                                          | 0.673 $\pm$ 0.012     | 0.667 $\pm$ 0.021     | **0.777** $\pm$ 0.015 | **0.777** $\pm$ 0.012 | 0.763 $\pm$ 0.021     |
| \endlastfoot                                                                         |                       |                       |                       |                       |                       |
| ```                                                                                  |                       |                       |                       |                       |                       |
| ACSF1                                                                                |                       |                       |                       |                       |                       |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Adiac                                                                                | 0.731 $\pm$ 0.003     | 0.728 $\pm$ 0.010     | 0.695 $\pm$ 0.017     | **0.740** $\pm$ 0.005 | 0.714 $\pm$ 0.003     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| AllGestureWiimoteX                                                                   | 0.680 $\pm$ 0.004     | 0.666 $\pm$ 0.007     | 0.653 $\pm$ 0.016     | **0.702** $\pm$ 0.002 | 0.673 $\pm$ 0.019     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| AllGestureWiimoteY                                                                   | 0.711 $\pm$ 0.024     | 0.699 $\pm$ 0.007     | 0.715 $\pm$ 0.010     | 0.733 $\pm$ 0.013     | **0.740** $\pm$ 0.010 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| AllGestureWiimoteZ                                                                   | 0.583 $\pm$ 0.013     | 0.650 $\pm$ 0.004     | 0.649 $\pm$ 0.017     | 0.664 $\pm$ 0.011     | **0.667** $\pm$ 0.019 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ArrowHead                                                                            | 0.804 $\pm$ 0.012     | 0.745 $\pm$ 0.007     | 0.806 $\pm$ 0.045     | **0.840** $\pm$ 0.023 | 0.825 $\pm$ 0.035     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| BME                                                                                  | 0.900 $\pm$ 0.075     | 0.987 $\pm$ 0.012     | **0.998** $\pm$ 0.004 | 0.987 $\pm$ 0.018     | 0.996 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Beef                                                                                 | **0.756** $\pm$ 0.038 | 0.700 $\pm$ 0.033     | 0.733 $\pm$ 0.033     | **0.756** $\pm$ 0.038 | 0.733 $\pm$ 0.033     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| BeetleFly                                                                            | 0.833 $\pm$ 0.029     | 0.900 $\pm$ 0.000     | 0.900 $\pm$ 0.050     | 0.883 $\pm$ 0.029     | **0.933** $\pm$ 0.029 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| BirdChicken                                                                          | 0.850 $\pm$ 0.087     | **0.933** $\pm$ 0.076 | 0.850 $\pm$ 0.087     | 0.850 $\pm$ 0.087     | 0.850 $\pm$ 0.087     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| CBF                                                                                  | 0.943 $\pm$ 0.012     | 0.994 $\pm$ 0.010     | **0.999** $\pm$ 0.001 | 0.998 $\pm$ 0.003     | **0.999** $\pm$ 0.001 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Car                                                                                  | 0.817 $\pm$ 0.000     | 0.794 $\pm$ 0.051     | 0.794 $\pm$ 0.010     | 0.806 $\pm$ 0.025     | **0.822** $\pm$ 0.025 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Chinatown                                                                            | 0.966 $\pm$ 0.009     | 0.962 $\pm$ 0.003     | 0.965 $\pm$ 0.009     | **0.976** $\pm$ 0.012 | 0.970 $\pm$ 0.007     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ChlorineConcentration                                                                | 0.723 $\pm$ 0.001     | 0.643 $\pm$ 0.004     | 0.721 $\pm$ 0.011     | **0.739** $\pm$ 0.016 | 0.737 $\pm$ 0.009     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| CinCECGTorso                                                                         | 0.733 $\pm$ 0.031     | 0.737 $\pm$ 0.004     | **0.895** $\pm$ 0.013 | 0.863 $\pm$ 0.019     | **0.895** $\pm$ 0.012 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Coffee                                                                               | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Computers                                                                            | 0.712 $\pm$ 0.036     | 0.735 $\pm$ 0.021     | 0.748 $\pm$ 0.016     | **0.772** $\pm$ 0.024 | 0.767 $\pm$ 0.012     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| CricketX                                                                             | 0.706 $\pm$ 0.020     | 0.726 $\pm$ 0.015     | 0.763 $\pm$ 0.010     | 0.755 $\pm$ 0.005     | **0.766** $\pm$ 0.011 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| CricketY                                                                             | 0.693 $\pm$ 0.018     | 0.732 $\pm$ 0.017     | 0.767 $\pm$ 0.011     | **0.779** $\pm$ 0.007 | 0.777 $\pm$ 0.011     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| CricketZ                                                                             | 0.740 $\pm$ 0.016     | 0.721 $\pm$ 0.009     | 0.773 $\pm$ 0.015     | 0.779 $\pm$ 0.012     | **0.797** $\pm$ 0.017 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Crop                                                                                 | 0.709 $\pm$ 0.003     | 0.695 $\pm$ 0.001     | 0.673 $\pm$ 0.003     | **0.712** $\pm$ 0.002 | 0.707 $\pm$ 0.003     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DiatomSizeReduction                                                                  | 0.900 $\pm$ 0.030     | 0.881 $\pm$ 0.032     | **0.938** $\pm$ 0.048 | 0.932 $\pm$ 0.049     | **0.938** $\pm$ 0.048 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DistalPhalanxOutlineAgeGroup                                                         | 0.743 $\pm$ 0.011     | **0.746** $\pm$ 0.017 | 0.715 $\pm$ 0.004     | 0.724 $\pm$ 0.011     | 0.700 $\pm$ 0.011     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DistalPhalanxOutlineCorrect                                                          | **0.762** $\pm$ 0.017 | 0.728 $\pm$ 0.007     | 0.755 $\pm$ 0.006     | 0.756 $\pm$ 0.014     | 0.743 $\pm$ 0.007     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DistalPhalanxTW                                                                      | 0.643 $\pm$ 0.004     | **0.698** $\pm$ 0.007 | 0.652 $\pm$ 0.027     | 0.688 $\pm$ 0.011     | 0.640 $\pm$ 0.007     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DodgerLoopDay                                                                        | 0.467 $\pm$ 0.031     | **0.504** $\pm$ 0.014 | 0.475 $\pm$ 0.022     | 0.500 $\pm$ 0.033     | 0.496 $\pm$ 0.031     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DodgerLoopGame                                                                       | 0.720 $\pm$ 0.051     | 0.749 $\pm$ 0.008     | 0.768 $\pm$ 0.045     | 0.756 $\pm$ 0.053     | **0.783** $\pm$ 0.040 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| DodgerLoopWeekend                                                                    | **0.971** $\pm$ 0.000 | 0.964 $\pm$ 0.000     | 0.957 $\pm$ 0.000     | 0.969 $\pm$ 0.004     | **0.971** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ECG200                                                                               | 0.843 $\pm$ 0.006     | **0.853** $\pm$ 0.012 | 0.837 $\pm$ 0.012     | **0.853** $\pm$ 0.015 | 0.847 $\pm$ 0.012     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ECG5000                                                                              | 0.933 $\pm$ 0.005     | 0.924 $\pm$ 0.003     | 0.936 $\pm$ 0.002     | 0.937 $\pm$ 0.002     | **0.939** $\pm$ 0.002 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ECGFiveDays                                                                          | 0.957 $\pm$ 0.007     | 0.977 $\pm$ 0.004     | 0.983 $\pm$ 0.001     | **0.995** $\pm$ 0.001 | 0.986 $\pm$ 0.001     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| EOGHorizontalSignal                                                                  | 0.561 $\pm$ 0.008     | 0.562 $\pm$ 0.018     | 0.603 $\pm$ 0.014     | 0.644 $\pm$ 0.015     | **0.649** $\pm$ 0.006 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| EOGVerticalSignal                                                                    | 0.463 $\pm$ 0.012     | **0.507** $\pm$ 0.007 | 0.465 $\pm$ 0.009     | 0.493 $\pm$ 0.014     | 0.491 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Earthquakes                                                                          | **0.722** $\pm$ 0.034 | 0.719 $\pm$ 0.007     | 0.707 $\pm$ 0.015     | 0.717 $\pm$ 0.032     | **0.722** $\pm$ 0.029 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ElectricDevices                                                                      | 0.631 $\pm$ 0.008     | 0.701 $\pm$ 0.003     | **0.762** $\pm$ 0.002 | 0.744 $\pm$ 0.005     | 0.751 $\pm$ 0.002     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| EthanolLevel                                                                         | **0.631** $\pm$ 0.010 | 0.439 $\pm$ 0.010     | 0.579 $\pm$ 0.023     | 0.614 $\pm$ 0.007     | 0.583 $\pm$ 0.012     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FaceAll                                                                              | 0.733 $\pm$ 0.014     | **0.794** $\pm$ 0.010 | 0.745 $\pm$ 0.007     | 0.747 $\pm$ 0.004     | 0.766 $\pm$ 0.006     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FaceFour                                                                             | 0.784 $\pm$ 0.041     | **0.958** $\pm$ 0.007 | 0.777 $\pm$ 0.093     | 0.811 $\pm$ 0.046     | 0.879 $\pm$ 0.046     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FacesUCR                                                                             | 0.791 $\pm$ 0.009     | 0.886 $\pm$ 0.005     | 0.863 $\pm$ 0.011     | 0.870 $\pm$ 0.011     | **0.902** $\pm$ 0.009 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FiftyWords                                                                           | 0.727 $\pm$ 0.021     | 0.740 $\pm$ 0.013     | 0.747 $\pm$ 0.011     | 0.767 $\pm$ 0.006     | **0.777** $\pm$ 0.012 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Fish                                                                                 | 0.947 $\pm$ 0.003     | 0.958 $\pm$ 0.007     | 0.949 $\pm$ 0.006     | 0.958 $\pm$ 0.012     | **0.970** $\pm$ 0.009 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FordA                                                                                | 0.914 $\pm$ 0.003     | 0.911 $\pm$ 0.002     | 0.909 $\pm$ 0.004     | **0.928** $\pm$ 0.005 | 0.914 $\pm$ 0.005     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FordB                                                                                | 0.800 $\pm$ 0.005     | 0.769 $\pm$ 0.002     | **0.801** $\pm$ 0.004 | 0.796 $\pm$ 0.011     | 0.795 $\pm$ 0.005     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FreezerRegularTrain                                                                  | 0.973 $\pm$ 0.012     | 0.976 $\pm$ 0.012     | **0.995** $\pm$ 0.001 | **0.995** $\pm$ 0.004 | **0.995** $\pm$ 0.002 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| FreezerSmallTrain                                                                    | 0.840 $\pm$ 0.012     | 0.870 $\pm$ 0.020     | **0.981** $\pm$ 0.004 | 0.970 $\pm$ 0.008     | 0.980 $\pm$ 0.005     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Fungi                                                                                | 0.753 $\pm$ 0.033     | 0.810 $\pm$ 0.025     | 0.794 $\pm$ 0.020     | 0.810 $\pm$ 0.020     | **0.815** $\pm$ 0.025 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GestureMidAirD1                                                                      | 0.656 $\pm$ 0.012     | 0.669 $\pm$ 0.023     | 0.726 $\pm$ 0.025     | 0.721 $\pm$ 0.018     | **0.756** $\pm$ 0.031 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GestureMidAirD2                                                                      | 0.567 $\pm$ 0.016     | 0.574 $\pm$ 0.032     | 0.646 $\pm$ 0.043     | 0.628 $\pm$ 0.019     | **0.669** $\pm$ 0.028 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GestureMidAirD3                                                                      | 0.359 $\pm$ 0.019     | 0.385 $\pm$ 0.013     | 0.474 $\pm$ 0.009     | 0.441 $\pm$ 0.018     | **0.479** $\pm$ 0.035 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GesturePebbleZ1                                                                      | 0.893 $\pm$ 0.015     | 0.911 $\pm$ 0.003     | 0.891 $\pm$ 0.003     | 0.924 $\pm$ 0.010     | **0.932** $\pm$ 0.007 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GesturePebbleZ2                                                                      | 0.846 $\pm$ 0.018     | **0.905** $\pm$ 0.006 | 0.835 $\pm$ 0.011     | 0.876 $\pm$ 0.032     | 0.892 $\pm$ 0.011     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GunPoint                                                                             | 0.984 $\pm$ 0.027     | 0.987 $\pm$ 0.007     | 0.991 $\pm$ 0.004     | **0.993** $\pm$ 0.007 | **0.993** $\pm$ 0.007 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GunPointAgeSpan                                                                      | 0.980 $\pm$ 0.008     | **0.998** $\pm$ 0.002 | 0.997 $\pm$ 0.000     | 0.995 $\pm$ 0.002     | 0.997 $\pm$ 0.000     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GunPointMaleVersusFemale                                                             | **1.000** $\pm$ 0.000 | 0.999 $\pm$ 0.002     | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| GunPointOldVersusYoung                                                               | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | 0.989 $\pm$ 0.004     | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Ham                                                                                  | **0.752** $\pm$ 0.025 | 0.667 $\pm$ 0.010     | 0.698 $\pm$ 0.049     | 0.730 $\pm$ 0.048     | 0.740 $\pm$ 0.044     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| HandOutlines                                                                         | 0.930 $\pm$ 0.007     | 0.931 $\pm$ 0.006     | 0.936 $\pm$ 0.004     | **0.942** $\pm$ 0.006 | 0.931 $\pm$ 0.004     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Haptics                                                                              | 0.491 $\pm$ 0.026     | 0.462 $\pm$ 0.002     | 0.487 $\pm$ 0.027     | 0.521 $\pm$ 0.033     | **0.523** $\pm$ 0.022 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Herring                                                                              | **0.698** $\pm$ 0.018 | 0.682 $\pm$ 0.024     | 0.615 $\pm$ 0.018     | 0.620 $\pm$ 0.039     | 0.635 $\pm$ 0.033     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| HouseTwenty                                                                          | 0.947 $\pm$ 0.010     | 0.961 $\pm$ 0.010     | **0.980** $\pm$ 0.005 | 0.975 $\pm$ 0.008     | **0.980** $\pm$ 0.005 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| InlineSkate                                                                          | 0.364 $\pm$ 0.019     | 0.334 $\pm$ 0.021     | 0.393 $\pm$ 0.008     | **0.403** $\pm$ 0.005 | 0.396 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| InsectEPGRegularTrain                                                                | 0.987 $\pm$ 0.014     | **1.000** $\pm$ 0.000 | 0.997 $\pm$ 0.005     | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| InsectEPGSmallTrain                                                                  | 0.953 $\pm$ 0.008     | **1.000** $\pm$ 0.000 | 0.985 $\pm$ 0.008     | 0.981 $\pm$ 0.014     | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| InsectWingbeatSound                                                                  | 0.539 $\pm$ 0.003     | 0.469 $\pm$ 0.019     | 0.524 $\pm$ 0.016     | **0.553** $\pm$ 0.010 | 0.531 $\pm$ 0.013     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ItalyPowerDemand                                                                     | **0.938** $\pm$ 0.005 | 0.911 $\pm$ 0.007     | 0.928 $\pm$ 0.015     | 0.937 $\pm$ 0.013     | 0.928 $\pm$ 0.014     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| LargeKitchenAppliances                                                               | 0.859 $\pm$ 0.005     | 0.820 $\pm$ 0.010     | 0.880 $\pm$ 0.012     | **0.884** $\pm$ 0.014 | 0.874 $\pm$ 0.009     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Lightning2                                                                           | 0.760 $\pm$ 0.041     | 0.781 $\pm$ 0.025     | 0.820 $\pm$ 0.000     | **0.836** $\pm$ 0.016 | **0.836** $\pm$ 0.033 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Lightning7                                                                           | 0.836 $\pm$ 0.036     | 0.749 $\pm$ 0.021     | 0.836 $\pm$ 0.014     | **0.868** $\pm$ 0.008 | 0.845 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Mallat                                                                               | 0.915 $\pm$ 0.010     | 0.868 $\pm$ 0.028     | 0.930 $\pm$ 0.033     | **0.957** $\pm$ 0.017 | 0.939 $\pm$ 0.023     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Meat                                                                                 | 0.911 $\pm$ 0.038     | **0.939** $\pm$ 0.019 | 0.806 $\pm$ 0.019     | 0.900 $\pm$ 0.029     | 0.872 $\pm$ 0.051     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MedicalImages                                                                        | 0.731 $\pm$ 0.003     | 0.705 $\pm$ 0.024     | 0.741 $\pm$ 0.011     | **0.778** $\pm$ 0.009 | 0.762 $\pm$ 0.013     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MelbournePedestrian                                                                  | **0.933** $\pm$ 0.004 | 0.908 $\pm$ 0.006     | 0.860 $\pm$ 0.005     | 0.930 $\pm$ 0.005     | 0.920 $\pm$ 0.006     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MiddlePhalanxOutlineAgeGroup                                                         | 0.481 $\pm$ 0.028     | **0.563** $\pm$ 0.042 | 0.552 $\pm$ 0.023     | 0.530 $\pm$ 0.023     | 0.550 $\pm$ 0.014     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MiddlePhalanxOutlineCorrect                                                          | 0.813 $\pm$ 0.028     | **0.844** $\pm$ 0.007 | 0.784 $\pm$ 0.019     | 0.795 $\pm$ 0.019     | 0.818 $\pm$ 0.019     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MiddlePhalanxTW                                                                      | 0.515 $\pm$ 0.019     | 0.455 $\pm$ 0.019     | **0.517** $\pm$ 0.004 | 0.498 $\pm$ 0.004     | 0.509 $\pm$ 0.014     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MixedShapesRegularTrain                                                              | 0.947 $\pm$ 0.002     | 0.956 $\pm$ 0.003     | 0.975 $\pm$ 0.001     | 0.974 $\pm$ 0.001     | **0.978** $\pm$ 0.001 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MixedShapesSmallTrain                                                                | 0.876 $\pm$ 0.011     | 0.897 $\pm$ 0.010     | 0.944 $\pm$ 0.006     | 0.935 $\pm$ 0.006     | **0.947** $\pm$ 0.009 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| MoteStrain                                                                           | 0.879 $\pm$ 0.011     | 0.887 $\pm$ 0.015     | 0.899 $\pm$ 0.004     | **0.922** $\pm$ 0.012 | 0.918 $\pm$ 0.013     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| NonInvasiveFetalECGThorax1                                                           | 0.918 $\pm$ 0.001     | 0.799 $\pm$ 0.004     | 0.890 $\pm$ 0.008     | **0.921** $\pm$ 0.005 | 0.887 $\pm$ 0.002     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| NonInvasiveFetalECGThorax2                                                           | 0.927 $\pm$ 0.002     | 0.817 $\pm$ 0.004     | 0.915 $\pm$ 0.003     | **0.933** $\pm$ 0.002 | 0.918 $\pm$ 0.003     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| OSULeaf                                                                              | 0.920 $\pm$ 0.009     | 0.902 $\pm$ 0.006     | **0.988** $\pm$ 0.007 | 0.986 $\pm$ 0.005     | 0.985 $\pm$ 0.002     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| OliveOil                                                                             | 0.889 $\pm$ 0.019     | **0.944** $\pm$ 0.019 | 0.700 $\pm$ 0.033     | 0.856 $\pm$ 0.019     | 0.789 $\pm$ 0.051     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PLAID                                                                                | 0.741 $\pm$ 0.005     | 0.819 $\pm$ 0.005     | 0.911 $\pm$ 0.005     | 0.901 $\pm$ 0.007     | **0.929** $\pm$ 0.007 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PhalangesOutlinesCorrect                                                             | **0.800** $\pm$ 0.004 | 0.796 $\pm$ 0.006     | 0.789 $\pm$ 0.005     | **0.800** $\pm$ 0.012 | 0.794 $\pm$ 0.005     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Phoneme                                                                              | 0.276 $\pm$ 0.014     | 0.294 $\pm$ 0.013     | 0.377 $\pm$ 0.008     | 0.377 $\pm$ 0.009     | **0.386** $\pm$ 0.011 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PickupGestureWiimoteZ                                                                | 0.760 $\pm$ 0.040     | 0.807 $\pm$ 0.012     | 0.853 $\pm$ 0.042     | 0.840 $\pm$ 0.060     | **0.887** $\pm$ 0.042 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PigAirwayPressure                                                                    | 0.117 $\pm$ 0.017     | 0.579 $\pm$ 0.012     | 0.535 $\pm$ 0.011     | 0.474 $\pm$ 0.007     | **0.612** $\pm$ 0.032 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PigArtPressure                                                                       | 0.750 $\pm$ 0.019     | 0.811 $\pm$ 0.015     | 0.798 $\pm$ 0.024     | 0.808 $\pm$ 0.021     | **0.845** $\pm$ 0.024 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PigCVP                                                                               | 0.723 $\pm$ 0.018     | **0.777** $\pm$ 0.012 | 0.670 $\pm$ 0.028     | 0.734 $\pm$ 0.012     | **0.777** $\pm$ 0.007 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Plane                                                                                | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| PowerCons                                                                            | 0.930 $\pm$ 0.012     | 0.941 $\pm$ 0.017     | 0.898 $\pm$ 0.006     | **0.952** $\pm$ 0.014 | 0.915 $\pm$ 0.003     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ProximalPhalanxOutlineAgeGroup                                                       | 0.800 $\pm$ 0.015     | **0.850** $\pm$ 0.014 | 0.837 $\pm$ 0.007     | 0.833 $\pm$ 0.010     | 0.837 $\pm$ 0.012     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ProximalPhalanxOutlineCorrect                                                        | 0.875 $\pm$ 0.010     | **0.885** $\pm$ 0.005 | 0.861 $\pm$ 0.008     | 0.877 $\pm$ 0.002     | 0.875 $\pm$ 0.005     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ProximalPhalanxTW                                                                    | **0.751** $\pm$ 0.013 | 0.727 $\pm$ 0.013     | 0.740 $\pm$ 0.007     | 0.738 $\pm$ 0.010     | 0.740 $\pm$ 0.010     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| RefrigerationDevices                                                                 | 0.520 $\pm$ 0.023     | 0.517 $\pm$ 0.014     | **0.568** $\pm$ 0.019 | 0.552 $\pm$ 0.023     | 0.564 $\pm$ 0.029     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Rock                                                                                 | 0.640 $\pm$ 0.087     | 0.607 $\pm$ 0.110     | 0.833 $\pm$ 0.099     | 0.807 $\pm$ 0.095     | **0.840** $\pm$ 0.106 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ScreenType                                                                           | 0.477 $\pm$ 0.018     | 0.465 $\pm$ 0.013     | 0.523 $\pm$ 0.012     | 0.542 $\pm$ 0.019     | **0.548** $\pm$ 0.006 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SemgHandGenderCh2                                                                    | 0.742 $\pm$ 0.010     | 0.877 $\pm$ 0.010     | 0.877 $\pm$ 0.008     | 0.866 $\pm$ 0.013     | **0.916** $\pm$ 0.010 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SemgHandMovementCh2                                                                  | 0.414 $\pm$ 0.019     | 0.657 $\pm$ 0.012     | 0.547 $\pm$ 0.005     | 0.533 $\pm$ 0.007     | **0.692** $\pm$ 0.009 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SemgHandSubjectCh2                                                                   | 0.662 $\pm$ 0.002     | 0.834 $\pm$ 0.013     | 0.840 $\pm$ 0.002     | 0.819 $\pm$ 0.006     | **0.884** $\pm$ 0.008 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ShakeGestureWiimoteZ                                                                 | 0.907 $\pm$ 0.031     | 0.907 $\pm$ 0.012     | 0.840 $\pm$ 0.035     | **0.913** $\pm$ 0.012 | 0.867 $\pm$ 0.012     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ShapeletSim                                                                          | 0.963 $\pm$ 0.006     | 0.924 $\pm$ 0.008     | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ShapesAll                                                                            | 0.893 $\pm$ 0.008     | 0.851 $\pm$ 0.007     | 0.899 $\pm$ 0.003     | **0.915** $\pm$ 0.002 | 0.909 $\pm$ 0.002     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SmallKitchenAppliances                                                               | 0.720 $\pm$ 0.012     | 0.784 $\pm$ 0.012     | **0.815** $\pm$ 0.015 | **0.815** $\pm$ 0.019 | 0.808 $\pm$ 0.017     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SmoothSubspace                                                                       | 0.891 $\pm$ 0.020     | **0.976** $\pm$ 0.004 | **0.976** $\pm$ 0.014 | 0.967 $\pm$ 0.007     | **0.976** $\pm$ 0.010 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SonyAIBORobotSurface1                                                                | 0.829 $\pm$ 0.015     | **0.881** $\pm$ 0.027 | 0.845 $\pm$ 0.021     | 0.840 $\pm$ 0.020     | 0.854 $\pm$ 0.019     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SonyAIBORobotSurface2                                                                | 0.829 $\pm$ 0.032     | 0.876 $\pm$ 0.032     | 0.901 $\pm$ 0.028     | 0.904 $\pm$ 0.044     | **0.910** $\pm$ 0.024 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| StarLightCurves                                                                      | 0.969 $\pm$ 0.001     | 0.969 $\pm$ 0.000     | 0.974 $\pm$ 0.001     | **0.976** $\pm$ 0.001 | **0.976** $\pm$ 0.001 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Strawberry                                                                           | **0.972** $\pm$ 0.002 | 0.959 $\pm$ 0.003     | 0.958 $\pm$ 0.002     | 0.968 $\pm$ 0.010     | 0.964 $\pm$ 0.004     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SwedishLeaf                                                                          | 0.919 $\pm$ 0.011     | 0.939 $\pm$ 0.004     | 0.953 $\pm$ 0.001     | **0.960** $\pm$ 0.002 | 0.958 $\pm$ 0.001     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Symbols                                                                              | 0.965 $\pm$ 0.006     | 0.984 $\pm$ 0.002     | **0.987** $\pm$ 0.000 | 0.986 $\pm$ 0.000     | 0.986 $\pm$ 0.001     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| SyntheticControl                                                                     | 0.967 $\pm$ 0.006     | 0.989 $\pm$ 0.004     | 0.999 $\pm$ 0.002     | 0.996 $\pm$ 0.004     | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ToeSegmentation1                                                                     | 0.953 $\pm$ 0.022     | **0.968** $\pm$ 0.013 | 0.923 $\pm$ 0.009     | 0.950 $\pm$ 0.015     | 0.952 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| ToeSegmentation2                                                                     | 0.897 $\pm$ 0.016     | **0.962** $\pm$ 0.008 | 0.913 $\pm$ 0.016     | 0.913 $\pm$ 0.009     | 0.923 $\pm$ 0.008     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Trace                                                                                | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| TwoLeadECG                                                                           | 0.916 $\pm$ 0.020     | 0.997 $\pm$ 0.001     | **1.000** $\pm$ 0.000 | 0.999 $\pm$ 0.001     | **1.000** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| TwoPatterns                                                                          | 0.989 $\pm$ 0.001     | 0.949 $\pm$ 0.003     | **0.998** $\pm$ 0.001 | **0.998** $\pm$ 0.000 | 0.997 $\pm$ 0.001     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| UMD                                                                                  | **0.993** $\pm$ 0.000 | 0.988 $\pm$ 0.008     | **0.993** $\pm$ 0.000 | **0.993** $\pm$ 0.000 | **0.993** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| UWaveGestureLibraryAll                                                               | 0.924 $\pm$ 0.001     | 0.872 $\pm$ 0.004     | 0.937 $\pm$ 0.002     | **0.948** $\pm$ 0.003 | 0.944 $\pm$ 0.001     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| UWaveGestureLibraryX                                                                 | 0.793 $\pm$ 0.003     | 0.778 $\pm$ 0.009     | 0.825 $\pm$ 0.002     | 0.836 $\pm$ 0.005     | **0.838** $\pm$ 0.003 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| UWaveGestureLibraryY                                                                 | 0.708 $\pm$ 0.010     | 0.677 $\pm$ 0.009     | 0.755 $\pm$ 0.002     | **0.765** $\pm$ 0.002 | 0.764 $\pm$ 0.002     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| UWaveGestureLibraryZ                                                                 | 0.729 $\pm$ 0.005     | 0.737 $\pm$ 0.005     | 0.761 $\pm$ 0.006     | 0.773 $\pm$ 0.010     | **0.788** $\pm$ 0.005 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Wafer                                                                                | 0.992 $\pm$ 0.002     | 0.996 $\pm$ 0.000     | **0.999** $\pm$ 0.000 | **0.999** $\pm$ 0.000 | **0.999** $\pm$ 0.000 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Wine                                                                                 | **0.901** $\pm$ 0.028 | 0.833 $\pm$ 0.037     | 0.673 $\pm$ 0.057     | 0.759 $\pm$ 0.037     | 0.759 $\pm$ 0.032     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| WordSynonyms                                                                         | 0.644 $\pm$ 0.017     | 0.623 $\pm$ 0.016     | 0.643 $\pm$ 0.017     | **0.677** $\pm$ 0.020 | 0.675 $\pm$ 0.028     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Worms                                                                                | 0.749 $\pm$ 0.033     | 0.697 $\pm$ 0.037     | 0.753 $\pm$ 0.047     | **0.805** $\pm$ 0.022 | 0.784 $\pm$ 0.067     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| WormsTwoClass                                                                        | 0.775 $\pm$ 0.037     | 0.740 $\pm$ 0.000     | 0.775 $\pm$ 0.033     | 0.784 $\pm$ 0.040     | **0.805** $\pm$ 0.022 |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Yoga                                                                                 | 0.833 $\pm$ 0.008     | 0.771 $\pm$ 0.014     | 0.819 $\pm$ 0.005     | **0.841** $\pm$ 0.006 | 0.838 $\pm$ 0.006     |
+--------------------------------------------------------------------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+

: Classification accuracy for 128 univariate datasets from the UCR benchmark. We report the mean and standard deviation across three random seeds.`\label{tab:detailed-ucr-benchmark}`{=latex}

```{=latex}
\scriptsize
```
```{=latex}
\centering
```
::: {#tab: detailed uea benchmark}
  Dataset                             Moment                  Mantis                   TiViT              TiViT + Moment          TiViT + Mantis
  --------------------------- ----------------------- ----------------------- ----------------------- ----------------------- -----------------------
  ArticularyWordRecognition      0.988 $\pm$ 0.002     **0.991** $\pm$ 0.002     0.977 $\pm$ 0.003       0.977 $\pm$ 0.003       0.974 $\pm$ 0.005
  BasicMotions                 **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000
  CharacterTrajectories        **0.982** $\pm$ 0.001     0.973 $\pm$ 0.001       0.964 $\pm$ 0.005     **0.982** $\pm$ 0.001     0.978 $\pm$ 0.005
  Cricket                      **1.000** $\pm$ 0.000     0.986 $\pm$ 0.000     **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000
  DuckDuckGeese                **0.467** $\pm$ 0.081     0.433 $\pm$ 0.023       0.393 $\pm$ 0.081       0.413 $\pm$ 0.064       0.433 $\pm$ 0.050
  ERing                          0.895 $\pm$ 0.022       0.905 $\pm$ 0.025       0.975 $\pm$ 0.014       0.977 $\pm$ 0.006     **0.981** $\pm$ 0.007
  EigenWorms                     0.746 $\pm$ 0.022       0.746 $\pm$ 0.016     **0.911** $\pm$ 0.016     0.880 $\pm$ 0.009     **0.911** $\pm$ 0.012
  Epilepsy                     **1.000** $\pm$ 0.000     0.990 $\pm$ 0.004     **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000   **1.000** $\pm$ 0.000
  EthanolConcentration           0.445 $\pm$ 0.013       0.269 $\pm$ 0.044     **0.485** $\pm$ 0.012     0.473 $\pm$ 0.030       0.465 $\pm$ 0.019
  FaceDetection                  0.584 $\pm$ 0.007       0.592 $\pm$ 0.006       0.598 $\pm$ 0.004       0.584 $\pm$ 0.007     **0.607** $\pm$ 0.005
  FingerMovements              **0.633** $\pm$ 0.045     0.593 $\pm$ 0.025       0.517 $\pm$ 0.040       0.620 $\pm$ 0.036       0.553 $\pm$ 0.050
  HandMovementDirection        **0.279** $\pm$ 0.051     0.212 $\pm$ 0.021       0.275 $\pm$ 0.016       0.257 $\pm$ 0.036       0.257 $\pm$ 0.027
  Handwriting                    0.296 $\pm$ 0.018     **0.425** $\pm$ 0.013     0.307 $\pm$ 0.034       0.340 $\pm$ 0.002       0.385 $\pm$ 0.021
  Heartbeat                      0.735 $\pm$ 0.007     **0.800** $\pm$ 0.017     0.732 $\pm$ 0.008       0.717 $\pm$ 0.022       0.769 $\pm$ 0.003
  InsectWingbeat                 0.231 $\pm$ 0.012     **0.573** $\pm$ 0.017     0.355 $\pm$ 0.008       0.332 $\pm$ 0.018       0.443 $\pm$ 0.020
  JapaneseVowels                 0.918 $\pm$ 0.006     **0.978** $\pm$ 0.003     0.940 $\pm$ 0.002       0.938 $\pm$ 0.012       0.933 $\pm$ 0.008
  LSST                           0.571 $\pm$ 0.005       0.607 $\pm$ 0.009       0.604 $\pm$ 0.005       0.610 $\pm$ 0.009     **0.652** $\pm$ 0.003
  Libras                         0.861 $\pm$ 0.017       0.887 $\pm$ 0.026       0.907 $\pm$ 0.006     **0.922** $\pm$ 0.022     0.920 $\pm$ 0.018
  MotorImagery                   0.530 $\pm$ 0.026     **0.563** $\pm$ 0.012   **0.563** $\pm$ 0.049     0.560 $\pm$ 0.044       0.553 $\pm$ 0.042
  NATOPS                         0.900 $\pm$ 0.029     **0.931** $\pm$ 0.014     0.869 $\pm$ 0.006       0.889 $\pm$ 0.006       0.878 $\pm$ 0.006
  PEMS-SF                        0.705 $\pm$ 0.029     **0.788** $\pm$ 0.029     0.709 $\pm$ 0.084       0.763 $\pm$ 0.044       0.742 $\pm$ 0.087
  PhonemeSpectra                 0.186 $\pm$ 0.004       0.272 $\pm$ 0.006       0.245 $\pm$ 0.007       0.265 $\pm$ 0.007     **0.286** $\pm$ 0.008
  RacketSports                   0.829 $\pm$ 0.007     **0.919** $\pm$ 0.004     0.846 $\pm$ 0.010       0.871 $\pm$ 0.008       0.879 $\pm$ 0.027
  SelfRegulationSCP1             0.762 $\pm$ 0.010       0.825 $\pm$ 0.022       0.858 $\pm$ 0.008       0.840 $\pm$ 0.003     **0.891** $\pm$ 0.010
  SelfRegulationSCP2             0.509 $\pm$ 0.031       0.491 $\pm$ 0.018     **0.526** $\pm$ 0.038     0.506 $\pm$ 0.017       0.517 $\pm$ 0.020
  SpokenArabicDigits           **0.981** $\pm$ 0.003     0.907 $\pm$ 0.006       0.969 $\pm$ 0.001       0.979 $\pm$ 0.003       0.972 $\pm$ 0.002
  UWaveGestureLibrary            0.846 $\pm$ 0.010       0.879 $\pm$ 0.015       0.910 $\pm$ 0.005       0.902 $\pm$ 0.004     **0.919** $\pm$ 0.009

  : Classification accuracy for 27 multivariate datasets from the UEA benchmark. We report the mean and standard deviation across three random seeds.
:::

```{=latex}
\normalsize
```
```{=latex}
\centering
```
::: {#tab: mean rank}
  Model                        UCR        UEA
  ------------------------- ---------- ----------
  Moment                       3.66       3.33
  Mantis                       3.44       2.85
  TiViT *(Ours)*               2.97       2.85
  TiViT + Moment *(Ours)*      2.16       2.63
  TiViT + Mantis *(Ours)*    **1.92**   **2.22**

  : Mean rank of TiViT and TSFMs across datasets from the UCR and UEA archive.
:::

```{=latex}
\newpage
```
Broader impacts {#app: broader impacts}
===============

Since this paper presents foundational machine learning research, we do not see any direct societal risks. The broader impact of our work will depend on its specific application.

We demonstrate that our method TiViT significantly improves classification accuracy. This advancement can be beneficial in healthcare where the analysis of physiological signals is crucial for early diagnosis and treatment or in industry where the accurate monitoring of sensor data enables predictive maintenance and reduces downtime.

However, deep learning models including TiViT operate as black boxes with limited interpretability. In safety-critical domains or applications directly impacting humans, such models necessitate careful deployment and oversight. Further research into interpretability and human-in-the-loop frameworks is essential to make deep learning models trustworthy for real-world settings.