---
abstract: |
  Multivariate time series present challenges to standard machine learning techniques, as they are often unlabeled, high dimensional, noisy, and contain missing data. To address this, we propose T-Rep, a self-supervised method to learn time series representations at a timestep granularity. T-Rep learns vector embeddings of time alongside its feature extractor, to extract temporal features such as trend, periodicity, or distribution shifts from the signal. These time-embeddings are leveraged in pretext tasks, to incorporate smooth and fine-grained temporal dependencies in the representations, as well as reinforce robustness to missing data. We evaluate T-Rep on downstream classification, forecasting, and anomaly detection tasks. It is compared to existing self-supervised algorithms for time series, which it outperforms in all three tasks. We test T-Rep in missing data regimes, where it proves more resilient than its counterparts. Finally, we provide latent space visualisation experiments, highlighting the interpretability of the learned representations.
author:
- |
  Archibald Fraikin\
  Let it Care\
  PariSanté Campus, Paris, France\
  `archibald.fraikin@inria.fr`\
  `\And`{=latex} Adrien Bennetot\
  Let it Care\
  PariSanté Campus, Paris, France\
  `adrien.bennetot@letitcare.com`\
  `\AND`{=latex} Stéphanie Allassonnière\
  Université Paris Cité, INRIA, Inserm, SU\
  Centre de Recherche des Cordeliers, Paris\
  `stephanie.allassonniere@inria.fr`
bibliography:
- refs.bib
title: 'T-Rep: Representation Learning for Time Series using Time-Embeddings'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\blue}[1]{\textcolor{black}{#1}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\maketitle
```
```{=latex}
\newcommand{\mbf}{\mathbf}
```
```{=latex}
\newcommand{\N}{\mathcal{N}}
```
```{=latex}
\newcommand{\T}{\mathcal{T}}
```
```{=latex}
\newcommand{\X}{\mathbf{X}}
```
```{=latex}
\newcommand{\x}{\mathbf{x}}
```
```{=latex}
\newcommand{\tb}{\mathbf{t}}
```
```{=latex}
\newcommand{\ub}{\mathbf{u}}
```
```{=latex}
\newcommand{\z}{\mathbf{z}}
```
```{=latex}
\newcommand{\xit}{\mathbf{x}_{i, t}}
```
```{=latex}
\newcommand{\uit}{\mathbf{u}_{i, t}}
```
```{=latex}
\newcommand{\zit}{\mathbf{z}_{i, t}}
```
```{=latex}
\maketitle
```
Introduction
============

Multivariate time series have become ubiquitous in domains such as medicine, climate science, or finance. Unfortunately, they are high-dimensional and complex objects with little data being labeled [@ts_issues], as it is an expensive and time-consuming process. Leveraging unlabeled data to build unsupervised representations of multivariate time series has thus become a challenge of great interest, as these embeddings can significantly improve performance in tasks like forecasting, classification, or anomaly detection [@cpc_cpd; @omnianomaly]. This has motivated the development of self-supervised learning (SSL) models for time series, first focusing on constructing instance-level representations for classification and clustering [@tnc; @t_loss; @random_features]. More fine-grained representations were then developed to model time series at the timestep-level [@ts2vec], which is key in domains such as healthcare or sensor systems. With fine-grained embeddings, one can capture subtle changes, periodic patterns, and irregularities that are essential for anomaly detection [@ad_med_ts] as well as understanding and forecasting disease progression. These representations can also be more resilient than raw data in the face of inter-sample variability or missing data [@ts2vec], common issues in Human Activity Recognition (HAR) and medicine.

A central issue when learning representations of time series is the incorporation of time in the latent space, especially for timestep-level embeddings. In SSL, the temporal structure is learned thanks to the pretext tasks. In current state-of-the-art (SOTA) models, these tasks are contrastive [@tnc; @ts2vec; @rp_ts], which poses important limitations [@ssl_ts_review]. In contrastive techniques, the learning signal is binary: positive pairs should be similar, while negative pairs should be very different [@simclr]. This makes obtaining a continuous or fine-grained notion of time in the embeddings unfeasible, as these tasks only express *whether* two points should be similar, but not *how* close or similar they should be. Embedded trajectories are thus unlikely to accurately reflect the data's temporal structure.

Further, temporal contrastive tasks are incompatible with finite-state systems, where the signal transitions between $S$ states through time, regularly (periodic signal) or irregularly. Such tasks define positive pairs by proximity in time, and negative pairs by points that are distant in time [@rp_ts; @t_loss], which can incur *sampling bias* issues. Points of a negative pair might be far in time but close to a period apart (i.e. very similar) and points of a positive pair might be close but very different (think of a pulsatile signal for example). This incoherent information hinders learning and may result in a poor embedding structure. Finite-state systems are extremely common in real-world scenarios such as sensor systems, medical monitoring, or weather systems, making the treatment of these cycles crucial.

To address the above issues, we propose T-Rep, a self-supervised method for learning fine-grained representations of (univariate and multivariate) time series. T-Rep improves the treatment of time in SSL thanks to the use of time-embeddings, which are integrated in the feature-extracting encoder and leveraged in the pretext tasks, helping the model learn detailed time-related features. We define as time-embedding a vector embedding of time, obtained as the output of a learned function $h_\psi$, which encodes temporal signal features such as trend, periodicity, distribution shifts etc. Time-embeddings thus enhance our model's resilience to missing data, and improve its performance when faced with finite-state systems and non-stationarity. We evaluate T-Rep on a wide variety of datasets in classification, forecasting and anomaly detection (see section `\ref{experiments}`{=latex}), `\blue{notably on Sepsis \citep{Sepsis1, Sepsis2} a real-world dataset containing multivariate time series from $40,336$ patients in intensive care units (ICU), featuring noisy and missing data}`{=latex}. Our major contributions are summarised as follows:

-   To the best of our knowledge, we propose the first self-supervised framework for time series to leverage time-embeddings in its pretext tasks. This helps the model learn fine-grained temporal dependencies, giving the latent space a more coherent temporal structure than existing methods. The use of time-embeddings also encourages resilience to missing data, and produces more information-dense and interpretable embeddings.

-   We compare T-Rep to SOTA self-supervised models for time series in classification, forecasting and anomaly detection. It consistently outperforms all baselines whilst using a lower-dimensional latent space, and also shows stronger resilience to missing data than existing methods. Further, our latent space visualisation experiments show that the learned embeddings are highly interpretable.

Related Work
============

#### Representation Learning for time series

The first techniques used in the field were encoder-decoder based, trained to reconstruct the original time series. Such models include Autowarp [@autowarp], TimeNet [@timenet] and LSTM-SAE [@lstm_sae], which all feature an RNN-based architecture. Variational Auto-Encoders [@vae] inspired models have also been used, notably Interfusion [@interfusion], SOM-VAE [@som_vae], and OmniAnomaly [@omnianomaly] which combines a VAE with normalising flows. Encoder-only methods have been more popular recently, often based on contrastive approaches [@ssl_ts_review]. The Contrastive Predictive Coding (CPC) [@cpc] framework tries to maximise the mutual information between future latent states and a context vector, using the infoNCE loss. This approach has been adapted for anomaly detection [@cpc_cpd] and general representations [@ts_tcc]. TS-TCC [@ts_tcc] augments CPC by applying weak and strong transformations to the raw signal. Also, augmentation-based contrastive methods in computer vision [@simclr] have been adapted to time series by changing the augmentations [@clocs]. Domain-specific transformations were proposed for wearable sensors [@apple_biosignals] and ECGs [@clocs], such as noise injection, cropping, warping, and jittering [@trans_matching]. The issue with these methods is they make transformation-invariance assumptions which may not be satisfied by the signal [@ssl_ts_review; @ts2vec]. TS2Vec [@ts2vec] addresses this with contextual consistency.\
**Time-Embeddings in time series representations** have only been used in transformer-based architectures, which require a *positional encoding* module [@transformer]. While some use the original fixed sinusoidal positional encoding [@ssl_tr1; @ssl_tr2], @tst and @multivar_tst chose to learn a time-embedding using a linear layer and a fully-connected layer respectively. Using or learning more sophisticated time-embeddings is a largely unexplored avenue that seems promising for dealing with long-term trends and seasonality (periodicity) in sequential data [@ssl_ts_review; @tsts_survey]. The most elaborate time-embedding for time series is Time2Vec [@time2vec], which was developed for supervised learning tasks and has not yet been exploited in a self-supervised setting. In existing self-supervised models, the time-embedding is used by the encoder to provide positional information [@tst; @multivar_tst; @ssl_tr1; @ssl_tr2], but is never exploited in pretext tasks. The best performing models use contrastive techniques to learn temporal dependencies, which only provide a binary signal [@ssl_ts_review]. T-Loss [@t_loss] follows the assumption that neighboring windows should be similar, and builds a triplet loss around this idea. TNC [@tnc] extends this framework, dividing the signal into stationary windows to construct its positive and negative pairs. In @rp_ts [@ts2vec; @t_loss], positive and negative pairs are delimited by a threshold on the number of differing timesteps, which makes these methods unsuited to capturing periodic or irregularly recurring patterns in the data. Further, all these contrastive methods are quite coarse, making it hard to learn fine-grained temporal dependencies.\
To summarise, existing methods have made tremendous progress in extracting spatial features from time series, but temporal feature learning is still limited. In particular, they are not suited to handling recurring patterns (periodic or irregular), and struggle to learn fine-grained temporal dependencies, because of the binary signal and sampling bias of contrastive tasks.

Background
==========

The aim of this work is to improve the treatment of time in representation learning for temporal data. These innovations are combined with state-of-the-art methods for spatial feature-learning and model training, the *contextual consistency* and *hierarchical loss* frameworks [@ts2vec].

Problem Definition
------------------

Given a dataset $X = \{\x_1, ..., \x_N\} \in \R^{N \times T \times C}$ of $N$ time series of length $T$ with $C$ channels, the objective of self-supervised learning is to learn a function $f_\theta$, s.t. $\forall i \in [0, N], \; \z_i = f_\theta (\x_i)$. Each $\z_i \in \R^{T \times F}$ is a representation of $\x_i$ of length $T$ and with $F$ channels, which should preserve as many features of the original data as possible. $f_\theta (\cdot)$ is learned by designing artificial supervised signals, called *pretext tasks*, from the unlabeled data $X$.

Contextual Consistency {#contextual_consistency}
----------------------

The objective of *contextual consistency* is to learn context-invariant representations of time series. The idea is to sample two overlapping segments $\x_{1}$ and $\x_{2}$ of the time series $\x$, to which random timestamp masking is applied, thus creating two different *contexts* (random choice of window and masked timesteps) for the overlapping window. Representations in the overlapping timesteps are then encouraged to be similar, leading to context-invariant representations. Two contrastive tasks were introduced by @ts2vec alongside the contextual consistency framework to extract spatial and temporal features.\
**Instance-wise contrasting** encourages representations of the same time series under different contexts to be encoded similarly, and for different instances to be dissimilar [@ts2vec]. Let $B$ be the batch-size, $i$ the time series index and $t$ a timestep. $\zit$ and $\z'_{i, t}$ denote the corresponding representation vectors under 2 different contexts. The loss function is given by: $$\label{eq:ts2vec_task1}
    \mathcal{L}^{(i, t)}_{inst} = - \log \frac{ \exp \left( \z_{i, t} \cdot \z'_{i, t} \right)}{\sum_{j = 1}^B \left( \exp \left( \z_{i, t} \cdot \z'_{j, t} \right) + \mathbbm{1}_{i \neq j} \exp \left( \z_{i, t} \cdot \z_{j, t} \right) \right)}\,.$$

**Temporal contrasting** encourages representations of time series under different contexts to be encoded similarly when their respective timesteps match, and far apart when the timesteps differ [@ts2vec]. The loss function is given in Eq. `\ref{eq:ts2vec_task2}`{=latex}, where $\Omega$ is the set of timesteps in the overlap between the 2 subseries: $$\label{eq:ts2vec_task2}
    \mathcal{L}^{(i, t)}_{temp} = - \log \frac{ \exp \left( \z_{i, t} \cdot \z'_{i, t} \right)}{\sum_{t' \in \Omega} \left( \exp (\z_{i, t} \cdot \z'_{i, t'}) + \mathbbm{1}_{t \neq t'} \exp \left( \z_{i, t} \cdot \z_{i, t'} \right)  \right)}\,.$$

Hierarchical Loss {#hierarchical_loss}
-----------------

The hierarchical loss framework applies and sums the model's loss function at different scales, starting from a per-timestep representation and applying `maxpool` operations to reduce the time-dimension between scales [@ts2vec]. This gives users control over the granularity of the representation used for downstream tasks, without sacrificing performance. It also makes the model more robust to missing data, as it makes use of long-range information in the surrounding representations to reconstruct missing timesteps [@ts2vec].

Method
======

Encoder Architecture
--------------------

We present below our convolutional encoder, which contains 3 modules. The overall model structure is illustrated in Figure `\ref{fig:overall_diag}`{=latex}.

```{=latex}
\centering
```
![`\blue{T-Rep architecture and workflow. The left part shows how the different modules interact (linear projection, time-embedding module, and encoder). The middle part shows the 4 pretext tasks used to train the model, and the kind of features they extract. The right hand side shows the loss computation: a linear combination of individual pretext task losses is passed to the \textit{hierarchical loss} algorithm (see Appendix \ref{pretext_task_weights}).}`{=latex}](images/model_overview.png){#fig:overall_diag width=".8\\linewidth"}

**Linear Projection Layer** The first layer projects individual points $\xit \in \R^C$ to vectors $\uit \in \R^{F}$ with a fixed number of channels $F$. Random timestamp masking is applied to each $\ub_i$ independently after the linear projection (only during training), as part of the *contextual consistency* framework [@ts2vec].

**Time-Embedding Module** The time-embedding module $h_\psi$ is responsible for learning time-related features $\bm{\tau}_{t}$ (trend, periodicity, distribution shifts etc.) directly from the time series sample indices $t$. The time-embedding function is not fixed like a transformer's positional encoding module, it is learned jointly with the rest of the encoder. This makes the time-embeddings flexible, they adapt to the data at hand. The choice of architecture for the time-embedding module can impact performance in downstream tasks, and is discussed in Appendix `\ref{te_choice}`{=latex}. For general applications, we recommend using Time2Vec [@time2vec], which captures trend and periodicity. To the best of our knowledge, T-Rep is the first model to combine a time-embedding module and a convolutional encoder in self-supervised learning for time series.

The time-embedding module must return vectors which define a probability distribution (positive components that sum to 1). This is due to the use of statistical divergence measures in a pretext task, which is detailed in section `\ref{task1}`{=latex}. We find experimentally that the optimal way to satisfy this constraint is by applying a sigmoid activation to the final layer of the module, and then dividing each element by the vector sum: $$(\bm{\tau}_{t})_k = \frac{\sigma(h_\psi(t))_k}{\sum_{j=1}^K \sigma(h_\psi(t))_j}\,,$$ where $\bm{\tau}_{t}$ contains $K$ elements, $\sigma(\cdot)$ is the sigmoid function and $h_\psi$ is the time-embedding module parameterised by $\psi$. Time-embeddings $\bm{\tau}_{t}$ are concatenated with vectors $\uit$ after the linear projection, and the vectors $[\uit \; \bm{\tau}_{t}]^T$ are fed to the encoder $f_\theta$.

#### Temporal Convolution Network (TCN) Encoder

The main body of the encoder, $f_\theta(\cdot)$, is structured as a sequence of residual blocks, each containing two layers of 1D dilated convolutions interleaved with GeLU activations. The convolution dilation parameter increases with the network depth, to first focus on local features and then longer-term dependencies: $d=2^i$, where $i$ is the block index. Dilated convolutions have proven to be very effective in both supervised and unsupervised learning for time series [@informer; @tnc; @cnn_vs_rnn].

Pretext Tasks
-------------

We present below two novel SSL pretext tasks which leverage time-embeddings, and are designed to complement each other. The first, \`Time-embedding Divergence Prediction', describes *how* the information gained through time-embeddings should structure the latent space and be included in the time series representations. On the other hand, the \`Time-embedding-conditioned Forecasting' task focuses on *what* information the time-embeddings and representations should contain.

### Time-embedding Divergence Prediction {#task1}

The first pretext task developed aims to integrate the notion of time in the latent space structure. It consists in predicting a divergence measure between two time-embeddings $\bm{\tau}$ and $\bm{\tau}'$, given the representations at the corresponding time steps. The purpose of this task is for distances in the latent space to correlate with temporal distances, resulting in smoother latent trajectories than with contrastive learning.

Let us define this regression task formally. Take a batch $X \in \R^{B \times T \times C}$, from which we sample $\x_{i, t}$ and $\x_{j, t'}$ $\forall i, j \in [0, B]$ and $t, t' \in [0, T]$ s.t. $t \neq t'$. The task input is the difference $\z_{i, t} - \z'_{j, t'}$, where $\z_{i, t}$ is the representation of $\x_{i, t}$ under the context $c$ and $\z'_{j, t'}$ is the representation of $\x_{j, t'}$ under the context $c'$. Taking representations under different contexts further encourages context-invariant representations, as detailed in section `\ref{contextual_consistency}`{=latex}. The regression target is $y = \mathcal{D}(\bm{\tau}, \bm{\tau}')$. $\bm{\tau}$ and $\bm{\tau}'$ are the respective time-embeddings of $t$ and $t'$, and $\mathcal{D}$ is a measure of statistical divergence, used to measure the discrepancy between the time-embedding distributions. We use the Jensen-Shannon divergence (JSD), a smoothed and symmetric version of the KL divergence [@js_div]. The task loss is: $$\mathcal{L}_{div} = \frac{1}{M} \sum\nolimits^M_{(i, j, t, t') \in \Omega} \left( \mathcal{G}_1\left( \zit \: - \: \bm{z}'_{j, t'} \right) - JSD(\bm{\tau}_t \, || \, \bm{\tau}_{t'}) \right) ^ 2\,,$$

where $\Omega$ is the set (of size $M$) of time/instance indices for the randomly sampled pairs of representations. Using divergences allows us to capture *how* two distributions measures differ, whereas a simple norm could only capture by *how much* two vectors differ. This nuance is important - suppose the time-embedding is a 3-dimensional vector that learned a hierarchical representation of time (equivalent to seconds, minutes, hours). A difference of 1.0 on all time scales (`01:01:01`) represents a very different situation to a difference of 3.0 hours and no difference in minutes and seconds (`03:00:00`), but could not be captured by a simple vector norm.\

### Time-embedding-conditioned Forecasting

Our second pretext task aims to incorporate predictive information in the time-embedding vectors, as well as context-awareness in the representations, to encourage robustness to missing data. The task takes in the representation of a time series at a specific timestep, and tries to predict the representation vector of a nearby point, conditioned on the target's time-embedding.

The input used is the concatenation $\left[ \zit \: \: \bm{\tau}_{t + \Delta} \right]^T$ of the representation $\zit \in \R^F$ at time $t$ and the time-embedding of the target $\bm{\tau}_{t + \Delta} \in \R^K$. $\Delta_{max}$ is a hyperparameter to fix the range in which the prediction target can be sampled. The target is the encoded representation $\z_{i, t + \Delta}$ at a uniformly sampled timestep $t + \Delta, \; \Delta \sim \mathcal{U}[- \Delta_{max}, \Delta_{max}]$. The input is forwarded through the task head $\mathcal{G}_2: \R^{F + K} \mapsto \R^{F}$, a 2-layer MLP with ReLU activations. The loss is a simple MSE given by: $$\mathcal{L}_{pred} = \frac{1}{M T} \sum\nolimits^M_{j \in \Omega_N} \sum\nolimits^T_{t \in \Omega_{T}} \left( \mathcal{G}_2\left( \left[ \zit^{(c_1)} \: \: \bm{\tau}_{t + \Delta_j} \right]^T \right) - \z_{i, t + \Delta_j}^{(c_2)} \right) ^ 2\,,$$

where $\Delta_j \sim \mathcal{U}[- \Delta_{max}, \Delta_{max}]$, $\Omega_M$ and $\Omega_T$ are the sets of randomly sampled instances and timesteps for each batch, whose respective cardinalities are controlled by hyperparameters $M$ and $T$. The contexts $c_1$ and $c_2$ are chosen randomly from $\{c, c'\}$, so they may be identical or different, further encouraging contextual consistency. Conditioning this prediction task on the time-embedding of the target forces the model to extract as much information about the signal as possible from its position in time. This results in more information-dense time-embeddings, which can be leveraged when working with individual trajectories for forecasting and anomaly detection.

In practice, we choose a short prediction range $\Delta_{max} \leq 20$, as the focus is not to build representations tailored to forecasting but rather \`context-aware' representations. This context-awareness is enforced by making predictions backwards as well as forwards, encouraging representations to contain information about their surroundings, making them robust to missing timesteps. Longer prediction horizons would push representations to contain more predictive features than spatial features, biasing the model away from use-cases around classification, clustering and other \`comparative' or instance-level downstream tasks.

A key objective of this pretext task is to build resilience to missing data. This is done by (1) learning information-dense time-embeddings, which are available even when data is missing, and (2) by learning context-aware representations, which can predict missing timesteps in their close vicinity.

Experiments
===========

This sections presents the experiments conducted to evaluate T-Rep's learned representations. Because of the variety of downstream tasks, we perform no hyperparameter tuning, and use the same hyperparameters across tasks. Further, the same architectures and hyperparameters are used across all evaluated models where possible, to ensure a fair comparison. Experimental details and guidelines for reproduction are included in Appendix `\ref{exp_details}`{=latex} and `\ref{rep_details}`{=latex}. The code written to produce these experiments has been made publicly available[^1].

Time Series Anomaly Detection
-----------------------------

```{=latex}
\begin{wraptable}{r}{7cm}
  \centering
  \scalebox{0.95}{
  \color{myblue}
  \begin{tabular}{lcccccccc}
  \toprule
    & Yahoo (F$1$) & Sepsis (F$1$) \\
    % & F$_1$ & F$_1$ \\
    \midrule
    Baseline & 0.110 & 0.241 \\
    TS2Vec & 0.733 & 0.619 \\
    \textbf{T-Rep} (Ours) & \textbf{0.757} & \textbf{0.666} \\
    \bottomrule
  \end{tabular}
 }
  \caption{\blue{Time series anomaly detection F1 scores, on the Yahoo dataset for point-based anomalies and Sepsis datasets for segment-based anomalies. Anomalies include outliers as well as changepoints. TS2Vec results are reproduced using official source code \citep{ts2vec_repo}.}}
  \label{table:anomaly}
\end{wraptable}
```
`\blue{We perform two experiments, point-based anomaly detection on the Yahoo dataset \citep{yahoo}, and segment-based anomaly detection on the 2019 PhysioNet Challenge's Sepsis dataset \citep{Sepsis1, Sepsis2}. We chose to include both types of tasks as segment-based anomaly detection tasks help avoid the bias associated with point-adjusted anomaly detection \citep{kim}. Yahoo contains 367 synthetic and real univariate time series, featuring outlier and change point anomalies. Sepsis is a real-world dataset containing multivariate time series from $40,336$ patients in intensive care units, featuring noisy and missing data. The task consists in detecting sepsis, a medical anomaly present in just 2.2\% of patients. On both datasets, we compare T-Rep to the self-supervised model TS2Vec \citep{ts2vec}, as well as a baseline following the same anomaly detection protocol as TS2Vec and T-Rep on each dataset, but using the raw data. We follow a streaming anomaly detection procedure \citep{ren2019time} on both datasets. Details on the procedures can be found in Appendix \ref{ad_method}, which also details the pre-processing applied to Sepsis.}`{=latex}\
`\blue{Table \ref{table:anomaly} shows the evaluation results (more detailed results are presented in Appendix \ref{ad_results}). T-Rep achieves the strongest performance in both datasets, with an F1 score of 75.5\% on Yahoo and 66.6\% on Sepsis. This respectively represents a 2.4\% and 4.8\% increase on the previous SOTA TS2Vec \citep{ts2vec}. T-Rep's performance can be attributed to its detailed understanding of temporal features, which help it better detect out-of-distribution or anomalous behaviour. It achieves this performance with a latent space dimension of 200 on Yahoo, which is smaller than the 320 dimensions used by TS2Vec \citep{ts2vec}, further showing that T-Rep learns more information-dense representations than its predecessors.}`{=latex}

Time Series Classification {#classif_exp}
--------------------------

```{=latex}
\begin{wraptable}{r}{7cm}
  \centering
  \scalebox{0.9}{
  \begin{tabular}{lcccccc}

  \toprule
     & \multicolumn{2}{c}{30 UEA datasets} \\
     \cmidrule(r){2-3}
    Method & Avg. Acc. & Avg. Difference (\%) \\
    \midrule
    T-Loss & 0.657 & 33.6 \\
    TS2Vec & 0.699 & 2.1 \\
    TNC & 0.671 & 12.3 \\
    TS-TCC & 0.667 & 10.0 \\
    DTW &  0.650 & 43.6 \\
    Minirocket & \textbf{0.707} & 4.8 \\
    \textbf{T-Rep} (Ours) & 0.706 & -- \\
    \bottomrule
  \end{tabular}
  }
  \caption{\blue{Multivariate time series classification results on the UEA archive. DTW, TNC, and TS-TCC results are taken directly from \citet{ts2vec}, while TS2Vec and Minirocket results are reproduced using the official code. `Avg. Difference' measures the relative difference in accuracy brought by T-Rep compared to a given model.}}
  % The latent space dimensions of T-Rep and TS2Vec are 128 while T-Loss, TS-TCC, TST and TNC are all set to 320
  \label{table:uea_summary_res}
\end{wraptable}
```
```{=latex}
\blue{The classification procedure is similar to that introduced by \citet{t_loss}: a representation $z$ is produced by the self-supervised model, and an SVM classifier with RBF kernel is then trained to classify the representations (see Appendix \ref{exp_details} and \ref{rep_details} for procedure and reproduction details). We compare T-Rep to SOTA self-supervised models for time series: TS2Vec \citep{ts2vec}, T-Loss \citep{t_loss}, TS-TCC \citep{ts_tcc}, TNC \citep{tnc}, Minirocket \cite{mini_rocket} and a DTW-based classifier \citep{dtw}. These models are evaluated on the UEA classification archive's 30 multivariate time series, coming from diverse domains such as medicine, sensor systems, speech, and activity recognition \cite{ucr_archive}.}
```
```{=latex}
\blue{Evaluation results are summarised in Table \ref{table:uea_summary_res}, and full results are shown in Appendix \ref{uea_full_results}, along with more details on the chosen evaluation metrics. Table \ref{table:uea_summary_res} shows that T-Rep has an accuracy $+2.1\%$ higher than TS2Vec and $+4.8\%$ higher than Minirocket on average. In terms of average accuracy, T-Rep's accuracy outperforms all competitors except Minirocket, which has a 0.001 lead. TS2Vec's performance is very close to T-Rep's, only 0.07 lower. T-Rep has a positive 'Avg. difference' to Minirocket despite having a lower 'Avg. Acc.' because Minirocket often performs slightly better than T-Rep, but when T-Rep is more accurate, it is so by a larger margin. It is important to note that Minirocket was developed specifically for time series classification (and is the SOTA as of this date), while T-Rep is a general representation learning model, highlighting its strengths across applications. The latent space dimensionality is set to 320 for all baselines, except Minirocket which uses 9996 dimensions (as per the official code). T-Rep uses only 128 dimensions, but leverages the temporal dimension of its representations (see Appendix \ref{classif_procedure} for details).}
```
Time Series Forecasting
-----------------------

We perform a multivariate forecasting task on the four public ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2), which contain electricity and power load data, as well as oil temperature [@informer]. Forecasting is performed over multiple horizons, using a procedure described in Appendix `\ref{forecast_procedure}`{=latex}. We evaluate T-Rep against TS2Vec [@ts2vec], a SOTA self-supervised model for time series, but also Informer [@informer], the SOTA in time series forecasting, as well as TCN [@cnn_vs_rnn], a supervised model with the same backbone architecture as T-Rep, and a linear model trained on raw data. Aggregate results, averaged over all datasets and prediction horizons, are presented in Table `\ref{table:forecast_short}`{=latex} (full results are in Appendix `\ref{ett_full_results}`{=latex}).

`\blue{Table \ref{table:forecast_short} shows that T-Rep achieves the best average scores, in terms of both MSE and MAE, with a 24.2\% decrease in MSE on the supervised SOTA Informer \citep{informer}, and a slight improvement of 1.80\% on the self-supervised SOTA TS2Vec \citep{ts2vec}. It also achieves a better average rank, ranking first more than any other model. Furthermore, the linear baseline is the second model that most frequently ranks first, beating TS2Vec in this metric. However, this model does not perform well in all datasets and therefore ranks 3rd in terms of MAE and MSE.}`{=latex} Interestingly, most existing self-supervised methods for time series use high-dimensional latent spaces ($F=320$ dimensions per timestep) [@t_loss; @tnc], which is thus used to produce baseline results in Table `\ref{table:forecast_short}`{=latex}. This can be an issue for downstream applications, which might face the *curse of dimensionality* [@curse_of_dim]. T-Rep, however, outperforms all baselines with a latent space that is almost 3 times smaller, using only $F=128$ dimensions. T-Rep's superior performance can be attributed to its comprehensive treatment of time, capturing trend, seasonality, or distribution shifts of more easily with its time-embedding module.

```{=latex}
\centering
```
```{=latex}
\scalebox{0.9}{
  \begin{threeparttable}
    \color{myblue}
  \scalebox{0.95}{
      \begin{tabular}{lccccccccccc}
      \toprule
             &  & \multicolumn{2}{c}{\textbf{T-Rep} (Ours)} &
             \multicolumn{2}{c}{TS2Vec} &
             \multicolumn{2}{c}{Informer} &
             \multicolumn{2}{c}{TCN} &
             \multicolumn{2}{c}{Linear Baseline}\\
        \cmidrule(r){3-4} \cmidrule(r){5-6} \cmidrule(r){7-8} \cmidrule(r){9-10} \cmidrule(r){11-12}
         & & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE\\
        \midrule
        
        \multicolumn{2}{l}{Avg. Rank} & \textbf{1.90} & \textbf{1.85} & 2.40 & 2.45 & 3.3 & 3.55 & 4.35 & 4.1 & 3.05 & 3.05\\
        \multicolumn{2}{l}{Ranks $1^{st}$} & \textbf{8} & \textbf{8} & 2 & 1 & 3 & 3 & 1 & 1 & 6 & 7\\
        \multicolumn{2}{l}{Avg.} & \textbf{0.986} & \textbf{0.702} & 1.004 & 0.712 & 1.300 & 0.820 & 2.012 & 1.205 & 1.017 & 0.727 \\
        
        \bottomrule
      \end{tabular}
  }
  \end{threeparttable}}
```
Robustness to Missing Data {#md_robustness}
--------------------------

```{=latex}
\centering
```
![Illustration of T-Rep's robustness to missing data on UCR archive datasets. (a) shows heatmap representations of T-Rep and TS2Vec when faced with missing data, and (b) shows accuracy against percentage of missing data in a classification task for T-Rep, TS2Vec [@ts2vec] and T-Loss [@t_loss]. Error bars denote the standard deviation over 6 train-test runs.](images/md_ls_exp_fig.png){#fig:md_robust width="0.7\\linewidth"}

T-Rep was developed with medical and HAR applications in mind, requiring strong resilience to missing data, which we evaluate in two different experiments. In the first *qualitative* experiment, we visualize representations of incomplete time series using the DodgerLoopGame dataset from the UCR archive [@ucr_archive], which features a continuous segment of 25 missing timesteps (see top row of Figure `\ref{fig:md_robust}`{=latex}.a). We then visualised a heatmap of T-Rep and TS2Vec's time series representations (bottom row), showing the 15 dimensions with highest variance. T-Loss [@t_loss] is not included as it does not produce timestep-level representations. For TS2Vec, the representations of missing timesteps very much stand out (they are brighter, reflecting higher values), illustrating that the model is struggling to interpolate and instead produces out-of-distribution representations. On the other hand, T-Rep produces much more plausible representations for these missing timesteps with smoother transitions in and out of the area with missing points, as well as realistic interpolations, matching the data distribution of the surrounding area.

Secondly, we decided to perform a more *quantitative* experiment, examining classification accuracy for different amounts of missing data, on the ArticularyWordRecognition dataset of the UCR archive [@ucr_archive]. We compare T-Rep's performance to TS2Vec [@ts2vec] and T-Loss, a self-supervised representation learning model for time series, specifically designed for downstream classification and clustering tasks [@t_loss]. The results are unequivoqual, T-Rep's performance is extremely resilient to missing data, starting at 98% accuracy with the complete dataset, and dropping by only 1.3% when faced with 75% missing data, and finally reaching 86.5% with only 10% of the available data (green curve of Figure `\ref{fig:md_robust}`{=latex}.b). The performance of TS2Vec is also very strong (orange curve), following a similar trend to T-Rep with $2\%$ less accuracy on average, and a more pronounced dip in performance when 90% of the data is missing, dropping to 82.7%. On the other hand, T-Loss is much more sensitive to any missing data. Its performance decreases exponentially to reach 2.6% when presented with 90% missing data (red curve).

Ablation Study {#ablation}
--------------

```{=latex}
\blue{To empirically validate T-Rep's components and pretext tasks, we conduct ablation studies on forecasting and anomaly detection tasks using the ETT datasets for forecasting and the PhysioNet Challenge's Sepsis dataset for anomaly detection. Unless specified, Time2Vec is the chosen time-embedding method. \textbf{w/o TE-conditioned forecasting} assigns a weight of 0.0 to the 'time-embedding-conditioned' forecasting task, redistributing weights evenly. \textbf{w/o TE divergence prediction} behaves similarly, but for the 'time-embedding divergence prediction' task. \textbf{w/o New pretext tasks} retains only the time-embedding module and the two TS2Vec pretext tasks, isolating the impact on performance of different time-embedding architectures. We explore a fully-connected two-layer MLP with ReLU activations (\textbf{w/ MLP TE module}) and a vector of RBF features (\textbf{w/ Radial Basis Features TE module}). The original TS2Vec model (\textbf{w/o TE module}) is also included in the ablation study, lacking any of T-Rep elements.}
```
```{=latex}
\centering
```
```{=latex}
\color{myblue}
```
```{=latex}
\scalebox{0.7}{
  \begin{tabular}{p{5.5cm}c c}
  \toprule
    & Forecasting & Anomaly Detection \\
    % & Avg. MSE & Avg. F1 \\
    \midrule
    \textbf{T-Rep} & \textbf{0.986} & \textbf{0.665} \\ 
    \midrule
    \multicolumn{3}{c}{\textit{Pretext tasks}} \\
    w/o TE-conditioned forecasting & 1.022 (+3.7\%) & 0.392 (-41\%) \\
    w/o TE divergence prediction & 1.003 (+1.7\%) & 0.634 (-4.7\%) \\
    w/o New pretext tasks & 0.999 (+1.3\%) & 0.513 (-22.8\%) \\
    \midrule
    \multicolumn{3}{c}{\textit{Architecture}}\\
    w/ MLP TE module & 1.008 (+2.2\%) & 0.443 (-33.3\%) \\
    w/ Radial Basis Features TE module & 1.007 (+2.1\%) & 0.401 (-39.7\%) \\
    w/o TE module (=TS2Vec) & 1.004 (+1.8\%) & 0.610 (-8.2\%) \\
    \bottomrule
  \end{tabular}
  }
```
```{=latex}
\blue{Results in Table \ref{table:ablation-study} confirm that the proposed pretext tasks and the addition of a time-embedding module to the encoder contribute to T-Rep's performance: removing any of these decreases the scores in both tasks. These results also illustrate the interdependency of both tasks, as in forecasting, only leaving one of the tasks obtains worse results than removing both pretext tasks. It also justifies our preferred choice of time-embedding, since Time2Vec \citep{time2vec} outperforms the other 2 architectures in both tasks.}
```
Conclusion
==========

We present T-Rep, a self-supervised method for learning representations of time series at a timestep granularity. T-Rep learns vector embeddings of time alongside its encoder, to extract temporal features such as trend, periodicity, distribution shifts etc. This, alongside pretext tasks which leverage the time-embeddings, allows our model to learn detailed temporal dependencies and capture any periodic or irregularly recurring patterns in the data. We evaluate T-Rep on classification, forecasting, and anomaly detection tasks, where it outperforms existing methods. This highlights the ability of time-embeddings to capture temporal dependencies within time series. Further, we demonstrate T-Rep's efficiency in missing data regimes, and provide visualisation experiments of the learned embedding space, to highlight the interpretability of our method.

```{=latex}
\newpage 
```
```{=latex}
\bibliographystyle{iclr2024_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Appendix
========

Latent Space Structure
----------------------

This section aims to assess the representations produced by T-Rep in different settings, and show their interpretability. We perform 2 experiments, the to examine the intra-instance latent space structure as well as the inter-instance structure.\
For the first experiment, we choose 2 datasets from the UCR archive [@ucr_archive] and create one synthetic dataset. We then plot a time series sample for each dataset and show a heatmap of the corresponding T-Rep representation evolving through time (only the 15 dimensions with highest variance are shown for each representation vector).

```{=latex}
\centering
```
![Time series representation trajectories in the latent space. The top row shows the input signal, and the bottom row shows a heatmap of the representations through time. Only the 15 most varying dimensions are shown for each representation. Data for the two left-most figures comes from UCR archive datasets, and the righ-most figure's data is synthetic.](images/ls_vis.png){#fig:ls_vis_small width="1.0\\linewidth"}

Figure `\ref{fig:ls_vis_small}`{=latex} showcases a variety of input signal properties captured by T-Rep. The leftmost plot showcases a sample from the Phoneme dataset [@ucr_archive], and we see that periodicity is captured by T-Rep, in particular the change in frequency and decrease in amplitude. Plot (b) is an extract of the Pig Central Venous Pressure medical dataset [@ucr_archive], where T-Rep accurately detects an anomaly in the input signal, represented by a sharp change in value (illustrated by a color shift from green to blue) at the corresponding spike. Finally, the rightmost plot showcases a synthetic dataset with an abrupt distribution shift. T-Rep's representation captures this with lighter colors and a change in frequency, as in the original signal. Finally, we can observe how T-Rep produces coherent representations for missing data in Figure `\ref{fig:md_robust}`{=latex}.a (see section `\ref{md_robustness}`{=latex} for details). This experiment highlights the interpretability of the representations learned by T-Rep, accurately reflecting properties of the original signal.\
`\blue{To study the global latent space structure (inter-instance), we look at UMAP plots \citep{umap} of representations learned by T-Rep on five UCR datasets \cite{ucr_archive}. Each instance is represented by a point, colored by the class it belongs to, to show whether the latent space structure discriminates the datasets' different classes. We include representations learned by TS2Vec to see how the new components of T-Rep affect the learned latent space. Results are presented in Figure \ref{fig:clustering-exp}}`{=latex}.

```{=latex}
\centering
```
![`\blue{UMAP visualisations of the learned representations of T-Rep and TS2Vec on 5 UCR datasets. Each color represents a different class. TS2Vec results are produced using the official code \citep{ts2vec_repo}.}`{=latex}](images/clustering_1.png){#fig:clustering-exp width="1.0\\linewidth"}

`\blue{Looking at the UMAPs across the 128 UCR datasets, we noticed that the latent space structure of T-Rep and TS2Vec is often similar, and/or of similar quality. This is shown in the two rightmost plots of Figure \ref{fig:clustering-exp}, and can be explained by the fact that T-Rep focuses on learning temporal features (and uses TS2Vec's tasks for spatial features), resulting in stronger differences in the evolution of representations through time (see Figure \ref{fig:ls_vis_small}).}`{=latex}\
`\blue{To focus on learning detailed temporal features, T-Rep puts a lower weight on TS2Vec's spatial-features pretext tasks. While one might intuitively think this results in lower quality spatial features, this is not what our experiments show. Indeed, for none of the tested datasets did we observe a significant downgrade in latent space structure, which is further supported by the classification results presented in section \ref{classif_exp}, where T-Rep outperforms TS2Vec by a small margin. These results show that learning a detailed temporal structure within an instance constitutes a useful feature to compare various instances. This is illustrated by the two leftmost plots of Figure \ref{fig:clustering-exp}, where we see much more cohesive and separated clusters for T-Rep than TS2Vec. This experiment truly highlights the versatile nature of T-Rep, improving the quality of representations at both the dataset scale  (inter-instance discrimination) and individual time series (timestep granularity).}`{=latex}

Experimental Details {#exp_details}
--------------------

The implementation of the models is done in Python, using Pytorch 1.13 [@pytorch] for deep learning and scikit-learn [@sklearn] for SVMs, linear regressions, pre-processing etc. We use the Adam optimizer [@adam] throughout, and train all models on a single NVIDIA GeForce RTX 3060 GPU with Cuda 11.7.\
As mentioned in the Experiments section, no hyperparameter tuning was performed. We use a batch size of 16, a learning rate of 0.001, set the maximum number of epochs to 200 across all datasets and tasks. We use 10 residual blocks for the encoder network, and set hidden channel widths to 128. The kernel size of all convolution layers is set to 3. The weighting of pretext tasks is detail in section `\ref{pretext_task_weights}`{=latex} below. As for the choice of time-embedding, we use a fully-connected two-layer MLP with ReLU activations for the classification tasks, and Time2Vec for the forecasting and anomaly detection tasks (see Appendix `\ref{te_choice}`{=latex} for additional information). For all experiments (unless stated otherwise), models are trained and tested 10 times per dataset to mitigate stochasticity. No random seed is used in any experiments.

### Pretext Task Weights {#pretext_task_weights}

As shown in Figure `\ref{fig:overall_diag}`{=latex}, the loss passed to hierarchical loss computation framework is a linear combination of pretext task's loss: $$\mathcal{L} = \sum^4_{i=1} \alpha_i \cdot \ell_i \quad \quad \text{s.t.} \quad \sum^4_{i=1} \alpha_i = 1 \,,$$

where $\alpha_i$ designates task $i$'s weight. Choosing the task weights can be seen as a semantic as well as an optimisation problem: the extracted features will of course depend on the weights attributed to each task, but the choice of weights will also impact the optimisation landscape of the final loss. Balancing these two factors is a complex and interesting research direction, which we leave as future work. For all pretext tasks (instance-wise contrasting, temporal contrasting, time-embedding divergence prediction, time-embedding-conditioned forecasting), we use the same weights: $\alpha_i = 0.25 \; \forall i \in [1, 4]$. This is the case across all downstream tasks, no task-specific tuning is performed.

### Forecasting {#forecast_procedure}

The multivariate forecasting task consists in taking the $L$ previous points $X = \{\x_{t - L + 1}, ..., \x_{t - 1}, \x_t\} \in \R^{L \times C}$ and using them to predict the next $H$ points $y = \{\x_{t+1}, ..., \x_{t + H}\} \in \R^{H \times C}$, where $H$ is the prediction horizon. Such $X$ and $y$ are used to train the supervised forecasting baselines. For self-supervised models, the forecasting model is a simple ridge regression, taking as input $\z_t$ (the representation of $\x_t$), and trying to predict $y$. The input time series are normalised using the z-score, each channel being treated independently. We use no covariates such as minute, hour, day-of-the-week etc., to avoid interference with the temporal features learned by the time-embeddings. To accurately evaluate the benefits of time-embeddings, we wanted to ensure the model would not rely on other temporal features (the covariates).\
The forecasting model is a ridge regression (linear regression with an $L_2$ regularisation term, weighted by a parameter $\alpha$) trained directly on the extracted representations. $\alpha$ is chosen using grid search on a validation set, amongst the following values: $\alpha \in \{0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 500, 1000\}$.

The train/validation/test split for the ETT datasets is 12/4/4 months. The prediction horizon $H$ ranges from 1 to 30 days for the hourly data (ETTh$_1$ and ETTh$_2$), and from 6 hours to 7 days for minute data (ETTm$_1$ and ETTm$_2$). The evaluation metrics, MSE and MAE, are respectively given by: $$\begin{aligned}
    \text{MSE} &= \frac{1}{HB} \sum^H_{i=1} \sum^{B}_{j=1} (x^j_{t+i} - \hat{x}^j_{t+i})^2 \,,\\
    \text{MAE} &= \frac{1}{HB} \sum^H_{i=1} \sum^{B}_{j=1} |x^j_{t+i} - \hat{x}^j_{t+i}| \,,\end{aligned}$$

where $x^j_{t+i}$ is the true value, $\hat{x}^j_{t+i}$ is the prediction, $N$ is the number of time series instances indexed by $j$. The reported MSE and MAE are thus averaged over all prediction horizons and instances. To mitigate the effects of stochasticity during training, each model is trained and tested 10 times on each dataset and prediction horizon, and the reported results are the mean.

### Anomaly Detection {#ad_method}

```{=latex}
\subsubsubsection{\textbf{Yahoo}}
```
The Yahoo anomaly detection task follows a streaming evaluation procedure [@ren2019time], where given a time series segment $\{x_1, ..., x_t\}$, the goal is to infer if the last point $x_t$ is anomalous. To avoid drifting, the input data is differenced $d$ times, following the number of roots $d$ found by an ADF test [@adf]. For fair comparison, we follow the anomaly detection procedure of @ts2vec, which computes an *anomaly score* $\alpha_t$ for $x_t$. $\alpha_t$ is given by the dissimilarity between representations where the observation of interest $x_t$ has been masked vs. unmasked. Since the model was trained with random masking, it should output a coherent representation even with a masked input, resulting in a small $\alpha_t$. However, if the point $x_t$ presents an anomaly, the masked representation will be very different from the unmasked representation, as it anticipated \`normal' behaviour, not an anomaly, resulting in a high value of $\alpha_t$. More precisely, $\alpha_t$ is given by: $$\alpha_t = || r^{u}_t - r^{m}_t ||_1\,,$$

where $r^{u}_t$ is the representation of the unmasked $x_t$ and $r^{m}_t$ is the representation of the masked observation. The anomaly score is adjusted by the average of the preceding $Z$ points: $$\alpha_t^{adj} = \frac{\alpha_t - \Bar{\alpha_t}}{\Bar{\alpha_t}}\,,$$ where $\Bar{\alpha_t} = \frac{1}{Z}\sum^Z_{i=t-Z} \alpha_i$. At inference time, anomalies are detected if the adjusted anomaly score $\alpha^{adj}_t$ is above a certain threshold: $\alpha^{adj}_t > \mu + \beta \cdot \sigma$, where $\mu$ and $\sigma$ are the mean and standard deviation of the historical scores, and $\beta$ is a hyperparameter. If an anomaly is detected within a delay of 7 timesteps from when it occured, it is considered correct, following the procedure introduced in @ren2019time. To mitigate the effects of stochasticity during training, each model is trained and tested 10 times, and the reported results are the mean.

```{=latex}
\subsubsubsection{\textbf{Sepsis}}
```
`\blue{The anomaly detection procedure followed for Sepsis is different from Yahoo's. We use a supervised method, since there is only one type of anomaly (sepsis) to detect, so we can fit a model to it, obtaining higher scores than a purely unsupervised method (as is done for Yahoo). One could not do that with a dataset like Yahoo, where there are various anomalies which cannot be explicitly classified. Despite being solved in a supervised manner, we believe sepsis detection remains an \textit{anomaly detection} task and is differs from \textit{classification} because we are classifying individual timesteps rather than entire time series, thus evaluating different characteristics of T-Rep. To perform well, T-Rep must thus produce representations with accurate \textit{intra-instance} structure, where healthy timesteps are very different from abnormal (sepsis) timesteps (which is learned in a \textit{self-supervised} manner) . This is further supported by the very low number of positive samples (2.2\%), so training a classification model on raw data is not efficient. For these reasons, although we used a supervised model on top of our representation, we consider this task to be an \textit{anomaly detection} task.}`{=latex}\
`\blue{The Sepsis dataset is firstly pre-processed by forward-filling all missing values and features which have lower granularity than the pysiological data, to ensure we have data at all timesteps for all features. The length of time-series is then standardised to 45 (patient ICU stay length varies in the raw data), corresponding to the last 45 hours of each patient in the ICU. We then subselect ten features based on medical relevance ('HR', 'Temp', 'Resp', 'MAP', 'Creatinine', 'Bilirubin direct', 'Glucose', 'Lactate', 'Age', 'ICULOS'). We have found this to be rather 'light' pre-processing compared to what top-performers in the 2019 PhysioNet Challenge did. This is done on purpose, to further highlight gaps between model performances.}`{=latex}\
`\blue{We obtain a dataset $\X \in \R^{N \times 45 \times 10}$. We train self-supervised models TS2Vec and T-Rep on this dataset, and then encode individual time series in sliding windows of length 6 (this window length seemed like a medically sensible lookback period, but we performed no parameter tuning). We thus obtain latent representations }`{=latex} $$\z_{t-6:t} = f_\theta (\x_{t-6:t})$$ `\blue{where $f_\theta$ is the self-supervised model (T-Rep or TS2Vec) and $\z_{t-6:t} \in \R^{6 \times F}$, with latent dimensionality $F=32$. A very small latent space is used since we pre-selected 32 features. Given $\z_{t-6:t}$, we try to predict whether $\z_t$ showcases sepsis, which is framed as a binary classification problem. This classification task is then solved using an SVM classifier with RBF kernel is trained using the representations. The penalty hyperparameter $C$ is chosen through cross-validation, in the range $\{10^k | k \in [\![-4, 4]\!] \}$. The linear baseline uses the same method, but using the raw data $\x_{t-6:t}$ instead of learned representations $\z_{t-6:t}$.}`{=latex}

### Classification {#classif_procedure}

The classification procedure for TS-TCC, TNC, and T-Loss is the same as in @t_loss. Instance-level representations are produced by the self-supervised model, and an SVM classifier with RBF kernel is trained using the representations. The penalty hyperparameter $C$ is chosen through cross-validation, in the range $\{10^k | k \in [\![-4, 4]\!] \}$.\
For T-Rep and TS2Vec, the procedure is slightly different, leveraging the fact that the temporal granularity of representations is controllable by the user (from timestep-level to instance-wide), thanks to the *hierarchical loss* framework [@ts2vec]. We use this to produce representations which maintain a temporal dimension, whose length is controlled by hyperparameter $W$: the latent vectors $z$ have dimension $\R^{\lfloor \frac{T}{W} \rfloor \times F}$. This is achieved by applying a maxpool operation with kernel size $k=\lfloor \frac{T}{W} \rfloor$ and stride $s = \lfloor \frac{T}{W} \rfloor$ to the timestep-level representation $z \in \R^{T \times F}$. The representation $z \in \R^{\lfloor \frac{T}{W} \rfloor \times F}$ is then flattened to obtain $z_{flat} \in \R^{\lfloor \frac{T}{W} \rfloor \cdot F}$, which is fed to an SVM with RBF kernel for classification, following the exact same protocol as in [@t_loss]. We refer to this new procedure as *TimeDim* in Table `\ref{table:full_uea_results}`{=latex}.\
Maintaining the temporal dimension in the representations is advantageous, as it gives a better snapshot of a time series' latent trajectory, compared to simply using one instance-wide vector. This is especially important for T-Rep, whose innovation lies in the treatment of time and its incorporation in latent trajectories. Because keeping some temporality increases the dimensionality of the vector $z_{flat}$ passed to the SVM, we use a smaller latent dimension $F$ to compensate. In practice we use $W = 10$ and $F = 128$, but no hyperparameter tuning was performed, so these may not be optimal parameters. As in all other experiments, the reported results are the mean obtained after training and testing each model 10 times per dataset.\
Interestingly, for TS2Vec, the *TimeDim* procedure does not result in a performance gain, so the scores obtained with the original procedure from [@t_loss] (using instance-wide representations) are reported in Table `\ref{table:uea_summary_res}`{=latex}. To summarise, for the results reported in Table `\ref{table:uea_summary_res}`{=latex} (in the paper's main body), T-Rep is the only model to use the *TimeDim* procedure. TS2Vec, TNC, T-Loss, TS-TCC all use the procedure introduced by @t_loss. TS2Vec results for both procedures are reported in Table `\ref{table:full_uea_results}`{=latex}, which contains scores on individual datasets, as well as aggregate metrics.\

Reproduction Details for Baselines {#rep_details}
----------------------------------

**TS2Vec:** The results for TS2Vec [@ts2vec] have been reproduced using the official code (found [here](https://github.com/yuezhihan/ts2vec)). The hyperparameters used for TS2Vec are exactly those of the original paper (except for the batch size, set to 16 in our reproduction), and the same as T-Rep where possible. The only notable difference in hyperparameters is that T-Rep uses slightly wider hidden channels to accommodate the additional use of a time-embedding vector. The results published in the TS2Vec paper weren't used for the forecasting task because of slight changes we made to the experimental setup, specifically removing the use of covariates in forecasting. For anomaly detection and classification, we used our own reproduction because the results slightly differed from the originally published results, which we attribute to the following reasons. Firstly, the change in batch size from 8 to 16 for us, which the original paper says to be important [@ts2vec]. Secondly, we introduced a new classification procedure, so we had to test TS2Vec in this new setup. Further, we train and test all models 10 times to mitigate stochasticity and obtain more stable results, which is not the case for the results of the original TS2Vec paper [@ts2vec]. Also, we do not use any random seed, while the TS2Vec codebase [@ts2vec_repo] shows a single run with a random seed was used to produce their results.\
**Informer:** For the ETTh$_1$, ETTh$_2$ and ETTm$_1$ datasets, we use the results published in the Informer paper [@informer]. However, the paper doesn't include results on ETTm$_2$, so we obtain these results using the official code (found [here](https://github.com/zhouhaoyi/Informer2020)) and the default hyperparameters (which are also used in the original paper).\
**TCN:** T-Rep's encoder architecture is a TCN network [@cnn_vs_rnn], so we reused our code to implement the forecasting TCN model, changing only the training part to use a supervised MSE loss. The hyperparameters (channel widths, learning rate, optimiser) are the same as for T-Rep.\
For the classification tasks, the results of TS-TCC [@ts_tcc], TNC [@tnc], and DTW [@dtw] are taken from the TS2Vec paper [@ts2vec], based on their own reproduction. The results for T-Loss [@t_loss] are taken from the T-Loss paper directly [@t_loss].\
For the anomaly detection tasks, the baseline results of SPOT, DSPOT [@spot], and SR [@ren2019time] are taken directly from the TS2Vec paper [@ts2vec].

Time-Embedding Choice {#te_choice}
---------------------

The choice of architecture for the time-embedding varies depending on the temporal features one wishes to extract. For users concerned with capturing periodicity and trend, a Time2Vec-like module is ideal [@time2vec]. Focusing on these features has helped it become the top performer for anomaly detection and forecasting. If unsure about what features to extract, a fully-connect MLP may also be used, leading to perhaps less interpretable but very flexible and expressive embeddings. We found this to be the best time-embedding for classification tasks. We have also implemented a time-embedding model based on RBF features, which performs well in anomaly detection. It is not well understood when and why different time-embeddings perform better than others, beyond the simple case of Time2Vec [@time2vec]. We leave this as an area for future research.

Contextual Consistency {#contextual-consistency}
----------------------

`\blue{TS2Vec introduces a new pair-sampling method for contrastive learning: \textit{contextual consistency}. The idea is to sample two overlapping segments of the time-series and encourage the representations in the overlapping timesteps to be similar. Random transformations are applied to each sampled segment, leading to different \textit{contexts} for the overlapping window. Ideally, the encoder should extract a representation of the window which is invariant to the surrounding context.}`{=latex}\
`\blue{Given an input time-series $\x \in \R^{T \times F}$, random cropping is applied to obtain two overlapping segments of $\x$: $\x_{a_1:b_1}$ and $\x_{a_2:b_2}$ such that $0 \leq a_1 \leq a_2 \leq b_1 \leq b_2 \leq T$. $a_1, a_2, b_1, b_2$ are sampled uniformly in the allowed ranges at each training iteration. Timestamp masking is then randomly applied within each segment by the encoder, which outputs the respective representations $\z_{a_1:b_1} = \{z_{a_1}, ..., z_{b_1}\}$ and  $\z_{a_2:b_2} = \{z_{a_2}, ..., z_{b_2}\}$. These representations in the overlapping interval $[a_2, b_1]$ correspond to the same timesteps, but were obtained using different contexts. Timestep-wise and instance-wise contrasting tasks are then applied to them \cite{ts2vec}.}`{=latex}\
`\blue{The contextual consistency method is very powerful as it helps learn invariances to different sub-sequence distributions (which might differ in the case of non-stationary data), without making any underlying assumptions about the time-series distribution. This contrasts other SSL methods for temporal data, which often feature inductive biases in the form of invariance assumptions to noise, cropping, flipping etc. \citep{ssl_ts_review}.}`{=latex}

Hierarchical Loss {#hierarchical-loss}
-----------------

```{=latex}
\blue{The hierarchical loss framework applies and sums the model's loss function at different scales, starting from a per-timestep representation and applying \texttt{maxpool} operations to reduce the time-dimension between scales \citep{ts2vec}. This gives users control over the granularity of the representation used for downstream tasks, without sacrificing performance. It also makes the model more robust to missing data, as it makes use of long-range information in the surrounding representations to reconstruct missing timesteps \citep{ts2vec}. Its precise algorithm is given in Figure \ref{fig:hierarchical_loss} below, taken from \cite{ts2vec}:}
```
```{=latex}
\centering
```
![`\blue{Hierarchical loss algorithm \citep{ts2vec}}`{=latex}](images/hierarchical_loss.jpg){#fig:hierarchical_loss width="0.5\\linewidth"}

UEA Classification Full Results {#uea_full_results}
-------------------------------

Table `\ref{table:full_uea_results}`{=latex} presents the full results for the classification task on the 30 UEA multivariate time series datasets [@ucr_archive]. The evaluated models are T-Rep, following the *TimeDim* procedure detailed in Appendix `\ref{classif_procedure}`{=latex}, TS2Vec [@ts2vec], following both the *TimeDim* procedure and the original procedure introduced in [@t_loss], T-Loss [@t_loss], TNC [@tnc], TS-TCC [@ts_tcc] and DTW [@dtw].

```{=latex}
\color{myblue}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    \begin{tabular}{ l  cccccccc  }
        \toprule
        Dataset & \textbf{T-Rep} (Ours) & TS2Vec & TS2Vec (\textit{TimeDim}) & T-Loss & TNC & TS-TCC & DTW & Minirocket \\
        \midrule
        ArticularyWordRecognition & 0.968 & 0.974 & 0.969 & 0.943 & 0.973 & 0.953 & 0.987 & 0.980 \\
        AtrialFibrillation & 0.354 & 0.287 & 0.313 & 0.133 & 0.133 & 0.267 & 0.2 & 0.2 \\
        BasicMotions & 1.0 & 1.0 & 1.0 & 1.0 & 1.0 & 1.0 & 1.0 & 1.0 \\
        CharacterTrajectories & 0.989 & 0.988 & 0.988 & 0.993 & 0.967 & 0.985 & 0.989 & 0.988 \\
        Cricket & 0.958 & 0.953 & 0.938 & 0.972 & 0.958 & 0.917 & 1.0 & 0.972\\
        DuckDuckGeese & 0.457 & 0.434 & 0.368 & 0.65 & 0.46 & 0.38 & 0.6 & 0.540 \\
        ERing & 0.943 & 0.899 & 0.936 & 0.133 & 0.852 & 0.904 & 0.133 & 0.933 \\
        EigenWorms & 0.884 & 0.916 & 0.88 & 0.84 & 0.84 & 0.779 & 0.618 & 0.954 \\
        Epilepsy & 0.97 & 0.963 & 0.962 & 0.971 & 0.957 & 0.957 & 0.964 & 1.0 \\
        EthanolConcentration & 0.333 & 0.303 & 0.298 & 0.205 & 0.297 & 0.285 & 0.322 & 0.357 \\
        FaceDetection & 0.581 & 0.541 & 0.577 & 0.513 & 0.536 & 0.544 & 0.529 & 0.569 \\
        FingerMovements & 0.495 & 0.495 & 0.493 & 0.58 & 0.47 & 0.46 & 0.53 & 0.420 \\
        HandMovementDirection & 0.536 & 0.418 & 0.527 & 0.351 & 0.324 & 0.243 & 0.231 & 0.405 \\
        Handwriting & 0.414 & 0.463 & 0.424 & 0.451 & 0.249 & 0.498 & 0.286 & 0.241 \\
        Heartbeat & 0.725 & 0.734 & 0.724 & 0.741 & 0.746 & 0.750 & 0.717 & 0.722 \\
        InsectWingbeat & 0.328 & 0.323 & 0.327 & 0.156 & 0.469 & 0.264 & 0.1 & 0.319 \\
        JapaneseVowels & 0.962 & 0.97 & 0.961 & 0.989 & 0.978 & 0.93 & 0.949 & 0.921 \\
        LSST & 0.526 & 0.558 & 0.546 & 0.509 & 0.595 & 0.474 & 0.551 & 0.668 \\
        Libras & 0.829 & 0.859 & 0.833 & 0.883 & 0.817 & 0.822 & 0.87 & 0.944 \\
        MotorImagery & 0.495 & 0.5 & 0.497 & 0.58 & 0.5 & 0.61 & 0.5 & 0.470 \\
        NATOPS & 0.804 & 0.897 & 0.824 & 0.917 & 0.911 & 0.822 & 0.883 & 0.916 \\
        PEMS-SF & 0.8 & 0.772 & 0.794 & 0.675 & 0.699 & 0.734 & 0.711 & 0.896 \\
        PenDigits & 0.971 & 0.977 & 0.975 & 0.981 & 0.979 & 0.974 & 0.977 & 0.974 \\
        PhonemeSpectra & 0.232 & 0.243 & 0.228 & 0.222 & 0.207 & 0.252 & 0.151 & 0.280 \\
        RacketSports & 0.883 & 0.893 & 0.865 & 0.855 & 0.775 & 0.816 & 0.802 & 0.875 \\
        SelfRegulationSCP1 & 0.819 & 0.79 & 0.819 & 0.843 & 0.799 & 0.823 & 0.775 & 0.884 \\
        SelfRegulationSCP2 & 0.591 & 0.554 & 0.563 & 0.539 & 0.55 & 0.532 & 0.539 & 0.494 \\
        SpokenArabicDigits & 0.994 & 0.992 & 0.993 & 0.905 & 0.934 & 0.97 & 0.963 & 0.989 \\
        StandWalkJump & 0.441 & 0.407 & 0.3 & 0.332 & 0.4 & 0.332 & 0.2 & 0.333 \\
        UWaveGestureLibrary & 0.885 & 0.875 & 0.89 & 0.875 & 0.759 & 0.753 & 0.903 & 0.913 \\
        \midrule
        % On the first 29 datasets: \\
        Avg. Accuracy & 0.706 & 0.699 & 0.693 & 0.657 & 0.671 & 0.667 & 0.650 & \textbf{0.707} \\
        Ranks $1^{st}$ & 6 & 2 & 1 & 6 & 1 & 3 & 2 & \textbf{9} \\
        Avg. Acc. Difference (\%) & -- & 2.1 & 3.4 & 33.6 & 12.3 & 10.0 & 43.6 & 4.8 \\
        % Avg. Rank & \textbf{2.96} & 3.17  & 3.90 & 3.77 & 4.57 & 4.94 & 4.70 \\
        \bottomrule
    \end{tabular}
}
```
The metrics used for evaluation are the following:

-   **Avg. Accuracy:** The measured test set accuracy, averaged over all datasets.

-   **Ranks $1^{st}$:** This metric measures the number of datasets for which the given model is the best amongst all compared models. For example, T-Rep is the best performing model on 10 out of 30 datasets.

-   **Avg. Acc. Difference:** This measures the relative difference between T-Rep and a given model $M$, averaged over all 30 datasets. It is calculated as: $$\text{Avg. difference} = 100 \cdot \left( \sum^{30}_{i=1} \frac{Acc(\text{T-Rep})_i - Acc(M)_i}{Acc(M)_i} \right) \,,$$ where $Acc(M)_i$ is the accuracy of model $M$ on dataset $i$.

-   **Avg. Rank:** For each dataset, we compute the rank of a model compared to other models based on accuracy. The average rank over all 30 datasets is then reported.

ETT Forecasting Full Results {#ett_full_results}
----------------------------

Table `\ref{forecast-multivar}`{=latex} presents the full results for the forecasting task on the 4 ETT datasets [@informer]. We evaluate T-Rep against TS2Vec [@ts2vec], a SOTA self-supervised model for time series, but also Informer [@informer], the SOTA in time series forecasting, as well as TCN [@cnn_vs_rnn], a simpler model with the same backbone architecture as T-Rep, but trained with a supervised MSE loss.

```{=latex}
\centering
```
```{=latex}
\begin{threeparttable}
 \color{myblue} 
 \scalebox{0.9}{
      \begin{tabular}{lccccccccccc}
      \toprule
             &  & \multicolumn{2}{c}{\textbf{T-Rep }(Ours)} &
             \multicolumn{2}{c}{TS2Vec} &
             \multicolumn{2}{c}{Informer} &
             \multicolumn{2}{c}{TCN} &
             \multicolumn{2}{c}{Linear Baseline} \\
        \cmidrule(r){3-4} \cmidrule(r){5-6} \cmidrule(r){7-8} \cmidrule(r){9-10} \cmidrule(r){11-12}
        Dataset & H & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE \\
        \midrule
        \multirow{5}*{ETTh$_1$}
        & 24 & \textbf{0.511} & \textbf{0.496} & 0.575 & 0.529 & 0.577 & 0.549 & 0.767 & 0.612 & 0.873 & 0.664 \\
        & 48 & \textbf{0.546} & \textbf{0.524} & 0.608 & 0.553 & 0.685 & 0.625 & 0.713 & 0.617 & 0.912 & 0.689 \\
        & 168 & \textbf{0.759} & \textbf{0.649} & 0.782 & 0.659 & 0.931 & 0.752 & 0.995 & 0.738 & 0.993 & 0.749 \\
        & 336 & \textbf{0.936} & \textbf{0.742} & 0.956 & 0.753 & 1.128 & 0.873 & 1.175 & 0.800 & 1.085 & 0.804 \\
        & 720 & \textbf{1.061} & \textbf{0.813} & 1.092 & 0.831 & 1.215 & 0.896 & 1.453 & 1.311 & 1.172 & 0.863 \\
        
        \midrule
    
        \multirow{5}*{ETTh$_2$}
        & 24 & 0.560 & 0.565 & \textbf{0.448} & 0.506 & 0.720 & 0.665 & 1.365 & 0.888 & 0.463 & \textbf{0.498} \\
        & 48 & 0.847 & 0.711 & 0.685 & 0.642 & 1.457 & 1.001 & 1.395 & 0.960 & \textbf{0.614} & \textbf{0.588} \\
        & 168 & 2.327 & 1.206 & 2.227 & 1.164 & 3.489 & 1.515 & 3.166 & 1.407 & \textbf{1.738} & \textbf{1.016} \\
        & 336 & 2.665 & 1.324 & 2.803 & 1.360 & 2.723 & 1.340 & 3.256 & 1.481 & \textbf{2.198} & \textbf{1.173} \\
        & 720 & 2.690 & 1.365 & 2.849 & 1.436 & 3.467 & 1.473 & 3.690 & 1.588 & \textbf{2.454} & \textbf{1.290} \\
        
        \midrule
        
        \multirow{5}*{ETTm$_1$}
        & 24 & 0.417 & 0.420 & 0.438 & 0.435 & \textbf{0.323} & \textbf{0.369} & 0.324 & 0.374 & 0.590 & 0.505 \\
        & 48 & 0.526 & 0.484 & 0.582 & 0.553 & 0.494 & 0.505 & \textbf{0.477} & \textbf{0.450} & 0.813 & 0.637 \\
        & 96 & \textbf{0.573} & \textbf{0.516} & 0.602 & 0.537 & 0.678 & 0.614 & 0.636 & 0.602 & 0.866 & 0.654 \\
        & 288 & \textbf{0.648} & \textbf{0.577} & 0.709 & 0.610 & 1.056 & 0.786 & 1.270 & 1.351 & 0.929 & 0.697 \\
        & 672 & \textbf{0.758} & \textbf{0.649} & 0.826 & 0.687 & 1.192 & 0.926 & 1.381 & 1.467 & 1.008 & 0.746 \\
    
        \midrule
        
        \multirow{5}*{ETTm$_2$}
        & 24 & 0.172 & 0.293 & 0.189 & 0.310 & \textbf{0.147} & \textbf{0.277} & 1.452 & 1.938 & 0.275 & 0.364 \\
        & 48 & 0.263 & 0.377 & \textbf{0.256} & \textbf{0.369} & 0.267 & 0.389 & 2.181 & 0.839 & 0.363 & 0.434 \\
        & 96 & 0.397 & 0.470 & 0.402 & 0.471 & \textbf{0.317} & \textbf{0.411} & 3.921 & 1.714 & 0.441 & 0.484 \\
        & 288 & 0.897 & 0.733 & 0.879 & 0.724 & 1.147 & 0.834 & 3.649 & 3.245 & \textbf{0.754} & \textbf{0.664} \\
        & 672 & 2.185 & 1.144 & 2.193 & 1.159 & 3.989 & 1.598 & 6.973 & 1.719 & \textbf{1.796} & \textbf{1.027} \\
        
        \midrule
        
        \multicolumn{2}{l}{Avg. Rank} & \textbf{1.90} & \textbf{1.85} & 2.40 & 2.45 & 3.3 & 3.55 & 4.35 & 4.1 & 3.05 & 3.05 \\
        \multicolumn{2}{l}{Ranks $1^{\text{st}}$} & \textbf{8} & \textbf{8} & 2 & 1 & 3 & 3 & 1 & 1 & 6 & 7\\
        \multicolumn{2}{l}{Avg. MSE} & \textbf{0.986} & \textbf{0.702} & 1.004 & 0.712 & 1.300 & 0.820 & 2.012 & 1.205 &  1.017 & 0.727\\
        
        \bottomrule
      \end{tabular}
  }
  \end{threeparttable}
```
```{=latex}
\newpage
```
Time Series Anomaly Detection Full Results {#ad_results}
------------------------------------------

```{=latex}
\centering
```
```{=latex}
\scalebox{1.0}{
  \begin{tabular}{lcccccccc}
  \toprule
    & F$_1$ & Prec. & Rec. \\
    \midrule
    SPOT & 0.338 & 0.269 & 0.454 \\
    DSPOT & 0.316 & 0.241 & 0.458 \\
    % DONUT & 0.026 & 0.013 & \textbf{0.825} \\
    SR & 0.563 & 0.451 & 0.747 \\
    TS2Vec & 0.733 & 0.706 & 0.762 \\
    \textbf{T-Rep} (Ours) & \textbf{0.757} & \textbf{0.796} & 0.723 \\
    \bottomrule
  \end{tabular}
 }
```
```{=latex}
\centering
```
```{=latex}
\color{myblue}
```
```{=latex}
\scalebox{1.0}{
  \begin{tabular}{lcccccccc}
  \toprule
    & F$_1$ & Acc. \\
    \midrule
    Raw data & 0.241 & 0.676 \\
    TS2Vec & 0.619 & 0.771 \\
    \textbf{T-Rep} (Ours) & \textbf{0.665} & \textbf{0.790} \\
    \bottomrule
  \end{tabular}
 }
```

[^1]: <https://github.com/Let-it-Care/T-Rep>