---
abstract: |
  Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose [CauKer]{.smallcaps}, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. [CauKer]{.smallcaps} combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that [CauKer]{.smallcaps}-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.
author:
- |
  Shifeng Xie\
  Paris Noah's Ark Lab, Huawei$^1$\
  `shifeng.xie@telecom-paris.fr`\
  Vasilii Feofanov\
  Paris Noah's Ark Lab, Huawei$^1$\
  Marius Alonso\
  Paris Noah's Ark Lab, Huawei$^1$\
  Ambroise Odonnat\
  Paris Noah's Ark Lab, Huawei$^1$\
  Inria$^2$\
  Jianfeng Zhang\
  Paris Noah's Ark Lab, Huawei$^1$\
  Themis Palpanas\
  Université Paris Cité, LIPADE / diiP$^3$\
  Ievgen Redko\
  Paris Noah's Ark Lab, Huawei$^1$\
  `ievgen.redko@huawei.com`\
bibliography:
- reference.bib
title: '[CauKer]{.smallcaps}: classification time series foundation models can be pretrained on synthetic data only'
---

```{=latex}
\renewcommand{\rmdefault}{ptm}
```
```{=latex}
\renewcommand{\sfdefault}{phv}
```
```{=latex}
\newcommand{\@neuripsordinal}{39th}
```
```{=latex}
\newcommand{\@neuripsyear}{2025}
```
```{=latex}
\newcommand{\@neuripslocation}{San Diego}
```
```{=latex}
\newcommand{\acksection}{\section*{Acknowledgments and Disclosure of Funding}}
```
```{=latex}
\renewcommand{\normalsize}{%
  \@setfontsize\normalsize\@xpt\@xipt
  \abovedisplayskip      7\p@ \@plus 2\p@ \@minus 5\p@
  \abovedisplayshortskip \z@ \@plus 3\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@
}
```
```{=latex}
\renewcommand{\small}{%
  \@setfontsize\small\@ixpt\@xpt
  \abovedisplayskip      6\p@ \@plus 1.5\p@ \@minus 4\p@
  \abovedisplayshortskip \z@  \@plus 2\p@
  \belowdisplayskip      \abovedisplayskip
  \belowdisplayshortskip 3\p@ \@plus 2\p@   \@minus 2\p@
}
```
```{=latex}
\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt}
```
```{=latex}
\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt}
```
```{=latex}
\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt}
```
```{=latex}
\renewcommand{\large}{\@setfontsize\large\@xiipt{14}}
```
```{=latex}
\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}}
```
```{=latex}
\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}}
```
```{=latex}
\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}}
```
```{=latex}
\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}}
```
```{=latex}
\providecommand{\section}{}
```
```{=latex}
\renewcommand{\section}{%
  \@startsection{section}{1}{\z@}%
                {-2.0ex \@plus -0.5ex \@minus -0.2ex}%
                { 1.5ex \@plus  0.3ex \@minus  0.2ex}%
                {\large\bf\raggedright}%
}
```
```{=latex}
\providecommand{\subsection}{}
```
```{=latex}
\renewcommand{\subsection}{%
  \@startsection{subsection}{2}{\z@}%
                {-1.8ex \@plus -0.5ex \@minus -0.2ex}%
                { 0.8ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright}%
}
```
```{=latex}
\providecommand{\subsubsection}{}
```
```{=latex}
\renewcommand{\subsubsection}{%
  \@startsection{subsubsection}{3}{\z@}%
                {-1.5ex \@plus -0.5ex \@minus -0.2ex}%
                { 0.5ex \@plus  0.2ex}%
                {\normalsize\bf\raggedright}%
}
```
```{=latex}
\providecommand{\paragraph}{}
```
```{=latex}
\renewcommand{\paragraph}{%
  \@startsection{paragraph}{4}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
}
```
```{=latex}
\providecommand{\subparagraph}{}
```
```{=latex}
\renewcommand{\subparagraph}{%
  \@startsection{subparagraph}{5}{\z@}%
                {1.5ex \@plus 0.5ex \@minus 0.2ex}%
                {-1em}%
                {\normalsize\bf}%
}
```
```{=latex}
\providecommand{\subsubsubsection}{}
```
```{=latex}
\renewcommand{\subsubsubsection}{%
  \vskip5pt{\noindent\normalsize\rm\raggedright}%
}
```
```{=latex}
\renewcommand{\topfraction      }{0.85}
```
```{=latex}
\renewcommand{\bottomfraction   }{0.4}
```
```{=latex}
\renewcommand{\textfraction     }{0.1}
```
```{=latex}
\renewcommand{\floatpagefraction}{0.7}
```
```{=latex}
\renewenvironment{table}
  {\setlength{\abovecaptionskip}{\@neuripsbelowcaptionskip}%
   \setlength{\belowcaptionskip}{\@neuripsabovecaptionskip}%
   \@float{table}}
  {\end@float}
```
```{=latex}
\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@}
```
```{=latex}
\def\@listi  {\leftmargin\leftmargini}
```
```{=latex}
\def\@listii {\leftmargin\leftmarginii
              \labelwidth\leftmarginii
              \advance\labelwidth-\labelsep
              \topsep  2\p@ \@plus 1\p@    \@minus 0.5\p@
              \parsep  1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \itemsep \parsep}
```
```{=latex}
\def\@listiii{\leftmargin\leftmarginiii
              \labelwidth\leftmarginiii
              \advance\labelwidth-\labelsep
              \topsep    1\p@ \@plus 0.5\p@ \@minus 0.5\p@
              \parsep    \z@
              \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@
              \itemsep \topsep}
```
```{=latex}
\def\@listiv {\leftmargin\leftmarginiv
              \labelwidth\leftmarginiv
              \advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listv  {\leftmargin\leftmarginv
              \labelwidth\leftmarginv
              \advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listvi {\leftmargin\leftmarginvi
              \labelwidth\leftmarginvi
              \advance\labelwidth-\labelsep}
```
```{=latex}
\providecommand{\maketitle}{}
```
```{=latex}
\renewcommand{\maketitle}{%
  \par
  \begingroup
    \renewcommand{\thefootnote}{\fnsymbol{footnote}}
    % for perfect author name centering
    \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}}
    % The footnote-mark was overlapping the footnote-text,
    % added the following to fix this problem               (MK)
    \long\def\@makefntext##1{%
      \parindent 1em\noindent
      \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1
    }
    \thispagestyle{empty}
    \@maketitle
    \@thanks
    \@notice
  \endgroup
  \let\maketitle\relax
  \let\thanks\relax
}
```
```{=latex}
\newcommand{\@toptitlebar}{
  \hrule height 4\p@
  \vskip 0.25in
  \vskip -\parskip%
}
```
```{=latex}
\newcommand{\@bottomtitlebar}{
  \vskip 0.29in
  \vskip -\parskip
  \hrule height 1\p@
  \vskip 0.09in%
}
```
```{=latex}
\providecommand{\@maketitle}{}
```
```{=latex}
\renewcommand{\@maketitle}{%
  \vbox{%
    \hsize\textwidth
    \linewidth\hsize
    \vskip 0.1in
    \@toptitlebar
    \centering
    {\LARGE\bf \@title\par}
    \@bottomtitlebar
    \if@submission
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}
        Anonymous Author(s) \\
        Affiliation \\
        Address \\
        \texttt{email} \\
      \end{tabular}%
    \else
      \def\And{%
        \end{tabular}\hfil\linebreak[0]\hfil%
        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
      }
      \def\AND{%
        \end{tabular}\hfil\linebreak[4]\hfil%
        \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces%
      }
      \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}%
    \fi
    \vskip 0.3in \@minus 0.1in
  }
}
```
```{=latex}
\newcommand{\ftype@noticebox}{8}
```
```{=latex}
\newcommand{\@notice}{%
  % give a bit of extra room back to authors on first page
  \enlargethispage{2\baselineskip}%
  \@float{noticebox}[b]%
    \footnotesize\@noticestring%
  \end@float%
}
```
```{=latex}
\renewenvironment{abstract}%
{%
  \vskip 0.075in%
  \centerline%
  {\large\bf Abstract}%
  \vspace{0.5ex}%
  \begin{quote}%
}
{
  \par%
  \end{quote}%
  \vskip 1ex%
}
```
```{=latex}
\newcommand{\answerYes}[1][]{\textcolor{blue}{[Yes] #1}}
```
```{=latex}
\newcommand{\answerNo}[1][]{\textcolor{orange}{[No] #1}}
```
```{=latex}
\newcommand{\answerNA}[1][]{\textcolor{gray}{[NA] #1}}
```
```{=latex}
\newcommand{\answerTODO}[1][]{\textcolor{red}{\bf [TODO]}}
```
```{=latex}
\newcommand{\justificationTODO}[1][]{\textcolor{red}{\bf [TODO]}}
```
```{=latex}
\newcommand{\@noticestring}{%
    Preprint. Under review.%
  }
```
```{=latex}
\newcommand{\@noticestring}{%
      \@neuripsordinal\/ Conference on Neural Information Processing Systems
      (NeurIPS \@neuripsyear).%, \@neuripslocation.%
    }
```
```{=latex}
\newcommand{\@noticestring}{%
      Submitted to \@neuripsordinal\/ Conference on Neural Information
      Processing Systems (NeurIPS \@neuripsyear). Do not distribute.%
    }
```
```{=latex}
\let\ack\hide
```
```{=latex}
\let\endack\endhide
```
```{=latex}
\newcommand{\shifeng}[1]{\textcolor{blue}{Shifeng: #1}}
```
```{=latex}
\newcommand{\vasilii}[1]{\textcolor{magenta}{#1}}
```
Introduction
============

Time series data are ubiquitous in applications ranging from healthcare [@gnassounou2025psdnorm] and human activity recognition [@chen2025comodo] to industrial monitoring [@susto2018time]. Recently, the time series community has devoted significant effort to developing large-scale pretrained time series foundation models (TSFMs). Inspired by advances in natural language processing and computer vision, these models aim to achieve strong zero-shot performance in out-of-distribution (OOD) settings. TSFMs have been proposed for both forecasting [@Chronos; @MOIRAIS; @bhethanabhotla2024mamba4cast] and classification tasks [@MOMENT; @Nutime; @Mantis], showing promising results. TSFMs are usually trained on large-scale pretraining dataset collections gathered from different application domains. Recent works used as many as 1.13 billion timepoints of 13M unique time series for model pretraining [@MOMENT].

Despite the prevalence of large-scale pretraining in the development of TSFMs, several works [@hoo2024tabular; @ForecastPFN; @timePFN] showed that comparable performance can be achieved by training them purely on synthetic data. The latter approach has several important advantages. First, it removes the need for time-consuming data collection and curation. This is especially important in time series classification that lacks diverse and rich pretraining corpora. Second, it allows for generating arbitrarily large datasets for model scaling. Finally, it makes the OOD evaluation more meaningful, mitigating the risk of data leakage. Inspired by the recent success of foundation models in tabular classification [@TabPFN], our paper proposes a novel sample-efficient pretraining framework for TSFMs in classification based purely on synthetic data. Contrary to tabular and forecasting synthetic data generation pipelines, our proposal seeks to generate sequences with meaningful correlations between samples and realistic temporal dependencies within them. We provide an in-depth, large-scale study of its benefits compared to pretraining on commonly used time series classification corpora.

#### Findings

Overall, our findings can be summarized as follows:

1.  A carefully designed synthetic data generation pipeline can be efficiently used in training classification TSFMs. We propose such a pipeline and show that it requires rethinking synthetic data generators proposed previously for tabular data and time series forecasting.

2.  Pretraining on synthetic data reveals clear scaling laws both in terms of dataset size and model size. We illustrate this finding by showing that such scaling laws are broken when using common classification benchmarks for pretraining, likely due to the lack of diversity in existing classification datasets.

3.  Distinct from forecasting [@TSScalingLaw], where the leaderboard (with the exception of [@TabPFN]) is still dominated by models pretrained on large-scale real-world datasets, we show that pretraining on solely synthetic data can lead to state-of-the-art performance in classification.

The rest of this paper is organized as follows. In Section [2](#sec:related){reference-type="ref" reference="sec:related"}, we present recent advances in TSFMs and describe commonly used pretraining datasets. In Section [3](#sec:contributions){reference-type="ref" reference="sec:contributions"}, we present the problem setup considered in our work and the proposed synthetic data generation pipeline. In Section [4](#sec:experiments){reference-type="ref" reference="sec:experiments"}, we empirically validate the effectiveness of [CauKer]{.smallcaps}-generated synthetic data through extensive experiments, demonstrating its strong generalization, scalability, and superiority over existing synthetic generation methods. Finally, we conclude our work and its limitations in Section [5](#sec:conclusions){reference-type="ref" reference="sec:conclusions"}.

Related work {#sec:related}
============

#### Time series foundation models

Recent advances in TSFM have followed two primary directions: (1) training models from scratch on large-scale, diverse time series datasets [@Chronos; @MOMENT; @timeFM; @UniTS; @rasul2023lag; @wang2024rose; @MOIRAIS; @bhethanabhotla2024mamba4cast; @Mantis; @UniTS; @Nutime; @Timer], and (2) leveraging large language models (LLMs) as backbones for time series tasks [@chang2023llm4ts; @gruver2024large; @zhou2023one; @xue2023promptcast; @cao2023tempo; @jin2023time]. The first approach focuses on developing architectures specifically tailored for time series, while the second approach explores encoding time series data into textual formats or extending the model's input mechanisms to natively handle sequential numeric data. Among the TSFMs mentioned above, a vast majority were proposed for time series forecasting, with only [@Mantis; @UniTS; @MOMENT; @GPT4TS; @Nutime; @timesbert] natively supporting time series classification. In particular, [@Mantis; @Nutime] specifically target classification by contrastively pretraining encoder-only models over time series gathered from popular classification benchmarks. They achieve state-of-the-art results in this task. [@MOMENT] is an encoder-decoder model used for classification and other popular time series tasks, such as forecasting, imputation, and anomaly detection. [@UniTS] relies on a custom architecture and is used in generative and prediction tasks by leveraging task-specific tokens. Finally, [@GPT4TS] fine-tunes an LLM by adding an appropriate encoder for input data and a classification head to generate predictions.

#### Pretraining datasets

The training data for TSFM generally fall into three categories: real-world, synthetic, or hybrid datasets combining the two. Models trained (or fine-tuned in case of LLM-based TSFMs) exclusively on real data [@timeFM; @UniTS; @rasul2023lag; @wang2024rose; @Mantis; @UniTS; @Nutime; @chang2023llm4ts; @gruver2024large; @zhou2023one; @xue2023promptcast; @cao2023tempo; @jin2023time] typically leverage extensive collections (ranging from 300k to 50M distinct time series) drawn from diverse domains such as traffic, finance and environmental monitoring. Training on these datasets, however, may be suboptimal scaling-wise as [@quanreimagining] obtained comparable performance using \<1% of the original 27B pretraining dataset from [@MOIRAIS], while [@TSScalingLaw] showed that famous forecasting TSFMs have very flat scaling laws in the multivariate setting. Meanwhile, forecasting models such as Chronos [@Chronos] and TimesFM [@timeFM] enhance their training corpus by incorporating synthetic time series data alongside real-world data. Finally, such methods as TimePFN [@timePFN] and ForecastPFN [@ForecastPFN] are pre-trained solely on synthetic data. In all these forecasting models, synthetic data is commonly generated through structured statistical procedures, including Gaussian process (kernel-based) methods or piecewise linear and seasonal pattern constructions with additive noise (for more details, we refer the interested reader to Appendix [6](#Related work extension){reference-type="ref" reference="Related work extension"}.) To the best of our knowledge, no prior work has proposed classification-oriented synthetic data generation methods for training time series foundation models.

Our contributions {#sec:contributions}
=================

We now introduce the task of zero-shot time series classification using TSFMs. We then formally present the common pretraining strategies and introduce our synthetic data generation pipeline.

Problem setup
-------------

#### Zero-shot classification

As done in prior work on unsupervised representation learning [@MultivariateUnsupervised; @TS2Vec], we see a TSFM as an encoder $F:\mathbb{R}^{t}\!\to\!\mathbb{R}^{q}$ that is kept frozen during the evaluation. For a downstream classification dataset $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{n}$ with labels $y_{i}\in\{1,\dots,C\}$, we use a TSFM to obtain embeddings $z_{i}=F(x_{i})$ and train a lightweight classifier $h:\mathbb{R}^{q}\!\to\!\{1,\dots,C\}$ solely on $\{(z_{i},y_{i})\}$. At test time, an unseen series $x^{\ast}$ is classified by $\hat{y}=h\!\bigl(F(x^{\ast})\bigr)$. As $F$ is kept frozen, the resulting accuracy measures the quality of its learned representations.

To quantify OOD generalization ability, we follow [@TSScalingLaw] and evaluate the studied TSFMs only on samples not seen during their pretraining. In practice, if we evaluate a given TSFM on a test set from a UCR [@UCR] dataset, we ensure that the TSFM was not pretrained on it, but we allow for the train set of this same dataset to be used for pretraining. We note that [@Mantis; @MOMENT; @Nutime] all used train sets from the datasets on which they've reported the zero-shot OOD generalization. Next, a lightweight classifier $h$ is fitted on the UCR train set embeddings and evaluated on the disjoint UCR test set embeddings as explained above.

#### Self-supervised pretraining

Self-supervised learning (SSL) has emerged as a powerful training paradigm for foundation models, allowing them to effectively learn discriminative representations from large-scale unlabeled datasets, significantly reducing dependency on costly data labeling [@SSLsurvey]. SSL methods are categorized into two principal types: contrastive learning and masked (reconstruction) learning [@SSL2]. Contrastive learning focuses on distinguishing between similar (positive) and dissimilar (negative) data pairs to learn meaningful representations. Conversely, masked learning leverages reconstruction objectives by training models to predict masked parts of the input, thereby gaining robust contextual understanding [@MaskedSurvey].

In our work, we cover both pretraining regimes. To this end, we consider Mantis [@Mantis], an open-source FM pretrained contrastively, and MOMENT [@MOMENT], which is a masked-based pretrained model. Detailed formulations of the loss functions and architecture specifics for these models are provided in the Appendix [7](#Details of Mantis and MOMENT){reference-type="ref" reference="Details of Mantis and MOMENT"}.

[CauKer]{.smallcaps}: synthetic data generation for time series classification
------------------------------------------------------------------------------

We now present our proposed synthetic data generation pipeline, termed [CauKer]{.smallcaps} for **Cau**sal-**Ker**nel generation. To develop our intuition about it, we note that the synthetic data for the time series classification task needs to combine two key ingredients. On the one hand, the generated sequences should exhibit common time series patterns such as seasonality, periodicity, and trend. On the other hand, successful classification assumes that individual time series have a meaningful clustering structure that allows the trained model to successfully learn how to disentangle the underlying clusters during training. Below, we present a generation pipeline that satisfies these desiderata.

![An illustration of the proposed [CauKer]{.smallcaps} pipeline. Kernels sampled from the kernel bank $\mathcal{K}$ are randomly combined and used together with sampled mean functions to form GP priors. Time series sampled from these GP priors act as root nodes in a directed acyclic graph that encodes causal dependencies between nodes. Each edge of this graph applies an activation function from a predefined activation function bank and aggregates over incoming edges using a random linear transformation to propagate transformed time series through the graph. Intermediate node outputs are optionally interpolated to fixed length, forming the final synthetic dataset. This procedure yields rich, diverse, and causally consistent time series for self-supervised pretraining. ](intro_generation_v2.png){#fig:CauK_pipeline width=".97\\textwidth"}

#### Proposed approach

To proceed, we now define three banks of functions, namely: kernel, mean and activation banks denoted as $\mathcal{K} = \{\kappa_i(t,t')\}_{i=1}^{n_\mathcal{K}}$, $\mathcal{M} = \{\mu_i(t)\}_{i=1}^{n_\mathcal{M}}$ and $\mathcal{A} = \{\sigma(t)_i\}_{i=1}^{n_\mathcal{A}}$, respectively. For the kernel bank, we use the same kernel functions as [@Chronos]. For mean functions, we consider a linear function $ax+b$, exponential function $ae^{bx}$, and anomaly mean function that inserts random values from $\mathcal{U}(-5,5)$ at random indexes. Finally, the activation functions we use for $\mathcal{A}$ are a linear function $ax+b$ with $a \sim \mathcal{U}(0.5, 2)$, $b \sim \mathcal{U}(-1, 1)]$, ReLU activation, sigmoid, sine function, element-wise modulo operation $x \text{ mod } c$ for $c \sim \mathcal{U}[1,5]$, and Leaky ReLU with a random negative slope from $\mathcal{U}(0.01, 0.3)$. For simplicity, in what follows we let $\{s_i\}_{i=1}^{n}\sim \mathcal{S}$ denote an i.i.d. sampling (without replacement) of $n$ elements from a set $\mathcal{S}$.

Our generative pipeline, illustrated in Figure [1](#fig:CauK_pipeline){reference-type="ref" reference="fig:CauK_pipeline"}, then proceeds in five steps as follows:

-   We start by sampling candidate kernels from the kernel bank, *ie*, $\{\kappa_i(t,t')\}_{i=1}^{K} \overset{\text{i.i.d.}}{\sim} \mathcal{K}$ for some random number of candidate kernels $K \sim \mathcal{U}(1,n_\mathcal{K})$.

-   We define a composite kernel based on $K-1$ randomly sampled binary operations ($+$ and $\times$). More formally, for a random sequence $\{\star_i\}_{i=1}^{K-1} \sim \{+, \times\}$, we let $\kappa^* = \kappa_1(t,t') \star_i \cdots \star_{K-1} \kappa_K(t,t')$.

-   We draw $M$ mean functions $\{\mu_i(t)\}_{i=1}^{M} \overset{\text{i.i.d.}}{\sim} \mathcal{M}$, $M \sim \mathcal{U}(1,n_\mathcal{M})$ and repeat Step 1 and Step 2 $M$ times to obtain composite kernels $\{\kappa_i^*\}_{i=1}^M$. We further define $M$ GP priors to sample from $\{\mathcal{GP}(\mu_i, \kappa_i^*)\}_{i=1}^M$.

-   We sample a set of $E$ activation functions from the activation bank, *ie*, $\{\sigma_i\}_{i=1}^E \sim \mathcal{A}$, $E \sim \mathcal{U}(1,n_\mathcal{A})$.

-   We randomly generate a directed acyclic graph (DAG) $(\mathcal{V}, \mathcal{E})$ with $|\mathcal{E}|=E$, $|\mathcal{V}|=V$, and $M < V$ *root nodes*, i.e., nodes with in-degree zero. We then define a bijection $\phi : \mathcal{E} \to \{\sigma_1, \sigma_2, \dots, \sigma_E\}$ such that each directed edge $e_{ij} = (u_i, v_j)$ is uniquely associated with a function $\sigma_l$, i.e., $\phi(e_{ij}) = \sigma_l$. We then associate a time series $t_i \in \mathbb{R}^L$ sampled from $\mathcal{GP}(\mu_i, \kappa_i^*)\}$ to each of the $M$ root nodes. The value $t_{v_j}$ associated with a given non-root vertex $v_j$ is then calculated as follows. First, for each incoming edge $e_{ij}$, we apply an activation function $\phi(e_{ij})$ to $t_{u_i}$. Then, we aggregate all $\phi(e._{j})(t_{u.})$ using a randomly initialized linear layer with weights and biases $W,b \sim \mathcal{N}(0,1)$, *ie*, $t_{v_j} = W \times [\phi(e._{j})(t_{u.})] +b$, with $[\cdot]$ denoting the concatenation operation.

A complete pseudocode of this procedure, as well as the composition and visualizations of the kernel, mean, and activation banks, are provided in Appendix [8](#Details of CauK){reference-type="ref" reference="Details of CauK"}.

#### Design choices

The synthetic datasets generated using our [CauKer]{.smallcaps} approach effectively encode diverse, realistic patterns and causal dynamics characteristic of real-world classification problems. Unlike the kernel‑only generator of @Chronos (Steps 1,2), which was designed for forecasting and therefore draws zero‑mean Gaussian‑process samples that emphasize smooth trend extrapolation, our task calls for retaining the mean level itself (Step 3) as a discriminative cue -- a choice that is empirically confirmed in Section [4.6](#sec:Train on CauK Synthetic Data){reference-type="ref" reference="sec:Train on CauK Synthetic Data"}. Conversely, the structural causal model (SCM) generator (Steps 4,5) originally proposed for tabular classification [@TabPFN] produces rich non‑linear dependencies but lacks hallmark time series motifs such as seasonality or linear trends. By unifying kernel composition with an SCM backbone, [CauKer]{.smallcaps} inherits the local smoothness and periodic structure of Gaussian processes while simultaneously injecting causal semantics through directed edges, yielding synthetic series that are explicitly classification‑oriented and more faithful to real‑world temporal dynamics.

Our experiments in Section [4](#sec:experiments){reference-type="ref" reference="sec:experiments"} demonstrate that foundation models pretrained on such data exhibit improved out-of-distribution generalization and meaningful scaling behavior, outperforming models trained solely on traditional synthetic benchmarks and performing with those trained on much larger real-world time series corpora.

Experimental results {#sec:experiments}
====================

We now empirically evaluate the effectiveness of our proposed [CauKer]{.smallcaps} framework for pretraining classification TSFMs. Our experiments aim to answer the following key questions:

-   How does [CauKer]{.smallcaps} compare to alternative synthetic data generation methods?

-   Do TSFMs trained on [CauKer]{.smallcaps} data exhibit meaningful data and model scaling laws?

-   Can [CauKer]{.smallcaps}-generated synthetic data be a competitive replacement for real-world benchmarks in training TSFMs?

In all our experiments, we consider two recent TSFMs, namely Mantis and MOMENT. Mantis is an 8M encoder-only model pretrained using contrastive learning. We use the 77M version of the MOMENT model. The latter is an encoder-decoder model pretrained based on masked reconstruction. Considering these two models allows us to compare two different pretraining paradigms as previously done in [@TSScalingLaw] for forecasting. Finally, we follow [@Mantis] and evaluate Mantis in a zero-shot regime by learning a Random Forest classifier on the embeddings of training examples of a given dataset. For MOMENT, [@MOMENT] evaluated their model using an Support Vector Machine classifier. For both models, we report the test accuracy averaged over 128 UCR datasets, where each dataset has train and test sets following [@UCR].

Q1: [CauKer]{.smallcaps} against alternative synthetic generators {#sec:CauK_vs_baselines}
-----------------------------------------------------------------

#### Experimental setup

To better understand the exact contribution of the proposed [CauKer]{.smallcaps}, we first start by establishing the virtues of our synthetic data generation pipeline compared to prior work. For this, we generate four different synthetic corpora, namely: 1) FPFN [@timePFN] that uses a linear model of coregionalization to sample multivariate time series, 2) KernelSynth [@Chronos] that randomly composes covariance kernels to define a Gaussian process with zero mean; 3) Mean+KernelSynth: our re-implementation of the KernelSynth baseline in which we additionally add non-zero mean functions in the GP; 4) SCM, a reconstruction of the structural-causal model proposed by @TabPFN for tabular classification [^1]. We generate univariate time series with length $T=512$ as both Mantis and MOMENT were trained on time series of this length. For a fair comparison, we fix the number of synthetic samples to 100K.

::: {#tab:CauK_comparison}
  -------- ------- ------- ------- ------- -----------
  Mantis    73.49   77.52   77.70   78.20   **78.31**
  MOMENT    59.23   70.85   69.31   72.56   **74.24**
  -------- ------- ------- ------- ------- -----------

  : Average zero-shot accuracy (%) on the UCR benchmark after pretraining on synthetic corpora generated by different methods.
:::

#### Results

Table [1](#tab:CauK_comparison){reference-type="ref" reference="tab:CauK_comparison"} shows a relative comparison of our proposal compared to other methods. Our first observation is that classification-tailored tabular data generation pipeline SCM underperforms significantly compared to all other methods. This suggests that temporal dependencies are important for time series classification, differently from the forecasting setup, where TabPFN trained using SCM-generated data is among the strongest foundation models. We further note that forecasting-tailored FPFN and Kernel-Synth also provide suboptimal results, even more so for MOMENT. In the case of Mantis, the results of pretraining on these two datasets are closer to the reported performance of the Mantis model. This can be likely explained by the architecture of Mantis that incorporates strong time series classification priors into it (mean, standard deviation, and difference encoding in the token generator unit). On the contrary, MOMENT is a generic encoder-decoder model. We further note a distinct positive effect of including non-mean functions in the GP used to generate time series in our pipeline. Finally, [CauKer]{.smallcaps} improves upon this stronger baseline in both cases, highlighting the additional benefit of causal structure. The last two observations are particularly valid for MOMENT, indicating that they compensate for the lack of useful inductive biases for the task of time series classification.

#### Qualitative analysis

r0.35 ![image](dtw_synthetic_data.png){width="\\linewidth"}

We now try to better understand why [CauKer]{.smallcaps} is particularly suitable for classification. Intuitively, we expect that having a discriminative signal in the generated data -- a clustering structure defining meaningful groups of time series -- should enable efficient classification on previously unseen samples. To verify this, we generate 200 samples using [CauKer]{.smallcaps} and calculate a matrix of pairwise Dynamic Time Warping (DTW) distances [@sakoe1978dynamic] on them. We do hierarchical clustering on the obtained precomputed DTW distance matrix and sort the rows and the columns according to the obtained cluster memberships. We plot the obtained matrix in Figure [\[dtw\_plot\]](#dtw_plot){reference-type="ref" reference="dtw_plot"}. From it, we can observe the emergence of clear clusters (large blocks of time series having similar intra-cluster distances) as well as the introduction of anomalies obtained using the anomaly mean function in the generating GP. This leads us to believe that [CauKer]{.smallcaps} generates data tailored specifically to classification, which may explain its superiority when pretraining TSFMs on it.

Q2: Scaling laws for zero-shot classification with TSFMs
--------------------------------------------------------

Scaling laws are fundamental to improving foundation models, underpinning their ability to generalize and demonstrate emergent capabilities with increased data and model scale. While scaling laws are widely studied in language and vision, their systematic exploration in the context of zero-shot time series classification remains is currently absent. To the best of our knowledge, our work is the first to thoroughly investigate scaling laws specifically in the setup of zero-shot time series classification which is of independent interest.

r0.6 ![image](dataScalingAll.png){width="60%"}

### Data scaling laws {#sec:Data Scaling Laws}

#### Experimental setup

To investigate data scaling laws, we systematically vary the pretraining dataset sizes from two distinct sources: (1) randomly selected subsets of the real-world UEA benchmark [@UEA] at increments of $0.1\%$, $1\%$ \... $100\%$, and (2) synthetic data generated by our proposed [CauKer]{.smallcaps} method, at varying scales from 10K up to 10M samples. We recall that both Mantis and MOMENT take as input univariate time series. This means that each channel of multivariate UEA datasets becomes a training sample, with a total of 12M channels (train set and test set combined) from 30 different datasets. Additional details are provided in Appendix [\[appendix:experimental\_details\_2\]](#appendix:experimental_details_2){reference-type="ref" reference="appendix:experimental_details_2"}.

#### Results

As illustrated in Figures [3](#fig:scaling_laws){reference-type="ref" reference="fig:scaling_laws"}, our experiments indicate that the classification accuracy on the UCR datasets does not monotonically increase with the size of training data when trained on subsets of the UEA dataset (left for Mantis, middle left for MOMENT). We hypothesize that this behavior may be a result of a domain mismatch between UEA and UCR, further exacerbated by the lack of diversity within the real-world time series of UEA.

In contrast, [CauKer]{.smallcaps}-generated datasets exhibit clear and consistent scaling laws. The accuracy steadily improves with increasing data size, demonstrating the [CauKer]{.smallcaps}-generated data's effectiveness in capturing diverse patterns essential for generalizing to the UCR target set. Additionally, these results also suggest an interesting contrast between model capacities: the lightweight Mantis model achieves competitive performance even with smaller training sets, likely due to the strong time series classification priors incorporated in its architecture that we have mentioned above. In contrast, the larger and more generic MOMENT model exhibits more significant accuracy gains as the training data increases, highlighting its greater capacity to leverage large-scale data for improved representation learning. This distinction underscores the importance of jointly considering model capacity and data availability when designing scalable TSFMs.

![Scaling law of MOMENT and Mantis depending on the dataset size (**left**, **middle left**, respectively) model trained on different subsets of UEA and CauK datasets. Scaling law for the same models depending on the model size (**middle right**, **right**, respectively)](dataScalingAll.png){#fig:scaling_laws width="\\textwidth"}

![Scaling law of MOMENT and Mantis depending on the dataset size (**left**, **middle left**, respectively) model trained on different subsets of UEA and CauK datasets. Scaling law for the same models depending on the model size (**middle right**, **right**, respectively)](modelSizeAll.png){#fig:scaling_laws width="\\textwidth"}

### Model scaling laws {#sec:Model Scaling Laws}

#### Experimental setup

We further assessed model scaling laws by varying the size of the MOMENT model (Small, Base, Large versions of sizes 77M, 248M, and 783M, respectively), and Mantis model (with number of parameters 0.75M, 2.59M, 8.10M) using both UEA and [CauKer]{.smallcaps}-generated datasets. More details on the experiments can be found in Appendix [10](#appendix:model_scaling_details){reference-type="ref" reference="appendix:model_scaling_details"}.

#### Results

Results, as shown in Figure [3](#fig:scaling_laws){reference-type="ref" reference="fig:scaling_laws"} (middle right for Mantis, right for MOMENT), indicate that models trained on real-world UEA data do not exhibit consistent performance gains with increasing model size, reinforcing the notion of limited data diversity or domain mismatch. Conversely, models trained on [CauKer]{.smallcaps}-generated datasets consistently demonstrate increased accuracy as model size grows, clearly validating the presence of model scaling laws enabled by the synthetic [CauKer]{.smallcaps}-generated pretraining data. We further notice that, apart from the single outlier of MOMENT trained on the 10M samples [CauKer]{.smallcaps} corpus, every model pretrained on [CauKer]{.smallcaps} exhibits a strictly increasing UCR accuracy as its capacity grows. The small increase for MOMENT at 10M indicates that this particular encoder has reached (or is close to) saturation; a similar saturation point can be observed for Mantis once the parameter count exceeds approximately 28M (see Appendix [10](#appendix:model_scaling_details){reference-type="ref" reference="appendix:model_scaling_details"} for a more large-scale experiment). Conversely, the unstable -- or even degrading -- trend on models pretrained with larger UEA subsets is most plausibly explained by two factors: (i) the UEA collection lacks a clean, easily learnable generative structure, and (ii) its underlying distribution is mismatched with that of UCR, making additional capacity harder to exploit.

#### Qualitative analysis

A recent work by [@bouniot2025alexnettransformersmeasuringnonlinearity] showed that the expressive power of pretrained vision models can be characterized by measuring their non-linearity. The latter depends not only on the size of the model and its architecture, but also on the pretraining dataset. To verify how TSFMs' expressive power changes depending on the pretraining dataset, we calculate the non-linearity scores of the activation functions inside Mantis as done in the original paper for vision transformers. We then plot the obtained values for the Mantis models pretrained on [CauKer]{.smallcaps} synthetic datasets of varying sizes and compare them to UEA in Figure [\[fig:rho\_aff\]](#fig:rho_aff){reference-type="ref" reference="fig:rho_aff"} (top row). We note that Mantis pretrained on bigger [CauKer]{.smallcaps} synthetic datasets has a clear trend, while it barely changes when increasing the size of the UEA pretraining sample. Additionally, we validate this finding using the CKA score used to compare the similarity of internal representations of neural networks [@kornblithCKA].

r0.5 ![image](rho_aff_cka_cauker_uea.png){width="\\linewidth"}

Lower values of CKA indicate that the hidden layers change the inputs in a more drastic, non-linear way. We see that pretraining on [CauKer]{.smallcaps} exhibits a structural change in the model's inner workings when the dataset size becomes larger than 100k. In case of real-world UEA data, the CKA scores inside Mantis hidden layers barely change even when the pretraining sample size changes from 600K (5%) to 12M (100%). This, once again, hints at the fact that the model doesn't exploit the increasing sample size in this case.

Training time scaling laws {#sec:TrainTimeScaling}
--------------------------

We now study the training time scaling law that aims at identifying the gains in terms of test accuracy that more compute given by longer optimization of the model can bring.

#### Experimental setup

We track the evolution of zero‑shot accuracy with training epochs for Mantis and MOMENT pretrained on two corpora, namely a 10% subset of the real‑world UEA benchmark and a synthetic set of 1M series generated by [CauKer]{.smallcaps}.

#### Results

r0.45 ![image](trainTimeScalingAll.png){width="\\linewidth"}

As illustrated in Figure [\[fig:train\_time\_scaling\]](#fig:train_time_scaling){reference-type="ref" reference="fig:train_time_scaling"}, accuracy rises steadily when the models are trained on [CauKer]{.smallcaps}; additional epochs translate into consistent gains for both architectures. When pretrained on UEA, however, accuracy curves remain flat or fluctuate, especially for MOMENT, indicating that prolonged optimisation yields little benefit on this dataset. These findings echo the data‑ and model‑scaling observations reported earlier: causally structured, diverse [CauKer]{.smallcaps} data sustains learning over long horizons.

Comparison with forecasting scaling laws
----------------------------------------

We conclude this section by relating our obtained results to those provided for the time series forecasting task. To this end, we note that our empirical insights differ from prior work [@TSscalingLaw2; @TSScalingLaw; @TSscalingLaw3] in several ways. First, while we observed clear data- and model-scaling trends when pretraining on [CauKer]{.smallcaps} data, we also found signs of saturation at high data volumes or model capacities. Although [@TSscalingLaw2; @TSScalingLaw] reported a rather flat scaling law for real-world multivariate TSFMs, they were still monotonically decreasing. Second, our observed accuracy improvements follow sub-exponential rather than clean exponential growth. This suggests that the scaling dynamics in time series classification may follow different patterns compared to other modalities like language or vision, and that a more systematic, theory-driven study of such behavior is needed to fully understand its implications.

Training time scaling laws {#sec:TrainTimeScaling}
--------------------------

Beyond data and model scaling, we also investigate how training time impacts zero-shot classification performance---a critical yet underexplored dimension in the study of time series foundation models. In particular, we examine whether performance improves predictably as training epochs increase, and whether such trends differ across real and synthetic data sources.

We conduct this analysis by tracking zero-shot accuracy on the UCR benchmark at regular training checkpoints for two representative models: Mantis (8M parameters), pretrained on UEA (10%) and [CauKer]{.smallcaps} (1M) datasets. MOMENT (77M parameters), pretrained on the same UEA (10%) and [CauKer]{.smallcaps} (1M) datasets.

During training, we save model checkpoints every N epochs (where N=5 for Mantis and N=2 for MOMENT), and evaluate zero-shot classification performance on the full UCR test suite at each checkpoint.

Figure [\[fig:train\_time\_scaling\]](#fig:train_time_scaling){reference-type="ref" reference="fig:train_time_scaling"} reveals that models pretrained on [CauKer]{.smallcaps} exhibit a clear improvement in accuracy as training progresses, confirming that the synthetic data contains a learnable signal structure that models can effectively exploit over time. In contrast, UEA-based pretraining displays erratic or flat trends, especially for MOMENT, suggesting limited scalability along the time axis. These results further validate the structured, causally coherent design of [CauKer]{.smallcaps} and its compatibility with long-horizon pretraining strategies.

In summary, our experiments confirm that [CauKer]{.smallcaps}-generated data effectively enables both data and model scaling laws, crucial for the future development of high-performing, generalizable, and robust time series foundation models.

Q3: sample-efficient pretraining of TSFMs using [CauKer]{.smallcaps} synthetic data {#sec:Train on CauK Synthetic Data}
-----------------------------------------------------------------------------------

#### Experimental setup

We want to study the performance and the sample efficiency of pretraining Mantis and MOMENT foundation models on different datasets. Our main goal is to show that the performance of both models pre-trained on a total of 1.89M (Mantis) and 13M (MOMENT) unique time series can be almost matched by a pretraining on a smaller synthetic dataset generated using [CauKer]{.smallcaps}. For the latter, we generate as few as 100k samples for Mantis and 10M for MOMENT to account for the model size difference (8M vs. 77M). As before, we include in our study a baseline given by pretraining Mantis and MOMENT on 100k samples of the real-world UEA time series classification collection. Additionally, we also experiment with a subset of 100k time series randomly drawn from standard forecasting datasets (ETTh1, ETTh2, ETTm1, ETTm2, Electricity, ExchangeRate, Illness, Traffic, Weather) [@Etth; @Electricity; @ExchangeRate; @Illness; @Traffic; @Weather]. Although no prior work trained a classification model on such data, we include it to verify whether the forecasting benchmarks can be a good alternative for classification TSFM pretraining.

#### Results

From the results presented in Table [4](#tab:real_comparison){reference-type="ref" reference="tab:real_comparison"}, we note that the performances of Mantis and MOMENT can be almost matched by pretraining them on synthetic datasets that are $\sim\!20\times$ and $\sim\!1.3\times$ smaller than the original pretraining datasets used by each of the papers. The accuracy drop in the case of Mantis is less than 0.1%, while for MOMENT it barely exceeds 1%. This suggests that the synthetic data generated by [CauKer]{.smallcaps} makes model pretraining more sample-efficient. We also note that the training loss and test accuracy of Mantis pretrained on 100k and 1.89M time series exhibit a very different behavior. For the synthetic dataset the training loss remains higher indicating that it is harder to learn, likely due to the high diversity of the generated time series. Yet, the test accuracy in this case steadily improves and surpasses the accuracy of the original model which quickly learns the real-world pretraining dataset. This is reminiscent of the MOMENT pretraining which only required 2 epochs [@MOMENT] (even for the largest 783M) to converge.

In addition to this, the reported UCR classification accuracies of the original Mantis and MOMENT models represent *in-distribution* performance, since their respective training corpora include UCR train samples. In this sense, these scores may serve as a practical upper bound for zero-shot accuracy, beyond which out-of-distribution generalization is unlikely without direct exposure to test distributions. Finally, we note that the comparison with two other pretraining dataset candidates leads to strictly worse results, despite their comparable size.

![Performance comparison of Mantis and MOMENT models on different pretraining datasets. [CauKer]{.smallcaps}-generated pretraining data allows to nearly match the performance of the original TSFMs, while being more sample-efficient. Training loss and test accuracy corresponding to the first two rows illustrated in the right figure show that synthetic data is harder to train on, but leads to a smoother increase of the test accuracy across epochs.](loss-and-perf-over-epochs-version1.png){#tab:real_comparison width=".9\\linewidth"}

Conclusion {#sec:conclusions}
==========

In this work, we introduced [CauKer]{.smallcaps}, a novel synthetic data generation framework tailored for time series classification. By integrating Gaussian Process kernel composition with Structural Causal Models, [CauKer]{.smallcaps} generates synthetic datasets that are both temporally realistic and causally coherent. We demonstrated that TSFMs pretrained solely on [CauKer]{.smallcaps}-generated data can match the performance of models trained on larger real-world datasets. Furthermore, our study provides the first in-depth analysis of data and model scaling laws in zero-shot time series classification, establishing that such scaling effects emerge clearly when using synthetic data, but are irregular or absent when training on commonly used real-world datasets.

Our findings underscore a key insight already known in vision and natural language processing: the quality and structure of pretraining data have a profound impact on the generalization performance of TSFMs. While much recent progress in time series community has focused on architectural innovations, our results suggest that equivalent gains can be achieved through principled design of synthetic training data. We hope this work encourages the community to direct greater attention to the design, analysis, and benchmarking of time series training datasets, as a complementary path toward building scalable, general-purpose time series foundation models.

#### Limitations

Similar to prior work on scaling laws in time series forecasting [@TSScalingLaw], we considered only two models that follow a different pretraining paradigm. As our study was already quite compute-intense, we believe that this choice is justified, yet adding more models (such as [@UniTS]) would be a nice addition. In the same line, we didn't consider large-scale forecasting benchmarks such as LOTSA [@MOIRAIS] and Time-300B [@shi2025timemoebillionscaletimeseries] as we have observed that forecasting benchmarks are of limited utility for classification, especially for MOMENT.

Overview of pretraining datasets for time series foundation models {#Related work extension}
==================================================================

Table [2](#tab:pretraining-datasets){reference-type="ref" reference="tab:pretraining-datasets"} summarizes the pretraining datasets used by representative Time Series Foundation Models. For each model, we report whether synthetic data was used, the total number of time points and time series samples, whether the datasets are publicly available. The table is organized alphabetically by model name.

```{=latex}
\renewcommand{\arraystretch}{1.15}
```
::: {#tab:pretraining-datasets}
  **Model**                     **Synthetic**   **Real**   **Time Points**   **Series Count**   **Open**
  ---------------------------- --------------- ---------- ----------------- ------------------ ----------
  Chronos [@Chronos]                 Yes          Yes            84B               890K           Yes
  ForecastPFN [@ForecastPFN]         Yes           No            60M               300K           Yes
  Mantis [@Mantis]                   No           Yes            N/A           $\sim$1.89M        Yes
  MOMENT[@MOMENT]                    No           Yes           1.23B              13M            Yes
  NuTime [@Nutime]                   No           Yes            60M              1.89M           Yes
  TabPFN [@TabPFN]                   Yes           NO            N/A              9.216M           No
  TimePFN [@timePFN]                 Yes           No        $\sim 200M$         $\sim$3M         Yes
  UniTS [@UniTS]                     No           Yes            35M                6K            Yes

  : Overview of pretraining datasets for Time Series Foundation Models (TSFMs).
:::

Loss and architecture of Mantis and MOMENT {#Details of Mantis and MOMENT}
==========================================

#### Contrastive learning loss of Mantis.

Given an encoder $F: \mathbb{R}^{t} \rightarrow \mathbb{R}^{q}$, we consider random augmentations $\phi, \psi \sim \mathcal{U}(\mathcal{T})$. The similarity between two augmented samples is measured after projecting their embeddings to a new dimension $q'$ via $g: \mathbb{R}^{q} \rightarrow \mathbb{R}^{q'}$. Specifically, the cosine similarity is defined as: $$s_{\text{cos}}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a}^{\top}\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}, \quad \forall (\mathbf{a}, \mathbf{b}) \in \mathbb{R}^{2q'}.$$

Given a batch $B = \{\mathbf{x}_i\}_{i=1}^{b}$, we compute pairwise similarities: $$\mathbf{s}_i(\phi, \psi) = \left[s_{\text{cos}}\left(g \circ F \circ \phi(\mathbf{x}_i), g \circ F \circ \psi(\mathbf{x}_j)\right)\right]_{j=1}^{b} \in \mathbb{R}^{b}.$$ The Mantis encoder $F$ and projector $g$ are optimized by minimizing the contrastive loss: $$\mathcal{L}_{\text{contrastive}} = \sum_{i=1}^{b} l_{\text{ce}}\left(\frac{\mathbf{s}_i(\phi, \psi)}{T}, i\right),$$ where $l_{\text{ce}}$ is the cross-entropy loss and $T$ is a temperature parameter set to 0.1.

#### Masked learning loss of MOMENT.

Given a univariate time series $\mathcal{T} \in \mathbb{R}^{1 \times T}$, it is segmented into $N$ disjoint patches of length $P$. Each patch is mapped into a $D$-dimensional embedding, replaced with a learnable mask embedding $\text{[MASK]} \in \mathbb{R}^{1 \times D}$ for masked patches. The resulting embeddings are fed into a transformer encoder, producing transformed embeddings that are then decoded by a lightweight reconstruction head $h_{\text{rec}}$. The masked loss for reconstruction is defined as the mean squared error (MSE): $$\mathcal{L}_{\text{masked}} = \frac{1}{|\Omega|}\sum_{n \in \Omega} \left\|\mathcal{T}_n - h_{\text{rec}}(F(\text{[MASK]}))_n\right\|^2,$$ where $\Omega$ denotes the set of indices corresponding to masked patches.

#### Model architectures.

For the masked learning approach, MOMENTs leverages a Transformer-based architecture derived from the T5 family [@T5]model. Specifically, MOMENT employs a 8, 12, 24-layer Transformer encoder with hidden dimensions $D=512, 768, 1024$, and 8, 12, 16 attention heads for \"Small\", \"Base\", \"Large\" model. The model processes input time series by segmenting them into $N=64$ patches of length $P=8$, applying positional embeddings, and then reconstructing masked patches.

Conversely, Mantis utilizes a Vision Transformer (ViT)[@VIT] architecture. Initially, the input time series is divided into tokens, to which a learnable class token is appended. Positional embeddings are added to encode temporal information explicitly. The ViT unit consists of 6 transformer layers, each comprising multi-head attention with 8 heads. The final output is derived from the class token's embedding after aggregation by the transformer layers. It is worth noting that Mantis employs a customized tokenizer. For detailed information, please refer to the original Paper[^2].

Details of [CauKer]{.smallcaps} {#Details of CauK}
===============================

Pseudocode of the [CauKer]{.smallcaps}
--------------------------------------

The pipeline combines the temporal structure modeled by Gaussian processes with the flexible dependency modeling of structural causal models. Specifically, the algorithm first samples a number of root signals from GP priors constructed via randomly composed kernels and mean functions. It then propagates these signals through a randomly generated DAG, where each edge applies a nonlinear transformation drawn from an activation function bank. Finally, a fixed number of node outputs are selected as observed time series variables, each interpolated to a target length. This modular and stochastic design ensures rich diversity and causal consistency in the generated synthetic data.

![Kernel 1](kernel_visuals/kernel_1_cov.png){width="\\linewidth"}

![Kernel 2](kernel_visuals/kernel_2_cov.png){width="\\linewidth"}

![Kernel 3](kernel_visuals/kernel_3_cov.png){width="\\linewidth"}

![Kernel 4](kernel_visuals/kernel_4_cov.png){width="\\linewidth"}

![Kernel 5](kernel_visuals/kernel_5_cov.png){width="\\linewidth"}

![Kernel 6](kernel_visuals/kernel_6_cov.png){width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_1_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_2_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_3_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_4_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_5_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

![Visualizations of covariance matrices (top) and corresponding sampled time series (bottom) from each base kernel in the kernel bank.](kernel_visuals/kernel_6_ts.png){#fig:kernel_visualization_grid width="\\linewidth"}

Details of banks
----------------

Figure [10](#fig:kernel_visualization_grid){reference-type="ref" reference="fig:kernel_visualization_grid"} provides illustrative examples of the six representative kernels selected from our base kernel bank. The top row of the figure displays the covariance matrices induced by each kernel over 1024 evenly spaced time points, while the bottom row shows corresponding sample paths drawn from the Gaussian Process (GP) prior using these kernels.

Specifically, the illustrated kernels include:

-   **ExpSineSquared** --- captures periodic patterns with a fixed wavelength; produces strongly oscillatory samples with global smoothness.

-   **DotProduct** --- induces linear trend behavior; sample paths grow or decay steadily over time.

-   **RBF (Radial Basis Function)** --- generates smooth, localized fluctuations around zero with short-range correlations.

-   **RationalQuadratic** --- a scale mixture of RBF kernels, allowing for multiscale smooth variations in the signal.

-   **WhiteKernel** --- models uncorrelated noise; sample paths resemble pure Gaussian noise with no temporal structure.

-   **ConstantKernel** --- generates flat constant signals; serves as a component for additive models with nonzero mean.

These six kernels represent only a small subset of our full kernel bank. In practice, we construct a much larger kernel bank comprising 36 distinct kernels. This is achieved by varying the hyperparameters of each kernel (e.g., length-scale, periodicity, noise level, amplitude) across a range of scales to capture diverse temporal dynamics. For instance, we use multiple versions of the ExpSineSquared kernel with different periodicities to simulate both high- and low-frequency periodic patterns. Similarly, we vary the length-scale of RBF and RationalQuadratic kernels to control smoothness and correlation range.

The images presented in Figure [10](#fig:kernel_visualization_grid){reference-type="ref" reference="fig:kernel_visualization_grid"} serve as illustrative examples only. During synthetic data generation, kernels are sampled from the full kernel bank, which offers significantly richer diversity than what is shown here. These base kernels are subsequently composed using random additive and multiplicative operations to define flexible Gaussian process priors for root node generation in the [CauKer]{.smallcaps} pipeline.

Figure [11](#fig:mean_function_examples){reference-type="ref" reference="fig:mean_function_examples"} presents the four representative mean functions used in our synthetic data generation pipeline. Each subplot illustrates a randomly sampled instance from the corresponding function class. These functions can be combined multiplicatively or additively during Gaussian process sampling to enrich the diversity of generated signals.

-   Zero Mean: A baseline function returning a constant zero across the time axis, corresponding to the standard GP assumption with zero-centered priors.

-   Linear Mean: A simple affine transformation $a \cdot t + b$, enabling trends such as monotonic increases or decreases over time.

-   Exponential Mean: A parametric form $a \cdot \exp(b t)$ that introduces strong, nonlinear growth or decay patterns into the signal.

-   Sparse Anomalies: A piecewise-constant mean vector with a few randomly placed spikes, simulating rare disruptive events (e.g., faults, attacks, regime shifts).

These mean functions serve as building blocks for composing realistic non-stationary temporal structures in synthetic time series. In the generation process, two functions are randomly selected and combined (either by summation or elementwise multiplication), forming the final mean vector used in GP sampling. The images shown in Figure [11](#fig:mean_function_examples){reference-type="ref" reference="fig:mean_function_examples"} are illustrative samples; in practice, stochastic variation over parameters (slopes, amplitudes, etc.) ensures that each generated series presents unique mean behavior.

![Examples of four mean function types used in the synthetic data pipeline. Each function introduces distinct temporal structure, contributing to the diversity and realism of generated sequences.](kernel_visuals/mean_functions.png){#fig:mean_function_examples width="95%"}

#### Activation function bank.

In addition to kernel and mean banks, [CauKer]{.smallcaps} employs a diverse *activation function bank* $\mathcal{A}$ to propagate nonlinear transformations through the structural causal graph. Each edge in the DAG is randomly assigned an activation from this bank, which governs how parent node values influence their children. The activation bank comprises both classical and domain-specific transformations:

-   **Linear:** Identity or affine mappings $a x + b$, preserving proportional signal propagation.

-   **ReLU:** Rectified linear units $\max(0, x)$, introducing sparsity and piecewise linearity.

-   **Sigmoid:** Smooth squashing function $\sigma(x) = 1 / (1 + e^{-x})$, modeling saturation effects.

-   **Sinusoidal:** Periodic modulations $\sin(x)$, inducing wave-like behaviors.

-   **Modulo:** Modular transformations $x \bmod c$, yielding abrupt nonlinearities or periodic clipping.

-   **Leaky ReLU:** Slope-preserving variant of ReLU, ensuring non-zero gradients for negative inputs.

These nonlinearities enhance the diversity of functional relationships within the generated synthetic time series and allow the resulting signals to exhibit complex, structured dependencies. As illustrated in the SCM pipeline, these functions are applied edge-wise to linear combinations of parent signals before assigning values to child nodes.

Experimental details of Section [4.2.1](#sec:Data Scaling Laws){reference-type="ref" reference="sec:Data Scaling Laws"} {#experimental-details-of-section-secdata-scaling-laws}
=======================================================================================================================

In our scaling law experiments, we systematically evaluated the performance of two distinct models, Mantis and MOMENT, across varying dataset sizes from both real-world and synthetic sources. We adopted the official 8M parameters configuration of Mantis as released in its open-source repository, which includes a 6-layer ViT encoder with 8 attention heads and a hidden dimension of 256. The classification head used was a Random Forest classifier trained on frozen embeddings.

For MOMENT, we used the officially supported \`\`google/flan-t5-small" variant containing 77M parameters as the encoder backbone. This model structure is one of the pretrained configurations endorsed in the original MOMENT framework. During training, we froze the encoder and trained only the classification head, which was implemented as a Support Vector Machine (SVM). This setup mirrors the zero-shot classification evaluation protocol used in prior TSFM literature.

For both models, we varied the training data sizes as follows: for the real-world UEA dataset, subsets ranging from 0.1% to 100% (12.7K to 12.67M samples) were randomly sampled. For synthetic data, we generated samples using our [CauKer]{.smallcaps} method at 10K, 50K, 100K, 500K, 1M, 5M, and 10M scales. All series were univariate with length 512. The full list of data sizes and corresponding classification accuracy values on the UCR benchmark are reported in Table [3](#tab:scaling_law_data){reference-type="ref" reference="tab:scaling_law_data"}.

[\[appendix:experimental\_details\_2\]]{#appendix:experimental_details_2 label="appendix:experimental_details_2"}

::: {#tab:scaling_law_data}
   **Model**      **Train Set**       **Data Size**   **UCR Accuracy (%)**
  ----------- ---------------------- --------------- ----------------------
                       UEA                127K               72.42
                       UEA                1.27M              70.49
                       UEA                633K               71.09
                       UEA                6.33M              72.09
                       UEA               12.67M              72.10
               [CauKer]{.smallcaps}       100K               74.24
               [CauKer]{.smallcaps}       500K               74.35
               [CauKer]{.smallcaps}        1M                75.21
               [CauKer]{.smallcaps}        5M                77.01
               [CauKer]{.smallcaps}        10M               77.49
                       UEA                12.7K              75.67
                       UEA                127K               76.21
                       UEA                633K               75.83
                       UEA                1.27M              75.39
                       UEA                3.68M              76.33
                       UEA               12.67M              71.93
               [CauKer]{.smallcaps}        10K               76.91
               [CauKer]{.smallcaps}        50K               78.08
               [CauKer]{.smallcaps}       100K               78.55
               [CauKer]{.smallcaps}        1M                78.91
               [CauKer]{.smallcaps}        10M               79.09

  : Exact accuracy values used in the scaling law plots (Figure [3](#fig:scaling_laws){reference-type="ref" reference="fig:scaling_laws"}).
:::

Experimental details of Section [4.2.2](#sec:Model Scaling Laws){reference-type="ref" reference="sec:Model Scaling Laws"} {#appendix:model_scaling_details}
=========================================================================================================================

To investigate model scaling laws, we evaluated a range of model capacities for both MOMENT and Mantis using synthetic datasets generated by [CauKer]{.smallcaps}. For MOMENT, we adopted the official series of models given by:

-   **flan-t5-small** (77M parameters),

-   **flan-t5-base** (248M parameters),

-   **flan-t5-large** (783M parameters).

For the Mantis encoder, we varied the transformer depth and width while keeping the sequence length fixed at 512 and using the same patching configuration. The model variants are as follows:

-   **0.75M**: `hidden_dim=256`, `transf_depth=1`, `transf_num_heads=2`, `transf_mlp_dim=512`, `transf_dim_head=128`.

-   **2.59M**: same as above, with `transf_depth=3`, `transf_num_heads=4`.

-   **8.10M**: same as above, with `transf_depth=6`, `transf_num_heads=8`.

-   **28.56M**: same as above, with `transf_depth=12`, `transf_num_heads=16`.

-   **114.14M**: `hidden_dim=512`, `transf_depth=12`, `transf_num_heads=16`, `transf_mlp_dim=1024`, `transf_dim_head=256`.

All Mantis variants used the following fixed parameters: `seq_len=512`, `num_patches=32`, `scalar_scales=None`, `hidden_dim_scalar_enc=32`, and `epsilon_scalar_enc=1.1`. The model output embeddings were classified using a Random Forest classifier trained on frozen features.

This design allows us to jointly assess the impact of model depth, width, and hidden dimensionality on zero-shot classification performance under a consistent synthetic data regime.

Table [\[tab:model\_scaling\_values\]](#tab:model_scaling_values){reference-type="ref" reference="tab:model_scaling_values"} reports the exact accuracy values corresponding to the model scaling plots shown in Figure [12](#fig:model_scaling_Mantis2){reference-type="ref" reference="fig:model_scaling_Mantis2"}. For both MOMENT and Mantis, we list results under varying model sizes and dataset configurations.

```{=latex}
\renewcommand{\arraystretch}{1.25}
```
![Accuracy on UCR dataset with varying model sizes for the Mantis model trained on UEA subsets and synthetic [CauKer]{.smallcaps} data.](mantis_model_size_accuracy.png){#fig:model_scaling_Mantis2 width=".5\\linewidth"}

Experimental details of Section [4.6](#sec:Train on CauK Synthetic Data){reference-type="ref" reference="sec:Train on CauK Synthetic Data"} {#appendix:experimental_details_1}
===========================================================================================================================================

For all compared models, we adopted the best training loss epoch as the checkpoint for final evaluation. Specifically, the official setting for Mantis involves training for 100 epochs, while MOMENT is typically trained for 2 epochs. However, for our experiments, we trained Mantis for 100 epochs and MOMENT for 10 epochs to allow sufficient convergence, consistent with our goal of achieving the best performance on the [CauKer]{.smallcaps} and UEA datasets. For the MOMENT model, we utilized the base model \"google/flan-t5-small\" with 77M parameters, trained on both the [CauKer]{.smallcaps} and UEA datasets. The official MOMENT checkpoint used in our experiments (Time Series Pile), \"google-t5/t5-small,\" has 60M parameters.

Visualization of embeddings {#appendix:scm_kernel_visualization}
===========================

![Frequency Analysis](Umap/Umap/frequency_analysis_cauker_embeddings.png){#fig:freq_scm_kernel width="\\linewidth"}

![Slope Analysis](Umap/Umap/slope_analysis_cauker_embeddings.png){#fig:slope_scm_kernel width="\\linewidth"}

![Bias Analysis](Umap/Umap/bias_analysis_cauker_embeddings.png){#fig:bias_scm_kernel width="\\linewidth"}

![Combined Analysis](Umap/Umap/combined_analysis_cauker_embeddings.png){#fig:combined_scm_kernel width="\\linewidth"}

We generated univariate time series of length $L = 512$ using the [CauKer]{.smallcaps} pipeline. For the frequency class, 20 periodic kernels with periods evenly spaced in $[50, 500]$ were used. For the slope class, we sampled slopes in $[0.1, 10.0]$, and for the bias class, biases were drawn from $[-5, 5]$. Each parameter setting was instantiated 30 times to ensure balanced coverage across the range. We use Mantis 8M trained on 10M CauKer data to encode the time series.

The UMAP projections reveal that the encoder learned structured and disentangled representations:

-   In the frequency, slope, and bias views (Figures [13](#fig:freq_scm_kernel){reference-type="ref" reference="fig:freq_scm_kernel"}--[15](#fig:bias_scm_kernel){reference-type="ref" reference="fig:bias_scm_kernel"}), we observe continuous colour gradients along one principal direction of the embedding, confirming that the encoder preserves the underlying generative factor in a smooth and ordered fashion.

-   In the combined view (Figure [16](#fig:combined_scm_kernel){reference-type="ref" reference="fig:combined_scm_kernel"}), embeddings from the three generation processes form distinct clusters with minimal overlap, indicating that the encoder effectively disentangles the semantic attributes of each synthetic category.

-   The alignment of UMAP geometry with the known generative parameters supports the conclusion that the model did not merely memorize waveform patterns, but instead internalized semantically meaningful features of the data.

These results confirm that synthetic pretraining on [CauKer]{.smallcaps} enables the encoder to learn robust, interpretable, and transferable representations even in the absence of real data.

[^1]: As the original generator of [@TabPFN] is not open--sourced, we followed the algorithmic description in the paper and validated the implementation on the illustrative examples provided therein.

[^2]: <https://github.com/vfeofanov/Mantis>
