---
abstract: |
  Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in *Reverso*, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.
bibliography:
- references.bib
---

```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\newcommand{\shin}[1]{{\color{blue}Shin: #1}}
```
```{=latex}
\twocolumn[
  \icmltitle{Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting}

  % \icmlsetsymbol{equal}{*}


  \begin{icmlauthorlist}
    \icmlauthor{Xinghong Fu}{mit}
\icmlauthor{Yanhong Li}{ai2}
\icmlauthor{Georgios Papaioannou}{qube}
    \icmlauthor{Yoon Kim}{mit}
  \end{icmlauthorlist}
  \vspace{2mm} \centerline{ \url{https://github.com/shinfxh/reverso}}
  \icmlaffiliation{qube}{Qube Research \& Technologies}
  \icmlaffiliation{ai2}{Allen Institute for AI}
  \icmlaffiliation{mit}{Massachusetts Institute of Technology}
  \icmlcorrespondingauthor{Xinghong Fu}{fxh@mit.edu}

  % \icmlkeywords{Time series, Pretraining}

  \vskip 0.3in
]
```
```{=latex}
\printAffiliationsAndNotice{}
```
```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=0.7\linewidth]{figures/gift_eval_pareto_overall.png}
    \caption{Zero-shot performance on the full Gift-Eval test set \citep{aksu2024giftevalbenchmarkgeneraltime}. Reverso  sets a new performance-efficiency Pareto frontier compared to existing time series foundation models.}
    \label{fig:gift_overall_pareto}
    % \vspace{-3mm}
\end{figure*}
```
Introduction {#intro}
============

Time series forecasting is a core problem in machine learning with widespread applications including in weather forecasting, energy grid analysis, supply chain logistics, financial predictions, and more. Traditionally, statistical models [@arima; @arch; @garch; @kalmanfilters; @ets] as well as deep learning approaches based on RNNs [@elman1990findingrnn; @lstm; @gatedrnn] have enjoyed great success in time series forecasting [@goel2017r2n2; @qin2017dual; @petnehazi2019recurrent; @hewamalage2021recurrent *i.a.*]. More recently, models based on the transformer architecture [@vaswani2017attention] have led to further improvements [@patchtst; @informer; @autoformer; @fedformer; @liu2022pyraformer *i.a.*].

These initial deep learning-based approaches to time series forecasting were dataset-specific, and thus trained models for particular domains/tasks of interest. While such models can attain high accuracy when sufficient in-distribution data are available, they incur substantial costs in data collection and model training/maintenance. This approach moreover stands in stark contrast to recent progress in domains such as language, vision, and biology, where *foundation models* pretrained on broad datasets have been found to be useful across many tasks with little or no task-specific training [@bommasani2021opportunities].

The successes of foundation models in other modalities have motivated the recent line of work on *time series foundation models* [TSFM; @garza2023timegpt; @chronos; @timesfm; @liu2024timer; @moirai; @liu2025timerxl; @sundial; @flowstate; @tirex; @tempopfn *i.a.*]. TSFMs are large-scale neural networks trained on heterogeneous time series data taken from broad domains (see @liang2024foundation and @kottapalli2025foundation for surveys). A particularly useful capability of decoder-based TSFMs is their ability to perform *zero-shot forecasting* via in-context learning, i.e., predicting the future given any historical time series data given as context. These TSFMs can thus serve as a domain-general tool for time series forecasting, enabling the deployment of models in domains where task-specific training data may be scarce.

However, insofar as scaling has been a critical driver of progress of foundation models in other domains, much existing work has focused on scaling TSFMs, i.e., training ever-larger models on ever-larger datasets. For example, @xihe train a series of models up to 1.5B parameters and observe continuous improvements with scaling model size. While such large models can be performant, their sheer size can make them prohibitively expensive to train and deploy.

In this work, we revisit the core assumption that large-scale models are necessary for TSFMs. We show that small models that interleave long convolution layers [@flashfft] and modern linear RNN layers (in particular DeltaNet layers [@schlag2021linear; @yang2024parallelizing]) can match or outperform TSFMs that are orders of magnitude larger. We also study and ablate myriad data augmentation and inference-time strategies to arrive at a simple recipe that works well in practice. With our recipe, we train a family of TSFMs (dubbed *Reverso*) from 0.2M to 2.6M parameters that significantly push the performance-efficiency frontier, as shown in Figure `\ref{fig:gift_overall_pareto}`{=latex}.

Related Work {#sec:related_works}
============

#### Time series foundation models.

Our work is related to the existing research program around time series foundation models (TSFMs), which aim to train domain-general models for time series analysis and forecasting. TimeGPT [@garza2023timegpt], TimesFM [@timesfm], and Lag-LLaMA [@rasul2023lag] were some of the first works to show that decoder-only transformers can be utilized to train TSFMs with strong zero-shot forecasting performance. Timer [@liu2024timer] and Timer-XL [@liu2025timerxl] scaled such generative pretraining with dataset size, model size and context length. Moirai [@moirai] incorporates a masked encoder to handle multivariate forecasting from various distributions. Chronos [@chronos] fixes the vocabulary of time series patches, while Chronos-2 [@chronos2] introduced the group attention mechanism for multivariate forecasting. Xihe [@xihe] scales up TSFMs to over a billion parameters with a hierarchical block attention mechanism. PatchTST-FM-r1 [@patchtstfm] showed that a generic patched transformer can also achieve competitive results.

A complementary line of work reuses large language models directly for time series by reprogramming or aligning them to TS tasks [@zhou2023one; @jin2023time; @chang2025llm4ts]. However, recent studies suggest that the LLM backbone often provides little benefit over simpler LLM-free baselines [@tan2024language], motivating dedicated TSFMs such as those discussed above.

```{=latex}
\begin{figure*}[t]\centering
      \includegraphics[width=0.75\linewidth]{figures/new_arch.png}
          \vspace{-3mm}
    \caption{Reverso architecture. An input sequence $t \in \mathbb{R}^{L}$ of length $L$ is first passed through a single projection layer to obtain embedding representations $x \in \mathbb{R}^{L \times d}$. Then, $n_{layers}$ of sequence-mixing and channel-mixing blocks operates on $x$, where we alternate between long convolutions and DeltaNet for sequence mixing across length $L$, and use MLP layers for channel mixing across dimension $d$. The final output head (based on an attention-based transformation) obtains the predictions $\hat{y} \in \mathbb{R}^p$.}
    \label{fig:architecture}
    \vspace{-3mm}
\end{figure*}
```
#### Transformer alternatives for time series modeling.

While transformers have proven to be performant in the time series domain, there have also been works that employ modern \`\`sequence-mixing" primitives---which have been shown to be effective in language modeling---for time series modeling. These works generally make use of linear attention layers [@linearattention; @peng2021random; @schlag2021linear; @gla; @yang2024parallelizing], state-space models [@gu2022efficiently; @smith2023simplified_s5; @Gu2023MambaLS; @dao2024transformers], or convolution layers [@fu2023hungry; @flashfft; @poli2023hyena; @massaroli2023laughing].

TSMamba [@ma2024mamba] and Mamba4Cast [@bhethanabhotla2024mamba4cast] show that Mamba layers can be effective for time series. TiRex [@tirex] utilizes the xLSTM [@beck2024xlstm] architecture for zero-shot forecasting, while FlowState [@flowstate] uses the S5 module [@smith2023simplified_s5] and operates in the coefficient space of the transformed sequence. TempoPFN [@tempopfn] makes use of the GatedDeltaProduct [@siems2025deltaproduct] and train on fully synthetic data. Convolution modules have been comparatively less popular in time series modeling. SCINet [@liu2022scinet] introduces a downsample-convolve-interact framework for modeling complex time series. ModernTCN [@donghao2024moderntcn] makes use of grouped convolutions of varying kernel sizes across multiple dimensions while TVNet [@li2025tvnet] utilizes reshaping techniques to operate on time series in three dimensions. There have also been works that show even simpler sequence mixing primitives such as linear/MLP layers work well in practice [@ekambaram2023tsmixer; @wang2024timemixer; @superlinear].

```{=latex}
\vspace{-1mm}
```
Methods
=======

```{=latex}
\vspace{-1mm}
```
`\label{sec:methods}`{=latex}

Here we describe our recipe for learning efficient TSFMs, which includes the architecture (§`\ref{ref:architecture}`{=latex}), dataset (§`\ref{ref:dataset}`{=latex}), and inference strategy (§`\ref{sec:inference}`{=latex}). We emphasize that the individual components in our recipe are not novel: the sequence mixing primitives we use (long convolutions and DeltaNet layers) are not new; similarly, our data augmentation, synthetic data generation, and inference strategies have been proposed before in the literature. The core contribution of this work is to show that these existing ingredients can be combined to produce a TSFM that significantly pushes the performance-efficiency frontier.

```{=latex}
\vspace{-1mm}
```
Architecture
------------

```{=latex}
\vspace{-1mm}
```
`\label{ref:architecture}`{=latex} We are given an input time series $t \in \mathbb{R}^L$ of length $L$ and must predict an output $y \in \mathbb{R}^T$ of length $T$. Following standard practice [@patchtst], we train by predicting a patch of $p$ points at a time (in parallel) through learning a function $f_\theta : \mathbb{R}^L \to \mathbb{R}^{p}$ parameterized with $\theta$. During inference, we autoregressively predict chunks of $p$ data points until we have forecasted $T$ points. We use $L = 2048, p = 48$.

Our model architecture, shown in Figure `\ref{fig:architecture}`{=latex}, is extremely simple and consists of stacked neural network blocks where each block consists of a sequence mixing module followed by an MLP channel mixing module. Finally, we have an output decoder based on attention that uses the contextualized representation of the input to predict $p$ data points at once.

```{=latex}
\vspace{-2mm}
```
#### Embedding layer.

The sequence $t \in \mathbb{R}^{L \times 1}$ is first normalized within the range $[0,1]$ with $$\begin{aligned}
    t \leftarrow \frac{t - \min(t)}{\max(t) - \min(t)}.\end{aligned}$$ We found this $[0,1]$-normalization to work better than $z$-score normalization which subtracts the mean and divides by the standard deviation.[^1] In cases where there are missing values, these are imputed using linear interpolation. For sequences shorter than the model context length $L$, the remaining values are back-filled using the leftmost available data point.

The normalized sequence $t$ is then up-projected pointwise using a single linear layer into $d$ dimensions, yielding a transformed sequence ${x} \in \mathbb{R}^{L \times d}$. Unlike existing works that make use of special time embeddings [@tempopfn; @gluonts] to include seasonality and frequency features, utilizing metadata that might not be present at inference time, we adopt a minimalistic approach that can handle any time series as a simple numeric sequence.

#### Sequence mixing.

We adopt a hybrid sequence mixing strategy wherein we switch between (gated) long convolution [@flashfft] and DeltaNet layers [@schlag2021linear; @yang2024parallelizing].

The long convolution layer uses depthwise separable convolutions [@chollet2017xceptiondeeplearningdepthwise], where the number of groups is equal to $d$. This obtains the output $z \in \mathbb{R}^{L \times d}$ from an input sequence $x \in \mathbb{R}^{L \times d}$ given convolution kernel weight $w \in \mathbb{R}^{k \times d}$ via $$\begin{aligned}
z_{i, j} &= \sum_{m=0}^{k-1} w_{m, j} \cdot x_{i-m, j}
 % &= (x_{:, j} * w_{:, j})_i \\\end{aligned}$$ where $0 \leq i \leq L-1$ indexes the sequence position, and $0 \leq j \leq d-1$ indexes the dimensions. The long convolution is an instance of the convolution kernel where $k = L$. This has demonstrated strong recall and reasoning performance while maintaining a sub-quadratic compute cost [@poli2023hyena]. We also make use of a gating layer, where the gate comes from a depthwise separable (short) convolution layer. Taken together, our convolutional sequence mixing primitive is given by $$\begin{aligned}
     x_{conv} &\leftarrow \operatorname{SiLU}(\operatorname{short-conv}(x) \odot \operatorname{long-conv}(x)) \\
     x &\leftarrow x + \operatorname{LayerNorm}(x_{conv}). \end{aligned}$$ With FFT the overall complexity of the convolution layer is $O(dL \log L)$, enabling faster training than standard attention. While the FFT-based convolutions was previously not optimized for GPUs, recent works have enabled significant wallclock speed-ups [@flashfft], which we make use of in practice.

We also make use of linear RNN layers every other layer. The particular instance used in Reverso is DeltaNet [@schlag2021linear]. DeltaNet learns the following state transition using query, key and value vectors $q_i, k_i, v_i \in \mathbb{R}^{d_h}$ (with head dimension $d_h$) $$\begin{aligned}
    {S}_i & = {S}_{i-1}({I} - \beta_i {k}_i {k}_i^T) + \beta_i {v}_i {k}_i^T \\
    {x}_i & \leftarrow {x}_i + \operatorname{LayerNorm}({S}_i {q}_i)\end{aligned}$$ where the query, key, value vectors are obtained from linear projections followed by short convolutions of the input $x$, and $\beta_i \in (0,1)$ is obtained by a linear projection of the input $x_i$ followed by a sigmoid. We use 4 heads (i.e., $d_h = \frac{d}{4})$. To better model bidirectional context over the entire length $L$ sequence, we add the last time step of the previous layer to the current layer's first hidden state (i.e., $x^{(l)}_0 \gets x^{(l)}_0 + x^{(l-1)}_{L-1}$) before the DeltaNet layer. We found this type of vector-based \`\`state-weaving" strategy to work well in practice. Similar state-weaving strategies have been explored in @tempopfn.

In our ablation studies we also compare against other DeltaNet variants such as Gated DeltaNet [GDN; @yang2025gated] and Gated Delta Product [GDP; @siems2025deltaproduct], as well as linear attention variants such Gated Linear Attention [GLA; @gla] (which generalizes state-space models such as Mamba-2 [@dao2024transformers]). Our findings show that DeltaNet performs well despite having fewer parameters.

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=0.85\linewidth]{figures/data_pipeline.png}
    \caption{Our data augmentation (left) and synthetic data generation (right) pipeline. For data augmention we apply a series of standard data augmentations: downsampling, amplitude modulation, vertical flip, horizontal flip, censor, mixup. For synthetic data generation, we generate data from Gaussian process with randomly selected kernels from a kernel bank, and combine this with spike/trapezoidal patterns as well as processes sampled with trend, seasonality and irregularity.}
    \label{fig:data_pipeline}
\end{figure*}
```
#### Channel mixing.

Each sequence mixing layer is followed by a channel mixing MLP layer. The MLP layer works as in the standard transformer architecture [@vaswani2017attention], with a dimension expansion factor of 4, with ReLU activations: $$\begin{aligned}
x_{mlp} \leftarrow & \operatorname{ReLU}(xW_{up}) W_{down} \\
x \leftarrow & x + \operatorname{LayerNorm}(x_{mlp})\end{aligned}$$ We found this simple MLP to work better than GLU-based alternatives [@shazeer2020glu].

#### Decoder head.

The above blocks transform a sequence of inputs ${x}^{(0)} \in \mathbb{R}^{L\times d}$ into ${x}^{(n)} \in \mathbb{R}^{L \times d}$ after $n$ layers. To obtain the final prediction, we first pass the final transformed input ${x}^{(n)}$ to obtain a set of decoder \`\`query" vectors $q_{dec}$: $$\begin{aligned}
    &z = W_Lx^{(n)}, && W_L \in \mathbb{R}^{p \times L}, z \in \mathbb{R}^{p \times d} \\
    &q_{dec} = z W_q,  && W_q \in \mathbb{R}^{d \times d}, q_{dec} \in \mathbb{R}^{p \times d} \\\end{aligned}$$ The decoder query vectors are then used to attend over the keys and values obtained from a transformation over $x^{(n)}$, $$\begin{aligned}
    {k}_{{dec}} &  = {x}^{(n)} W_k,  &&W_k \in \mathbb{R}^{d \times d }, \\
    {v}_{{dec}} &  = {x}^{(n)} W_v,  &&W_v \in \mathbb{R}^{d \times d }, \\
        {o} & = \operatorname{attention}({q}_{{dec}} , {k}_{{dec}} , {v}_{{dec}} ),  && {o} \in \mathbb{R}^{p \times d }. \end{aligned}$$ We find that smaller models train well without any positional embedding, whereas `sin-cos` positional embedding improves performance for Reverso-2.6M. Finally, we apply a linear layer to obtain the final output $\hat{y} \in \mathbb{R}^{p}$: $$\begin{aligned}
    \hat{y} & =  {z} \, w_o ,  && w  \in \mathbb{R}^{d  \times 1}.\end{aligned}$$ We found this type of attention-based decoder \`\`head" to be more performant and parameter-efficient than a simple linear layer that directly predicts a $p$-sized vector from $x^{(n)}$.

#### Training objective.

Given the model prediction $\hat{y}$ we unnormalized the output and train against the ground truth output $y$, using the mean absolute error (MAE) loss, where we masked out NaN values on the ground truth $y$ during loss computation.

Dataset {#ref:dataset}
-------

Here we describe the pretraining dataset in addition to strategies for data augmentation and synthetic data generation.

#### Pretraining dataset.

The time series community has developed a series of commonly-used datasets [@monash; @gluonts; @aksu2024giftevalbenchmarkgeneraltime; @moirai], consisting of data from various sources such as weather, traffic, and other domains. We train our models on the GiftEval [@aksu2024giftevalbenchmarkgeneraltime] pretraining dataset, which has become the de facto standard for training TSFMs in recent years. The GiftEval pretraining dataset has around 4.5 million time series with 230 billion time points in total. The dataset however is significantly imbalanced towards datasets such as Buildings900k [@buildings900k], and Era5 [@era5].

To resolve this dataset imbalance, we precompute the strides on each dataset necessary to achieve a target (roughly uniform) fraction of time series sampled. For each dataset, we target a maximum of $N_{max}=100000$ samples per epoch, and recompute the strides such that we have at most $N_{max}$ samples from each dataset. Explicitly, for each dataset $\mathcal{D}$ with time series samples $t \in D$ each of length $l_t$, we compute the total sum of lengths as $\sum_{t \in \mathcal{D}} l_t$, and compute the stride for this dataset as $s_\mathcal{D} = \bigg\lceil \frac{\sum_{t \in \mathcal{D}} l_t}{N_{max}} \bigg\rceil$. We also set an upper limit to 48 samples per time series $t$, to avoid oversampling short datasets. A random start point in each sequence $t$ is chosen at each epoch to ensure sampling across the full pretraining set.

#### Data augmentation.

Several techniques for data augmentation have been previously reported to help increase data diversity during pretraining TSFMs. We explored these augmentation techniques and found the following to be useful, which we eventually incorporated into our pretraining recipe: downsampling, amplitude modulation, flip along the $x$ and $y$-axis (i.e., sign inversion and temporal reversal in  @tempopfn), censor augmentation and mixup [@chronos], applied in this order. See Figure `\ref{fig:data_pipeline}`{=latex} (left). Downsampling and amplitude modulation are applied at the level of the full sequence. Flip augmentations and censor augmentations are applied on each subsampled sequence of context length $L$ and mixup is applied to the full batch. The full data augmentation pipeline is given in Algorithm `\ref{alg:augmentation}`{=latex} of the appendix.

#### Synthetic data.

We use synthetic data similar to established baselines [@tirex; @chronos], using methods such KernelSynth [@chronos], which use Gaussian processes to generate synthetic data. In particular, we define a kernel bank $\mathcal{K}$ (see Table `\ref{tab:kernel_bank}`{=latex} of the appendix), and sample $j \sim U\{1, 5\}$ kernels from $\mathcal{K}$ and compose them using random binary additive or multiplicative operations. This forms a composite kernel $\tilde{\kappa}$. We also sample a mean $\mu$ which follows a linear trend with slope $m \sim U[-0.01, 0.01]$ and intercept $c \sim U[-1,1]$ with probability $1/2$ and constant otherwise. We then use $\tilde{\kappa}$ and $\mu$ in a Gaussian process to sample the synthetic time series $t_{syn}$. See Figure `\ref{fig:data_pipeline}`{=latex} (right).

We also include spike processes [@tirex; @tempopfn; @kairos] and TSI [@tsi] as used in Chronos-2 to help in learning simple trends and periodic patterns, as further described in Algorithms `\ref{alg:spike_gen}`{=latex} and `\ref{alg:tsi_gen}`{=latex} of the appendix. We generate a total of 1 million synthetic time series sequences with the above algorithm. The maximum sequence length is set to 4096.

Inference {#sec:inference}
---------

We apply several techniques at inference time that help improve performance.

#### Flip equivariance.

Following prior works [@timesfm], we found it helpful to ensure flip equivariance by passing both the original and flipped context to the model, and then averaging the results: $$\begin{aligned}
    % \hat{y}_+ & = f(x) \\
    % \hat{y}_- & = f(-x) \\
    \hat{y} =&  \frac{f(x) - f(-x)}{2}\end{aligned}$$ While this requires two forward passes of the model, we observe that this reduces forecasting error consistently across multiple benchmarks.

#### Downsampling.

Given a pretrained TSFM with fixed context length $L$, we generally want to ensure that patterns we wish to capture in the time series have seasonality period $S < L$. The case of $S > L$ potentially results in insufficient information for an effective forecast. Works such as Flowstate [@flowstate] determine this downsampling factor by rescaling the series with a ratio between the seasonality of the data and a base seasonality of the model. However, such an approach relies heavily on the metadata of the input that might not be available at inference time, and requires the handling of several edge cases where multiple frequency scales are present.

We instead use a simple algorithm to determine the downsampling factors using FFT as described below. We first compute the Fast Fourier Transform (FFT) of the input sequence $t \in \mathbb{R}^{L}$ to obtain the amplitude spectrum $A(f)$. We then identify the peaks in the spectrum. To distinguish the dominant peak from noise, we enforce a set of criteria, as described in Algorithm `\ref{alg:downsampling}`{=latex} and Appendix `\ref{sec:appendix_downsampling}`{=latex}.

Sequences $t$ where the seasonality exceeds the context length $L$ of Reverso are downsampled by a factor of $k$ to $t'$ which is passed as input to the model. Given an original forecast horizon $L$, the model now predicts $\lceil L/ k\rceil$ timesteps, which are then upsampled to $L$ by linear interpolation. An intuitive illustration is shown in Figure `\ref{fig:downsampling_comparison}`{=latex} of the appendix.

```{=latex}
\centering
```
::: {#tab:hyperparam}
  **Model**        **Parameters**   **Layers**   **Dim ($d$)**  
  --------------- ---------------- ------------ --------------- --
  Reverso-Nano          200K            2             32        
  Reverso-Small         550K            4             64        
  Reverso               2.6M            8             128       

  : Architecture configurations for Reverso models of different sizes.
:::

```{=latex}
\vspace{-3mm}
```
Empirical Study
===============

```{=latex}
\vspace{-1mm}
```
Experimental Setup
------------------

We pretrain three versions of Reverso with 200K, 550K and 2.6M parameters. See Table `\ref{tab:hyperparam}`{=latex} for the model configurations. We train with AdamW [@adamw] with maximum learning rate $5 \times 10^{-4}$ using a WSD scheduler [@wsd], $\beta_1 = 0.9, \beta_2=0.999, \epsilon=1\times10^{-8}$ and weight decay of $0.1$ and we roughly sample 1 million time points per training step with a batch size of 512. Our models take {10, 20, 40} H100-hours for a full training run.

#### Baselines.

Our baselines include state-of-the-art TSFMs across varying architectures and sizes: Chronos and Chronos-2 [@chronos; @chronos2], TimesFM-2 and TimesFM-2.5 [@timesfm], PatchTST-FM-r1 [@patchtstfm], TiRex [@tirex], FlowState [@flowstate], Xihe [@xihe], Kairos [@kairos], Moirai and Moirai-2 [@moirai; @moirai2], Sundial [@sundial], Toto [@toto], YingLong [@wang2025outputscalingyinglongdelayedchain] and Tiny-time Mixers [@ttm]. The sizes of these baseline models are given in Figure `\ref{fig:gift_overall_pareto}`{=latex} and Figure `\ref{fig:ltsf_pareto}`{=latex}.

`\label{sec:results}`{=latex}

Main Results: Zero-Shot Forecasting Performance
-----------------------------------------------

#### Gift-Eval. {#sec:point_forecasting}

The Gift-Eval benchmark [@aksu2024giftevalbenchmarkgeneraltime] contains 23 different datasets with 97 different forecasting tasks. We train with the provided Gift-Eval Pretrain dataset.[^2] On this benchmark, Reverso achieves a competitive MASE value of 0.711 at a modest model size of 2.6M parameters. In particular, we outperform similarly small TSFMs such as Super-Linear (2.5M), FlowState (2.6M) and Tiny-Time Mixers (1M).[^3] Reverso-Small, at just 550K parameters, outperforms all the above models with an MASE of 0.726. Table `\ref{tab:full_gift}`{=latex} of the appendix gives the full numeric results broken down by dataset/domain, while Figure `\ref{fig:qualitative_results}`{=latex} shows some qualitative results on various datasets. We visualize the results for the full benchmark for all models in Figure `\ref{fig:gift_overall_pareto}`{=latex} (see Table `\ref{tab:all_baselines_gift}`{=latex} of the appendix for the full table of baselines and their results).

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:horizon_results}
  -------------------------------------- ------------ --------------------------------------------- ------------ ---------- --
                                                       `\multicolumn{3}{c}{\textbf{MASE}}`{=latex}                          
  `\cmidrule`{=latex}(lr)3-5 **Model**   **Params**                     **Short**                    **Medium**   **Long**  
  Xihe-Max                               1.5B                             0.623                        0.718       0.763    
  TimesFM-2.5                            200M                             0.626                        0.724       0.751    
  PatchTST-FM                            260M                             0.616                        0.722       0.745    
  TiRex                                  30M                              0.638                        0.750       0.767    
  Reverso                                2.6M                             0.633                        0.705       0.749    
  Reverso-Small                          550K                             0.648                        0.728       0.754    
  -------------------------------------- ------------ --------------------------------------------- ------------ ---------- --

  : Model MASE scores across forecast horizons, averaged across the 21 datasets in Gift-Eval with all three horizons available. Our Reverso models demonstrate strong long horizon forecasting abilities, despite the multiple autoregressive rollouts using our prediction length of 48 as compared to models like Xihe-Max with maximum prediction length of 720.
:::

We also observe that our model does particularly well in long sequence forecasting. Table `\ref{tab:horizon_results}`{=latex} shows the performance of the top TSFMs on each of the short/medium/long horizon splits within Gift-Eval, where we see that Reverso achieves strong medium and long horizon point forecasting results, despite being the smallest family of models evaluated on this benchmark.

```{=latex}
\begin{table*}[htbp]\centering
  \caption{Zero-shot forecasting performance (MAE) on LTSF datasets: ETTm1, ETTm2, ETTh1, ETTh2,
  Electricity and Weather, comparing between Reverso variants against state-of-the-art foundation
  models. Results represent the averaged MAE across prediction lengths $\{96, 192, 336, 720\}$. Best
  results are in \textbf{bold}, and second-best are \underline{underlined}. A full set of results are
   shown in Table~\ref{tab:ltsf_full}}
  \label{tab:zero_shot_comparison_ltsf}
  \resizebox{\textwidth}{!}{%
  \begin{tabular}{lcccccccc}
  \toprule
  \textbf{Model} & \textbf{Reverso} & \textbf{Reverso-Small} & \textbf{Reverso-Nano} &
  \textbf{Sundial-L} & \textbf{Super-Linear} & \textbf{Timer-XL} & \textbf{TiRex} &
  \textbf{Chronos-2} \\
  \midrule
  \textbf{Params} & 2.6M & 550K & 200K & 444M & 2.6M & 85M & 30M & 120M \\
  \midrule
  ETTm1 & 0.367 & 0.376 & 0.382 & 0.369 & 0.389 & 0.392 & \underline{0.365} & \textbf{0.359} \\
  ETTm2 & 0.304 & 0.309 & 0.311 & 0.315 & 0.325 & 0.336 & \underline{0.302} & \textbf{0.291} \\
  ETTh1 & \textbf{0.404} & \textbf{0.404} & 0.416 & 0.420 & 0.416 & 0.417 & 0.417 & 0.405 \\
  ETTh2 & \underline{0.365} & 0.370 & 0.384 & 0.387 & 0.386 & 0.388 & \textbf{0.362} & 0.367 \\
  Electricity & \underline{0.238} & 0.241 & 0.249 & 0.262 & 0.267 & 0.268 & 0.240 & \textbf{0.237} \\
  Weather & 0.253 & 0.252 & 0.257 & 0.275 & 0.275 & 0.294 & \underline{0.247} & \textbf{0.245} \\
  \midrule
  \textbf{Avg} & \underline{0.322} & 0.325 & 0.333 & 0.338 & 0.343 & 0.349 & \underline{0.322} & \textbf{0.317}
  \\
  \midrule
  \textbf{Avg Rank} & \underline{2.50} & 3.83 & 5.17 & 6.33 & 6.17 & 7.83 & 2.83 & \textbf{1.67} \\
  \bottomrule
  \end{tabular}%
  }
\end{table*}
```
#### LTSF/TSLib.

We next explore zero-shot transfer results to the LTSF [@dlinear] test set. On this dataset we outperform Sundial [@sundial], Super-Linear [@superlinear], Timer-XL [@liu2025timerxl] and several other models at a much smaller parameter count, as shown in Figure `\ref{fig:ltsf_pareto}`{=latex}. We report more granular performance numbers in Table `\ref{tab:zero_shot_comparison_ltsf}`{=latex}, where we follow Sundial [@sundial] and report the mean MAE achieved across various prediction horizons for the datasets of ETTh1, ETTh2, ETTm1, ETTm2, Electricity and Weather.

These results are especially strong given that some of the baselines are quite advantaged compared to Reverso. For example, in-domain datasets such as Electricity enter into the pretraining datasets of TiRex and Chronos-2. Moreover, for models which do not report results on the full benchmark, we impute their scores with the best existing model on each missing dataset.[^4] Despite the advantage given to all other models, we observe that Reverso is still the one of the best performing class of models on LTSF.

```{=latex}
\centering
```
![LTSF performance vs. Parameter Count. MAE is averaged over the horizons of $\{96, 192, 336, 720\}$ for the datasets ETTh1, ETTh2, ETTm1, ETTm2, Electricity and Weather. For models which are not evaluated on all the datasets (e.g. YingLong did not report results for Electricity), we impute with the other best existing model on that dataset.](figures/ltsf_pareto_mae.png){#fig:ltsf_pareto width="\\linewidth"}

Ablations
---------

We perform ablations across various architectural, dataset, and inference-strategy choices.

#### Architecture. {#subsec:architecture}

How much does our hybrid sequence mixing layers help for time series? In Table `\ref{tab:sequence ablations}`{=latex}, we report the MASE achieved by different instances of our model using the different sequence mixing layers. Across each model, we keep the number of layers fixed at 8 and sequence mixing dimension at 128. Here we train on a smaller portion of the full training set for efficiency.

We find that for non-hybrid models, DeltaNet [@schlag2021linear] and Gated DeltaNet [@yang2025gated] achieves the low loss with few parameter counts compared to layers like Gated Linear Attention [@gla] and DeltaProduct [@siems2025deltaproduct]. Overall, linear attention and convolution methods consistently outperform full attention. Hybrid models that combine long convolutions with linear RNN layers ultimately perform best.

Table `\ref{tab:decoder_ablations}`{=latex} shows ablation studies on our attention decoder head, where we replace the attention mechanism with a simple (bi)linear layer. For the simple linear layer, the hidden states $x^{(n)} \in \mathbb{R}^{L \times d}$ after the last Reverso block are projected to the output with two linear projections $W_1 \in \mathbb{R}^{d \times 1}$ and $W_2 \in \mathbb{R}^{p \times L}$ with the following transformation $$\begin{aligned}
    \hat{y} = W_2 x^{(L)}W_1.\end{aligned}$$ We observe that the attention mechanism at the decoder boosts overall performance, in particular helps to capture long range dependencies.

```{=latex}
\centering
```
```{=latex}
\resizebox{0.98\columnwidth}{!}{%
\begin{tabular}{lcccc}
\toprule
\textbf{\begin{tabular}[c]{@{}l@{}}Sequence Module\end{tabular}}  & \textbf{Params} & \textbf{\begin{tabular}[c]{@{}c@{}}Long\\ MASE\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Short\\ MASE\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Overall\\ MASE\end{tabular}} \\ \midrule
DeltaNet & 2.0M & 0.706 & 0.792 & 0.732 \\
Gated DeltaProduct & 3.0M & 0.711 & 0.793 & 0.735 \\
Gated DeltaNet & 2.6M & 0.708 & 0.782 & 0.730 \\
Long Convolution & 3.1M & 0.708 & 0.799 & 0.735 \\
Gated Linear Attention & 2.1M & 0.726 & 0.817 & 0.753 \\ 
Attention (\texttt{sin-cos}) & 2.0M & 0.729 & 0.840 & 0.762 \\ 
Attention (\texttt{RoPE}) & 2.0M & 0.719 & 0.824 & 0.750 \\ \midrule
Conv + Gated DeltaNet & 3.1M & 0.704 & 0.784 & 0.728 \\ 
Conv + DeltaNet (Reverso) & 2.6M & 0.700 & 0.786 & 0.725 \\ \bottomrule
\end{tabular}%
}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lccccc}
\toprule
\textbf{\begin{tabular}[c]{@{}c@{}}Decoder\\ Architecture\end{tabular}} & \textbf{\begin{tabular}[c]{@{}l@{}}Reverso \\ Layers\end{tabular}} & \textbf{Dimension} & \textbf{\begin{tabular}[c]{@{}c@{}}Long\\ MASE\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Short\\ MASE\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}Overall\\ MASE\end{tabular}} \\ \midrule
% Attention   & 2 & 32 & 0.741 & 0.856 & 0.775 \\
Linear      & 4 & 64 & 0.751 & 0.830 & 0.774 \\
Attention   & 4 & 64 & 0.728 & 0.811 & 0.753 \\
Linear      & 8 & 128 & 0.719 & 0.789 & 0.740 \\
Attention   & 8 & 128 & 0.700 & 0.786 & 0.725 \\ \bottomrule
\end{tabular}%
}
```
#### Data augmentation and synthetic data.

Data augmentation strategies and synthetic data generation processes have been shown to improve data diversity. Table `\ref{tab:augmentation_ablation}`{=latex} shows a leave-one-out experiment, where we train Reverso (again on a smaller training set) while removing each one of the data augmentations within the pipeline {`mixup, downsample`, `temporal reversal(flip-x)`, `vertical flip(flip-y)`, `censor`, `amplitude modulation`}. We find that our training recipe is robust to the setting of individual data augmentation techniques, where removing a single data augmentation does not significantly hurt pre-training. But at the same time, the usage of augmentations remain necessary, and ablating them altogether is detrimental. Synthetic data also shows significant benefit, even when present in small ratios. This finding corroborates recent works [@chronos2; @tempopfn] which also highlight the importance of data augmentations and synthetic data in training TSFMs.

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
::: {#tab:augmentation_ablation}
  **Method**                   **MASE**
  --------------------------- ----------
  Baseline                      0.738
  w/o mixup                     0.740
  w/o downsample                0.740
  w/o temp rev                  0.740
  w/o flip                      0.739
  w/o censor                    0.738
  w/o amp mod                   0.737
  w/o any data augmentation     0.755
  w/o synthetic data            0.786

  : Ablations on dataset augmentation and synthetic data.
:::

#### Inference.

Finally, we analyze the different effects of downsampling and flip invariance methods described in Section `\ref{sec:inference}`{=latex} in forecasting. We find that downsampling helps bring long range dependencies into the context window of our model, improving the medium and long term forecast performance.

Flip invariance helps more on short sequences. We compare two methods of doing this autoregressive rollout: `flip-once` where the original and flipped predictions across the whole forecast horizon are obtained separately and averaged once at the end after the autoregressive rollout is completed, versus `flip-every` where the original and flipped predictions are averaged at the end of each intermediate autoregressive step. The latter method shows slightly more marginal improvements than the former.

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{@{}l c c c c c c@{}}
\toprule
\textbf{Method} & 
\textbf{\begin{tabular}[c]{@{}c@{}}Short\\ Seq\end{tabular}} & 
\textbf{\begin{tabular}[c]{@{}c@{}}Long\\ Seq\end{tabular}} & 
\textbf{\begin{tabular}[c]{@{}c@{}}Short\\ Term\end{tabular}} & 
\textbf{\begin{tabular}[c]{@{}c@{}}Med\\ Term\end{tabular}} & 
\textbf{\begin{tabular}[c]{@{}c@{}}Long\\ Term\end{tabular}} & 
\textbf{Overall} \\ \midrule
Baseline    & 0.781 & 0.697 & 0.710 & 0.730 & 0.746 & 0.722 \\
w/o downsampling & 0.781 & 0.717 & 0.710 & 0.755 & 0.789 & 0.736 \\ \midrule
No flip    & 0.788 & 0.700 & 0.715 & 0.730 & 0.748 & 0.726 \\
Flip once  & 0.781 & 0.698 & 0.710 & 0.730 & 0.747 & 0.722 \\
Flip every & 0.781 & 0.697 & 0.710 & 0.730 & 0.746 & 0.722 \\ \bottomrule
\end{tabular}%
}
```
Discussion {#sec:discussion}
==========

As stated in §`\ref{sec:methods}`{=latex}, Reverso is built from established architectural components; our main contribution lies in how we combine them. We show that recent hybrid models that have been successful in language modeling can result in simple TSFMs that yield a strong performance--efficiency trade-off.

More generally, we view Reverso as an optimized recipe for TSFMs. Similar patterns have appeared in foundation models in other domains: RoBERTa [@liu2019roberta] systematically refines the original BERT recipe [@devlin2019bert], while DINOv2 [@oquab2023dinov2] streamlines the original DINO formulation. Likewise, many influential works in foundation modeling arise from scaling up existing designs (e.g., GPT-2 to GPT-3), rather than introducing entirely new building blocks. Our results show that, in the TSFM setting, carefully designed architectures allow us instead to *scale down* existing recipes while maintaining competitive performance, effectively pushing the Pareto frontier towards smaller and cheaper models. Our findings also resonate with recent work on hybrid LLM architectures [@lieber2024jamba; @glorioso2024zamba; @waleffe2024empirical], which demonstrates that mixing established primitives can outperform either component alone.

Reverso still has several limitations. First, Reverso is trained primarily as a univariate forecasting model. Chronos-2 has shown that attention can be cleverly utilized to learn cross-channel dependence in multivariate time series. Future work could investigate the potentials of the various sequence mixing layers in multivariate domains. Second, while Reverso's performance on long sequence was near state-of-the-art, its performance on shorter sequences still lagged behind larger TSFMs. Finally, we focus primarily on point prediction, although some applications of interest would benefit from distributional predictions; insofar as conformal methods [@conformal2021; @conformal2025] have also been adopted as a lightweight adaptation to obtain uncertainty bounds for any point time series forecasts, we anticipate such techniques being applicable to obtain uncertainty estimates from Reverso models.

Conclusion {#sec:conclusion}
==========

This paper presents Reverso, a family of models that significantly push the efficiency-performance frontier of TSFMs. We show that large-scale models are not necessary, and that simple architectures based on convolutions and linear RNN layers can achieve competitive zero-shot forecasting performance. Reverso demonstrates strong capability as a highly accurate model for long sequence, long horizon forecasting.

Impact Statement {#impact-statement .unnumbered}
================

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgments {#acknowledgments .unnumbered}
===============

This work was supported by the National Science Foundation under CAREER Award No. 2441872 and a gift from Qube RT.

```{=latex}
\bibliographystyle{icml2026/icml2026}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\onecolumn
```
Data generation and augmentation details
========================================

Synthetic data composition {#app:synthetic_data}
--------------------------

```{=latex}
\begin{algorithm}
\caption{KernelSynth Data Generation}
\label{alg:kernelsynth}
\begin{algorithmic}[1]
\REQUIRE Length $L$, Kernel bank $\mathcal{K}$ (e.g., RBF, Periodic, Linear, Rational Quadratic), Max kernels $J_{max}=5$.
\STATE \textbf{Compose Kernel $\tilde{\kappa}$:}
\STATE Sample number of kernels $N \sim \text{Uniform}\{1, J_{max}\}$
\STATE Sample base kernel $\tilde{\kappa} \sim \mathcal{K}$
\FOR{$i = 2$ to $N$}
    \STATE Sample next kernel $k' \sim \mathcal{K}$
    \STATE Sample operation $\oplus \sim \{\text{Add}, \text{Multiply}\}$
    \STATE $\tilde{\kappa} \leftarrow \tilde{\kappa} \oplus k'$
\ENDFOR
\STATE \textbf{Define Mean Function $\mu(t)$:}
\IF{$u \sim U(0,1) < 0.5$}
    \STATE \textit{Linear trend:} Sample $m \sim U[-0.01, 0.01]$, $c \sim U[-0.1, 0.1]$
    \STATE $\mu(t) = m \cdot t + c$
\ELSE
    \STATE \textit{Constant:} $\mu(t) = 0$ (or sample constant $c$)
\ENDIF
\STATE \textbf{Sample Time Series from Gaussian Process:}
\STATE Compute covariance matrix $\Sigma \in \mathbb{R}^{L \times L}$ where $\Sigma_{uv} = \tilde{\kappa}(u, v)$
\STATE Compute mean vector $\mathbf{m} \in \mathbb{R}^{L}$ where $\mathbf{m}_t = \mu(t)$
\STATE Sample $t_{syn} \sim \mathcal{N}(\mathbf{m}, \Sigma)$ \COMMENT{Multivariate Gaussian}
\RETURN $t_{syn}$
\end{algorithmic}
\end{algorithm}
```
We make use of standard synthetic data generation practices that has been developed in the community.

KernelSynth[@chronos] introduced the use of Gaussian Process(GP) for synthetic data generation. In particular, we define a kernel bank $\mathcal{K}$, and sample $j \sim U\{1, 5\}$ kernels from $\mathcal{K}$ and compose them using random binary additive or multiplicative operations. This forms a composite kernel $\tilde{\kappa}$. We also sample $\mu$ which follows a linear trend with slope $m \sim U[-0.01, 0.01]$ and intercept $c \sim U[-0.1,0.1]$ with probability $1/2$ and constant otherwise. We then use $\tilde{\kappa}$ and $\mu$ in a Gaussian process to sample the synthetic time series $t_{syn}$ according to $$\begin{aligned}
    t_{syn} \sim \operatorname{Gaussian Process}(\mu, \tilde{\kappa}(i_1, i_2)).\end{aligned}$$ We use the following sets of kernels in our kernel bank $\mathcal{K}$ as shown in Table `\ref{tab:kernel_bank}`{=latex}, applied to points $L_{syn}$ evenly spaced points $x, x' \in [0, 1]$. To enable efficient sampling, we use batched Cholesky decomposition. The constant, linear, RBF and Rational Quadratic kernels were introduced in KernelSynth [@chronos]. The Matern kernel was used in TempoPFN[@tempopfn] as a more robust and accurate representation of how GPs can model real world data. We use the following set of periods $\mathcal{P} = \{24, 48, 96, 168, 336, 672, 7, 14, 30, 60, 365, 730, 4, 26, 52, 6, 12, 40, 10\}$(normalized by time series length $L_{syn}$) to capture patterns of various timescales.

```{=latex}
\centering
```
::: {#tab:kernel_bank}
  **Kernel**           **Formula** $\kappa(x, x')$                                                                                                             **Hyperparameters**
  -------------------- --------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------
  Constant             $C$                                                                                                                                     $C = 1$
  Linear               $\sigma^2 + x \cdot x'$                                                                                                                 $\sigma \in \{0, 1, 10\}$
  RBF                  $\exp\left( -\frac{\|x-x'\|^2}{2l^2} \right)$                                                                                           $l \in \{0.1, 1, 10\}$
  Rational Quadratic   $\left( 1 + \frac{\|x-x'\|^2}{2\alpha} \right)^{-\alpha}$                                                                               $\alpha \in \{0.1, 1, 10\}$
  Matérn               $\frac{2^{1-\nu}}{\Gamma(\nu)} \left( \sqrt{2\nu} \frac{\|x-x'\|}{l} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{\|x-x'\|}{l} \right)$   $\nu \in \{0.5, 1.5, 2.5\}, l \in \{0.1, 1, 10\}$
  Periodic             $\exp\left( -2 \sin^2(\pi \|x-x'\| / p) \right)$                                                                                        $p \in \mathcal{P}$

  : Kernel Bank $\mathcal{K}$ used for Synthetic Data Generation
:::

```{=latex}
\vspace{5pt}
```
Beyond GP data, we also include spike processes [@tirex; @tempopfn; @kairos] and TSI [@tsi] as used in Chronos-2 to help in learning simple trends and periodic patterns.

```{=latex}
\begin{algorithm}
\caption{Spike Process Generation, adapted from Kairos~\cite{kairos}}
\label{alg:spike_gen}
\begin{algorithmic}[1]
\REQUIRE Length $L$, Pattern types $\mathcal{T} = \{\text{"inverted\_u", "spikes"}\}$, Ranges for baseline $[b_{min}, b_{max}]$, period $[p_{min}, p_{max}]$, amplitude $[a_{min}, a_{max}]$, width $[w_{min}, w_{max}]$, and noise $[\sigma_{min}, \sigma_{max}]$.
\STATE \textbf{Sample parameters:}
\STATE $type \sim \text{Uniform}(\mathcal{T})$, $b \sim \text{Uniform}(b_{min}, b_{max})$, $p \sim \text{Uniform}\{p_{min}, p_{max}\}$
\STATE $a \sim \text{Uniform}(a_{min}, a_{max})$, $w \sim \text{Uniform}\{w_{min}, w_{max}\}$, $\sigma_{\epsilon} \sim \text{Uniform}(\sigma_{min}, \sigma_{max})$
\STATE \textbf{Construct trapezoid shape $e$ of length $w$:}
\STATE Define $u = \lfloor w/4 \rfloor$, $f = \lfloor w/2 \rfloor$, $d = w - u - f$
\STATE $e_{up} = \text{linspace}(0, a, u)$, $e_{flat} = \text{constant}(a, f)$, $e_{down} = \text{linspace}(a, 0, d)$
\STATE $e = [e_{up}; e_{flat}; e_{down}]$
\STATE \textbf{Initialize series:} $x_t = b$ for $t = 1, \dots, L$
\STATE $s = -1$ if $type = \text{"inverted\_u"}$ else $1$
\STATE \textbf{Add periodic patterns:}
\FOR{$i = 0, p, 2p, \dots < L$}
    \STATE $len = \min(w, L - i)$
    \STATE $x_{i:i+len} \leftarrow x_{i:i+len} + s \cdot e_{1:len}$
\ENDFOR
\STATE \textbf{Add white noise:} $x \leftarrow x + \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, \sigma_{\epsilon}^2)$
\RETURN $x$
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\begin{algorithm}
\caption{TSI (Trend, Seasonality, Irregularity) Generation, following Chronos-2\cite{chronos2}}
\label{alg:tsi_gen}
\begin{algorithmic}[1]
\REQUIRE Length $L$, Component probabilities $P_{trend}, P_{seas}, P_{noise}, P_{out}, P_{shift}$, Trend types $\mathcal{T}$, Seasonality periods $\mathcal{P}$, Wave shapes $\mathcal{W}$, Noise distributions $\mathcal{D}$.
\STATE \textbf{Initialize:} $x_t \leftarrow 0$ for $t=1, \dots, L$
\STATE \textbf{Add Trend:}
\IF{$u \sim U(0,1) < P_{trend}$}
    \STATE Sample trend type $\tau \in \mathcal{T}$ (e.g., linear, exp, poly, piecewise)
    \STATE Sample parameters $\theta_\tau$ (slope, intercept, degree, etc.)
    \STATE $x \leftarrow x + f_\tau(t; \theta_\tau)$
\ENDIF
\STATE \textbf{Add Seasonality:}
\IF{$u \sim U(0,1) < P_{seas}$}
    \STATE Sample number of components $K \sim U\{1, 3\}$
    \STATE Sample distinct periods $\{p_1, \dots, p_K\} \subset \mathcal{P}$
    \FOR{$k=1$ to $K$}
        \STATE Sample wave form $w \in \mathcal{W}$ (e.g., sine, sawtooth, square)
        \STATE Sample amplitude $A$ and phase $\phi$
        \STATE $x_t \leftarrow x_t + A \cdot w\bigg(\frac{2\pi}{p_k} t + \phi\bigg)$
    \ENDFOR
\ENDIF
\STATE \textbf{Add Irregularity (Noise):}
\IF{$u \sim U(0,1) < P_{noise}$}
    \STATE Sample distribution $\mathcal{N} \in \mathcal{D}$ and scale $\sigma$
    \STATE $x \leftarrow x + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma)$
\ENDIF
\STATE \textbf{Add Anomalies:}
\IF{$u \sim U(0,1) < P_{out}$}
    \STATE Add random sparse outliers to $x$
\ENDIF
\IF{$u \sim U(0,1) < P_{shift}$}
    \STATE Add random level shifts (step functions) to $x$
\ENDIF
\RETURN $x$
\end{algorithmic}
\end{algorithm}
```
Data augmentation specifics
---------------------------

Here we present the various data augmentation strategies that we found to have been helpful in improving the data diversity during training. We demonstrate this in a full pipeline detailed in Algorithm `\ref{alg:augmentation}`{=latex}. Our pipeline applies transformations at both the instance level (sequentially) and the batch level (Mixup).

1.  **Downsampling:** To allow the model to learn features across varying temporal resolutions, we downsample the raw time series $t$ by a factor $k$. This effectively compresses long-term dependencies into the context window $L$.

2.  **Amplitude Modulation:** We multiply $t$ by a piecewise linear function. We follow the implementation from TiRex but sample just a single intermediate changepoint.

3.  **Flips:** We apply random sign flips (inverting the y-axis) and temporal reversals (flipping the x-axis). This follows the implementation of TempoPFN [@tempopfn].

4.  **Censoring:** The series is clipped from both the top and the bottom. This effectively applies a per-sample thresholding which reduces the effect of anomalies on training.

5.  **Batch Mixup:** We apply Mixup [@chronos] at the batch level, creating a convex interpolation between samples.

The formal procedure for generating a single training batch is detailed in Algorithm `\ref{alg:augmentation}`{=latex}.

```{=latex}
\begin{algorithm}[tb]\caption{Pretraining Data Augmentation Pipeline}
   \label{alg:augmentation}
\begin{algorithmic}
   \STATE {\bfseries Input:} Dataset $\mathcal{D}$, Batch size $B$, Context length $L$
   \STATE {\bfseries Hyperparameters:} 
   \STATE \hspace{1em} $p(\texttt{downsample}), [k_{min}, k_{max}]$ (Downsample probability, range of downsample ratios)
   \STATE \hspace{1em} $p(\texttt{modulate}), p(\texttt{flip-x}), p(\texttt{flip-y})$ (Amp. Mod., Flip-$x$, Flip-$y$ probabilities)
   \STATE \hspace{1em} $p(\texttt{censor})$ (Censor prob), $\alpha$ (Mixup beta param)
   
   \STATE Initialize batch $\mathcal{B} \leftarrow \emptyset$
   
   \WHILE{$|\mathcal{B}| < B$}
       \STATE Sample raw time series $X$ from $\mathcal{D}$
       
       \STATE \COMMENT{1. Multi-scale Downsampling}
       \IF{sample $u \sim U(0,1) < p_d$}
           \STATE Sample stride $k \sim U_{\text{int}}(k_{min}, k_{max})$
           \STATE $X \leftarrow \text{Downsample}(X, k)$
       \ENDIF
       
       \STATE \COMMENT{2. Amplitude Modulation}
       \IF{sample $u \sim U(0,1) < p(\texttt{modulate})$}
            \STATE Sample changepoint $x_2 \subset \{1, \dots, \texttt{len}(X) - 2\}$. Set $x_1 = 0, x_3 = \texttt{len}(X) - 1$
            \STATE Sample scalar $\{y_1, y_2, x_3\} \sim \mathcal{N}(1, 0.5)$
            \STATE Piecewise linear $f(x)$ connecting $(x_1, y_1), (x_2, y_2), (x_3, y_3)$
           \STATE $X \leftarrow X \cdot f(x)$
       \ENDIF
       
       \STATE \COMMENT{3. Slicing to context length}
       \STATE $T_{len} \leftarrow \texttt{len}(X)$
       \IF{$T_{len} > L$}
           \STATE Sample $t_{start} \sim U_{\text{int}}(0, T_{len} - L)$
           \STATE $x_{seq} \leftarrow X[t_{start} : t_{start} + L]$ (on the next iteration we start at $t_{start} + L$).
       \ELSE
           \STATE $x_{seq} \leftarrow \text{Pad}(X, L)$
       \ENDIF
       
       \STATE \COMMENT{4. Flip Augmentations}
       \IF{sample $u \sim U(0,1) < p(\texttt{flip-y})$}
           \STATE $x_{seq} \leftarrow -x_{seq}$ \COMMENT{Sign Inversion}
       \ENDIF
       \IF{sample $u \sim U(0,1) < p(\texttt{flip-x})$}
           \STATE $x_{seq}[i] \leftarrow x_{seq}[L-i-1]$ \COMMENT{Temporal Reversal}
       \ENDIF
       
       \STATE \COMMENT{5. Censor Augmentation}
       % \STATE $x_{seq} \leftarrow \text{FillNaNs}(x_{seq})$
       \IF{sample $u \sim U(0,1) < p(\text{censor})$}
            \STATE Sample $q \sim U(0, 1)$ 
            \STATE Compute threshold $c \leftarrow \text{Quantile}(x_{seq}, q)$ 
            \STATE $\texttt{direction} \sim \{\texttt{top, bottom, none}\}$
            \IF{$\texttt{direction} = \texttt{top}$}
                \STATE $x_{seq} \leftarrow \min(x_{seq}, c)$
            \ELSIF{$\texttt{direction} = \texttt{bottom}$}
                \STATE $x_{seq} \leftarrow \max(x_{seq}, c)$
            \ENDIF
        \ENDIF
       
       \STATE Add $x_{seq}$ to $\mathcal{B}$
   \ENDWHILE
   
   \STATE \COMMENT{6.Mixup}
   \STATE Construct permutation $\pi$ of $\{0, \dots, L-1\}$, $\tilde{\mathcal{B}} = \mathcal{B}[\pi]$
   \STATE Sample $\lambda \sim \text{Beta}(\alpha, \alpha)$
   \FOR{$i=1$ {\bfseries to} $B$}
        \STATE Sample $\lambda \sim \text{Beta}(\alpha, \alpha)$
        \STATE $\mathcal{B}[i] \leftarrow \lambda \mathcal{B}[i] + (1-\lambda) \mathcal{\tilde{B}}[i]$
   \ENDFOR
   
   \STATE \textbf{return} $\mathcal{B}$
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\clearpage
```
Extended results on Gift-Eval
=============================

```{=latex}
\centering
```
::: {#tab:full_gift}
  Dataset                              Domain           Var MASE      Dataset                       Domain       Var   MASE
  ------------------------------------ -------------- ----- --------- ----------------------------- ------------ ----- --------
  loop\_seattle/5T/short               Transport          1 0.5531    ett1/H/short                  Energy       7     0.8112
  loop\_seattle/5T/medium              Transport          1 0.8008    ett1/H/medium                 Energy       7     1.2476
  loop\_seattle/5T/long                Transport          1 0.8880    ett1/H/long                   Energy       7     1.3336
  loop\_seattle/D/short                Transport          1 0.8818    ett1/W/short                  Energy       7     1.4936
  loop\_seattle/H/short                Transport          1 0.8113    ett2/15T/short                Energy       7     0.7479
  loop\_seattle/H/medium               Transport          1 0.9009    ett2/15T/medium               Energy       7     0.8903
  loop\_seattle/H/long                 Transport          1 0.8668    ett2/15T/long                 Energy       7     0.9261
  m\_dense/D/short                     Transport          1 0.7038    ett2/D/short                  Energy       7     1.3449
  m\_dense/H/short                     Transport          1 0.8146    ett2/H/short                  Energy       7     0.7140
  m\_dense/H/medium                    Transport          1 0.6703    ett2/H/medium                 Energy       7     0.9900
  m\_dense/H/long                      Transport          1 0.6777    ett2/H/long                   Energy       7     0.9873
  sz\_taxi/15T/short                   Transport          1 0.5441    ett2/W/short                  Energy       7     0.7538
  sz\_taxi/15T/medium                  Transport          1 0.5420    hierarchical\_sales/D/short   Sales        1     0.7456
  sz\_taxi/15T/long                    Transport          1 0.5148    hierarchical\_sales/W/short   Sales        1     0.7469
  sz\_taxi/H/short                     Transport          1 0.5698    hospital/M/short              Healthcare   1     0.7863
  bitbrains\_fast\_storage/5T/short    Web/CloudOps       2 0.6703    jena\_weather/10T/short       Nature       21    0.2907
  bitbrains\_fast\_storage/5T/medium   Web/CloudOps       2 0.9883    jena\_weather/10T/medium      Nature       21    0.6179
  bitbrains\_fast\_storage/5T/long     Web/CloudOps       2 0.8828    jena\_weather/10T/long        Nature       21    0.6427
  bitbrains\_fast\_storage/H/short     Web/CloudOps       2 1.0029    jena\_weather/D/short         Nature       21    1.3003
  bitbrains\_rnd/5T/short              Web/CloudOps       2 1.6286    jena\_weather/H/short         Nature       21    0.5126
  bitbrains\_rnd/5T/medium             Web/CloudOps       2 4.3766    jena\_weather/H/medium        Nature       21    0.7812
  bitbrains\_rnd/5T/long               Web/CloudOps       2 3.3422    jena\_weather/H/long          Nature       21    1.0075
  bitbrains\_rnd/H/short               Web/CloudOps       2 5.8132    kdd\_cup\_2018/D/short        Nature       1     1.2126
  bizitobs\_application/10S/short      Web/CloudOps       2 1.0614    kdd\_cup\_2018/H/short        Nature       1     0.9513
  bizitobs\_application/10S/medium     Web/CloudOps       2 1.5193    kdd\_cup\_2018/H/medium       Nature       1     1.0454
  bizitobs\_application/10S/long       Web/CloudOps       2 3.2373    kdd\_cup\_2018/H/long         Nature       1     1.0415
  bizitobs\_l2c/5T/short               Web/CloudOps       7 0.2825    m4\_daily/D/short             Econ/Fin     1     3.3280
  bizitobs\_l2c/5T/medium              Web/CloudOps       7 0.4683    m4\_hourly/H/short            Econ/Fin     1     0.7820
  bizitobs\_l2c/5T/long                Web/CloudOps       7 0.4891    m4\_monthly/M/short           Econ/Fin     1     0.9472
  bizitobs\_l2c/H/short                Web/CloudOps       7 0.4383    m4\_quarterly/Q/short         Econ/Fin     1     1.2159
  bizitobs\_l2c/H/medium               Web/CloudOps       7 0.4881    m4\_weekly/W/short            Econ/Fin     1     2.0522
  bizitobs\_l2c/H/long                 Web/CloudOps       7 0.5473    m4\_yearly/A/short            Econ/Fin     1     3.4251
  bizitobs\_service/10S/short          Web/CloudOps       2 0.7725    restaurant/D/short            Sales        1     0.6936
  bizitobs\_service/10S/medium         Web/CloudOps       2 0.9605    saugeen/D/short               Nature       1     2.8601
  bizitobs\_service/10S/long           Web/CloudOps       2 1.3776    saugeen/M/short               Nature       1     0.7681
  car\_parts/M/short                   Sales              1 0.8652    saugeen/W/short               Nature       1     1.2569
  covid\_deaths/D/short                Healthcare         1 34.3613   solar/10T/short               Energy       1     1.1669
  electricity/15T/short                Energy             1 1.0297    solar/10T/medium              Energy       1     0.8224
  electricity/15T/medium               Energy             1 0.8463    solar/10T/long                Energy       1     0.8593
  electricity/15T/long                 Energy             1 0.8917    solar/D/short                 Energy       1     1.0168
  electricity/D/short                  Energy             1 1.4781    solar/H/short                 Energy       1     0.8386
  electricity/H/short                  Energy             1 0.9649    solar/H/medium                Energy       1     0.8806
  electricity/H/medium                 Energy             1 1.0626    solar/H/long                  Energy       1     0.9416
  electricity/H/long                   Energy             1 1.1940    solar/W/short                 Energy       1     1.4687
  electricity/W/short                  Energy             1 1.6038    temperature\_rain/D/short     Nature       1     1.3853
  ett1/15T/short                       Energy             7 0.6924    us\_births/D/short            Healthcare   1     0.3505
  ett1/15T/medium                      Energy             7 1.0419    us\_births/M/short            Healthcare   1     0.7934
  ett1/15T/long                        Energy             7 1.0498    us\_births/W/short            Healthcare   1     1.0961
  ett1/D/short                         Energy             7 1.6066                                                     

  : Full results on Gift-Eval achieved by Reverso, with an overall MASE of 0.711.
:::

```{=latex}
\centering
```
::: {#tab:all_baselines_gift}
  **Family**       **Model**                  **Params**   **MASE**
  ---------------- ------------------------ ------------ ----------
  TimesFM          TimesFM-2.5                      200M      0.705
                   TimesFM-2.0                      500M      0.758
  PatchTST-FM-r1   PatchTST-FM                      260M      0.707
  Xihe             Xihe-Max                         1.5B      0.711
                   Xihe-Base                        700M      0.718
                   Xihe-Flash                       300M      0.726
                   Xihe-Lite                         94M      0.729
                   Xihe-Tiny                        9.5M      0.766
  Reverso          Reverso                          2.6M      0.711
                   Reverso-Small                    0.6M      0.726
                   Reverso-Nano                     0.2M      0.760
  TiRex            TiRex                             30M      0.716
  Chronos          Chronos2(Data leakage)           120M      0.698
                   Chronos2                         120M      0.720
                   Chronos-Bolt-S                    48M      0.822
                   Chronos-B                        200M      0.876
  FlowState        FlowState-9.1M                   9.1M      0.726
                   FlowState-2.6M                   2.6M      0.735
  Kairos           Kairos-50M                        50M      0.742
                   Kairos-23M                        23M      0.748
                   Kairos-10M                        10M      0.753
  Toto             Toto-Base                        150M      0.750
  Sundial          Sundial-B                        128M      0.750
  TTM              TTM-Finetuned                    1.0M      0.756
  TabPFN           TabPFN-TS                         11M      0.771
  YingLong         YingLong-300M                    300M      0.798
                   YingLong-110M                    100M      0.809
                   YingLong-50M                      50M      0.822
                   YingLong-6M                      6.0M      0.880
  SuperLinear      SuperLinear                      2.5M      0.857
  Moirai           Moirai-L                         311M      0.875
                   Moirai-B                          91M      0.901
                   Moirai-S                          14M      0.946
                   Moirai2                           11M      0.728

  : MASE scores versus parameter size breakdowns for models compared in Figure `\ref{fig:gift_overall_pareto}`{=latex}. Note that the data leaked version of Chronos-2 has been trained on datasets within Gift-Eval and hence is not included within Figure `\ref{fig:gift_overall_pareto}`{=latex} for fair zero-shot comparison, but is still a strong frame of reference for SOTA foundation models.
:::

```{=latex}
\centering
```
```{=latex}
\centering
```
![Long sequences](figures/gift_eval_pareto_long.png){#fig:gift_long_pareto width="\\linewidth"}

`\hfill `{=latex}

```{=latex}
\centering
```
![Short sequences](figures/gift_eval_pareto_short.png){#fig:gift_short_pareto width="\\linewidth"}

```{=latex}
\centering
```
::: {#tab:long_short_split}
  **Dataset**                **Frequency**
  -------------------------- ---------------
  bitbrains\_fast\_storage   5T
  bitbrains\_rnd             5T
  bizitobs\_application      10S
  bizitobs\_l2c              5T
  bizitobs\_l2c              H
  bizitobs\_service          10S
  electricity                15T
  electricity                H
  ett1                       15T
  ett1                       H
  ett2                       15T
  ett2                       H
  jena\_weather              10T
  jena\_weather              H
  kdd\_cup\_2018             H
  loop\_seattle              5T
  loop\_seattle              H
  m\_dense                   H
  solar                      10T
  solar                      H
  sz\_taxi                   15T

  : Datasets used for Figure `\ref{tab:horizon_results}`{=latex}, a subset of GiftEval which has all three short/medium/long forecasting horizons.
:::

```{=latex}
\centering
```
```{=latex}
\centering
```
![`bitbrains_rnd_5T/long`](figures/qualitative/bitbrains_rnd_5T_long_sample1554.png){#fig:qual_a width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![`bizitobs_l2c_H/long`](figures/qualitative/bizitobs_l2c_H_long_sample0.png){#fig:qual_b width="\\linewidth"}

```{=latex}
\vspace{1em}
```
```{=latex}
\centering
```
![`bizitobs_service_10S/long`](figures/qualitative/bizitobs_service_10S_long_sample27.png){#fig:qual_c width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![`electricity_15T/long`](figures/qualitative/electricity_15T_long_sample6576.png){#fig:qual_d width="\\linewidth"}

```{=latex}
\vspace{1em}
```
```{=latex}
\centering
```
![`loop_seattle_5T/long`](figures/qualitative/loop_seattle_5T_long_sample2152.png){#fig:qual_e width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![`m_dense_H/long`](figures/qualitative/m_dense_H_long_sample49.png){#fig:qual_f width="\\linewidth"}

```{=latex}
\vspace{1em}
```
```{=latex}
\centering
```
![`solar_10T/long`](figures/qualitative/solar_10T_long_sample0.png){#fig:qual_g width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![`sz_taxi_15T/long`](figures/qualitative/sz_taxi_15T_long_sample17.png){#fig:qual_h width="\\linewidth"}

```{=latex}
\clearpage
```
Detailed results for LTSF/TSLib
===============================

```{=latex}
\centering
```
::: {#tab:ltsf_full}
  **Family**    **Model**         **Params**   **ETTh1**   **ETTh2**   **ETTm1**   **ETTm2**                     **Elec.**                   **Weather**   **Avg**
  ------------- --------------- ------------ ----------- ----------- ----------- ----------- ----------------------------- ----------------------------- ---------
  Chronos       Chronos2-120M           120M       0.405       0.367       0.359       0.291                         0.237                         0.245     0.317
                Chronos-B               200M       0.468       0.410       0.500       0.350                         0.279                         0.315     0.387
  YingLong      YingLong-300M           300M       0.407       0.370       0.358       0.296   [0.237]{style="color: red"}                         0.245     0.319
                YingLong-110M           110M       0.408       0.366       0.356       0.296   [0.237]{style="color: red"}                         0.255     0.320
                YingLong-50M             50M       0.408       0.370       0.366       0.298   [0.237]{style="color: red"}                         0.257     0.323
                YingLong-6M             6.0M       0.412       0.382       0.368       0.302   [0.237]{style="color: red"}                         0.268     0.328
  Reverso       Reverso                 2.6M       0.404       0.365       0.367       0.304                         0.238                         0.253     0.322
                Reverso-Small           0.6M       0.404       0.370       0.376       0.309                         0.241                         0.252     0.325
                Reverso-Nano            0.3M       0.416       0.384       0.382       0.311                         0.249                         0.257     0.333
  TiRex         TiRex-30M                30M       0.417       0.362       0.365       0.302                         0.240                         0.247     0.322
  PatchTST      PatchTST                5.0M       0.431       0.379       0.381       0.315   [0.237]{style="color: red"}                         0.264     0.334
  VisionTS      VisionTS                112M       0.414       0.375       0.372       0.321   [0.237]{style="color: red"}                         0.292     0.335
  Sundial       Sundial-L               444M       0.419       0.387       0.369       0.315                         0.262                         0.275     0.338
                Sundial-S                32M       0.418       0.387       0.388       0.324                         0.265                         0.271     0.342
                Sundial-B               128M       0.434       0.387       0.377       0.320                         0.265                         0.270     0.342
  SuperLinear   Super-Linear            2.5M       0.415       0.386       0.388       0.325                         0.267                         0.275     0.343
  TimesFM       TimesFM                 200M       0.444       0.406       0.419       0.347   [0.237]{style="color: red"}   [0.222]{style="color: red"}     0.346
  Moirai        Moirai-B                 91M       0.419       0.382       0.385       0.337                         0.275                         0.282     0.347
  Timer         Timer-XL                 84M       0.417       0.388       0.392       0.336                         0.268                         0.294     0.349
  Time-MoE      Time-MoE-L              453M       0.420       0.415       0.406       0.361   [0.237]{style="color: red"}                         0.300     0.356
                Time-MoE-B              113M       0.424       0.404       0.415       0.365   [0.237]{style="color: red"}                         0.297     0.357

  : Results across each dataset for the models in Figure `\ref{fig:ltsf_pareto}`{=latex}, averaged across the horizons $\{96, 192, 336, 720\}$. Red values represent missing data, imputed using the best available model across each horizon.
:::

```{=latex}
\clearpage
```
Downsampling Algorithm
======================

```{=latex}
\begin{algorithm}
\caption{Downsampling}
\label{alg:downsampling}
\begin{algorithmic}[1]
\REQUIRE Time series $x$, Context length $L$.
\REQUIRE Hyperparameters: Dominance ratio $\alpha$, Significance threshold $\beta$, Min periods in window $M$.
\STATE Compute amplitude spectrum $A(f) = |\text{FFT}(x)|$
\STATE Identify peaks: $p_1 \leftarrow \max_{f>0} A(f)$ at frequency $f_1$, \quad $p_2 \leftarrow \max_{f>0, f \ne f_1} A(f)$
\STATE Compute stats: $p_{DC} \leftarrow A(0)$, \quad $\mu_A \leftarrow \text{mean}(A)$, \quad $\sigma_A \leftarrow \text{std}(A)$
\STATE \textbf{Check Seasonality Significance:}
\IF{$p_1 \ge \alpha \cdot p_2$ \AND $p_1 \ge p_{DC}$ \AND $p_1 \ge \mu_A + \beta \cdot \sigma_A$}
    \STATE Calculate primary period $S \leftarrow 1/f_1$
    \STATE Compute stride $k \leftarrow \lfloor \frac{M \cdot S}{L} \rfloor$
    \IF{$k > 1$}
        \RETURN Downsampled series $x' = [x_0, x_k, x_{2k}, \dots]$
    \ENDIF
\ENDIF
\RETURN Original series $x$
\end{algorithmic}
\end{algorithm}
```
`\label{sec:appendix_downsampling}`{=latex}

```{=latex}
\centering
```
![Downsampling Comparison. In this example, consider an input with period 4000 to a model with context length 2048, tasked with forecasting the next 720 points. In (b), the model does not have enough information to forecast the rising part of the trapezoid. However, through downsampling the input, multiple full periods now fit into the context window and the model can forecast more accurately.](figures/downsampling_comparison.png){#fig:downsampling_comparison width="0.7\\linewidth"}

Let $p_1$ be the amplitude of the highest peak at frequency $f_1$, and $p_2$ be the amplitude of the second highest peak. Let $p_{DC}$ denote the DC component (amplitude at $f=0$), and let $\mu_A, \sigma_A$ be the mean and standard deviation of the spectral amplitudes, respectively. We consider the seasonality at $f_1$ significant if and only if all the following conditions are met:

$$\begin{aligned}
    p_1 &\geq \alpha \cdot p_2 \label{eq:peak_dominance} \\
    p_1 &\geq p_{DC} \label{eq:dc_dominance} \\
    p_1 &\geq \mu_A + \beta \cdot \sigma_A \label{eq:stat_sig}\end{aligned}$$

Equation `\ref{eq:peak_dominance}`{=latex} ensures a single dominant frequency exists, mitigating ambiguity from multi-scale seasonality which we leave for future work. Equation `\ref{eq:dc_dominance}`{=latex} ensures the seasonality is stronger than the trend component, and Equation `\ref{eq:stat_sig}`{=latex} provides statistical confidence that the signal is not merely noise.

If these conditions are satisfied, we calculate the primary period $S = 1/f_1$. To ensure the model captures sufficient temporal context, we compute a downsampling stride $k$ such that at least $M$ full periods fit within the fixed context window $L$: $$k = \left\lfloor \frac{MS}{L} \right\rfloor$$ The input sequence is then downsampled by taking every $k$-th point, effectively expanding the receptive field of the model to cover $k \cdot L$ time steps while maintaining the fixed input dimension $L$. If the spectral peaks do not meet the criteria, we do not downsample. We find that typically applying this downsampling algorithm to single time series samples at each time has high variance, resulting in different downsampling ratios for each sequence and in practice we average the downsampling ratios across the same frequency within the same dataset. We use $\alpha = 2, \beta = 4, M = 8$. Downsampling is not applied to short term forecast for which the forecast horizon is significantly shorter than the seasonality since this reduces the resolution of the predictions without capturing further seasonality information.

[^1]: We unnormalize the model's output for prediction.

[^2]: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain

[^3]: The official zero-shot performance of pretrained TTM-R2 lags significantly behind other baselines (MASE=1.02), so we compare against a stronger finetuned model (TTM-R2-Finetuned) which achieves MASE of 0.756.

[^4]: For instance, the values for Electricity for YingLong were imputed using the MAE values obtained by Chronos-2.