---
abstract: |
  Scaling laws for large language models (LLMs) have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, with architectural details (aspect ratio and number of heads) having a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.
author:
- |
  Thomas D. P. Edwards\
  Johns Hopkins University\
  `tedwar42@jhu.edu`\
  `\And`{=latex} James Alvey\
  University of Cambridge\
  University of Amsterdam\
  `j.b.g.alvey@uva.nl`\
  `\AND`{=latex} Justin Alsing\
  Stockholm University\
  Calda AI\
  `justin@calda.ai`\
  `\And`{=latex} Nam H. Nguyen\
  Capital One\
  `nam.nguyen@capitalone.com`\
  `\And`{=latex} Benjamin D. Wandelt\
  Institut d'Astrophysique de Paris\
  CCA, Flatiron Institute\
  `bwandelt@iap.fr`\
bibliography:
- main.bib
title: 'Scaling-laws for Large Time-series Models'
---

```{=latex}
\maketitle
```
Introduction {#sec:intro}
============

Time-series forecasting is fundamental to decision-making and scientific inference across all domains involving time-ordered observations. In fact, making probabilistic forecasts given past data (whether explicitly or implicitly) arguably underpins every human decision [@kording2004bayesian; @doya2007bayesian; @doya2008modulators; @funamizu2016neural; @lindig2022bayes]. In industrial and scientific settings, time-series forecasting has traditionally involved supervised training of either statistical models (e.g., ARIMA, GARCH, state-space models, and others; see [@west2013bayesian; @hyndman2018forecasting] for reviews), bespoke dynamical models based on domain-specific knowledge, or more recently deep-learning based approaches trained for a specific forecasting task (see [@torres2021deep] for a review). While these approaches have formed the bedrock of time-series analysis up until now, key challenges and limitations remain: statistical models often fail to describe and capture the latent processes underlying the data, hampering their predictive utility; developing specialized problem-specific models requires considerable investment in human time and resources; and supervised deep-learning approaches trained on a single dataset are typically only useful in the data-rich regime, and generalize poorly to other problems.

The emergence of large language (LLMs; [@devlin2018bert; @brown2020language; @touvron2023llama; @chung2024scaling]) and computer vision models [@dosovitskiy2020image; @radford2021learning; @ramesh2021zero; @yan2021videogpt; @arnab2021vivit; @he2022masked; @li2023blip] with zero-shot prediction capabilities has sparked an interest in developing foundation models for time-series --- general purpose forecasting models, pre-trained on a large and diverse corpus of time-series data, aimed at achieving state-of-the-art (SOTA) zero-shot forecasting performance across many domain areas [@2023arXiv231010688D; @2024arXiv240203885G; @2023arXiv231008278R; @2023arXiv231003589G; @2022arXiv221114730N; @2024arXiv240202592W; @2023arXiv231005063W; @2023arXiv230512095X; @2024arXiv240210198I; @salinas2020deepar; @oreshkin2019n; @oreshkin2021meta; @gruver2024large; @ma2023survey; @2024arXiv240307815F; @2023arXiv230514406K]. Large time-series models (LTMs) are already achieving zero-shot prediction capability similar to or better than baseline statistical or domain-specific models in many areas [@2023arXiv231010688D; @2024arXiv240203885G; @2023arXiv231008278R; @2023arXiv231003589G; @2022arXiv221114730N; @2024arXiv240202592W; @2023arXiv231005063W; @2023arXiv230512095X; @2024arXiv240210198I; @2024arXiv240307815F].

Underpinning the investment and subsequent success of LLMs and large-scale computer vision models was the demonstration of neural scaling laws [@originalscaling; @tan2019efficientnet; @tan2021efficientnetv2; @raghu2021vision; @riquelme2021scaling; @he2022masked; @henighan2020scaling]. The observed power-law scaling of test loss with model size, compute resources, and training set size, has provided a basis for predicting expected gains from different efforts, aiding the community in allocating resources appropriately to achieve performance breakthroughs. Given qualitative differences in data and modeling challenges, existence of neural scaling-laws for time-series is not guaranteed from the language and computer vision results. Establishment of similarly favourable scaling laws for LTMs would serve as a motivation and guide in the pursuit of foundation models for time-series forecasting.

**Contributions:** We establish for the first time that LTMs enjoy similar power-law scaling laws to language and computer vision. We train decoder-only transformer models (with architectures tailored to time-series forecasting; §`\ref{sec:data_methods}`{=latex}) on a large, diverse, and well-balanced dataset comprising around 8 billion data points across 30,211,687 individual time-series, drawn from $38$ qualitatively distinct data sources from varied areas (see §`\ref{sec:data_methods}`{=latex}). We demonstrate power-law like scaling behavior of model performance with model size, compute, and dataset size (Fig. `\ref{fig:scaling}`{=latex}), finding similar scaling behavior in three key measures of model performance: the mean-square error (MSE) characterizing the accuracy of point (posterior mean) forecasts; the Continuous Ranked Probability Score (CRPS [@hersbach2000decomposition]) characterizing the fidelity of the probabilistic predictions (ie., coverage of the forecast posterior density); and the log-likelihood loss characterizing the Kullback-Leibler (KL) divergence between the model and data generative distributions.

Data and Methods {#sec:data_methods}
================

**Data:** The development of a foundation model for time-series forecasting is predicated on the availability of a sufficiently large, diverse, and well-balanced dataset to train on. We constructed a corpus of time-series data comprising around $8$ billion data points drawn from $38$ varied data sources (see Tab. `\ref{tab:dataset_stats}`{=latex}). For the purpose of this study, our focus was to ensure our dataset is: large enough so that for our largest models ($\sim 100$M parameters) we are still operating in the $\sim$infinite data limit (c.f. [@originalscaling]); as diverse as possible given the practical limitations on publicly available data; and well-balanced, such that no individual dataset comprises more than roughly $15\%$ of the total number of data points. Our resulting dataset is competitive with the SOTA in terms of both diversity and size[^1], while covering a wide variety of sampling frequencies, record lengths, dynamic ranges, and underlying latent process phenomenology. We focus exclusively on univariate time-series forecasting, and leave the study of scaling-laws for multivariate LTMs to future work. Data sources, balancing procedure and pre-processing steps are detailed in Appendix `\ref{app:data}`{=latex}.

**Model Architecture:** We use decoder-only transformer models with self-attention as the primary architecture throughout, with a context length of $256$ data points and ReLU activation functions. Following the performance gains shown in Ref. [@2023arXiv231005063W], we use a learnable encoding rather than the sinusoidal positional encoding used in the original transformer model [@vaswani2017attention]. Both the learned positional encoding and embedding are simple linear layers going from one input to $d_\mathrm{m}$ outputs.

**Distribution Head:** We use a Student's-$t$ distribution head, where the mean, variance, and degrees of freedom are each modelled by a separate dense network with four hidden layers of dimension $d_\mathrm{m}$. The Student's-$t$ distribution allows us to model heavy-tailed data, which we find in experiments enables significantly more stable training than a Gaussian head or MSE loss. We note that in reality, time-series data and processes exhibit diverse distributional characteristics, and a more expressive distribution head (e.g., mixture model, normalizing flow or diffusion model) is well-motivated. We leave the exploration of scaling-laws under more expressive distribution heads to future work. We use a negative log-likelihood (KL) loss for training throughout.

**Parameter Counting:** With this setup, the model architecture is defined by the following parameters: the number of output dimensions $\theta_{\mathrm{out}}$, the input/output size of the linear layers in the self-attention $d_\mathrm{m}$, the number of heads $N_{\mathrm{heads}}$, the hidden layer size of the linear layers directly after the self-attention $d_{\mathrm{ff}}$, and the number of decoder layers $N_\mathrm{l}$. Throughout this work we fix $d_\mathrm{m} = d_{\mathrm{ff}}$ and treat all trainable parameters (including weights and biases of all layers) equally in the parameter counting. As shown in Fig. `\ref{fig:scaling}`{=latex}, we explore models with $\sim 10^3$ to $\sim 10^8$ trainable parameters.

**Learning rate and architecture sensitivity:** To extract reliable scaling laws, we need to determine sensitivity and robustness to the learning rate (LR) schedule and architecture choices. We use a linear warm up followed by sinusoidal decay for the learning rate scheduling, and find that the model performance clearly depends on the maximum LR reached at the end of the warm-up. To ensure robustness to the maximum LR, we fit a power-law to the best model at each parameter size to estimate the optimal maximum LR as a function of parameter count, shown in Fig. `\ref{fig:LR_scaling}`{=latex} (Appendix `\ref{app:lr_dep}`{=latex}).

Figure `\ref{fig:model_shape_scaling}`{=latex} (Appendix `\ref{app:lr_dep}`{=latex}) shows how the minimum CRPS varies as a function of aspect ratio $d_m/N_\mathrm{l}$ (left panel) and the number of attention heads, $N_{\mathrm{heads}}$ (right panel). Performance is $\sim$insensitive to the number of heads, and only weakly sensitive to aspect ratio for aspect ratios $\lesssim100$ (after which performance drops steeply). We note that this is analogous to the weak architecture sensitivity observed for LLMs [@originalscaling]. For the main parameter-, compute-, and data-scaling runs, we fix the number of heads to four, and keep the aspect ratio $<70$. See Appendix `\ref{app:training}`{=latex} for further training details.

Results {#sec:results}
=======

```{=latex}
\centering
```
![**Test Loss Scaling Laws:** Minimum MSE (left), CRPS (middle) and log-likelihood (right) in-sequence test metrics as a function of the number of parameters (top), compute (middle), and dataset size (bottom).](parameters_vs_loss_studentT.png "fig:"){#fig:scaling width="0.96\\linewidth"} ![**Test Loss Scaling Laws:** Minimum MSE (left), CRPS (middle) and log-likelihood (right) in-sequence test metrics as a function of the number of parameters (top), compute (middle), and dataset size (bottom).](compute_scaling_plot.png "fig:"){#fig:scaling width="0.96\\linewidth"} ![**Test Loss Scaling Laws:** Minimum MSE (left), CRPS (middle) and log-likelihood (right) in-sequence test metrics as a function of the number of parameters (top), compute (middle), and dataset size (bottom).](data_vs_loss_studentT.png "fig:"){#fig:scaling width="0.96\\linewidth"}

Scaling as a function of parameter count $N_p$, dataset size $\mathcal{D}$, and compute $\mathcal{C}$ is summarized in Fig.`\ref{fig:scaling}`{=latex}. For each scaling-relation, we fit a power law of the form $\ln L(A) = -B_0\ln A + B_0 \ln A_0$, where $L$ is the objective function (MSE, CRPS, or log-likelihood) and $A$ is the scaled quantity (i.e., parameter count, dataset size, or compute). The fitted parameter values are given in Tab. `\ref{tab:fits}`{=latex} (Appendix `\ref{app:pl_fits}`{=latex}). Where broken power-law like scaling is observed, we report the power law fit after the break only, since this is the relevant quantity to motivate extrapolation to larger models / datasets / compute resources.

**Parameter scaling:** Fig. `\ref{fig:scaling}`{=latex} (top row) shows the minimum in-sequence test loss (MSE, CRPS, and log-likelihood[^2]) as a function of parameter count, showing $\sim$power-law behavior over nearly five orders of magnitude in model size. A mild break is observed in the power-law behavior in both the MSE and CRPS test losses, indicating qualitatively different behavior for smaller models. In contrast, little or no break is seen in the log-likelihood scaling; this qualitative difference relative to MSE and CRPS is likely due to the log-likelihood being more sensitive to variations in the tails of the forecast distribution (see e.g.,  [@BJERREGARD2021100058]). The observed scaling over many orders of magnitude demonstrates that LTMs are likely to reach SOTA performance given enough data and model size.

**Data Scaling:** Extracting reliable scaling behavior with dataset size requires keeping the data-diversity fixed, i.e., each dataset's relative contribution to the total data count should remain the same under scaling (see Tab. `\ref{tab:dataset_stats}`{=latex}). For time-series that are significantly longer than our context length, we use a randomly chosen portion, $f_d$, of each time-series, while for series that would become shorter than our context length once cut, we instead randomly drop the entire series with probability equal to $1-f_d$. We compute the test loss over the full test set to allowing direct comparison between runs with different values of $f_d$, and reduce the noise on the test loss in the small (scaled) dataset limit.

Results are shown in the bottom row of Fig. `\ref{fig:scaling}`{=latex} where we train a $\sim 20$M parameter model using the optimum max learning rate found during the parameter scaling exploration, with early stopping. We find power-law scaling across four orders of magnitude in all three performance measures.

**Compute Scaling:** The compute at any given stage in the training process is given by $\mathcal{C} = 6BN_\mathrm{p}L_\mathrm{seq}$, where $B$ is the batch size, $N_\mathrm{p}$ is the number of parameters in the model, and $L_\mathrm{seq}$ is the context length [@originalscaling]. Test losses as a function of compute are shown in Fig. `\ref{fig:scaling}`{=latex} (middle row), where the scaling law is obtained from the minimum test loss attained at any given value of $\mathcal{C}$. Although we see a significant amount of noise in the loss functions during training, there is a clear overall trend towards lower test losses for higher compute, which is well-described by a power law. Similarly to the parameter scaling we see a mild break at low compute values for both the MSE and CRPS test losses. Note that while the MSE and CRPS metrics appear to be approximately converged over the compute range considered, the log-likelihood may not be fully converged; additional training may be needed to obtain accurate compute scaling-law fits for the log-likelihood in the large model limit.

Discussion {#sec:discussion}
==========

We have focused on evaluating models by their in-sequence (next step) test loss, rather than explicitly assessing model's ability to forecast further into the future. The implicit assumption that a model with good in-sequence predictions should naturally be able to forecast into the future is theoretically and empirically well-motivated: as the modelled posterior predictive distribution for the next value improves, any accumulated errors from auto-regressive roll-out those predictions into the future should also improve. In App. `\ref{app:forecast}`{=latex}, we show some clear examples of how forecast roll-out becomes increasingly coherent with increasing model size. We leave the study of scaling-laws based on forecasting ability on different time horizons to future work.

We have detailed the specific scaling laws for a decoder-only transformer with self-attention. However, it would be interesting to explore how modifications to this architecture might improve model scaling. In particular, much of the recent progress in using LTMs [@2023arXiv231010688D; @2024arXiv240203885G; @2023arXiv231008278R; @2023arXiv231003589G; @2022arXiv221114730N; @2024arXiv240202592W; @2023arXiv231005063W; @2023arXiv230512095X; @2024arXiv240210198I; @2024arXiv240307815F] has involved various changes to transformer architectures to make them more suited to time-series data. We advocate for comparative scaling law studies as new architectures are introduced, to allow the community to evaluate which model architectures will eventually reach SOTA zero-shot prediction capabilities.

When experimenting with data scaling, we found it was critical to scale the training data in such a way as to preserve the data diversity; approaches to data scaling that did not preserve data diversity failed to reveal any clear scaling behaviour. Given the importance of data diversity in establishing data scaling laws, and in training SOTA pre-trained foundation models in general, developing a robust framework for quantifying data diversity would be of great utility to the field.

One scaling law that we have not explored in this work (due to computational limitations) is performance as a function of increasing context length. Multiple studies (e.g., [@2024arXiv240307815F; @2024arXiv240305530G]), both for LLMs and LTMs, have shown that increasing the context length significantly improves both in-sequence prediction and forecasting, and a recent study [@2024arXiv240515124S] find interesting scaling behaviour of LTMs with context length. We will explore context-length scaling in future work.

We have focused on univariate time-series data. However, a general purpose foundation model for time-series forecasting should be able to cope with the more general setting of multivariate time-series prediction, with multiple exogeneous covariates. Establishing scaling-laws for multivariate time-series forecasting will be an important extension to this work; this demands the assembly of a large and diverse training set of multivariate data, each with their own exogeneous factors.

```{=latex}
\bibliographystyle{ieeetr}
```
```{=latex}
\appendix
```
```{=latex}
\newpage
```
Dataset Details {#app:data}
===============

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:dataset_stats}
                          `\centering`{=latex}`\arraybackslash`{=latex} **Monash**   `\centering`{=latex}`\arraybackslash`{=latex} **Climate**   `\centering`{=latex}`\arraybackslash`{=latex} **Energy**   `\centering`{=latex}`\arraybackslash`{=latex} **Traffic**   `\centering`{=latex}`\arraybackslash`{=latex} **Finance**   `\centering`{=latex}`\arraybackslash`{=latex} **Audio**   `\centering`{=latex}`\arraybackslash`{=latex} **Total**
  ----------------------- ---------------------------------------------------------- ----------------------------------------------------------- ---------------------------------------------------------- ----------------------------------------------------------- ----------------------------------------------------------- --------------------------------------------------------- ---------------------------------------------------------
  **Datasets**            `\centering`{=latex}`\arraybackslash 23`{=latex}           `\centering`{=latex}`\arraybackslash 15`{=latex}            `\centering`{=latex}`\arraybackslash 2`{=latex}            `\centering`{=latex}`\arraybackslash 5`{=latex}             `\centering`{=latex}`\arraybackslash 2`{=latex}             `\centering`{=latex}`\arraybackslash 3`{=latex}           `\centering`{=latex}`\arraybackslash 38`{=latex}
  **\# of data points**   `\centering`{=latex}`\arraybackslash`{=latex} 503M         `\centering`{=latex}`\arraybackslash`{=latex} 1.56B         `\centering`{=latex}`\arraybackslash`{=latex} 2.5B         `\centering`{=latex}`\arraybackslash`{=latex} 1.5B          `\centering`{=latex}`\arraybackslash`{=latex} 42.6M         `\centering`{=latex}`\arraybackslash`{=latex} 1.98B       `\centering`{=latex}`\arraybackslash`{=latex} 8.13B
  **% of data**           `\centering`{=latex}`\arraybackslash 6.18`{=latex}%        `\centering`{=latex}`\arraybackslash 19.19`{=latex}%        `\centering`{=latex}`\arraybackslash 30.75`{=latex}%       `\centering`{=latex}`\arraybackslash 18.45`{=latex}%        `\centering`{=latex}`\arraybackslash 0.52`{=latex}%         `\centering`{=latex}`\arraybackslash 24.35`{=latex}%      `\centering`{=latex}`\arraybackslash 100`{=latex}%

  : **Dataset summary**. M indicates million and B indicates billion.
:::

In this section we detail the various sources that form the basis of our dataset and the choices made during its construction and re-balancing. Constructing a training set for establishing foundational scaling relations requires three key considerations. Firstly, the dataset should be large enough so that for the largest models trained, we are still operating in the $\sim$infinite data limit (see e.g., [@originalscaling]). Secondly, the dataset needs to be sufficiently diverse so that any results are representative of the foundation-model regime, covering a large volume of the space of time-series phenomenology. Thirdly, the dataset needs to be balanced, so that any scaling results are representative of foundation model behaviour and not tied to performance gains for a single or handful of dominant dataset(s).

Taking inspiration from large language models [@originalscaling], we therefore aimed to gather around $\mathcal{O}(10^{10})$ data points from a variety of domains. We note that treating a single floating point number on a similar footing to a language token is not necessarily a good comparison; language tokens can contain significantly more semantic meaning than a floating point number can. The continual growth of open-source time-series datasets in both size and diversity will enable increasingly robust neural scaling studies.

Before detailing our particular sources, we would like to emphasize that there is a large corpus of time series data publicly available but is not currently formatted for easy downloading and processing. Ref. [@2024arXiv240202592W] was the first paper to open-source a large dataset[^3], setting a trend for improved training and benchmarking of foundational time-series models. However, significant work is still required to expand available datasets in size and diversity to reach the same maturity as LLMs (large-scale, SOTA language models are trained on well over a trillion tokens).

We now discuss each dataset presented in Tab. `\ref{tab:dataset_stats}`{=latex}.

All data used throughout this work has been labelled/licensed as free to use for non-commercial purposes with the appropriate citations. We have included the appropriate citations where necessary below.

Monash
------

The Monash dataset has been the default repository of open-source time series data used by the academic community for some time [@godahewa2021monash]. It contains data from a huge variety of sources and contains a wide variety of characteristics. For this work we exclude series that are either too short or are particularly noisy.[^4] We are then left with a total of 23 different sources which add up to a total of $\sim 500\mathrm{M}$ data points; details are given in Tab. `\ref{tab:monash}`{=latex}.

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:monash}
  **Dataset**              `\centering`{=latex}`\arraybackslash`{=latex} **Frequency**         `\centering`{=latex}`\arraybackslash`{=latex} **Number of Series**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Data Points**
  ------------------------ ------------------------------------------------------------------- -------------------------------------------------------------------- -------------------------------------------------------------------------
  London Smart Meters      `\centering`{=latex}`\arraybackslash`{=latex} Half Hourly           `\centering`{=latex}`\arraybackslash 5`{=latex},560                  `\centering`{=latex}`\arraybackslash`{=latex} 166.5M
  Wind Farms               `\centering`{=latex}`\arraybackslash`{=latex} Every Minute          `\centering`{=latex}`\arraybackslash 339`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 172.1M
  Wind Power               `\centering`{=latex}`\arraybackslash 4`{=latex} Seconds Intervals   `\centering`{=latex}`\arraybackslash 1`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 7.4M
  Solar Power              `\centering`{=latex}`\arraybackslash 4`{=latex} Second Intervals    `\centering`{=latex}`\arraybackslash 1`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 7.4M
  Oikolab Weather          `\centering`{=latex}`\arraybackslash`{=latex} Hourly                `\centering`{=latex}`\arraybackslash 8`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 0.8M
  Elecdemand               `\centering`{=latex}`\arraybackslash`{=latex} Half Hourly           `\centering`{=latex}`\arraybackslash 1`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 17.5k
  Kaggle Web Traffic       `\centering`{=latex}`\arraybackslash`{=latex} Daily                 `\centering`{=latex}`\arraybackslash 145`{=latex},063                `\centering`{=latex}`\arraybackslash`{=latex} 116.5M
  Tourism Quarterly        `\centering`{=latex}`\arraybackslash`{=latex} Quarterly             `\centering`{=latex}`\arraybackslash 427`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 42.5k
  Tourism Monthly          `\centering`{=latex}`\arraybackslash`{=latex} Monthly               `\centering`{=latex}`\arraybackslash 366`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 109.3k
  CIF 2016                 `\centering`{=latex}`\arraybackslash`{=latex} Monthly               `\centering`{=latex}`\arraybackslash 72`{=latex}                     `\centering`{=latex}`\arraybackslash`{=latex} 7.1k
  Traffic Weekly           `\centering`{=latex}`\arraybackslash`{=latex} Weekly                `\centering`{=latex}`\arraybackslash 862`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 89.6k
  Traffic Hourly           `\centering`{=latex}`\arraybackslash`{=latex} Hourly                `\centering`{=latex}`\arraybackslash 862`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 15.1M
  Australian Electricity   `\centering`{=latex}`\arraybackslash`{=latex} Half Hourly           `\centering`{=latex}`\arraybackslash 5`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 1.2M
  Sunspot                  `\centering`{=latex}`\arraybackslash`{=latex} Daily                 `\centering`{=latex}`\arraybackslash 1`{=latex}                      `\centering`{=latex}`\arraybackslash`{=latex} 73.9k
  Hospital                 `\centering`{=latex}`\arraybackslash`{=latex} Monthly               `\centering`{=latex}`\arraybackslash 767`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 64.4k
  NN5 Daily                `\centering`{=latex}`\arraybackslash`{=latex} Daily                 `\centering`{=latex}`\arraybackslash 111`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 87.8k
  NN5 Weekly               `\centering`{=latex}`\arraybackslash`{=latex} Weekly                `\centering`{=latex}`\arraybackslash 111`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 12.5k
  M4 Hourly                `\centering`{=latex}`\arraybackslash`{=latex} Hourly                `\centering`{=latex}`\arraybackslash 414`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 373.4k
  Fred MD                  `\centering`{=latex}`\arraybackslash`{=latex} Monthly               `\centering`{=latex}`\arraybackslash 107`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 77.9k
  Solar Weekly             `\centering`{=latex}`\arraybackslash`{=latex} Weekly                `\centering`{=latex}`\arraybackslash 137`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 7.1k
  Solar 10 Minutes         `\centering`{=latex}`\arraybackslash 10`{=latex} Minute Intervals   `\centering`{=latex}`\arraybackslash 137`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 7.2M
  Electricity Weekly       `\centering`{=latex}`\arraybackslash`{=latex} Weekly                `\centering`{=latex}`\arraybackslash 321`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 50.1k
  Electricity Hourly       `\centering`{=latex}`\arraybackslash`{=latex} Hourly                `\centering`{=latex}`\arraybackslash 321`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex} 8.4M

  : **Monash Data:** For each dataset we list the sampling frequency, the total number of series, and the total number of data points.
:::

Climate
-------

Our climate dataset, made up of around 1.5B data points, has two primary sources: the National Oceanic and Atmospheric Administration (NOAA) and the fifth generation European Centre for Medium-Range Weather Forecasts atmospheric reanalysis of the global climate (ERA5). Each source provides approximately 750M data points split across a variety of observables and time frames.

We note here that since the global climate is a correlated system, forecasting a single variable into the future whilst ignoring the evolution of the rest of the system is intrinsically difficult (maybe impossible in some cases). Nevertheless, each time series can provide important information from which the foundation model can learn correlations. Moreover, some seasonal trends are very stable and predictable from a single time series. Future work should carefully consider how to include climate data in a way that allows the model to exploit correlations inherent in the data [@graphcast; @fourcastnet].

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:NOAA}
  **Dataset**                   `\centering`{=latex}`\arraybackslash`{=latex} **Frequency**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Series**   `\centering`{=latex}`\arraybackslash`{=latex} **Length**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Data Points**
  ----------------------------- ------------------------------------------------------------- -------------------------------------------------------------------- ---------------------------------------------------------- -------------------------------------------------------------------------
  SST Mean                      `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 582241`{=latex}                 `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 212.5M
  SST Anomalies                 `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 581249`{=latex}                 `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 212.1M
  SST Long Term Average         `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 218211`{=latex}                 `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 79.6M
  SST Monthly Average           `\centering`{=latex}`\arraybackslash`{=latex} Monthly         `\centering`{=latex}`\arraybackslash 72730`{=latex}                  `\centering`{=latex}`\arraybackslash 509`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 37M
  SST Weekly Average            `\centering`{=latex}`\arraybackslash`{=latex} Monthly         `\centering`{=latex}`\arraybackslash 72689`{=latex}                  `\centering`{=latex}`\arraybackslash 2214`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 161M
  Ice Mean                      `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 63971`{=latex}                  `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 23M
  Ice Long Term Average         `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 12451`{=latex}                  `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 4.5M
  Ice Monthly Average           `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 5363`{=latex}                   `\centering`{=latex}`\arraybackslash 509`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 2.7M
  Radiation Long Term Average   `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 6622`{=latex}                   `\centering`{=latex}`\arraybackslash 365`{=latex}          `\centering`{=latex}`\arraybackslash`{=latex} 2.4M

  : **NOAA Data:** For each dataset we list the sampling frequency, the total number of series, the length of each series, and the total number of data points.
:::

```{=latex}
\vskip 6pt
```
**NOAA:** We primarily gather data from NOAA high-resolution blended analysis of daily sea surface temperature (SST) which includes both temperature measurements and ice level measurements on a $0.25^{\circ}$ grid worldwide.[^5] Weather at different points of the grid are intrinsically correlated, especially on such small grid sizes. We therefore downsample the data by a factor of three by randomly choosing grid points without replacement (we do this independently for each dataset).

To ensure we have data that covers a wide range of time scales and variability we pick a variety of observables shown in Tab. `\ref{tab:NOAA}`{=latex}. For the daily data we pick 8 years of data, each separated by 5 years (spread out to maximize data diversity i.e., minimize year to year correlations) but skip leap years for easier data processing (so all arrays are 365 elements long). The final year selection is 1985, 1990, 1995, 2001, 2005, 2010, 2015, and 2021. This size of this dataset could easily be supplemented simply by adding more of the 40 years of available data.

For additional diversity we use the same method to extract outgoing long wave radiation time series from <https://downloads.psl.noaa.gov/Datasets/uninterp_OLR/>. This is the shown in the final row of Tab. `\ref{tab:NOAA}`{=latex}.

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:ERA5}
  **Dataset**          `\centering`{=latex}`\arraybackslash`{=latex} **Frequency**      `\centering`{=latex}`\arraybackslash`{=latex} **Number of Series**   `\centering`{=latex}`\arraybackslash`{=latex} **Length**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Data Points**
  -------------------- ---------------------------------------------------------------- -------------------------------------------------------------------- ---------------------------------------------------------- -------------------------------------------------------------------------
  Sea Level Pressure   `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63094`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 138M
  2m Temp.             `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63190`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 138M
  2m Dewpoint Temp.    `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63123`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 138M
  Surface Pressure     `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63263`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 139M
  10m V Wind Comp.     `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63263`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 139M
  10m U Wind Comp.     `\centering`{=latex}`\arraybackslash 4`{=latex} Hour Intervals   `\centering`{=latex}`\arraybackslash 63220`{=latex}                  `\centering`{=latex}`\arraybackslash 2190`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex} 138M

  : **ERA5 Data:** Similar to above. The different number of series for each dataset is due to the randomness in the subsampling.
:::

**ERA5:** We take a similar approach to above when processing/gathering ERA5 data. Here though, we focus on higher frequencies by using a single year of data (2001) sampled every four hours. We additionally use different data variables (the six most popular variables) to ensure that the data features are likely different to those present in the NOAA data. ERA5 data is also originally on a $0.25^{\circ}$ global grid which we randomly down sample by a factor of four. Details are given in Tab. `\ref{tab:ERA5}`{=latex}.

Energy
------

For the energy dataset, we use the benchmark dataset prepared in the `BuildingsBench` data release [@emami2023buildingsbench]. In particular, we choose to sample 2.5B data points from the full dataset (which totals over 15B individual data points). These 2.5B data points, which overall constitute approximately 30% of our full dataset, are all taken from the Buildings-900K database. These time series represent a large-scale sample of simulated US building energy demand and are designed to be broadly representative of US commercial and residential building stock. As described in [@emami2023buildingsbench], the dataset is originally sourced from the NREL EULP database [@osti_1854582], which provides 15- minute resolution, appliance-level consumption for 550K residential and 350K commercial buildings spread across all climate regions in the U.S. For more finer-grained details, see App. B.3 in Ref. [@emami2023buildingsbench].

Traffic
-------

We consider the public LargeST [@liu2023largest] dataset which is a collection of 8600 time series recorded from traffic sensors in the California area. The data spans over 5 years, from 2017 to 2021, and is sampled at 15 minute resolution. To reduce the data size, we down-sample the data to hourly resolution and remove series that contains over $50\%$ missing entries. This gives us a total of 8520 series all with length 175296, which translates to 1.46B data points.

Finance
-------

We include daily stock returns and volume data, treated as separate one-dimensional time-series respectively, for $5038$ stocks listed across the Nasdaq, NYSE, and AMEX stock exchanges. Daily stock returns and volume tickers are obtained for $7230$ stocks from `yahoo finance`, from the beginning of each listing up to 1st January 2024. We discard any stocks that have fewer than $512$ ticks (recorded trading days), and any series containing `NaN` or `inf`. This results in time-series for $5038$ stocks, with both returns and volume data, and a total of $42.6$M data points (Tab. `\ref{tab:finance}`{=latex}).

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:finance}
  **Dataset**     `\centering`{=latex}`\arraybackslash`{=latex} **Frequency**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Series**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Data Points**
  --------------- ------------------------------------------------------------- -------------------------------------------------------------------- -------------------------------------------------------------------------
  Stock Returns   `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 5038`{=latex}                   `\centering`{=latex}`\arraybackslash`{=latex} 26.3M
  Stock Volume    `\centering`{=latex}`\arraybackslash`{=latex} Daily           `\centering`{=latex}`\arraybackslash 5038`{=latex}                   `\centering`{=latex}`\arraybackslash`{=latex} 26.3M

  : **Finance Data:** Daily stock returns and volume data for $5038$ stocks listed across the Nasdaq, NYSE and AMEX exchanges, obtained from `yahoo finance`.
:::

Audio
-----

Audio data is intrinsically a one dimensional time series rich with structure and features; it is therefore perfectly suited for our study. We have three primary sources of audio data, all from the DagsHub Open-Source Audio Datasets repository (<https://github.com/DagsHub/audio-datasets>). Again, the total volume of data here is extremely large and can be used to supplement future datasets for larger models. Here we use three particular sources each from a different domain to enhance its diversity. As presented in Tab `\ref{tab:dataset_stats}`{=latex}, these three sources add up to approximately 2B data points and $\sim 25\%$ of our overall dataset. A summary of the three sources can be found in Tab. `\ref{tab:audio}`{=latex}

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:audio}
  **Dataset**     `\centering`{=latex}`\arraybackslash`{=latex} **Frequency**       `\centering`{=latex}`\arraybackslash`{=latex} **Number of Series**   `\centering`{=latex}`\arraybackslash`{=latex} **Length**   `\centering`{=latex}`\arraybackslash`{=latex} **Number of Data Points**
  --------------- ----------------------------------------------------------------- -------------------------------------------------------------------- ---------------------------------------------------------- -------------------------------------------------------------------------
  Commands        `\centering`{=latex}`\arraybackslash 16`{=latex} $\mathrm{kHz}$   `\centering`{=latex}`\arraybackslash 47650`{=latex}                  `\centering`{=latex}`\arraybackslash 16`{=latex},000       `\centering`{=latex}`\arraybackslash`{=latex} 762.4M
  Arabic Speech   `\centering`{=latex}`\arraybackslash 24`{=latex} $\mathrm{kHz}$   `\centering`{=latex}`\arraybackslash 1813`{=latex}                   `\centering`{=latex}`\arraybackslash`{=latex} Varied       `\centering`{=latex}`\arraybackslash`{=latex} 329.9M
  Bird Audio      `\centering`{=latex}`\arraybackslash 22`{=latex} $\mathrm{kHz}$   `\centering`{=latex}`\arraybackslash 4000`{=latex}                   `\centering`{=latex}`\arraybackslash`{=latex} Varied       `\centering`{=latex}`\arraybackslash`{=latex} 888.3M

  : **Audio Data:** Similar to above.
:::

**Commands:** The speech command dataset [@2018arXiv180403209W] is made up of a series of short audio files with different voices saying a collection of common English words (e.g., \`\`happy" and \`\`five"). From all the data provided <https://github.com/DagsHub/audio-datasets/blob/main/Speech_Commands_Dataset/README.md> we take a random half of the data and exclude any clips that are not 16k long (again for easy saving). We are then left with 47650 series, making a total of $\sim 750\mathrm{M}$ data points.

```{=latex}
\vskip 3pt
```
**Arabic Speech:** This dataset contains 1813 time series of high quality (studio recorded) spoken Arabic utterances sampled at $48\mathrm{kHz}$ -- <https://github.com/DagsHub/audio-datasets/tree/main/Arabic-Speech-Corpus>. To reduce the data size without dramatically affecting its quality, we down sample the data by a factor of two (human speech is typically below $24\mathrm{kHz}$). This gives us a total of $\sim 300\mathrm{M}$ data points.

```{=latex}
\vskip 3pt
```
**Birds:** Finally, we use the bird detection dataset from <https://github.com/DagsHub/audio-datasets/blob/main/Bird-Audio-Detection-challenge/README.md> [@https://doi.org/10.1111/2041-210X.13103]. This dataset contains a combination of bird and other sounds designed to train machine learning algorithms to detect bird noises. Here we ignore the labels and use the entire dataset in training. Again, to reduce data volumes we down sample by a factor of two, and only use a randomly chosen half of the data. This leaves us with 4000 time series sampled at 22 $\mathrm{kHz}$ for a total of $\sim 900\mathrm{M}$ data points.

```{=latex}
\centering
```
![**In-sequence test loss to forecasting:** Here we show the connection between improved in-sequence test loss and forecasting performance as a function of model size. In particular, we show the true data in black with $1\sigma$ ranges for both in-sequence and forecasting predictions. It is clear that as in-sequence test loss decreases, forecasting also becomes substantially more predictive.](forecasts.png){#fig:forecast width="\\linewidth"}

Dataset balancing and pre-processing
------------------------------------

Each dataset is made up of a large number of individual time series of varying lengths. We use 95% of the set of time series for training and the remaining 5% for testing. Since the majority of the series are significantly longer than our context window, during training and testing we visit each series with probability $p_i = t_i/T$, where $t_i$ is the number of data points in that series and $T$ is the total number of data points in the training set. Additionally, each time we visit a series we choose a random starting index. This strategy ensures that the model sees each section of the data once (on average) in a given epoch. We normalize each time-series in the training set to have zero mean and unit standard deviation.[^6]

Training details and compute requirements {#app:training}
=========================================

We use the `AdamW` optimizer with a batch size of 512, a cosine learning rate scheduler with a linear warm up of 3000 training steps, and train for a total of $10^5$ steps. When training on the entire dataset ($\sim8$B data points), this equates to roughly two epochs. To reduce computational costs we compute the test loss every $\mathcal{O}(200)$ steps and average over a random 10% of the test data each time.

To produce the results in this paper requires $\mathcal{O}(50 - 70)$ individual production runs. Apart from the 100M parameter run, these were all carried out on single A100 NVIDIA GPU instances, each taking between 1 and 3 days to complete. As such, overall, the work presented here required $\mathcal{O}(150)$ GPU-days of compute. To host the full dataset, we also required a CPU RAM allocation of approximately 250 GB.

Learning-rate and architecture dependence {#app:lr_dep}
=========================================

In Fig. `\ref{fig:LR_scaling}`{=latex} we show the effect of changing the maximum learning rate reached at the end of the warm up. The performance of the model (CRPS) clearly depends on the maximum learning rate, and that dependence is itself a function of parameter count. The dependence on maxmimum learning rate is strong enough that it is possible to get better performance with a smaller model, if the maximum learning rate is too small (or too large) for the larger model. Moreover, for a fixed model size we see a clear optimum learning rate above which models diverge (shown as crosses on Fig. `\ref{fig:LR_scaling}`{=latex}). To ensure that we used an optimal maximum learning rate as a function of model size, we fit a power law with a constant offset to the best models (at each parameter size) shown in Fig.  `\ref{fig:LR_scaling}`{=latex}. In the few cases for the largest models where our power law fit overestimates the optimal maximum learning rate (leading to divergence), we slowly reduce the learning rate until we achieve convergence.

In Figure `\ref{fig:model_shape_scaling}`{=latex} we show the dependence of model performance on architecture choices, showing that performance is largely invariant to the number of heads, while being only weakly sensitive to the aspect ratio (for aspect ratios $\lesssim 100$).

```{=latex}
\centering
```
![**Importance of Learning Rate:** Here we show the minimum CRPS measured on the test data as a function of the maximum learning rate reached at the end of the linear warm up schedule. Crosses indicate that the model diverged before training was complete. There is a clear optimum max learning rate which decreases as a function of model size/number of parameters.](maxLR.png){#fig:LR_scaling width="0.7\\linewidth"}

```{=latex}
\centering
```
![**Importance of Transformer Architecture:** We show the minimum CRPS on the test set as a function of architecture choices and number of parameters. *Left:* Performance on the test data has a weak dependence on aspect ratio below $<100$ but degrades significantly $>128$. We therefore keep aspect ratios $<70$ for all scaling runs. *Right:* Here we see that the number of attention heads has no noticeable affect on the performance for both model sizes tested. We fix the number of heads to four for the scaling runs.](n_heads_AR_combined.png){#fig:model_shape_scaling width="\\linewidth"}

Power-law fits {#app:pl_fits}
==============

In `\ref{tab:fits}`{=latex} we provide the power law fits to teh scaling relations shown in Fig. `\ref{fig:scaling}`{=latex}.

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\centering
```
::: {#tab:fits}
  --------------------------------- ---------------------------------------------------------------------------------- ----------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- ----------------------------------------------------- ---------------------------------------------------------------- -----------------------------------------------------
                                    `\centering`{=latex}`\arraybackslash`{=latex} `\multicolumn{2}{c|}{MSE}`{=latex}   `\centering`{=latex}`\arraybackslash`{=latex} `\multicolumn{2}{c|}{CRPS}`{=latex}   `\centering`{=latex}`\arraybackslash`{=latex} `\multicolumn{2}{c}{Log-Likelihood}`{=latex}   `\centering`{=latex}`\arraybackslash`{=latex}         `\centering`{=latex}`\arraybackslash`{=latex}                    `\centering`{=latex}`\arraybackslash`{=latex}
                                    `\centering`{=latex}`\arraybackslash`{=latex} $\log_{10}(A_0)$                     `\centering`{=latex}`\arraybackslash`{=latex} $B_0$                                 `\centering`{=latex}`\arraybackslash`{=latex} $\log_{10}(A_0)$                               `\centering`{=latex}`\arraybackslash`{=latex} $B_0$   `\centering`{=latex}`\arraybackslash`{=latex} $\log_{10}(A_0)$   `\centering`{=latex}`\arraybackslash`{=latex} $B_0$
  Number of Parameters, $N_p$       `\centering`{=latex}`\arraybackslash -19.47`{=latex}                               `\centering`{=latex}`\arraybackslash 0.042`{=latex}                                 `\centering`{=latex}`\arraybackslash -22.64`{=latex}                                         `\centering`{=latex}`\arraybackslash 0.036`{=latex}   `\centering`{=latex}`\arraybackslash 4.33`{=latex}               `\centering`{=latex}`\arraybackslash 0.151`{=latex}
  Training Compute, $\mathcal{C}$   `\centering`{=latex}`\arraybackslash -38.88`{=latex}                               `\centering`{=latex}`\arraybackslash 0.031`{=latex}                                 `\centering`{=latex}`\arraybackslash  -43.03`{=latex}                                        `\centering`{=latex}`\arraybackslash 0.028`{=latex}   `\centering`{=latex}`\arraybackslash -6.65`{=latex}              `\centering`{=latex}`\arraybackslash 0.101`{=latex}
  Dataset Size, $\mathcal{D}$       `\centering`{=latex}`\arraybackslash -8.91`{=latex}                                `\centering`{=latex}`\arraybackslash 0.062`{=latex}                                 `\centering`{=latex}`\arraybackslash -30.42`{=latex}                                         `\centering`{=latex}`\arraybackslash 0.027`{=latex}   `\centering`{=latex}`\arraybackslash 7.00`{=latex}               `\centering`{=latex}`\arraybackslash 0.188`{=latex}
  --------------------------------- ---------------------------------------------------------------------------------- ----------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- ----------------------------------------------------- ---------------------------------------------------------------- -----------------------------------------------------

  : **Power-law fits.**
:::

In-Sequence Predictions to Forecasting {#app:forecast}
======================================

Here we simply show an example of how in-sequence test loss correlates with forecasting prediction from roll-out. In particular, in Fig. `\ref{fig:forecast}`{=latex} we show forecasts for three different datasets as a function of model size. Here we use the best weights (i.e., the model that achieved the lowest test loss during training) for each model size and show both in-sequence and forecasting along with the true data. For both the in-seqeunce and forecasting predictions we show the $1\sigma$ range of predictions. Although not perfect, it's clear that as one scales up model size (and therefore in-sequence test loss decreases), forecasting performance also improves substantially. Although we only show three examples here, we observe a similar trend in the forecasting power of our models for a variety of datasets. We leave a more detailed exploration to future work.

[^1]: Where other SOTA time-series datasets from recent studies are larger, this discrepency is mostly accounted for by the use of synthetic data (which we deliberately do not include), or reduction in our total data count from re-balancing the data to ensure it is not dominated by a single source.

[^2]: We add a constant factor of two to the log-likelihood to ensure values are always positive, enabling us to examine power-law scaling. A constant additive factor can change the slope of the fitted power-law; to remain as agnostic as possible we choose to add the smallest integer required to make all values of the loss positive.

[^3]: Note that by the time this data became open-source we had already fixed our dataset for the production runs completed for this study.

[^4]: We found through experimentation that removing very noisy datasets significantly improved training stability.

[^5]: The original data can be found here <https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2.highres/>.

[^6]: In rare instances where input time-series are constant (and hence have zero standard deviation), we set them to a constant value of zero.
