---
abstract: |
  Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce `\method`{=latex}, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, `\method `{=latex}enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows `\method `{=latex}to scale effectively without a corresponding increase in inference costs. `\method `{=latex}comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data `\dataset`{=latex}, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position `\method `{=latex}as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.
author:
- |
  `\small`{=latex}Xiaoming Shi$^{1*\spadesuit}$, Shiyu Wang$^{*\spadesuit}$, Yuqi Nie$^{2*}$, Dianqi Li,  Zhou Ye, Qingsong Wen$^{3 \dagger}$, Ming Jin$^{4 \dagger \spadesuit}$ `\vspace{1.5mm}`{=latex}\
  $^{1}$Xiaohongshu Inc   $^{2}$Princeton University   $^{3}$Squirrel Ai Learning   $^{4}$Griffith University `\vspace{1mm}`{=latex}\
  `sxm728@hotmail.com`,  `kwuking@gmail.com`,   `ynie@princeton.edu`\
  `{dianqili77, yezhou199032, qingsongedu, mingjinedu}@gmail.com`
bibliography:
- iclr2025\_conference.bib
title: '`\icon `{=latex}[Time-MoE]{.smallcaps}: Billion-Scale Time Series Foundation Models with Mixture of Experts'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\method}{\textsc{Time-MoE}\xspace}
```
```{=latex}
\newcommand{\basemodel}{\textsc{Time-MoE}\textsubscript{base}\xspace}
```
```{=latex}
\newcommand{\largemodel}{\textsc{Time-MoE}\textsubscript{large}\xspace}
```
```{=latex}
\newcommand{\ultramodel}{\textsc{Time-MoE}\textsubscript{ultra}\xspace}
```
```{=latex}
\newcommand{\dataset}{\texttt{Time-300B}\xspace}
```
```{=latex}
\newcommand{\rowc}{\rowcolor{fbApp}}
```
```{=latex}
\newcommand{\cc}{\cellcolor{fbApp}}
```
```{=latex}
\newcommand{\icon}{\raisebox{-2pt}{\includegraphics[width=1.0em]{figures/timemoe-logo.png}}\xspace}
```
```{=latex}
\newcommand{\emoji}[2][1.2em]{\raisebox{-0.2\height}{\includegraphics[height=#1]{#2}}}
```
```{=latex}
\newcommand\blfootnote[1]{
    \begingroup
    \renewcommand\thefootnote{}\footnote{#1}
    \addtocounter{footnote}{-1}
    \endgroup
}
```
```{=latex}
\newcommand{\update}[1]{{\textcolor{black}{#1}}}
```
```{=latex}
\newcommand{\boldres}[1]{{\textbf{\textcolor{red}{#1}}}}
```
```{=latex}
\newcommand{\secondres}[1]{{\underline{\textcolor{blue}{#1}}}}
```
```{=latex}
\newcommand{\tp}[1]{\colorbox{lavender}{#1}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\blfootnote{$^*$ Equal contribution \ \ $^{\spadesuit}$ Project lead \ \ $^{\dagger}$ Corresponding author}
```
```{=latex}
\vspace{-0.6cm}
```
```{=latex}
\maketitle
```
**Resources**: [<https://github.com/Time-MoE/Time-MoE>]{style="color: pink"}

Introduction
============

Time series data is a major modality in real-world dynamic systems and applications across various domains [@box2015time; @zhang2024self; @liang2024foundation]. Analyzing time series data is challenging due to its inherent complexity and distribution shifts, yet it is crucial for unlocking insights that enhance predictive analytics and decision-making. As a key task in high demand, time series forecasting has long been studied and is vital for driving various use cases in fields such as energy, climate, education, quantitative finance, cloud service, and urban computing, [@jin2023large; @nie2024survey; @wang2023flow; @mao2024time]. Traditionally, forecasting has been performed in a task-specific, end-to-end manner using either statistical or deep learning models. Despite their competitive performance, the field has not converged on building unified, general-purpose forecasting models until recently, with the emergence of a few foundation models (FMs) for universal forecasting [@dasdecoder; @woo2024unified; @ansari2024chronos]. Although promising, they are generally small in scale and have limited task-solving capabilities compared to domain-specific models, limiting their real-world impact when balancing forecasting precision against computational budget.

```{=latex}
\centering
```
![Performance overview. (**Left**) Comparison between `\method `{=latex}models and state-of-the-art time series foundation models, reporting the average zero-shot performance across six benchmark datasets. (**Right**) Comparison of few- and zero-shot performance between `\method `{=latex}and dense variants, with similar effective FLOPs per time series token, across the same six benchmarks.](figures/TimesMoE-performance-highlight-v6.png){#fig:performance-highlight width="0.9\\linewidth"}

```{=latex}
\vspace{-5mm}
```
Increasing model size and training tokens typically leads to performance improvements, as known as scaling laws, which have been extensively explored in the language and vision domains [@kaplan2020scaling; @alabdulmohsin2022revisiting]. However, such properties have not been thoroughly investigated in the time series domain [@yao2024towards]. Assuming that scaling forecasting models with high-quality training data follows similar principles, several challenges remain: ***Dense versus sparse training.*** Most time series forecasting models compose of dense layers, which means each input time series tokens requires computations with all model parameters. While effective, this is computationally intensive. In contrast, sparse training with mixture-of-experts (MoE) is more flop-efficient per parameter and allows for scaling up model size with a fixed inference budget while giving better performance, as showcased on the right of Figure `\ref{fig:performance-highlight}`{=latex}. However, optimizing a sparse, large-scale time series model faces another challenge of ***stability and convergency.*** Time series are highly heterogeneous [@woo2024unified; @dong2024heterogeneity], and selecting the appropriate model design and routing algorithm often involves a trade-off between performance and computational efficiency. Sparse solutions for time series foundation models have yet to be explored, leaving a significant gap in addressing these two challenges. While time series ***pre-training datasets*** are no longer a major bottleneck, most existing works [@dasdecoder; @woo2024unified; @ansari2024chronos] have not extensively discussed their in-model data processing pipelines or mixing strategies. Answering this is particularly important, given that existing data archives are often noisy and largely imbalanced across domains.

On the other hand, most time series FMs face limitations in ***flexibility and generalizability***. General-purpose forecasting is a fundamental capability, requiring a model to handle any forecasting problems, regardless of context lengths, forecasting horizons, input variables, and other properties such as frequencies and distributions. Meanwhile, achieving strong generalizability pushes the boundaries further that existing works often fail to meet simultaneously. For instance, Timer [@liutimer] has limited native support for arbitrary output lengths, which may lead to truncated outputs, while Moment [@goswamimoment] operates with a fixed input context length. Although Moirai [@woo2024unified] achieves universal forecasting, it depends on hardcoded heuristics in both the input and output layers.

The recognition of the above challenges naturally raises a pivotal question:

```{=latex}
\begin{tcolorbox}[notitle, rounded corners, colframe=darkgrey, colback=white, boxrule=2pt, boxsep=0pt, left=0.15cm, right=0.17cm, enhanced, shadow={2.5pt}{-2.5pt}{0pt}{opacity=5,mygrey},toprule=2pt, before skip=0.65em, after skip=0.75em 
  ]\emph{
  {
    \centering 
  {
    \fontsize{8.5pt}{13.2pt}\selectfont 
    How to scale time series foundation models to achieve universal forecasting while balancing model capability and computational overhead, mirroring the success of foundation models in other domains?
  }
  \\
  }
  }
\end{tcolorbox}
```
```{=latex}
\vspace{2mm}
```
Answering this question drives the design of `\method`{=latex}, a scalable and unified architecture for pre-training larger, more capable forecasting FMs while reducing computational costs. `\method `{=latex}consists of a family of decoder-only transformer models with a mixture-of-experts architecture, operating in an auto-regressive manner to support any forecasting horizon and accommodate context lengths of up to 4096. With its sparsely activated design, `\method `{=latex}enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows `\method `{=latex}to scale effectively without significantly increasing inference costs. Our proposal is built on a minimalist design, where the input time series is point-wise tokenized and encoded before being processed by a sparse transformer decoder, activating only a small subset of parameters. Pre-trained on large-scale time series data across 9 domains and over 300 billion time points, `\method `{=latex}is optimized through multi-task learning to forecast at multiple resolutions. During inference, different forecasting heads are utilized to enable forecasts across diverse scales, enabling flexible forecast horizons. For the first time, we scale a time series FM up to 2.4 billion parameters, achieving substantial improvements in forecasting precision compared to existing models, as shown on the left of Figure `\ref{fig:performance-highlight}`{=latex}. Compared to dense models with the same number of activated parameters or equivalent computational budgets, our models consistently outperform them by a large margin. Our contributions lie in three aspects:

1.  We present `\method`{=latex}, a universal decoder-only time series forecasting foundation model architecture with mixture-of-experts. To the best of our knowledge, this is the first work to scale time series foundation models up to 2.4 billion parameters. `\method `{=latex}achieves substantial improvements in forecasting accuracy and consistently outperforms dense models with comparable computational resources, while maintaining high efficiency.

2.  We introduce `\dataset`{=latex}, the largest open-access time series data collection, comprising over 300 billion time points spanning more than nine domains, accompanied by a well-designed data-cleaning pipeline. Our `\method `{=latex}models and `\dataset `{=latex}data collection are open-sourced.

3.  Trained on `\dataset`{=latex}, `\method `{=latex}models outperform other time series foundation models with a similar number of activated parameters across six real-world benchmarks, achieving reductions in forecasting errors by an average of 20% and 24% in zero-shot and in-distribution scenarios, respectively.

Related Work
============

`\noindent`{=latex}**Time Series Forecasting.** Deep learning models have become powerful tools for time series forecasting over the past decade, which can be broadly categorized into two types: (1) *univariate models*, such as DeepState [@rangapuram2018deep], DeepAR [@salinas2020deepar], and N-BEATS [@oreshkinn], which focus on modeling individual time series, and (2) *multivariate models*, which include both transformer-based approaches [@wen2023transformers; @zhou2021informer; @patchtstnietime; @liuitransformer; @wang2024card; @chen2024multi; @wang2022end] and non-transformer models [@sen2019think; @jin2022multivariate; @wangtimemixer; @hu2024attractor; @qi2024pdetime; @wang2024neuralreconciler], designed to handle multiple time series simultaneously. While these models achieve competitive in-domain performance [@wang2024timemixer++], many are task-specific and fall short in generalizability when applied to cross-domain data in few-shot or zero-shot scenarios.

`\noindent`{=latex}**Large Time Series Models.** Self-supervised learning has been extensively developed for time series [@zhang2024self], employing masked reconstruction [@zerveas2021transformer; @patchtstnietime] or contrastive learning [@zhang2022self; @wang2023full; @yue2022ts2vec]. However, these methods are limited in both data and model scale, with many focused on in-domain learning and transfer. Recently, general pre-training of time series models on large-scale data has emerged [@liang2024foundation], though still in its early stages with insufficient exploration into sparse solutions. See Appendix `\ref{sec:morework}`{=latex} for more information. Unlike these dense models, `\method `{=latex}introduces a scalable, unified architecture for pre-training larger, more capable forecasting foundation models while maintaining the same scale of activated parameters and computational budget.

`\noindent`{=latex}**Sparse Deep Learning for Time Series.** Deep learning models are often dense and over-parameterized [@hoefler2021sparsity], leading to increased memory and computational demands during both training and inference. However, sparse networks, such as mixture-of-experts models [@jacobs1991adaptive], which dynamically route inputs to specialized expert networks, have shown comparable or even superior generalization to dense models while being more efficient [@fedus2022switch; @riquelme2021scaling]. In time series research, model sparsification has received relatively less attention, as time series models have traditionally been small in scale, with simple models like DLinear [@zeng2023transformers] and SparseTSF [@linsparsetsf] excelling in specific tasks prior to the advent of large-scale, general pre-training. The most relevant works on this topic include Pathformer [@chen2024multi], MoLE [@ni2024mixture], and IME [@ismail2023interpretable]. However, none of them delve into the scalability of foundation models with sparse structures. Besides, MoLE and IME are not sparse models, as input data is passed to all heads and then combined to make predictions.

Methodology
===========

Our proposed `\method`{=latex}, illustrated in Figure `\ref{fig:framework}`{=latex}, adopts a mixture-of-experts-based, decoder-only transformer architecture, comprising three key components: (1) *input token embedding*, (2) *MoE transformer block*, and (3) *multi-resolution forecasting*. For the first time, we scale a sparsely-activated time series model to 2.4 billion parameters, achieving significantly better zero-shot performance with the same computation. This marks a major step forward in developing large time series models for universal forecasting.

`\noindent`{=latex}**Problem Statement.** We address the problem of predicting future values in a time series: given a sequence of historical observations $\mathbf{X}_{1:T} = \left(x_1,x_2,\ldots,x_T \right) \in \mathbb{R}^{T}$ spanning $T$ time steps, our objective is to forecast the next $H$ time steps, i.e., $\hat{\mathbf{X}}_{T+1:T+H} = f_\theta \left( \mathbf{X}_{1:T} \right) \in \mathbb{R}^{H}$. Here, $f_\theta$ represents a time series model, where $T$ is the context length and $H$ is the forecasting horizon. Notably, both $T$ and $H$ can be *flexible* during `\method `{=latex}inference, distinguishing it from task-specific models with fixed horizons. Additionally, channel independence [@patchtstnietime] is adopted to transform a multivariate input into univariate series, allowing `\method `{=latex}to handle *any-variate* forecasting problems in real-world applications.

```{=latex}
\centering
```
![ The architecture of `\method`{=latex}, which is a decoder-only model. Given an input time series of arbitrary length, `\tikzmarknode[mycircled,red]{a1}{1}`{=latex} we first tokenize it into a sequence of data points, `\tikzmarknode[mycircled,red]{a1}{2}`{=latex} which are then encoded. These tokens are processed through $N$-stacked backbone layers, primarily consisting of causal multi-head self-attention and `\tikzmarknode[mycircled,red]{a1}{3}`{=latex} sparse temporal mixture-of-expert layers. During training, `\tikzmarknode[mycircled,red]{a1}{4}`{=latex} we optimize forecasting heads at multiple resolutions. For model inference, `\method `{=latex}provides forecasts of flexible length by `\tikzmarknode[mycircled,red]{a1}{5}`{=latex} dynamically scheduling these heads. Details about the causal multi-head self-attention are in Appendix `\ref{sec:implements}`{=latex} and illustrated in Figure `\ref{fig:causal_attn}`{=latex}. ](figures/TimesMoE-framework-v2.2.png){width="90%"}

```{=latex}
\vspace{-2mm}
```
`\label{fig:framework}`{=latex}

`\method `{=latex}Overview
--------------------------

`\noindent`{=latex}**Input Token Embedding.** We utilize *point-wise tokenization* for time series embedding to ensure the completeness of temporal information. This enhances our model's flexibility and broad applicability in handling variable-length sequences. Then, we employ SwiGLU [@shazeer2020glu] to embed each time series point: `\vspace{-2mm}`{=latex} $$\mathbf{h}^0_t = \operatorname{SwiGLU}(x_t) = \operatorname{Swish} \left( W x_t \right) \otimes \left( V x_t \right),$$ where $W \in R^{D \times 1}$ and $V \in R^{D \times 1}$ are learnable parameters, and $D$ denotes the hidden dimension.

`\noindent`{=latex}**MoE Transformer Block.** Our approach builds upon a decoder-only transformer [@vaswani2017attention] and integrates recent advancements from large language models [@bai2023qwen; @touvron2023llama]. We employ RMSNorm [@zhang2019root] to normalize the input of each transformer sub-layer, thereby enhancing training stability. Instead of using absolute positional encoding, we adopt rotary positional embeddings [@su2024roformer], which provide greater flexibility in sequence length and improved extrapolation capabilities. In line with [@chowdhery2023palm], we remove biases from most layers but retain them in the QKV layer of self-attention to improve extrapolation. To introduce sparsity, we replace a feed-forward network (FFN) with a mixture-of-experts layer, incorporating a shared pool of experts that are sparsely activated. $$\begin{aligned}
\mathbf{u}^l_t & = \operatorname{SA} \left( \operatorname{RMSNorm} \left( \mathbf{h}^{l-1}_t \right) \right) + \mathbf{h}^{l-1}_t, \\
\mathbf{\bar{u}}^l_t & = \operatorname{RMSNorm} \left( \mathbf{u}^l_t \right), \\
\mathbf{h}^l_t & = \operatorname{Mixture} \left( \mathbf{\bar{u}}^l_t \right) + \mathbf{u}^l_t.\end{aligned}$$ Here, $\operatorname{SA}$ denotes self-attention with a causal mask, and $\operatorname{Mixture}$ refers to the mixture-of-experts layer. In practice, $\operatorname{Mixture}$ comprises several expert networks, each mirroring the architecture of a standard FFN. An individual time series point can be routed to either a single expert [@fedus2022switch] or multiple experts [@lepikhin2020gshard]. One expert is designated as a shared expert to capture and consolidate common knowledge across different contexts. $$\begin{aligned}
\operatorname{Mixture} \left( \mathbf{\bar{u}}^l_t \right) & = g_{N+1,t} \operatorname{FFN}_{N+1} \left( \mathbf{\bar{u}}^l_t \right)
+ \sum_{i=1}^{N} \left( {g_{i,t} \operatorname{FFN}_{i} \left( \mathbf{\bar{u}}^l_t \right)} \right), \label{equ:mixture} \\
g_{i,t} & = \begin{cases} 
s_{i,t}, & s_{i,t} \in \operatorname{Topk} (\{ s_{j, t} | 1 \leq j \leq N \}, K), \\
0, & \text{otherwise}, 
\end{cases} \\
g_{N + 1,t} &= \operatorname{Sigmoid} \left( \mathbf{W}^{l}_{N+1} \mathbf{\bar{u}}^l_t \right), \\
s_{i,t} & = \operatorname{Softmax}_i \left( \mathbf{W}_{i}^{l} \mathbf{\bar{u}}_{t}^{l} \right) \label{equ:iso_expert_gate}, \end{aligned}$$ where $\mathbf{W}_{i}^{l} \in \mathbb{R}^{1 \times D}$ denotes the trainable parameters, and $N$ and $K$ respectively denote the numbers of non-shared experts and activated non-shared experts per mixture-of-experts layer.

`\noindent`{=latex}**Multi-resolution Forecasting.** We introduce a novel multi-resolution forecasting head, which allows for forecasting at multiple scales simultaneously, in contrast to existing foundation models that are limited to a single fixed scale. This capability enhances `\method`{=latex}'s flexibility by enabling forecasting across various horizons. The model employs multiple output projections from single-layer FFNs, each designed for different prediction horizons. During training, `\method `{=latex}aggregates forecasting errors from different horizons to compute a composite loss (Section `\ref{sec:loss}`{=latex}), thereby improving the model generalization. By incorporating a simple greedy scheduling algorithm (see Appendix `\ref{sec:implements}`{=latex}), `\method `{=latex}efficiently handles predictions across arbitrary horizons. This design also boosts prediction robustness through multi-resolution ensemble learning during inference.

Model Training
--------------

### `\dataset `{=latex}Dataset

Training time series foundation models require extensive, high-quality data. Recent advancements have facilitated the collection of numerous time series datasets from various sources [@godahewa2021monash; @ansari2024chronos; @woo2024unified; @liutimer; @liu2024moirai]. Nonetheless, data quality still remains a challenge, with prevalent issues such as *missing values* and *invalid observations* [@wang2024deep] that can significantly impair model performance and destabilize training. To mitigate these issues, we developed a streamlined *data-cleaning pipeline* (Appendix `\ref{sec:pre_data}`{=latex}) to filter and refine raw data, and constructed the largest open-access, high-quality time series data collection named `\dataset `{=latex}for foundation model pre-training. `\dataset `{=latex}comprises a diverse array of publicly available datasets from domains such as energy, retail, healthcare, weather, finance, transportation, and web, augmented with synthetic data to enhance both quantity and diversity. It spans sampling frequencies from seconds to yearly intervals and, after processing through our data-cleaning pipeline, includes over 300 billion time points, as summarized in Table `\ref{tab:dataset_summary}`{=latex}.

```{=latex}
\begin{table*}[t]\caption{Key statistics of the pre-training dataset \dataset from various domains.}
    \label{tab:dataset_summary}
    \centering
    \vskip -0.02in
    \resizebox{\textwidth}{!}{
    %\renewcommand{\tabcolsep}{5pt}
    \begin{tabular}{lcccccccccc}
        \toprule
            &
            \textbf{Energy} &
            \textbf{Finance} &
            \textbf{Healthcare} &
            \textbf{Nature} &
            \textbf{Sales} &
            \textbf{Synthetic} &
            \textbf{Transport} &
            \textbf{Web} &
            \textbf{Other} &
            \textbf{Total} \\
        \midrule
            \textbf{\# Seqs.} & 
            2,875,335 & 
            1,715 & 
            1,752 & 
            31,621,183&
            110,210 & 
            11,968,625 & 
            622,414 & 
            972,158 & 
            40,265 &
            48,220,929\\
            \textbf{\# Obs.} & 
            15.981 \text{B} & 
            413.696 \text{K} & 
            471.040 \text{K} & 
            279.724 \text{B} &
            26.382 \text{M} & 
            9.222 \text{B} & 
            2.130 \text{B} & 
            1.804 \text{B} & 
            20.32 \text{M} &
            309.09 \text{B} \\
            \textbf{Percent\%} & 
            5.17 \% & 
            0.0001\% & 
            0.0001\% & 
            90.50 \% &
            0.008 \% & 
            2.98\% & 
            0.69 \% & 
            0.58 \% & 
            0.006 \% &
            100\% \\
        \bottomrule
    \end{tabular}
    }
    % \vskip -0.1in
\end{table*}
```
### Loss Function {#sec:loss}

Pre-training time series foundation models in large scale presents significant challenges in training stability due to the massive datasets and the vast number of parameters involved. To address this, we use the Huber loss [@huber1992robust; @wen2019robusttrend], which provides greater robustness to outliers and improves training stability: $$\mathcal{L}_{\text{ar}} \left( x_t, \hat{x}_t \right) = \begin{cases}
\frac{1}{2} \left( x_t - \hat{x}_t \right)^{2}, & \text{if } \left| x_t - \hat{x}_t \right| \leq \delta, \\
\delta \times \left( \left| x_t - \hat{x}_t \right| - \frac{1}{2} \times \delta \right), & \text{otherwise},
\end{cases}$$ where $\delta$ is a hyperparameter that balances the L1 and L2 loss components.

When training the model with a MoE architecture, focusing solely on optimizing prediction error often leads to load imbalance issues among the experts. A common problem is routing collapse [@shazeer2017sparsely], where the model predominantly selects only a few experts, limiting training opportunities for others. To mitigate this, following the approaches of [@dai2024deepseekmoe; @fedus2022switch], we achieve expert-level balancing with an auxiliary loss to reduce routing collapse: $$\begin{aligned}
\mathcal{L}_{\text{aux}} = N \sum_{i=1}^{N}f_i r_i, &
\quad
f_i = \frac{1}{KT} \sum_{t=1}^{T} \mathbb{I} \left(\text{Time point } t \text{ selects Expert } i \right), \ \ r_i = \frac{1}{T} \sum_{t=1}^{T} s_{i,t}, \label{equ:aux_loss}\end{aligned}$$ where $f_i$ represents the fraction of tokens assigned to expert $i$, and $r_i$ denotes the proportion of router probability allocated to expert $i$. $\mathbb{I}$ is the indicator function. Finally, we combine the auto-regressive losses across all multi-resolution projections with the auxiliary balance loss to form the final loss: $$\mathcal{L} = \frac{1}{P} \sum_{j=1}^{P} \mathcal{L}_{\text{ar}} \left( \mathbf{X}_{t+1:t+p_{j}}, \hat{\mathbf{X}}_{t+1:t+p_{j}} \right) + \alpha \mathcal{L}_{\text{aux}},$$ where $P$ is the number of multi-resolution projections and $p_j$ is the horizon of the $j$-th projection.

### Model Configurations and Training Details

Informed by the scaling laws demonstrated in [@dubey2024llama; @touvron2023llama], which show that a 7- or 8-billion parameter model continues to improve performance even after training on over one trillion tokens, we chose to scale `\method `{=latex}up to 2.4 billion parameters with around 1 billion of them activated. This model, `\ultramodel`{=latex}, supports inference on consumer-grade GPUs with less than 8GB of VRAM. We have also developed two smaller models: `\basemodel`{=latex}, with 50 million activated parameters, and `\largemodel`{=latex}, with 200 million activated parameters, both specifically designed for fast inference on CPU architectures. The detailed model configurations are in Table `\ref{tab:model_size}`{=latex}. Each model undergoes training for $100,000$ steps with a batch size of $1024$, where the maximum sequence length is capped at $4096$. This setup results in the consumption of $4$ million time points per iteration. We choose $\left\{ 1,8,32,64 \right\}$ as different forecast horizons in the output projection and set the factor of the auxiliary loss $\alpha$ to $0.02$. Refer to Appendix `\ref{sec:implements}`{=latex} for optimization details.

```{=latex}
\vskip -0.02in
```
```{=latex}
\centering
```
```{=latex}
\resizebox{0.95\columnwidth}{!}{
    \begin{tabular}{l@{\extracolsep{\fill}}ccccccccc}
        \toprule
            &
            Layers &
            Heads &
            Experts &
            $K$ &
            $\bm{d}$\textsubscript{model} & 
            $\bm{d}$\textsubscript{ff} &
            $\bm{d}$\textsubscript{expert} &
            Activated Params &
            Total Params
            \\
        \midrule
            \basemodel &
            12 &
            12 &
            8 &
            2 &
            384 &
            1536 &
            192 &
            50 $\mathrm{M}$ &  % 113.4
            113 $\mathrm{M}$ \\ % 49.6
            {\largemodel} &
            12 &
            12 &
            8 &
            2 &
            768 &
            3072 &
            384 &
            200 $\mathrm{M}$ & % 453.2
            453 $\mathrm{M}$ \\ % 198.3
            {\ultramodel} &
            36 &
            16 &
            8 &
            2 &
            1024 &
            4096 &
            512 &
            1.1 $\mathrm{B}$ &
            2.4 $\mathrm{B}$ \\
        \bottomrule
    \end{tabular}%
   }
```
Main Results
============

`\method `{=latex}consistently outperforms state-of-the-art models by large margins across 6 well-established benchmarks and settings (Appendix `\ref{sec:implements}`{=latex}). To ensure a fair comparison, we adhered to the configurations from [@woo2024unified] for out-of-distribution forecasting and [@wu2023timesnet] for in-distribution forecasting with a unified evaluation pipeline we developed. Specifically, we evaluate `\method `{=latex}against 16 different baselines, representing state-of-the-art forecasting foundation models. They are categorized into two groups: (1) zero-shot forecasting group, includes pre-trained models such as Moirai [-@woo2024unified], TimesFM [-@dasdecoder], Moment [-@goswamimoment], and Chronos [-@ansari2024chronos]; (2) in-distribution (full-shot) forecasting group, consists of up-to-date models such as iTransformer [-@liuitransformer], TimeMixer [-@wangtimemixer], TimesNet [-@wu2023timesnet], PatchTST [-@patchtstnietime], Crossformer [-@zhang2023crossformer], TiDE [-@daslong], DLinear [-@zeng2023transformers],FEDformer [-@zhou2022fedformer]. We also include addition comparisons with Timer  [-@liutimer], TFT [-@lim2021tft], and N-BEATS [-@oreshkinn] in Appendix `\ref{app:additional_results}`{=latex}.

Zero-shot Forecasting
---------------------

```{=latex}
\vspace{-3mm}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{cr|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc}
          \toprule
          \multicolumn{2}{c}{\multirow{3}{*}{\textbf{\scalebox{1.2}{Models}}}} &   \multicolumn{6}{c}{\emoji{figures/timemoe-logo.png} \textbf{\method (Ours)}}& \multicolumn{16}{c}{\emoji{figures/zero-shot.png} \textbf{Zero-shot Time Series Models}} \\
          \cmidrule(lr){3-8} \cmidrule(lr){9-24}
          &       & \multicolumn{2}{c}{\textbf{{\basemodel}}} & \multicolumn{2}{c}{\textbf{{\largemodel}}} & \multicolumn{2}{c}{\textbf{{\ultramodel}}} & \multicolumn{2}{c}{\textbf{Moirai$_{small}$}} & \multicolumn{2}{c}{\textbf{Moirai$_{base}$}} & \multicolumn{2}{c}{\textbf{Moirai$_{large}$}} & \multicolumn{2}{c}{\textbf{TimesFM}} & \multicolumn{2}{c}{\textbf{Moment}} & \multicolumn{2}{c}{\textbf{Chronos$_{small}$}} & \multicolumn{2}{c}{\textbf{Chronos$_{base}$}} & \multicolumn{2}{c}{\textbf{Chronos$_{large}$}} \\
          % \midrule
          \cmidrule(lr){3-4} \cmidrule(lr){5-6}\cmidrule(lr){7-8} \cmidrule(lr){9-10}\cmidrule(lr){11-12}\cmidrule(lr){13-14}\cmidrule(lr){15-16}\cmidrule(lr){17-18}\cmidrule(lr){19-20}\cmidrule(lr){21-22}\cmidrule(lr){23-24}
          
          \multicolumn{2}{c}{\scalebox{1.2}{\textbf{Metrics}}}& \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE}  \\
          \midrule    \multirow{4}[1]{*}{ETTh1} & 96    & 0.357 & \secondres{0.381} & \secondres{0.350} & 0.382 & \boldres{0.349} & \boldres{0.379} & 0.401 & 0.402 & 0.376 & 0.392 & 0.381 & 0.388 & 0.414 & 0.404 & 0.688 & 0.557 & 0.466 & 0.409 & 0.440 & 0.393 & 0.441 & 0.390 \\
          & 192   & \boldres{0.384} & \boldres{0.404} & \secondres{0.388} & \secondres{0.412} & 0.395 & 0.413 & 0.435 & 0.421 & 0.412 & 0.413 & 0.434 & 0.415 & 0.465 & 0.434 & 0.688 & 0.560 & 0.530 & 0.450 & 0.492 & 0.426 & 0.502 & 0.424 \\
          & 336   & \boldres{0.411} & 0.434 & \boldres{0.411} & \secondres{0.430} & 0.447 & 0.453 & 0.438 & 0.434 & \secondres{0.433} & \boldres{0.428} & 0.495 & 0.445 & 0.503 & 0.456 & 0.675 & 0.563 & 0.570 & 0.486 & 0.550 & 0.462 & 0.576 & 0.467 \\
          & 720   & 0.449 & 0.477 & \boldres{0.427} & 0.455 & 0.457 & 0.462 & \secondres{0.439} & \secondres{0.454} & 0.447 & \boldres{0.444} & 0.611 & 0.510 & 0.511 & 0.481 & 0.683 & 0.585 & 0.615 & 0.543 & 0.882 & 0.591 & 0.835 & 0.583 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & \secondres{0.400} & \secondres{0.424} & \boldres{0.394} & \boldres{0.419} & 0.412 & 0.426 & 0.428 & 0.427 & 0.417 & \boldres{0.419} & 0.480 & 0.439 & 0.473 & 0.443 & 0.683 & 0.566 & 0.545 & 0.472 & 0.591 & 0.468 & 0.588 & 0.466 \\
    \midrule
    \multirow{4}[0]{*}{ETTh2} & 96    & 0.305 & 0.359 & 0.302 & 0.354 & \boldres{0.292} & 0.352 & 0.297 & \secondres{0.336} & \secondres{0.294} & \boldres{0.330} & 0.296 & \boldres{0.330} & 0.315 & 0.349 & 0.342 & 0.396 & 0.307 & 0.356 & 0.308 & 0.343 & 0.320 & 0.345 \\
          & 192   & \secondres{0.351} & 0.386 & 0.364 & 0.385 & \boldres{0.347} & 0.379 & 0.368 & 0.381 & 0.365 & \secondres{0.375} & 0.361 & \boldres{0.371} & 0.388 & 0.395 & 0.354 & 0.402 & 0.376 & 0.401 & 0.384 & 0.392 & 0.406 & 0.399 \\
          & 336   & 0.391 & 0.418 & 0.417 & 0.425 & 0.406 & 0.419 & \secondres{0.370} & 0.393 & 0.376 & \boldres{0.390} & 0.390 & \boldres{0.390} & 0.422 & 0.427 & \boldres{0.356} & \secondres{0.407} & 0.408 & 0.431 & 0.429 & 0.430 & 0.492 & 0.453 \\
          & 720   & 0.419 & 0.454 & 0.537 & 0.496 & 0.439 & 0.447 & \secondres{0.411} & \secondres{0.426} & 0.416 & 0.433 & 0.423 & \boldres{0.418} & 0.443 & 0.454 & \boldres{0.395} & 0.434 & 0.604 & 0.533 & 0.501 & 0.477 & 0.603 & 0.511 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.366 & 0.404 & 0.405 & 0.415 & 0.371 & 0.399 & \boldres{0.361} & 0.384 & \secondres{0.362} & \secondres{0.382} & 0.367 & \boldres{0.377} & 0.392 & 0.406 & \boldres{0.361} & 0.409 & 0.424 & 0.430 & 0.405 & 0.410 & 0.455 & 0.427 \\
    \midrule
    \multirow{4}[0]{*}{ETTm1} & 96    & 0.338 & 0.368 & \secondres{0.309} & 0.357 & \boldres{0.281} & \boldres{0.341} & 0.418 & 0.392 & 0.363 & \secondres{0.356} & 0.380 & 0.361 & 0.361 & 0.370 & 0.654 & 0.527 & 0.511 & 0.423 & 0.454 & 0.408 & 0.457 & 0.403 \\
          & 192   & 0.353 & 0.388 & \secondres{0.346} & 0.381 & \boldres{0.305} & \boldres{0.358}  & 0.431 & 0.405 & 0.388 & \secondres{0.375} & 0.412 & 0.383 & 0.414 & 0.405 & 0.662 & 0.532 & 0.618 & 0.485 & 0.567 & 0.477 & 0.530 & 0.450 \\
          & 336   & 0.381 & 0.413 & \secondres{0.373} & 0.408 & \boldres{0.369} & \secondres{0.395}  & 0.433 & 0.412 & 0.416 & \boldres{0.392} & 0.436 & 0.400 & 0.445 & 0.429 & 0.672 & 0.537 & 0.683 & 0.524 & 0.662 & 0.525 & 0.577 & 0.481 \\
          & 720   & 0.504 & 0.493 & 0.475 & 0.477 & 0.469 & 0.472  & \secondres{0.462} & 0.432 & \boldres{0.460} & \boldres{0.418} & \secondres{0.462} & \secondres{0.420} & 0.512 & 0.471 & 0.692 & 0.551 & 0.748 & 0.566 & 0.900 & 0.591 & 0.660 & 0.526 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.394 & 0.415  & \secondres{0.376} & 0.405 & \boldres{0.356} & \secondres{0.391} & 0.436 & 0.410 & 0.406 & \boldres{0.385} & 0.422 & 0.391 & 0.433 & 0.418 & 0.670 & 0.536 & 0.640 & 0.499 & 0.645 & 0.500 & 0.555 & 0.465 \\
    \midrule
    \multirow{4}[0]{*}{ETTm2} & 96    & 0.201 & 0.291 & \boldres{0.197} & 0.286 & \secondres{0.198} & 0.288 & 0.214 & 0.288 & 0.205 & \secondres{0.273} & 0.211 & 0.274 & 0.202 & \boldres{0.270} & 0.260 & 0.335 & 0.209 & 0.291 & 0.199 & 0.274 & \boldres{0.197} & 0.271 \\
          & 192   & 0.258 & 0.334 & \secondres{0.250} & 0.322 & \boldres{0.235} & \boldres{0.312} & 0.284 & 0.332 & 0.275 & 0.316 & 0.281 & 0.318 & 0.289 & 0.321 & 0.289 & 0.350 & 0.280 & 0.341 & 0.261 & 0.322 & 0.254 & \secondres{0.314} \\
          & 336   & 0.324 & 0.373 & 0.337 & 0.375 & \boldres{0.293} & \boldres{0.348} & 0.331 & 0.362 & 0.329 & \secondres{0.350} & 0.341 & 0.355 & 0.360 & 0.366 & 0.324 & 0.369 & 0.354 & 0.390 & 0.326 & 0.366 & \secondres{0.313} & 0.353 \\
          & 720   & 0.488 & 0.464 & 0.480 & 0.461 & 0.427 & 0.428 & \secondres{0.402} & \boldres{0.408} & 0.437 & \secondres{0.411} & 0.485 & 0.428 & 0.462 & 0.430 & \boldres{0.394} & 0.409 & 0.553 & 0.499 & 0.455 & 0.439 & 0.416 & 0.415 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.317 & 0.365  & 0.316 & 0.361 & \boldres{0.288} & 0.344 & 0.307 & 0.347 & 0.311 & \boldres{0.337} & 0.329 & 0.343 & 0.328 & 0.346 & 0.316 & 0.365 & 0.349 & 0.380 & 0.310 & 0.350 & \secondres{0.295} & \secondres{0.338} \\
    \midrule
    \multirow{4}[0]{*}{Weather} & 96    & 0.160 & 0.214 & \secondres{0.159} & \secondres{0.213} & \boldres{0.157} & \boldres{0.211} & 0.198 & 0.222 & 0.220 & 0.217 & 0.199 & \boldres{0.211} & - & - & 0.243 & 0.255 & 0.211 & 0.243 & 0.203 & 0.238 & 0.194 & 0.235 \\
          & 192   & \secondres{0.210} & 0.260 & 0.215 & 0.266 & \boldres{0.208} & \secondres{0.256} & 0.247 & 0.265 & 0.271 & 0.259 & 0.246 & \boldres{0.251} & - & - & 0.278 & 0.329 & 0.263 & 0.294 & 0.256 & 0.290 & 0.249 & 0.285 \\
          & 336   & \secondres{0.274} & 0.309 & 0.291 & 0.322 & \boldres{0.255} & \boldres{0.290} & 0.283 & 0.303 & 0.286 & 0.297 & \secondres{0.274} & \secondres{0.291} & - & - & 0.306 & 0.346 & 0.321 & 0.339 & 0.314 & 0.336 & 0.302 & 0.327 \\
          & 720   & 0.418 & 0.405 & 0.415 & 0.400 & 0.405 & 0.397 & 0.373 & 0.354 & 0.373 & \secondres{0.354} & \boldres{0.337} & \boldres{0.340} & - & - & \secondres{0.350} & 0.374 & 0.404 & 0.397 & 0.397 & 0.396 & 0.372 & 0.378 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} &  0.265 & 0.297 & 0.270 & 0.300 & \boldres{0.256} & 0.288 & 0.275 & 0.286 & 0.287 & \secondres{0.281} & \secondres{0.264} & \boldres{0.273} & - & - & 0.294 & 0.326 & 0.300 & 0.318 & 0.292 & 0.315 & 0.279 & 0.306 \\
    \midrule
    \multirow{4}[0]{*}{Global Temp} & 96    & \secondres{0.211} & \secondres{0.343} & \boldres{0.210} & \boldres{0.342} & 0.214 & 0.345 & 0.227 & 0.354 & 0.224 & 0.351 & 0.224 & 0.351 & 0.255 & 0.375 & 0.363 & 0.472 & 0.234 & 0.361 & 0.230 & 0.355 & 0.228 & 0.354  \\
        & 192   & 0.257 & 0.386 & \secondres{0.254} & \secondres{0.385} & \boldres{0.246} & \boldres{0.379} & 0.269 & 0.396 & 0.266 & 0.394 & 0.267 & 0.395 & 0.313 & 0.423 & 0.387 & 0.489 & 0.276 & 0.400 & 0.273 & 0.395 & 0.276 & 0.398  \\
          & 336   & 0.281 & 0.405 & \secondres{0.267} & \boldres{0.395} & \boldres{0.266} & \secondres{0.398} & 0.292 & 0.419 & 0.296 & 0.420 & 0.291 & 0.417 & 0.362 & 0.460 & 0.430 & 0.517 & 0.314 & 0.431 & 0.324 & 0.434 & 0.327 & 0.437 \\
          & 720   & 0.354 & 0.465 & \secondres{0.289} & \boldres{0.420} & \boldres{0.288} & \secondres{0.421} & 0.351 & 0.437 & 0.403 & 0.498 & 0.387 & 0.488 & 0.486 & 0.545 & 0.582 & 0.617 & 0.418 & 0.504 & 0.505 & 0.542 & 0.472 & 0.535  \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.275 & \secondres{0.400} & \secondres{0.255} & \boldres{0.385} & \boldres{0.253} & \boldres{0.385} & 0.285 & 0.409 & 0.297 & 0.416 & 0.292 & 0.413 & 0.354 & 0.451 & 0.440 & 0.524 & 0.311 & 0.424 & 0.333 & 0.431 & 0.326 & 0.431 \\
    \midrule
    \rowc\rowcolor{blue!15}
    \multicolumn{2}{c|}{\scalebox{1.1}
    {\textbf{Average}}} & \secondres{0.336} & 0.384 & \secondres{0.336} & 0.380 & \boldres{0.322} & \secondres{0.372} & 0.349 & 0.377 & 0.347 & \boldres{0.370} & 0.359 & 0.373 & 0.396 & 0.413 & 0.461 & 0.454 & 0.428 & 0.420 & 0.429 & 0.412 & 0.416 & 0.405 \\
    \midrule
    \rowc
    \multicolumn{2}{c|}{\scalebox{1.1}{\textbf{{$1^{\text{st}}$ Count}}}} & \multicolumn{2}{c|}{3} & \multicolumn{2}{c|}{10} & \multicolumn{2}{c|}{\boldres{28}} & \multicolumn{2}{c|}{2} & \multicolumn{2}{c|}{\secondres{11}} & \multicolumn{2}{c|}{10} & \multicolumn{2}{c|}{1} & \multicolumn{2}{c|}{4} & \multicolumn{2}{c|}{0} & \multicolumn{2}{c|}{0} & \multicolumn{2}{c}{1} \\
    \end{tabular}%
  }
```
`\label{tab:zero_shot_full}`{=latex} `\vspace{-2mm}`{=latex}

`\noindent`{=latex}**Setup.** Time series foundation models have recently demonstrated impressive zero-shot learning capabilities [@liang2024foundation; @liu2024autotimes]. In this section, we conducted experiments on the six well-known long-term forecasting benchmarks for which datasets were *not* included in the pre-training corpora. We use four different prediction horizons, which are $\{96, 192, 336, 720\}$, with the corresponding input time series lengths $\{512,1024,2048,3072\}$. The evaluation metrics adopt mean square error (MSE) and mean absolute error (MAE).

`\noindent`{=latex}**Results.** Detailed results of zero-shot forecasting are in Table `\ref{tab:zero_shot_full}`{=latex}. **`\method`{=latex}** ***achieves consistent state-of-the-art performances, improving a large margin as MSE reduction in average exceeding **20%** over the other most competitive baselines.*** Importantly, as the model size scales (e.g., `\basemodel `{=latex}$\rightarrow$ `\ultramodel`{=latex}), it continuously exhibits enhanced performance across all datasets, affirming the efficacy of scaling laws within our time series foundation models. Furthermore, in comparisons with robust baselines that have a similar number of activated parameters, `\method `{=latex}demonstrates significantly superior performance. The largest models among the state-of-the-art baselines are Chronos~large~, Moment and Moirai~large~. Compared to those models, `\method `{=latex}achieves average MSE reductions of **23%**, **30%** and **11%** respectively.

In-distribution Forecasting
---------------------------

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    %\renewcommand{\arraystretch}{1.1}
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{cr|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc}
          \toprule
            \multicolumn{2}{c}{\multirow{3}{*}{\scalebox{1.2}{\textbf{Models}}}}       & \multicolumn{6}{c}{\emoji{figures/timemoe-logo.png} \textbf{\method (Ours)}}                 & \multicolumn{16}{c}{\emoji{figures/full-shot.png} \textbf{Full-shot Time Series Models}} \\
          \cmidrule(lr){3-8} \cmidrule(lr){9-24}
          &       & \multicolumn{2}{c}{\textbf{{\basemodel}}} & \multicolumn{2}{c}{\textbf{{\largemodel}}} & \multicolumn{2}{c}{\textbf{{\ultramodel}}} & \multicolumn{2}{c}{\textbf{iTransformer}} & \multicolumn{2}{c}{\textbf{TimeMixer}} &  \multicolumn{2}{c}{\textbf{TimesNet}} & \multicolumn{2}{c}{\textbf{PatchTST}} & \multicolumn{2}{c}{\textbf{Crossformer}} & \multicolumn{2}{c}{\textbf{TiDE}} & \multicolumn{2}{c}{\textbf{DLinear}} & \multicolumn{2}{c}{\textbf{FEDformer}} \\

          \cmidrule(lr){3-4} \cmidrule(lr){5-6}\cmidrule(lr){7-8} \cmidrule(lr){9-10}\cmidrule(lr){11-12}\cmidrule(lr){13-14}\cmidrule(lr){15-16}\cmidrule(lr){17-18}\cmidrule(lr){19-20} \cmidrule(lr){21-22} \cmidrule(lr){23-24}
          
          \multicolumn{2}{c}{\scalebox{1.2}{\textbf{Metrics}}}& \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} \\
          \midrule    \multirow{4}[1]{*}{ETTh1} & 96    & 0.345 & 0.373 & \secondres{0.335} & \secondres{0.371} & \boldres{0.323} & \boldres{0.365} & 0.386 & 0.405 & 0.375 & 0.400 & 0.384 & 0.402 & 0.414 & 0.419 & 0.423 & 0.448 & 0.479 & 0.464 & 0.386 & 0.400 & 0.376 & 0.419 \\
          & 192   & \secondres{0.372} & \secondres{0.396} & 0.374 & 0.400 & \boldres{0.359} & \boldres{0.391} & 0.441 & 0.436 & 0.436 & 0.429 & 0.421 & 0.429 & 0.460 & 0.445 & 0.471 & 0.474 & 0.525 & 0.492 & 0.437 & 0.432 & 0.420 & 0.448 \\
          & 336   & \secondres{0.389} & \boldres{0.412} & 0.390 & \boldres{0.412} & \boldres{0.388} & \secondres{0.418} & 0.487 & 0.458 & 0.484 & 0.458 & 0.491 & 0.469 & 0.501 & 0.466 & 0.570 & 0.546 & 0.565 & 0.515 & 0.481 & 0.459 & 0.459 & 0.465 \\
          & 720   & \secondres{0.410} & \secondres{0.443} & \boldres{0.402} & \boldres{0.433} & 0.425 & 0.450 & 0.503 & 0.491 & 0.498 & 0.482 & 0.521 & 0.500 & 0.500 & 0.488 & 0.653 & 0.621 & 0.594 & 0.558 & 0.519 & 0.516 & 0.506 & 0.507 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} &0.379 & \secondres{0.406} & \secondres{0.375} & \boldres{0.404} & \boldres{0.373} & \secondres{0.406} & 0.454 & 0.447 & 0.448 & 0.442 & 0.454 & 0.450 & 0.468 & 0.454 & 0.529 & 0.522 & 0.540 & 0.507 & 0.455 & 0.451 & 0.440 & 0.459 \\
    \midrule
    \multirow{4}[0]{*}{ETTh2} &96    & \secondres{0.276} & \secondres{0.340} & 0.278 & \boldres{0.335} & \boldres{0.274} & 0.338 & 0.297 & 0.349 & 0.289 & 0.341 & 0.340 & 0.374 & 0.302 & 0.348 & 0.745 & 0.584 & 0.400 & 0.440 & 0.333 & 0.387 & 0.358 & 0.397 \\
          & 192   & \secondres{0.331} & \secondres{0.371} & 0.345 & 0.373 & \boldres{0.330} & \boldres{0.370} & 0.380 & 0.400 & 0.372 & 0.392  & 0.402 & 0.414 & 0.388 & 0.400 & 0.877 & 0.656 & 0.528 & 0.509 & 0.477 & 0.476 & 0.429 & 0.439 \\
          & 336   & \secondres{0.373} & \secondres{0.402} & 0.384 & \secondres{0.402} & \boldres{0.362} & \boldres{0.396} & 0.428 & 0.432 & 0.386 & 0.414 & 0.452 & 0.541 & 0.426 & 0.433 & 1.043 & 0.731 & 0.643 & 0.571 & 0.594 & 0.541 & 0.496 & 0.487 \\
          & 720   & \secondres{0.404} & \secondres{0.431} & 0.437 & 0.437 & \boldres{0.370} & \boldres{0.417} & 0.427 & 0.445  & 0.412 & 0.434 & 0.462 & 0.657 & 0.431 & 0.446 & 1.104 & 0.763 & 0.874 & 0.679 & 0.831 & 0.657 & 0.463 & 0.474 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & \secondres{0.346} & \secondres{0.386} & 0.361 & \secondres{0.386} & \boldres{0.334} & \boldres{0.380} & 0.383 & 0.406 & 0.364 & 0.395 & 0.414 & 0.496 & 0.386 & 0.406 & 0.942 & 0.683 & 0.611 & 0.549 & 0.558 & 0.515 & 0.436 & 0.449 \\ 
    \midrule
    \multirow{4}[0]{*}{ETTm1} & 96    & 0.286 & 0.334 & \secondres{0.264} & \secondres{0.325} & \boldres{0.256} & \boldres{0.323} & 0.334 & 0.368 & 0.320 & 0.357 & 0.338 & 0.375 & 0.329 & 0.367 & 0.404 & 0.426 & 0.364 & 0.387 & 0.345 & 0.372 & 0.379 & 0.419 \\
          & 192   & 0.307 & 0.358 & \secondres{0.295} & \secondres{0.350} & \boldres{0.281} & \boldres{0.343} & 0.377 & 0.391 & 0.361 & 0.381 & 0.374 & 0.387 & 0.367 & 0.385 & 0.450 & 0.451 & 0.398 & 0.404 & 0.380 & 0.389  & 0.426 & 0.441 \\
          & 336   & 0.354 & 0.390 & \boldres{0.323} & \secondres{0.376} & \secondres{0.326} & \boldres{0.374} & 0.426 & 0.420 & 0.390 & 0.404  & 0.410 & 0.411 & 0.399 & 0.410 & 0.532 & 0.515 & 0.428 & 0.425 & 0.413 & 0.413 & 0.445 & 0.459 \\
          & 720   & \secondres{0.433} & 0.445 & \boldres{0.409} & \boldres{0.435} & 0.454 & 0.452 & 0.491 & 0.459 & 0.454 & \secondres{0.441} & 0.478 & 0.450 & 0.454 & 0.439 & 0.666 & 0.589 & 0.487 & 0.461 & 0.474 & 0.453 & 0.543 & 0.490 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.345 & 0.381 & \boldres{0.322} & \boldres{0.371} & \secondres{0.329} & \secondres{0.373} & 0.407 & 0.409 & 0.381 & 0.395 & 0.400 & 0.405 & 0.387 & 0.400 & 0.513 & 0.495 & 0.419 & 0.419 & 0.403 & 0.406 & 0.448 & 0.452 \\           
    \midrule
    \multirow{4}[0]{*}{ETTm2} & 96    & \secondres{0.172} & 0.265 & \boldres{0.169} & \secondres{0.259} & 0.183 & 0.273 & 0.180 & 0.264 & 0.175 & \boldres{0.258} & 0.187 & 0.267 & 0.175 & \secondres{0.259} & 0.287 & 0.366 & 0.207 & 0.305 & 0.193 & 0.292 & 0.203 & 0.287 \\
          & 192   & \secondres{0.228} & 0.306 & \boldres{0.223} & \boldres{0.295} & \boldres{0.223} & 0.301 & 0.250 & 0.309   & 0.237 & \secondres{0.299} & 0.249 & 0.309 & 0.241 & 0.302 & 0.414 & 0.492   & 0.290 & 0.364 & 0.284 & 0.362 & 0.269 & 0.328 \\
          & 336   & \secondres{0.281} & 0.345 & 0.293 & 0.341 & \boldres{0.278} & \boldres{0.339} & 0.311 & 0.348   & 0.298 & \secondres{0.340} & 0.321 & 0.351 & 0.305 & 0.343 & 0.597 & 0.542   & 0.377 & 0.422 & 0.369 & 0.427 & 0.325 & 0.366 \\
          & 720   & 0.403 & 0.424 & 0.451 & 0.433 & 0.425 & 0.424 & 0.412 & 0.407   & \boldres{0.391} & \boldres{0.396} & 0.408 & 0.403 & \secondres{0.402} & \secondres{0.400} & 1.730 & 1.042 & 0.558 & 0.524 & 0.554 & 0.522 & 0.421 & 0.415 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & \boldres{0.271} & 0.335 & 0.284 & 0.332 & 0.277 & 0.334 & 0.288 & 0.332 & \secondres{0.275} & \boldres{0.323} & 0.291 & 0.332 & 0.280 & \secondres{0.326} & 0.757 & 0.610 & 0.358 & 0.403 & 0.350 & 0.400 & 0.304 & 0.349  \\                      
    \midrule
    \multirow{4}[0]{*}{Weather} & 96    & \secondres{0.151} & \secondres{0.203} & \boldres{0.149} & \boldres{0.201} & 0.154 & 0.208 & 0.174 & 0.214 & 0.163 & 0.209 & 0.172 & 0.220 & 0.177 & 0.218 & 0.158 & 0.230 & 0.202 & 0.261 & 0.196 & 0.255 & 0.217 & 0.296 \\
          & 192   & \secondres{0.195} & \secondres{0.246} & \boldres{0.192} & \boldres{0.244} & 0.202 & 0.251 & 0.221 & 0.254 & 0.208 & 0.250 & 0.219 & 0.261 & 0.225 & 0.259 & 0.206 & 0.277 & 0.242 & 0.298 & 0.237 & 0.296  & 0.276 & 0.336 \\
          & 336   & \secondres{0.247} & 0.288 & \boldres{0.245} & \boldres{0.285} & 0.252 & \secondres{0.287} & 0.278 & 0.296 & 0.251 & \secondres{0.287} & 0.280 & 0.306 & 0.278 & 0.297 & 0.272 & 0.335 & 0.287 & 0.335 & 0.283 & 0.335 & 0.339 & 0.380 \\
          & 720   & \secondres{0.352} & 0.366 & \secondres{0.352} & 0.365 & 0.392 & 0.376 & 0.358 & \secondres{0.349} & \boldres{0.339} & \boldres{0.341} & 0.365 & 0.359 & 0.354 & 0.348 & 0.398 & 0.418 & 0.351 & 0.386 & 0.345 & 0.381 & 0.403 & 0.428 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & \secondres{0.236} & 0.275 & \boldres{0.234} & \secondres{0.273} & 0.250 & 0.280 & 0.257 & 0.278 & 0.240 & \boldres{0.271} & 0.259 & 0.286 & 0.258 & 0.280 & 0.258 & 0.315 & 0.270 & 0.320 & 0.265 & 0.316 & 0.308 & 0.360  \\   
    \midrule
     \multirow{4}[0]{*}{Global Temp} & 96    & \secondres{0.192} & \secondres{0.328} & \secondres{0.192} & 0.329 & \boldres{0.189} & \boldres{0.322}  & 0.223 & 0.351 & 0.215 & 0.346 & 0.250 & 0.381 & 0.219 & 0.349 & 0.272 & 0.406 & 0.223 & 0.352 & 0.221 & 0.354 & 0.261 & 0.392 \\
          & 192   & 0.238 & \boldres{0.375} & \secondres{0.236} &\boldres{ 0.375} & \boldres{0.234} & \secondres{0.376} & 0.282 & 0.404 & 0.266 & 0.393 & 0.298 & 0.418 & 0.269 & 0.395 & 0.305 & 0.435 & 0.278 & 0.401 & 0.257 & 0.388 & 0.299 & 0.423 \\
          & 336   & 0.259 & \boldres{0.397} & \secondres{0.256} & \boldres{0.397} & \boldres{0.253} & \secondres{0.399} & 0.313 & 0.431 & 0.313 & 0.430 & 0.315 & 0.434 & 0.319 & 0.435 & 0.352 & 0.468 & 0.330 & 0.440 & 0.294 & 0.418 & 0.341 & 0.454 \\
          & 720   & 0.345 & 0.465  & \secondres{0.322} & \secondres{0.451} & \boldres{0.292} & \boldres{0.426} & 0.393 & 0.488 & 0.468 & 0.536 & 0.407 & 0.497 & 0.452 & 0.526 & 0.508 & 0.562 & 0.485 & 0.544 & 0.380 & 0.479 & 0.359 & 0.469 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} & 0.258 & 0.391 & \secondres{0.251} & \secondres{0.388} & \boldres{0.242} & \boldres{0.380} & 0.303 & 0.419 & 0.316 & 0.426 & 0.318 & 0.433 & 0.315 & 0.426 & 0.359 & 0.468 & 0.329 & 0.434 & 0.288 & 0.410 & 0.315 & 0.435 \\  
    \midrule
    \rowc\rowcolor{blue!15}
    \multicolumn{2}{c|}{\scalebox{1.1}
    {\textbf{Average}}} & 0.306 & 0.362 & \secondres{0.304} & \secondres{0.359} & \boldres{0.301} & \boldres{0.358} & 0.349 & 0.382 & 0.337 & 0.375 & 0.356 & 0.400 & 0.349 & 0.382 & 0.560 & 0.516 & 0.421 & 0.439 & 0.387 & 0.416 & 0.375  & 0.417 \\
    \midrule
    \rowc
    \multicolumn{2}{c}{\textbf{{$1^{\text{st}}$ Count}}} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{\secondres{21}} & \multicolumn{2}{c}{\boldres{33}} & \multicolumn{2}{c}{0} & \multicolumn{2}{c}{7} & \multicolumn{2}{c}{0} & \multicolumn{2}{c}{0} & \multicolumn{2}{c}{0} &  \multicolumn{2}{c}{0} & \multicolumn{2}{c}{0} & \multicolumn{2}{c}{0} \\
    \bottomrule
    \end{tabular}%
    }
```
`\label{tab:lsf_full}`{=latex}

`\noindent`{=latex}**Setup.** We fine-tune the pre-trained `\method `{=latex}models on the train split of the above-mentioned six benchmarks and set the number of finetuning epochs to only one.

`\noindent`{=latex}**Results.** The full results are in Table `\ref{tab:lsf_full}`{=latex}. **`\method`{=latex}** ***exhibits remarkable capabilities, comprehensively surpassing advanced deep time series models from recent years, achieving a MSE reduction of 24% in average.*** Fine-tuning on downstream data with only one epoch significantly improves predictive performance, showcasing the remarkable potential of large time series models built on the MoE architecture. Similar to zero-shot forecasting, as the model size increases, the scaling law continues to be effective, leading to continuous improvements in the performance of the `\method`{=latex}.

Ablation Study
--------------

```{=latex}
\vspace{-1mm}
```
```{=latex}
\centering
```
```{=latex}
\begin{subtable}[b]{0.45\textwidth}
        \centering
        \vskip -0.02in
        \resizebox{0.78\columnwidth}{!}{
        \begin{tabular}{lcc}
            \toprule
                &
                \textbf{Average MSE} \\ 
            \midrule
                \rowcolor{tabhighlight} \basemodel &
                \textbf{0.262} \\
                \hspace{2em} w/o Huber loss &
                0.267 \\
                \hspace{2em} w/o multi-resolution layer &
                0.269 \\
                \hspace{2em} w/o mixture-of-experts &
                0.272 \\
                \hspace{2em} w/o auxiliary loss &
                0.275 \\
            \bottomrule
        \end{tabular}
        }
    \end{subtable}
```
```{=latex}
\begin{subtable}[b]{0.49\textwidth}
        \centering
        % \vskip -0.02in
        \resizebox{\columnwidth}{!}{
        \begin{tabular}{lccc}
            \toprule
                &
                \textbf{Average MSE} &
                \textbf{Inference Speed} \\
            \midrule
                \rowcolor{tabhighlight} \basemodel w/ \{1,8,32,64\} &
                \textbf{0.262} &
                \textbf{0.095 \text{s/iter}} \\
                \basemodel w/ \{1,8,32\} &
                0.273 &
                0.130 \text{s/iter} \\
                \basemodel w/ \{1,8\} &
                0.320 &
                0.411 \text{s/iter} \\
                \basemodel w/ \{1\} &
                1.382 &
                2.834 \text{s/iter} \\
            \bottomrule
        \end{tabular}
        }
    \end{subtable}
```
`\label{tab:ablation_study}`{=latex}

To validate our designs in `\method `{=latex}, we conducted detailed ablation studies on key architectural components and loss functions across all experimental benchmarks, as shown in Table  `\ref{tab:ablation_study}`{=latex}.

#### Model Architecture.

Replacing the MoE layers with standard FFNs (w/o mixture-of-experts) led to an average performance drop from $0.262$ to $0.272$, highlighting the performance boost provided by the sparse architecture. A detailed comparison of dense and sparse models is presented in Section  `\ref{sec:scale_analysis}`{=latex}. We retained only the horizon-32 output layer by eliminating the other multi-resolution output layers from the `\basemodel`{=latex}, excluding the multi-task optimization (w/o multi-resolution layer). Consequently, we observed that the performance of this modified model was slightly inferior compared to that of the `\basemodel`{=latex}. Additionally, as shown in the right side of Table `\ref{tab:ablation_study}`{=latex}, our default selection of four multi-resolution output projections with receptive horizons of $\{1, 8, 32, 64\}$ results in optimal predictive performance and inference speed. As we reduce the number of multi-resolution output projections, performance consistently declines, and inference speed significantly increases. This demonstrates the rationality of our multi-resolution output projection design.

#### Training Loss.

Models trained with Huber loss outperformed those using MSE loss (w/o Huber loss), due to Huber loss's superior robustness in handling outlier time points. We also removed the auxiliary loss from the objective function, retaining only the auto-regressive loss (w/o auxiliary loss) while still using the MoE architecture. This adjustment caused the expert layers to collapse into a smaller FFN during training, as the activation score of the most effective expert became disproportionately stronger without the load balance loss. Consequently, the model's performance was significantly worse than the `\basemodel`{=latex}.

Scalability Analysis {#sec:scale_analysis}
--------------------

#### Dense versus Sparse Models.

To assess the performance and efficiency benefits of sparse architectures in time series forecasting, we replaced the MoE layer with a dense layer containing an equivalent number of parameters as the activated parameters in the MoE layer. Using identical training setup and data, we trained three dense models corresponding to the sizes of the three `\method `{=latex}models. A zero-shot performance comparison between the dense and sparse models is shown in Figure `\ref{fig:times_moe_trace}`{=latex}. Our approach reduced training costs by an average of **78%** and inference costs by **39%** compared to dense variants. This clearly demonstrates the advantages of `\method`{=latex}, particularly in maintaining exceptional performance while significantly reducing costs.

#### Model and Data Scaling.

We save model checkpoints at intervals of every 20 billion time points during training, allowing to plot performance traces for models of different sizes trained on various data scales. The right side of Figure `\ref{fig:times_moe_trace}`{=latex} shows that models trained on larger datasets consistently outperform those trained on smaller datasets, regardless of model size. Our empirical results confirm that as both data volume and model parameters scale, sparse models demonstrate continuous and substantial improvements in performance, as well as achieve better forecasting accuracy compared to the dense counterparts under the same scales.

```{=latex}
\centering
```
![Scalability analysis. **(Left)** Comparison of dense and sparse models in terms of training and inference costs. **(Right)** Average MSE for 96-horizon forecasting across six benchmarks, comparing `\method `{=latex}and dense models, both trained from scratch with varying data sizes.](figures/scaling_properties_v3.png){#fig:times_moe_trace width="0.9\\linewidth"}

#### Training Precision.

We trained a new model, `\basemodel`{=latex}(FP32), using identical configurations but with float32 precision instead of bfloat16. As shown in Table `\ref{tab:precision_comp}`{=latex}, the forecasting performance of both models is comparable. However, the bfloat16 model achieves a **12%** improvement in training speed and reduces memory consumption by **20%** compared to the float32 model. Moreover, the bfloat16 model can seamlessly integrate with flash-attention [@dao2023flashattention2], further boosting training and inference speed by **23%** and **19%** respectively.

. `\centering`{=latex} `\resizebox{\columnwidth}{!}{
    \begin{tabular}{lccccccc}
        \toprule
            &
            \textbf{Average MSE} &
            \textbf{Training Speed} &
            \textbf{Inference Speed} &
            \textbf{Training Memory} &
            \textbf{Inference Memory} \\ 
        \midrule
            \rowcolor{tabhighlight}
            \basemodel &
            0.262 &
            0.84 s/iter &
            0.095 \text{s/iter} &
            1.77 \text{GB} &
            226.70 \text{MB} \\
            \basemodel w/o FA &
            0.262 &
            1.09 \text{s/iter} &
            0.118 \text{s/iter} &
            1.77 \text{GB} &
            226.70 \text{MB} \\
            \basemodel w/ FP32 &
            0.261 &
            1.24 \text{s/iter} &
            0.133 \text{s/iter} &
            2.21 \text{GB} &
            453.41 \text{MB} \\
        \bottomrule
    \end{tabular}
   }`{=latex}

Sparsification Analysis
-----------------------

#### Activation Visualization.

As shown in Figure `\ref{fig:expert_gate_score}`{=latex}, `\method `{=latex}dynamically activates different experts across various datasets, with each expert specializing in learning distinct knowledge. This leads to diverse activation patterns across datasets from different domains, showcasing `\method`{=latex}'s strong generalization capabilities. The heterogeneous activations indicate that the model adapts its learned representations to the specific characteristics of each dataset, contributing to its great transferability and generalization as a large-scale time series foundation model.

```{=latex}
\centering
```
![Gating scores for experts across different layers in the six benchmarks.](figures/expert_gate_score_v5.png){#fig:expert_gate_score width="\\linewidth"}

#### Number of Experts.

```{=latex}
\begin{wraptable}{R}{0.45\textwidth}
\vspace{-4mm}
    \caption{Performance and inference speed across different top$_{k}$ setups. Average MSE for horizon-96 forecasting evaluated across six benchmarks. Lower values of inference speed (s/iter) indicate better performance.}
    \label{tab:sparse_analysis}
    \centering
    \resizebox{0.43\columnwidth}{!}{
    \renewcommand{\tabcolsep}{3.5pt}
    \begin{tabular}{lccc}
        \toprule
           \basemodel &
            \textbf{Average MSE} &
            \textbf{Inference Speed} \\
        \midrule
             w/ \{Top$_{1}$\} &
            0.264 &
            0.082 \text{s/iter} \\
            \rowcolor{tabhighlight}
             w/ \{Top$_{2}$\} &
            \textbf{0.262} &
            \textbf{0.095 \text{s/iter}} \\
             w/ \{Top$_{4}$\} &
            0.262 &
            0.109 \text{s/iter} \\
             w/ \{Top$_{6}$\} &
            0.265 &
            0.120 \text{s/iter} \\
             w/ \{Top$_{8}$\} &
            0.269&
            0.129 \text{s/iter} \\
        \bottomrule
    \end{tabular}
    }
    \vspace{-3mm}
\end{wraptable}
```
We performed a sensitivity analysis on the number of experts, represented as top$_{k}$, within the `\method `{=latex}architecture, as shown in Table `\ref{tab:sparse_analysis}`{=latex}. As $k$ increases, performance shows only marginal changes, with minimal improvements in average MSE. However, inference time increases noticeably as more experts are utilized. This indicates that increasing sparsity within the MoE architecture does not compromise performance but significantly enhances computational efficiency. This balance is critical for scaling time series foundation models, where optimizing performance and computational cost is essential. Sparse MoE architectures inherently offer advantages in these areas.

Conclusion
==========

In this paper, we introduced `\method`{=latex}, a scalable and unified architecture for time series foundation models that leverages a sparse design with mixture-of-experts to enhance computational efficiency without compromising model capacity. Pre-trained on our newly introduced large-scale time series dataset, `\dataset`{=latex}, `\method `{=latex}was scaled to 2.4 billion parameters, with 1.1 billion activated, demonstrating significant improvements in forecasting capabilities. Our results validate the scaling properties in time series forecasting, showing that `\method `{=latex}consistently outperforms dense models with equivalent computational budgets across multiple widely accepted benchmarks. With its ability to perform universal forecasting and superior performance in both zero-shot and fine-tuned scenarios, `\method `{=latex}establishes itself as a state-of-the-art solution for real-world forecasting challenges. This work lays the groundwork for future advancements in scaling and enhancing the efficiency of time series foundation models, paving the way toward time series general intelligence.

Acknowledgement {#acknowledgement .unnumbered}
===============

Y. Nie acknowledges financial support from Princeton Language and Intelligence at Princeton University. M. Jin was supported in part by the NVIDIA Academic Grant Program and CSIRO -- National Science Foundation (US) AI Research Collaboration Program.

```{=latex}
\bibliographystyle{iclr2025_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Further Related Work {#sec:morework}
====================

In this section, we delve deeper into the related work on large time series models. Current research efforts in universal forecasting with time series foundation models can be broadly classified into three categories, as summarized in Table `\ref{tab:ltsm_comp}`{=latex}: (1) *encoder-only models*, such as Moirai [@woo2024unified] and Moment [@goswamimoment], which employ masked reconstruction and have been pre-trained on datasets containing 27B and 1B time points, respectively, with model sizes reaching up to 385M parameters; (2) *encoder-decoder models*, exemplified by Chronos [@ansari2024chronos], which offers pre-trained models at four scales, with up to 710M parameters; and (3) *decoder-only models*, including TimesFM [@dasdecoder], Lag-Llama [@rasul2023lagllama], and Timer [@liutimer], with the largest models containing up to 200M parameters. The concurrent work, Moirai-MoE [@liu2024moirai], includes up to 935M parameters but with a different expert and routing design. In contrast to these models, `\method `{=latex}introduces a scalable, unified architecture with a sparse mixture-of-experts design, optimized for larger time series forecasting models while reducing inference costs. Trained on our `\dataset `{=latex}dataset, comprising over 300B time points, `\method `{=latex}is scaled to 2.4B parameters for the first time. It outperforms existing models with the same number of activated parameters, significantly enhancing both model efficiency and forecasting precision, while avoiding limitations such as fixed context lengths or hardcoded heuristics.

```{=latex}
\begin{table*}[ht]\caption{Comparison between large time series models.}
  \vspace{-2pt}
  \label{tab:ltsm_comp}
  \centering
  \renewcommand{\multirowsetup}{\centering}
  \setlength{\tabcolsep}{2.5pt}
  \renewcommand\arraystretch{1.3}
  \resizebox{\columnwidth}{!}{
  \begin{tabular}{c|c|c|c|c|c|c|c|c}
    \toprule
    Method & Time-MoE & Moirai & TimesFM & Moment & Chronos & Timer & Lag-Llama & TimeGPT   \\ 
    \toprule
     \multirow{2}{*}{Architecture} & Decoder- & Encoder-  & Decoder- & Encoder- & Encoder-  & Decoder- & Decoder- & Encoder-\\
     & Only & Only & Only & Only & Decoder & Only & Only & Decoder \\
    \midrule
    (Max) Model Size & 2.4B & 311M & 200M & 385M & 710M & 67M & 200M & Unknown \\
    \midrule
    Input Token & Point & Patch & Patch & Patch & Point & Patch & Point & Patch \\
    \midrule
    Dataset Scale & 309B & 27B/231B\textsuperscript{*} & 100B & 1.13B & 84B & 28B & 0.36B & 100B \\
    \midrule
    Max Length & 4096 \textsuperscript{\dag} & 5000 & 512 & 512  & 512 & 1440 & 1024 & Unknown \\
    \midrule
    FFN & Sparse & Dense & Dense & Dense & Dense & Dense & Dense & Dense \\
    \midrule
    Open-source Data & \checkmark & \checkmark &  &  & \checkmark & \checkmark &  &   \\
    \midrule
    Source & Ours & \footnotesize\citeauthor{woo2024unified} & \footnotesize\citeauthor{dasdecoder} & \fontsize{8.5pt}{10pt}\selectfont\citeauthor{goswamimoment} & \footnotesize\citeauthor{ansari2024chronos} & \footnotesize\citeauthor{liutimer} & \footnotesize\citeauthor{rasul2023lagllama} & \footnotesize\citeauthor{garza2023timegpt} \\
    \bottomrule
    \multicolumn{9}{l}{\textsuperscript{*} \small{Depend on the way of calculation according to the original paper.} \textsuperscript{\dag} \small{indicates the total of the context and prediction lengths.}}
  \end{tabular}
  }
  \vspace{-2pt}
\end{table*}
```
Implementation Details {#sec:implements}
======================

#### Training Configuration.

Each model is trained for 100,000 steps with a batch size of 1,024, and a maximum sequence length capped at 4,096. This setup processes 4 million time points per iteration. We use forecast horizons of $\left\{ 1,8,32,64 \right\}$ in the output projection and set the auxiliary loss factor $\alpha$ to 0.02. For optimization, we apply the AdamW optimizer with the following hyperparameters: $\text{lr}=1\mathrm{e}\text{-}{3}$, $\text{weight\_decay}=1\mathrm{e}\text{-}{1}$, $\beta_1=0.9$, and $\beta_2=0.95$. A learning rate scheduler with a linear warmup for the first 10,000 steps, followed by cosine annealing, is used. Training is performed on 128 $\times$ NVIDIA A100-80G GPUs with BF16 precision. To improve batch processing efficiency and handle varying sequence lengths, we employ sequence packing [@raffel2020exploring], which reduces padding requirements.

#### Benchmark Details.

We evaluate the performance of various models for long-term forecasting across eight well-established datasets, including the Weather [@wu2021autoformer], Global Temp [@wu2023corrformer], and ETT datasets (ETTh1, ETTh2, ETTm1, ETTm2) [@zhou2021informer]. A detailed description of each dataset is provided in Table `\ref{tab:dataset}`{=latex}.

```{=latex}
\vskip 0.05in
```
```{=latex}
\centering
```
```{=latex}
\resizebox{1.0\columnwidth}{!}{
  \begin{threeparttable}
  \begin{small}
  \renewcommand{\multirowsetup}{\centering}
  \setlength{\tabcolsep}{3.8pt}
  \begin{tabular}{c|l|c|c|c|c|c|c}
    \toprule
    Tasks & Dataset & Dim & Series Length & Dataset Size &Frequency &Forecastability$\ast$ &\scalebox{0.8}{Information} \\
    \toprule
     & ETTm1 & 7 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (34465, 11521, 11521)  & 15min &0.46 &\scalebox{0.8}{Temperature}\\
    \cmidrule{2-8}
    & ETTm2 & 7 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (34465, 11521, 11521)  & 15min &0.55 &\scalebox{0.8}{Temperature}\\
    \cmidrule{2-8}
     Long-term & ETTh1 & 7 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (8545, 2881, 2881) & Hourly &0.38 &\scalebox{0.8}{Temperature} \\
    \cmidrule{2-8}
     Forecasting &ETTh2 & 7 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (8545, 2881, 2881) & Hourly &0.45 &\scalebox{0.8}{Temperature} \\
    \cmidrule{2-8}
     & Weather & 21 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (36792, 5271, 10540)  &10 min &0.75  &\scalebox{0.8}{Weather} \\
     \cmidrule{2-8}
     & Global Temp & 1000 & \scalebox{0.8}{\{96, 192, 336, 720\}} & (12280, 1755, 3509)  &Hourly &0.78  &\scalebox{0.8}{Temperature} \\
    \bottomrule
    \end{tabular}
     \begin{tablenotes}
        \item $\ast$ The forecastability is calculated by one minus the entropy of Fourier decomposition of time series \citep{goerg2013forecastable}. A larger value indicates better predictability.
    \end{tablenotes}
    \end{small}
  \end{threeparttable}
  }
```
```{=latex}
\vspace{-5pt}
```
#### Metrics.

We use mean square error (MSE) and mean absolute error (MAE) as evaluation metrics for time-series forecasting. These metrics are calculated as follows: $$\begin{aligned}
 \label{equ:metrics}
    \text{MSE} &= \frac{1}{H}\sum_{i=1}^H (x_{i} - \widehat{x}_{i})^2,
    &
    \text{MAE} &= \frac{1}{H}\sum_{i=1}^H|x_{i} - \widehat{x}_{i}|,\end{aligned}$$ where $x_{i},\widehat{x}_{i} \in \mathbb{R}$ are the ground truth and predictions of the $i$-th future time point.

#### Technical Details.

Our mixture-of-experts layer consists of one shared expert and several isolated experts, each represented by a feedforward network that is smaller than the standard FFN employed in dense models. In the formulation from Equations `\ref{equ:mixture}`{=latex} to `\ref{equ:iso_expert_gate}`{=latex}, $\text{FFN}_{N+1}$ denotes the shared expert, while $\text{FFN}_{1}$ to $\text{FFN}_{N}$ correspond to the isolated experts. The weight $g_{N+1,t}$ associated with the shared expert for token $t$ is normalized using the $\operatorname{Sigmoid}$ function. In contrast, the weight $g_{i,t}$ for the $i$-th isolated expert of token $t$ is normalized using the $\operatorname{Softmax}$ function. Furthermore, we retain only the top-$k$ largest scores among the isolated experts and set the remaining scores to zero.

```{=latex}
\begin{wrapfigure}{r}{0.4\textwidth}
  \centering
    \includegraphics[width=0.38\textwidth]{figures/causal_attn.pdf}
    \caption{Causal attention layer.}
    \label{fig:causal_attn}
\end{wrapfigure}
```
To prevent routing collapse among experts, we adopt the strategy proposed by  [@fedus2022switch], incorporating an auxiliary loss to ensure balanced expert load. The key aspect of this method is to penalize experts with high gating scores. This helps prevent a scenario where stronger experts, being exposed to more tokens, become even stronger while weaker experts continue to fall behind. The mathematical formulation is presented in Equation `\ref{equ:aux_loss}`{=latex}, where $f_i$ represents the fraction of tokens assigned to expert $i$, and $r_i$ denotes the proportion of router probability allocated to expert $i$. If one expert is assigned too many tokens and achieves a higher routing score, it will incur a correspondingly higher loss.

#### Multi-resolution Forecasting.

To construct the multi-resolution forecasting head, we define $P$ output projections, each corresponding to a distinct forecasting horizon, denoted as $\left(p_1, p_2, \ldots, p_P\right)$. The output projection for horizon $p_j$ is used to forecast the subsequent $p_j$ time steps, as follows: $$\hat{\mathbf{X}}_{t+1:t+p_{j}} = \mathbf{W}_{p_{j}}\mathbf{h}^L_t,$$ where $\mathbf{W}_{p_{j}} \in \mathbb{R}^{p_{j} \times D}$ is the learnable parameter matrix for that horizon, and $\mathbf{h}^L_t$ represents the output hidden state from the last MoE Transformer block. All output projections are optimized simultaneously during model training.

During inference, we apply a greedy scheduling algorithm for arbitrary target output lengths $H$, as outlined in Algorithm `\ref{alg:scheduling}`{=latex}. For each forecast operation in the auto-regressive process, we select a projection $p_j$ with the closest forecasting horizon that does not exceed the remaining forecast duration. This approach allows `\method `{=latex}to extend predictions beyond the next immediate time step or fixed horizon, significantly improving both the model's utility and overall forecasting accuracy.

```{=latex}
\begin{algorithm}
\caption{Scheduling for the Multi-resolution Forecasting}
\label{alg:scheduling}
\begin{algorithmic}[1]
\REQUIRE Target output length $H$, forecast horizon of each output projection $\{p_1, p_2, \ldots, p_P\}$ in ascending order
\ENSURE Combined output length $\hat{H} = H$, $p_1 = 1$

\STATE $\hat{H} \gets 0$
\STATE $J \gets \{\}$

\WHILE{$\hat{H} < H$}
    \FOR{$j = P$ \textbf{down to} $1$}
        \IF{$\hat{H} + p_j \leq H$}
            \STATE $\hat{H} \gets \hat{H} + p_j$
            \STATE \textbf{add} $p_j$ \textbf{to} $J$
            \STATE \textbf{break}
        \ENDIF
    \ENDFOR
\ENDWHILE

\RETURN $J$
\end{algorithmic}
\end{algorithm}
```
Processed Data Archive {#sec:pre_data}
======================

Going beyond the previous work [@ansari2024chronos; @woo2024unified; @liutimer], we organized a comprehensive large-scale time series dataset from a vast collection of complex raw data. We utilize the missing value ratio and the invalid observation ratio as metrics to assess the quality of the dataset. These two metrics can effectively identify data issues caused by the instability of data collection and artificially imputed values. The missing value ratio is defined as the proportion of \`nan' and \`inf' values present in the time series. Meanwhile, the invalid observation ratio refers to the maximum proportion of zeros in the first- or second-order differences of the time series. To address these issues and drawing inspiration from the data processing techniques of large language models [@penedo2023refinedweb; @together2023redpajama; @jin2024position], we developed a fine-grained *data-cleaning pipeline* specifically designed for time series data:

#### Missing Value Processing.

In time series data, missing values often appear as \`nan' (not a number) or \`inf' (infinity). While previous studies commonly address this by replacing missing values with the mean, this may distort the original time series pattern. Instead, we employ a method that splits the original sequence into multiple sub-sequences at points where missing values occur, effectively removing those segments while preserving the integrity of the original time series pattern.

#### Invalid Observation Processing.

In some data collection systems, missing values are often filled with $0$ or another constant, leading to sequences with constant values that do not represent valid patterns for the model. To address this, we developed a filtering method that uses a fixed-length window to scan the entire sequence. For each window, we calculate the ratio of first-order and second-order differences, discarding the window if this ratio exceeds a pre-specified threshold (set to 0.2 in our case). The remaining valid continuous window sequences are then concatenated into a single sequence. This process transforms the original sequence into multiple sub-sequences, effectively removing segments with invalid patterns.

Following the processing steps described above, we compiled a high-quality time series dataset named `\dataset`{=latex}, which spans a range of sampling frequencies from seconds to yearly intervals, encompassing a total of 309.09 billion time points. To optimize memory efficiency and loading speed, each dataset is split into multiple binary files, with a metafile providing details such as the start and end positions of each sequence. This setup allows us to load the data using a fixed amount of memory during training, preventing memory shortages. Datasets like `Weatherbench`, `CMIP6`, and `ERA5` are particularly large, often leading to data imbalance and homogenization. To mitigate these issues, we apply down-sampling to these datasets. During training, we utilized approximately 117 billion time points in `\dataset`{=latex}, sampling each batch according to fixed proportions of domains and distributions of observation values.

Below, we outline the key properties of the datasets after processing, including their domain, sampling frequency, number of time series, total number of observations, and data source. Also, we present the key component's source code of the data-cleaning pipeline in Algorithm `\ref{alg:data-cleaning}`{=latex}.

```{=latex}
\fontsize{7.9pt}{13pt}
```
```{=latex}
\selectfont
```
::: {#tab:time_300b_detail}
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| **Dataset**                                                                               | **Domain** | **Freq.** | **\# Time Series** | **\# Obs.**     | **Source**                            |
+:=========================================================================================:+:==========:+:=========:+:==================:+:===============:+:=====================================:+
| `\multicolumn{6}{|c|}{{\bfseries Table \thetable\ continued from previous page}}`{=latex} |            |           |                    |                 |                                       |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| **Dataset**                                                                               | **Domain** | **Freq.** | **\# Time Series** | **\# Obs.**     | **Source**                            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| ```{=latex}                                                                               | Energy     | 15T       | 347                | 39,708,170      | [@godahewa2021monash]                 |
| \endfoot                                                                                  |            |           |                    |                 |                                       |
| ```                                                                                       |            |           |                    |                 |                                       |
| ```{=latex}                                                                               |            |           |                    |                 |                                       |
| \endlastfoot                                                                              |            |           |                    |                 |                                       |
| ```                                                                                       |            |           |                    |                 |                                       |
| ```{=latex}                                                                               |            |           |                    |                 |                                       |
| \midrule                                                                                  |            |           |                    |                 |                                       |
| ```                                                                                       |            |           |                    |                 |                                       |
| Electricity (15 min)                                                                      |            |           |                    |                 |                                       |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Electricity (Weekly)                                                                      | Energy     | W         | 318                | 49,608          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| ERCOT Load                                                                                | Energy     | H         | 152                | 1,238,832       | [@ourownstory2023neuralprophet]       |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Australian Electricity                                                                    | Energy     | 30T       | 5                  | 1,153,584       | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Solar Power                                                                               | Energy     | 4S        | 26                 | 5,248           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Wind Farms                                                                                | Energy     | T         | 43,246             | 39,705,317      | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Bear                                                                                | Energy     | H         | 215                | 1,422,320       | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Fox                                                                                 | Energy     | H         | 179                | 2,285,288       | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Panther                                                                             | Energy     | H         | 136                | 893,840         | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Rat                                                                                 | Energy     | H         | 455                | 4,596,080       | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Borealis                                                                                  | Energy     | H         | 17                 | 82,757          | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Buildings900K                                                                             | Energy     | H         | 2,464,188          | 15,124,358,211  | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Bull                                                                                | Energy     | H         | 464                | 501,832         | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Cockatoo                                                                            | Energy     | H         | 4                  | 17032           | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Covid19 Energy                                                                            | Energy     | H         | 1                  | 31,912          | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Elecdemand                                                                                | Energy     | 30T       | 1                  | 17,520          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| GEF12                                                                                     | Energy     | H         | 20                 | 788,280         | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| GEF17                                                                                     | Energy     | H         | 8                  | 140,352         | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| BDG-2 Hog                                                                                 | Energy     | H         | 152                | 365,304         | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| IDEAL                                                                                     | Energy     | H         | 225                | 1,253,088       | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| KDD Cup 2018                                                                              | Energy     | H         | 3,054              | 922,746         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| KDD Cup 2022                                                                              | Energy     | 10T       | 8,554              | 2,332,874       | [@zhou2022kddcup2022]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| London Smart Meters                                                                       | Energy     | 30T       | 24,132             | 160,041,727     | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PDB                                                                                       | Energy     | H         | 1                  | 17,520          | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Residential Load Power                                                                    | Energy     | T         | 79,508             | 404,832,695     | [@bergmeir2023residential]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Residential PV Power                                                                      | Energy     | T         | 248,888            | 184,238,228     | [@bergmeir2023residential]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Sceaux                                                                                    | Energy     | H         | 1                  | 34,223          | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| SMART                                                                                     | Energy     | H         | 5                  | 95,709          | [@emami2023buildingsbench]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Spanish                                                                                   | Energy     | H         | 1                  | 35,064          | [@wang2023proenfo]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Exchange Rate                                                                             | Finance    | B         | 13                 | 56,096          | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| CIF 2016                                                                                  | Finance    | M         | 72                 | 7,108           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Bitcoin                                                                                   | Finance    | D         | 29                 | 68927           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| FRED MD                                                                                   | Finance    | M         | 104                | 71,624          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| NN5 Daily                                                                                 | Finance    | D         | 220                | 35,303          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Tourism Monthly                                                                           | Finance    | M         | 359                | 98,867          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Tourism Quarterly                                                                         | Finance    | Q         | 427                | 39,128          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Tourism Yearly                                                                            | Finance    | Y         | 419                | 11,198          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| COVID Deaths                                                                              | Healthcare | D         | 2                  | 364             | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Hospital                                                                                  | Healthcare | M         | 727                | 55,224          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| CDC Fluview ILINet                                                                        | Healthcare | W         | 286                | 220,144         | [@cdc]                                |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| CDC Fluview WHO NREVSS                                                                    | Healthcare | W         | 108                | 56,407          | [@cdc]                                |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Project Tycho                                                                             | Healthcare | W         | 588                | 120,183         | [@van2018tycho]                       |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| US Births                                                                                 | Healthcare | D         | 1                  | 7,275           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Weatherbench (Hourly)                                                                     | Nature     | H         | 3,984,029          | 74,630,250,518  | [@rasp2020weatherbench]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Weatherbench (Daily)                                                                      | Nature     | D         | 301,229            | 3,223,513,345   | [@rasp2020weatherbench]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Weatherbench (Weekly)                                                                     | Nature     | W         | 226,533            | 462,956,049     | [@rasp2020weatherbench]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Beijing Air Quality                                                                       | Nature     | H         | 4,262              | 2,932,657       | [@chen2019beijingair]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| China Air Quality                                                                         | Nature     | H         | 17,686             | 4,217,605       | [@zheng2015chinaair]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| CMIP6                                                                                     | Nature     | 6H        | 14,327,808         | 104,592,998,400 | [@nguyen2023climatelearn]             |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| ERA5                                                                                      | Nature     | H         | 11,940,789         | 93,768,721,472  | [@nguyen2023climatelearn]             |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Oikolab Weather                                                                           | Nature     | H         | 309                | 615,574         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Saugeen                                                                                   | Nature     | D         | 38                 | 17,311          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Subseasonal                                                                               | Nature     | D         | 17,604             | 51,968,498      | [@mouatadid2023subseasonalclimateusa] |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Subseasonal Precipitation                                                                 | Nature     | D         | 13,467             | 4,830,284       | [@mouatadid2023subseasonalclimateusa] |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Sunspot                                                                                   | Nature     | D         | 19                 | 45,312          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Temperature Rain                                                                          | Nature     | D         | 13,226             | 3,368,098       | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Weather                                                                                   | Nature     | D         | 9,525              | 26,036,234      | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Dominick                                                                                  | Sales      | D         | 3,712              | 759,817         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Car Parts                                                                                 | Sales      | M         | 16                 | 816             | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Favorita Sales                                                                            | Sales      | D         | 91,513             | 20,371,303      | [@woo2024unified]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Favorita Transactions                                                                     | Sales      | D         | 258                | 81,196          | [@woo2024unified]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Hierarchical Sales                                                                        | Sales      | D         | 215                | 114,372         | [@mancuso2021hierarchical]            |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Restaurant                                                                                | Sales      | D         | 155                | 30,289          | [@woo2024unified]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M5                                                                                        | Sales      | D         | 14,341             | 5,011,077       | [@alexander2020gluonts]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Mexico City Bikes                                                                         | Transport  | H         | 556                | 78,848          | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Traffic                                                                                   | Transport  | H         | 1,371              | 14,993,544      | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Taxi (Hourly)                                                                             | Transport  | H         | 2,433              | 1,762,024       | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Beijing Subway                                                                            | Transport  | 30T       | 552                | 19,872          | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Covid Mobility                                                                            | Transport  | D         | 426                | 120,950         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| HZMetro                                                                                   | Transport  | 15T       | 160                | 11,680          | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| LargeST                                                                                   | Transport  | 5T        | 1,208,997          | 4,175,062,621   | [@liu2023largest]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Loop Seattle                                                                              | Transport  | 5T        | 1,809              | 33,700,832      | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Los-Loop                                                                                  | Transport  | 5T        | 3,381              | 6,231,168       | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Pedestrian Counts                                                                         | Transport  | H         | 80                 | 3,125,914       | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PEMS Bay                                                                                  | Transport  | 5T        | 3,980              | 15,975,920      | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PEMS03                                                                                    | Transport  | 5T        | 1,651              | 9,210,432       | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PEMS04                                                                                    | Transport  | 5T        | 6,634              | 14,638,784      | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PEMS07                                                                                    | Transport  | 5T        | 3,828              | 23,789,760      | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| PEMS08                                                                                    | Transport  | 5T        | 2,612              | 8,684,480       | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Q-Traffic                                                                                 | Transport  | 15T       | 46,990             | 257,200,384     | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| SHMetro                                                                                   | Transport  | 15T       | 574                | 41,902          | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| SZ-Taxi                                                                                   | Transport  | 15T       | 156                | 464,256         | [@wang2023libcity]                    |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Rideshare                                                                                 | Transport  | H         | 1,352              | 192,949         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Taxi                                                                                      | Transport  | 30T       | 96,758             | 40,584,636      | [@alexander2020gluonts]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Traffic Hourly                                                                            | Transport  | H         | 1,363              | 14,858,016      | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Traffic Weekly                                                                            | Transport  | W         | 821                | 78,816          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Uber TLC Daily                                                                            | Transport  | D         | 235                | 42,533          | [@alexander2020gluonts]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Uber TLC Hourly                                                                           | Transport  | H         | 344                | 510,284         | [@alexander2020gluonts]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Vehicle Trips                                                                             | Transport  | D         | 10                 | 1,626           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Wiki Daily (100k)                                                                         | Web        | D         | 100,001            | 274,099,872     | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Alibaba Cluster Trace 2018                                                                | Web        | 5T        | 48,640             | 83,776,950      | [@woo2023pushing]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Azure VM Traces 2017                                                                      | Web        | 5T        | 263,928            | 880,648,165     | [@woo2023pushing]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Borg Cluster Data 2011                                                                    | Web        | 5T        | 216,636            | 176,650,715     | [@woo2023pushing]                     |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Kaggle Web Traffic Weekly                                                                 | Web        | W         | 133,388            | 15,206,232      | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Extended Web Traffic                                                                      | Web        | D         | 161,890            | 332,586,145     | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| Wiki-Rolling                                                                              | Web        | D         | 47,675             | 40,619,100      | [@alexander2020gluonts]               |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| TSMixup 10M                                                                               | Synthetic  | \-        | 10,968,625         | 8,198,358,952   | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| KernelSynth 1M                                                                            | Synthetic  | \-        | 1,000,000          | 1,024,000,000   | [@ansari2024chronos]                  |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M1 Monthly                                                                                | Other      | M         | 8                  | 1,047           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M1 Quarterly                                                                              | Other      | 3M        | 195                | 9,628           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M1 Yearly                                                                                 | Other      | Y         | 106                | 3136            | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M3 Monthly                                                                                | Other      | M         | 799                | 109,538         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M3 Quarterly                                                                              | Other      | 3M        | 755                | 36,960          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M3 Yearly                                                                                 | Other      | Y         | 645                | 18,319          | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Daily                                                                                  | Other      | D         | 4,134              | 9,903,554       | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Hourly                                                                                 | Other      | H         | 415                | 352,988         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Monthly                                                                                | Other      | M         | 30,126             | 8,480,953       | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Quarterly                                                                              | Other      | 3M        | 2,623              | 491,632         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Weekly                                                                                 | Other      | W         | 293                | 348,224         | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+
| M4 Yearly                                                                                 | Other      | Y         | 106                | 3,136           | [@godahewa2021monash]                 |
+-------------------------------------------------------------------------------------------+------------+-----------+--------------------+-----------------+---------------------------------------+

: Datasets and key properties from `\dataset`{=latex}. For frequency: S = second, T = minute, H = hour, D = day, B = business day, W = week, M = month, Q = quarter, Y = year.
:::

```{=latex}
\begin{algorithm}[H]\caption{Code Snippet of Data-cleaning Pipline}
\label{alg:data-cleaning}

    \definecolor{codeblue}{rgb}{0.25,0.5,0.5}
    \lstset{
        backgroundcolor=\color{white},
        basicstyle=\fontsize{7.2pt}{7.2pt}\ttfamily\selectfont,
        columns=fullflexible,
        breaklines=true,
        captionpos=b,
        commentstyle=\fontsize{7.2pt}{7.2pt}\color{codeblue},
        keywordstyle=\fontsize{7.2pt}{7.2pt},
         frame=none,
    }
  
    \begin{lstlisting}[language=python]
# Missing Value Processing
def split_seq_by_nan_inf(seq, minimum_seq_length: int = 1):
    output = []
    sublist = []
    for num in seq:
        if num is None or np.isnan(num) or np.isinf(num):
            if len(sublist) >= minimum_seq_length:
                output.append(sublist)
            sublist = []
        else:
            sublist.append(num)
    if len(sublist) >= minimum_seq_length:
        output.append(sublist)
    return output

# Invalid Observation Processing
def split_seq_by_window_quality(seq, window_size: int = 128, zero_threshold, minimum_seq_length: int = 256):
    if len(seq) <= window_size:
        flag, info = check_sequence(seq, zero_threshold=zero_threshold)
        if flag:
            return [seq]
        else:
            return []

    i = window_size
    sub_seq = []
    out_list = []
    while True:
        if i + window_size > len(seq):
            window_seq = seq[i - window_size: len(seq)]
            i = len(seq)
        else:
            window_seq = seq[i - window_size: i]
        flag, info = check_sequence(window_seq, zero_threshold=zero_threshold)
        if flag:
            sub_seq.extend(window_seq)
        else:
            if len(sub_seq) >= minimum_seq_length:
                out_list.append(sub_seq)
            sub_seq = []
        if i >= len(seq):
            break
        i += window_size

    if len(sub_seq) >= minimum_seq_length:
        out_list.append(sub_seq)

    return out_list

def check_sequence(seq, zero_threshold: float):
    import numpy as np
    if not isinstance(seq, np.ndarray):
        seq = np.array(seq)

    if len(seq.shape) > 1:
        raise RuntimeError(f'Dimension of the seq is not equal to 1: {seq.shape}')

    flag = True
    info = {}

    nan_count = np.sum(np.isnan(seq))
    info['nan_count'] = nan_count
    if nan_count > 0:
        flag = False
        return flag, info

    inf_count = np.sum(np.isinf(seq))
    info['inf_count'] = inf_count
    if inf_count > 0:
        flag = False
        return flag, info

    zero_ratio = np.sum(seq == 0) / len(seq)
    info['zero_ratio'] = zero_ratio
    if zero_ratio > zero_threshold:
        flag = False

    first_diff = seq[1:] - seq[:-1]
    first_diff_zero_ratio = np.sum(first_diff == 0) / len(first_diff)

    info['first_diff_zero_ratio'] = first_diff_zero_ratio
    if first_diff_zero_ratio > zero_threshold:
        flag = False

    second_diff = seq[2:] - seq[:-2]
    second_diff_zero_ratio = np.sum(second_diff == 0) / len(second_diff)

    info['second_diff_zero_ratio'] = second_diff_zero_ratio
    if second_diff_zero_ratio > zero_threshold:
        flag = False

    return flag, info
    \end{lstlisting}
\end{algorithm}
```
```{=latex}
\newpage
```
Additional Results {#sec:analysis_full}
==================

Ablation Study {#sec:ablation_full}
--------------

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
        \begin{tabular}{l|cc|cc|cc|cc|cc|cc}
        \toprule
        & \multicolumn{2}{c}{ETTh1} & \multicolumn{2}{c}{ETTh2} & \multicolumn{2}{c}{ETTm1} & \multicolumn{2}{c}{ETTm2} & \multicolumn{2}{c}{Weather} & \multicolumn{2}{c}{Global Temp} \\
        & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} \\
        \midrule
           \rowcolor{tabhighlight}
            \basemodel &
            \textbf{0.357} & 0.381 &
            \textbf{0.305} & \textbf{0.359} &
            \textbf{0.338} & \textbf{0.368} &
            \textbf{0.201} & \textbf{0.291} &
            \textbf{0.160} & \textbf{0.214} &
            \textbf{0.211} & \textbf{0.343} \\
            \hspace{2em} w/o Huber loss &
            0.365 & 0.383 &
            0.309 & 0.366 &
            0.344 & 0.369 &
            0.205 & 0.295 &
            0.163 & 0.221 &
            0.217 & 0.359 \\
            \hspace{2em} w/o multi-resolution layer &
            0.358 & \textbf{0.379} &
            0.313 & 0.362 &
            0.348 & 0.377 &
            0.212 & 0.301 &
            0.164 & 0.219 &
            0.217 & 0.354 \\
            \hspace{2em} w/o mixture-of-experts &
            0.370 & 0.398 &
            0.317 & 0.372 &
            0.347 & 0.373 &
            0.212 & 0.298 &
            0.163 & 0.218 &
            0.223 & 0.357 \\
            \hspace{2em} w/o auxiliary loss &
            0.368 & 0.394 &
            0.325 & 0.387 &
            0.350 & 0.377 &
            0.219 & 0.304 &
            0.164 & 0.220 &
            0.226 & 0.363 \\
        \bottomrule
    \end{tabular}
   }
```
As shown in Table `\ref{tab:ablation_study_full}`{=latex}, replacing the MoE layers with standard FFNs (denoted as "w/o mixture-of-experts ") led to a noticeable performance decline, with the average MSE worsening from $0.262$ to $0.272$. This highlights the significant contribution of the sparse architecture to the model's overall performance, as its dynamic routing enables more specialized processing of diverse input patterns.

We also conducted experiments by retaining only the horizon-32 forecasting head from the `\basemodel `{=latex}(denoted as "w/o multi-resolution layer"), excluding the multi-task optimization. The performance of this modified model was slightly inferior to the complete `\basemodel`{=latex}.

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{lccccccccc}
        \toprule
            &
            ETTh1 &
            ETTh2 &
            ETTm1 &
            ETTm2 &
            Weather &
            Global Temp &
            Average MSE &
            Inference Speed \\
        \midrule
            \rowcolor{tabhighlight}
            \basemodel w/ \{1,8,32,64\} &
            0.357 &
            \textbf{0.305} &
            \textbf{0.338} &
            \textbf{0.201} &
            \textbf{0.160} &
            \textbf{0.211} &
            \textbf{0.262} &
            \textbf{0.095 \text{s/iter}} \\
            \basemodel w/ \{1,8,32\} &
            \textbf{0.353} &
            0.316 &
            0.370 &
            0.225 &
            0.161 &
            0.213 &
            0.273 &
            0.130 \text{s/iter} \\
            \basemodel w/ \{1,8\} &
            0.389 &
            0.391 &
            0.441 &
            0.304 &
            0.174 &
            0.222 &
            0.320 &
            0.411 \text{s/iter} \\
            \basemodel w/ \{1\} &
            1.071 &
            0.920 &
            2.098 &
            2.320 &
            1.500 &
            0.383 &
            1.382 &
            2.834 \text{s/iter} \\
        \bottomrule
    \end{tabular}
    }
```
As shown in Table `\ref{tab:ablation_resolution_full}`{=latex}, the default configuration of four multi-resolution forecasting heads with receptive horizons of ${1, 8, 32, 64}$ delivers optimal predictive performance and inference speed. Reducing the number of heads consistently resulted in decreased performance and longer inference time. This inverse relationship highlights the effectiveness of our multi-resolution forecasting design, striking a balance between accuracy and computational efficiency in a decoder-only forecasting foundation model.

These findings highlight the importance of key architectural components in `\method`{=latex}, such as the mixture-of-experts, multi-task optimization, and multi-resolution forecasting, in delivering state-of-the-art performance in universal time series forecasting.

Training Precision Analysis {#sec:precision_full}
---------------------------

To optimize model performance and efficiency, we conducted a comparative study examining the impact of numerical precision during training. We trained two versions of our model under identical configurations, with the only difference being the precision: one using bfloat16 and the other using float32. The model trained with float32 precision is referred to as `\basemodel `{=latex}w/ FP32.

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    \renewcommand{\tabcolsep}{2.3pt}
    \begin{tabular}{lccccccccccccc}
        \toprule
            &
            ETTh1 &
            ETTh2 &
            ETTm1 &
            ETTm2 &
            Weather &
            Global Temp &
            Average MSE &
            Training Speed &
            Inference Speed &
            Training Memory &
            Inference Memory \\ 
        \midrule
           \rowcolor{tabhighlight}
            \basemodel &
            0.357 &
            0.305 &
            0.338 &
            0.201 &
            0.160 &
            0.211 &
            0.262 &
            0.84 \text{s/iter} &
            0.095 \text{s/iter} &
            1.77 \text{GB} &
            226.70 \text{MB} \\
            \basemodel w/o FA &
            0.357 &
            0.305 &
            0.338 &
            0.201 &
            0.160 &
            0.211 &
            0.262 &
            1.09 \text{s/iter} &
            0.118 \text{s/iter} &
            1.77 \text{GB} &
            226.70 \text{MB} \\
            \basemodel w/ FP32 &
            0.358 &
            0.303 &
            0.342 &
            0.198 &
            0.158 &
            0.208 &
            0.261 &
            1.24 \text{s/iter} &
            0.133 \text{s/iter} &
            2.21 \text{GB} &
            453.41 \text{MB} \\
        \bottomrule
    \end{tabular}
    }
```
As detailed in Table `\ref{tab:precision_comp}`{=latex}, our analysis reveals that the forecasting performances of these two models are remarkably comparable. This finding is significant as it demonstrates that the use of reduced precision (e.g., bfloat16) does not compromise the predictive capabilities of our model.

However, the similarities in performance belie the substantial differences in computational efficiency and resource utilization:

-   **Training Speed:** Notably, the bfloat16 model demonstrates a **12%** improvement in training speed compared to its float32 counterpart. This considerable acceleration in the training process can significantly reduce the time-to-deployment for large-scale models and facilitate more rapid experimentation and iteration.

-   **Memory Consumption:** In terms of memory usage, the bfloat16 model exhibits superior efficiency, consuming substantially less memory than the float32 model. Specifically, we observed a reduction of **20%** in memory usage. This memory optimization is crucial for scaling models to larger sizes or deploying them on memory-constrained hardware.

-   **Compatibility with Advanced Techniques:** A key advantage of the bfloat16 model is its seamless integration with advanced optimization techniques. In particular, it can easily be combined with flash-attention [@dao2023flashattention2], a state-of-the-art attention mechanism designed for better efficiency. This integration results in an additional **23%** increase in training speed and a **19%** boost in inference speed, further enhancing the already significant performance gains.

The implications of these findings are far-reaching:

-   **Resource Efficiency:** The reduced memory footprint and increased training speed of the bfloat16 model translate to more efficient utilization of computational resources, potentially lowering infrastructure costs and energy consumption.

-   **Scalability:** The memory savings offered by bfloat16 precision enable the training of larger, more complex models on the same hardware, potentially leading to improved model capabilities without increasing computational requirements.

-   **Faster Development Cycles:** The substantial improvements in training speed can accelerate the research and development process, allowing for more rapid prototyping and experimentation.

-   **Inference Optimization:** The compatibility with flash-attention not only benefits training but also enhances inference speed, which is crucial for real-time applications and large-scale deployments.

Our experiments show that adopting bfloat16 precision, combined with advanced techniques like flash-attention, provides a compelling balance between model performance, computational efficiency, and resource utilization. These optimizations enable the scalable and efficient deployment of large-scale time series forecasting models without sacrificing predictive accuracy.

Additional Experimental Results {#app:additional_results}
-------------------------------

### TaxiBJ Dataset

We include a benchmark dataset, TaxiBJ [@zhang2017deep] for short-term forecasting evaluation. This original dataset encompasses taxicab GPS data and meteorological information collected from Beijing over four distinct intervals: July 1, 2013 - October 30, 2013; March 1, 2014 - June 30, 2014; March 1, 2015 - June 30, 2015; and November 1, 2015 - April 10, 2016. We selected the in-flow data from the period November 1, 2015, to April 10, 2016 as our benchmark. This benchmark dataset consists of 1,024 time-series sequences derived from $32 \times 32$ grid cells.

We conducted evaluations on all zero-shot models using this benchmark, and set the context length to $512$ for all baselines. The results are summarized in Table `\ref{tab:zero_shot_short}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\resizebox{\columnwidth}{!}{
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{cr|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc|cc}
          \toprule
          \multicolumn{2}{c}{\multirow{3}{*}{\textbf{\scalebox{1.2}{Models}}}} &   \multicolumn{6}{c}{\emoji{figures/timemoe-logo.png} \textbf{\method (Ours)}}& \multicolumn{16}{c}{\emoji{figures/zero-shot.png} \textbf{Zero-shot Time Series Models}} \\
          \cmidrule(lr){3-8} \cmidrule(lr){9-24}
          &       & \multicolumn{2}{c}{\textbf{{\basemodel}}} & \multicolumn{2}{c}{\textbf{{\largemodel}}} & \multicolumn{2}{c}{\textbf{{\ultramodel}}} & \multicolumn{2}{c}{\textbf{Moirai$_{small}$}} & \multicolumn{2}{c}{\textbf{Moirai$_{base}$}} & \multicolumn{2}{c}{\textbf{Moirai$_{large}$}} & \multicolumn{2}{c}{\textbf{TimesFM}} & \multicolumn{2}{c}{\textbf{Moment}} & \multicolumn{2}{c}{\textbf{Chronos$_{small}$}} & \multicolumn{2}{c}{\textbf{Chronos$_{base}$}} & \multicolumn{2}{c}{\textbf{Chronos$_{large}$}} \\
          \cmidrule(lr){3-4} \cmidrule(lr){5-6}\cmidrule(lr){7-8} \cmidrule(lr){9-10}\cmidrule(lr){11-12}\cmidrule(lr){13-14}\cmidrule(lr){15-16}\cmidrule(lr){17-18}\cmidrule(lr){19-20}\cmidrule(lr){21-22}\cmidrule(lr){23-24}
          
          \multicolumn{2}{c}{\scalebox{1.2}{\textbf{Metrics}}}& \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE}  \\
          \midrule    \multirow{4}[1]{*}{TaxiBJ} & 1    & \secondres{0.214} & 0.294 & 0.214 & \boldres{0.292} & \boldres{0.214} & \secondres{0.294} & 0.334 & 0.373 & 0.282 & 0.334 & 0.267 & 0.323 & 0.247 & 0.316 & 0.866 & 0.751 & 0.250 & 0.315 & 0.255 & 0.316 & 0.250 & 0.303 \\
          & 8   & 0.302 & 0.363 & \boldres{0.297} & 0.356 & \secondres{0.302} & 0.362 & 0.487 & 0.470 & 0.427 & 0.422 & 0.431 & 0.425 & 0.393 & 0.430 & 0.883 & 0.759 & 0.341 & 0.380 & 0.311 & \secondres{0.352} & 0.310 & \boldres{0.351} \\
          & 24   & 0.385 & 0.419 & \boldres{0.376} & \boldres{0.410} & \secondres{0.385} & \secondres{0.417} & 0.610 & 0.529 & 0.530 & 0.477 & 0.548 & 0.488 & 0.494 & 0.495 & 0.894 & 0.764 & 0.438 & 0.440 & 0.427 & 0.420 & 0.431 & 0.418 \\
          & 48   & 0.423 & 0.448 & \boldres{0.414} & \boldres{0.440} & \secondres{0.422} & \secondres{0.444} & 0.626 & 0.542 & 0.559 & 0.497 & 0.563 & 0.500 & 0.524 & 0.515 & 0.892 & 0.765 & 0.502 & 0.478 & 0.475 & 0.450 & 0.494 & 0.460 \\
    \midrule
    \end{tabular}%
  }
```
`\label{tab:zero_shot_short}`{=latex} `\vspace{-2mm}`{=latex}

The results indicate that our models consistently outperform other baselines in short-term forecasting on the TaxiBJ dataset.

### Comparison to Timer, TFT, and N-BEATS

In this section, we incorporate additional baseline models for a more comprehensive evaluation. Specifically, Timer [-@liutimer] is included for zero-shot forecasting (Table `\ref{tab:zero_shot_addition}`{=latex}), while TFT [-@lim2021tft] and N-BEATS [-@oreshkinn] are included for in-domain forecasting (Table `\ref{tab:in_addition}`{=latex}). The results indicate that our models consistently demonstrate improved performance relative to these established approaches.

```{=latex}
\centering
```
```{=latex}
\resizebox{0.6\columnwidth}{!}{
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{cr|cc|cc|cc|cc|cc|cc}
          \toprule
          \multicolumn{2}{c}{\multirow{3}{*}{\textbf{\scalebox{1.2}{Models}}}} & \multicolumn{6}{c}{\emoji{figures/timemoe-logo.png} \textbf{\method (Ours)}} & \multicolumn{6}{c}{\textbf{Timer}} \\
          \cmidrule(lr){3-8} \cmidrule(lr){9-14}
          & & \multicolumn{2}{c}{\textbf{\(\basemodel\)}} & \multicolumn{2}{c}{\textbf{\(\largemodel\)}} & \multicolumn{2}{c}{\textbf{\(\ultramodel\)}} & \multicolumn{2}{c}{\textbf{1B}} & \multicolumn{2}{c}{\textbf{16B}} & \multicolumn{2}{c}{\textbf{28B}} \\
          \cmidrule(lr){3-4} \cmidrule(lr){5-6} \cmidrule(lr){7-8} \cmidrule(lr){9-10} \cmidrule(lr){11-12} \cmidrule(lr){13-14}
          \multicolumn{2}{c}{\scalebox{1.2}{\textbf{Metrics}}} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE} & \textbf{MSE} & \textbf{MAE}  \\
          \midrule
          \multirow{4}{*}{ETTh1} & 96  & 0.357 & \secondres{0.381} & \secondres{0.350} & 0.382 & \boldres{0.349} & \boldres{0.379} & 0.438 & 0.425 & 0.364 & 0.388 & 0.393 & 0.421 \\
          & 192 & \boldres{0.384} & \boldres{0.404} & \secondres{0.388} & 0.412 & 0.395 & 0.413 & 0.509 & 0.459 & 0.401 & \secondres{0.410} & 0.434 & 0.447 \\
          & 336 & \boldres{0.411} & 0.434 & \boldres{0.411} & \secondres{0.430} & 0.447 & 0.453 & 0.554 & 0.482 & \secondres{0.423} & \boldres{0.422} & 0.460 & 0.464 \\
          & 720 & 0.449 & 0.477 & \boldres{0.427} & \secondres{0.455} & 0.457 & 0.462 & 0.706 & 0.544 & \secondres{0.436} & \boldres{0.444} & 0.487 & 0.494 \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & \secondres{0.400} & 0.424 & \boldres{0.394} & \secondres{0.419} & 0.412 & 0.426 & 0.552 & 0.478 & 0.406 & \boldres{0.416} & 0.444 & 0.456 \\
          \midrule
          \multirow{4}{*}{ETTh2} & 96  & 0.305 & 0.359 & 0.302 & 0.354 & \boldres{0.292} & \secondres{0.352} & 0.315 & 0.351 & \secondres{0.294} & \boldres{0.350} & 0.308 & 0.369 \\
          & 192 & 0.351 & 0.386 & 0.364 & \secondres{0.385} & \boldres{0.347} & \boldres{0.379} & 0.393 & 0.402 & 0.353 & \secondres{0.385} & \secondres{0.348} & 0.398 \\
          & 336 & 0.391 & 0.418 & 0.417 & 0.425 & 0.406 & 0.419 & 0.412 & 0.422 & \secondres{0.376} & \boldres{0.400} & \boldres{0.366} & \secondres{0.414} \\
          & 720 & 0.419 & 0.454 & 0.537 & 0.496 & 0.439 & 0.447 & 0.425 & \secondres{0.440} & \boldres{0.393} & \boldres{0.420} & \secondres{0.409} & 0.446 \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & 0.366 & 0.404 & 0.405 & 0.415 & 0.371 & \secondres{0.399} & 0.386 & 0.404 & \boldres{0.354} & \boldres{0.389} & \secondres{0.358} & 0.407 \\
          \midrule
          \multirow{4}{*}{ETTm1} & 96  & 0.338 & 0.368 & \secondres{0.309} & \secondres{0.357} & \boldres{0.281} & \boldres{0.341} & 0.690 & 0.526 & 0.766 & 0.549 & 0.420 & 0.418 \\
          & 192 & 0.353 & 0.388 & \secondres{0.346} & \secondres{0.381} & \boldres{0.305} & \boldres{0.358} & 0.757 & 0.560 & 0.755 & 0.553 & 0.467 & 0.445 \\
          & 336 & 0.381 & 0.413 & \secondres{0.373} & \secondres{0.408} & \boldres{0.369} & \boldres{0.395} & 0.832 & 0.594 & 0.765 & 0.561 & 0.502 & 0.467 \\
          & 720 & 0.504 & 0.493 & \secondres{0.475} & \secondres{0.477} & \boldres{0.469} & \boldres{0.472} & 0.883 & 0.627 & 0.752 & 0.565 & 0.558 & 0.499 \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & 0.394 & 0.415 & \secondres{0.376} & \secondres{0.405} & \boldres{0.356} & \boldres{0.391} & 0.791 & 0.577 & 0.760 & 0.557 & 0.487 & 0.457 \\
          \midrule
          \multirow{4}{*}{ETTm2} & 96  & 0.201 & 0.291 & \boldres{0.197} & \boldres{0.286} & \secondres{0.198} & \secondres{0.288} & 0.213 & 0.295 & 0.234 & 0.312 & 0.247 & 0.324 \\
          & 192 & 0.258 & 0.334 & \secondres{0.250} & \secondres{0.322} & \boldres{0.235} & \boldres{0.312} & 0.283 & 0.339 & 0.287 & 0.343 & 0.294 & 0.358 \\
          & 336 & \secondres{0.324} & \secondres{0.373} & 0.337 & 0.375 & \boldres{0.293} & \boldres{0.348} & 0.346 & 0.377 & 0.340 & \secondres{0.373} & 0.335 & 0.385 \\
          & 720 & 0.488 & 0.464 & 0.480 & 0.461 & 0.427 & 0.428 & \secondres{0.424} & \secondres{0.424} & 0.437 & 0.426 & \boldres{0.386} & \boldres{0.418} \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & 0.317 & 0.365 & \secondres{0.316} & \secondres{0.361} & \boldres{0.288} & \boldres{0.344} & 0.317 & 0.359 & 0.324 & 0.364 & \secondres{0.316} & 0.371 \\
          \midrule
          \multirow{4}{*}{Weather} & 96  & 0.160 & 0.214 & \secondres{0.159} & \secondres{0.213} & \boldres{0.157} & \boldres{0.211} & 0.181 & 0.232 & 0.203 & 0.255 & 0.243 & 0.283 \\
          & 192 & \secondres{0.210} & \secondres{0.260} & 0.215 & 0.266 & \boldres{0.208} & \boldres{0.256} & 0.234 & 0.284 & 0.254 & 0.296 & 0.288 & 0.320 \\
          & 336 & \secondres{0.274} & \secondres{0.309} & 0.291 & 0.322 & \boldres{0.255} & \boldres{0.290} & 0.297 & 0.332 & 0.313 & 0.336 & 0.323 & 0.345 \\
          & 720 & 0.418 & 0.405 & 0.415 & 0.400 & 0.405 & 0.397 & \secondres{0.364} & \secondres{0.380} & 0.408 & 0.395 & \boldres{0.362} & \boldres{0.374} \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & \secondres{0.265} & \secondres{0.297} & 0.270 & 0.300 & \boldres{0.256} & \boldres{0.288} & 0.269 & 0.307 & 0.294 & 0.321 & 0.304 & 0.331 \\
          \midrule
          \multirow{4}{*}{Global Temp} & 96  & \secondres{0.211} & \secondres{0.343} & \boldres{0.210} & \boldres{0.342} & 0.214 & 0.345 & 0.250 & 0.373 & 0.245 & 0.372 & 0.308 & 0.425 \\
          & 192 & 0.257 & 0.386 & \secondres{0.254} & \secondres{0.385} & \boldres{0.246} & \boldres{0.379} & 0.299 & 0.415 & 0.300 & 0.418 & 0.359 & 0.465 \\
          & 336 & 0.281 & 0.405 & \secondres{0.267} & \boldres{0.395} & \boldres{0.266} & \secondres{0.398} & 0.347 & 0.451 & 0.365 & 0.466 & 0.415 & 0.507 \\
          & 720 & 0.354 & 0.465 & \secondres{0.289} & \boldres{0.420} & \boldres{0.288} & \secondres{0.421} & 0.452 & 0.521 & 0.542 & 0.585 & 0.579 & 0.617 \\
          \rowcolor{tabhighlight}
          & \textbf{Avg.} & 0.275 & \secondres{0.400} & \secondres{0.255} & \boldres{0.385} & \boldres{0.253} & \boldres{0.385} & 0.337 & 0.440 & 0.363 & 0.460 & 0.415 & 0.504 \\
          \midrule
          \rowcolor{blue!15}
          \multicolumn{2}{c|}{\scalebox{1.1}{\textbf{Average}}} & \secondres{0.336} & 0.384 & \secondres{0.336} & \secondres{0.380} & \boldres{0.322} & \boldres{0.372} & 0.394 & 0.406 & 0.378 & 0.393 & 0.344 & 0.391 \\
          \bottomrule
    \end{tabular}%
  }
```
`\label{tab:zero_shot_addition}`{=latex}

```{=latex}
\vspace{-2mm}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{0.6\columnwidth}{!}{
    %\renewcommand{\arraystretch}{1.1}
    \renewcommand{\tabcolsep}{3pt}
    \begin{tabular}{cr|cc|cc|cc|cc|cc}
          \toprule
          \multicolumn{2}{c}{\multirow{3}{*}{\scalebox{1.2}{\textbf{Models}}}} 
          & \multicolumn{6}{c}{\emoji{figures/timemoe-logo.png} \textbf{\method (Ours)}}
          & \multicolumn{4}{c}{\emoji{figures/full-shot.png} \textbf{Full-shot Time Series Models}} \\
          \cmidrule(lr){3-8} \cmidrule(lr){9-12}
          & 
          & \multicolumn{2}{c}{\textbf{\(\basemodel\)}}
          & \multicolumn{2}{c}{\textbf{\(\largemodel\)}}
          & \multicolumn{2}{c}{\textbf{\(\ultramodel\)}}
          & \multicolumn{2}{c}{\textbf{TFT}}
          & \multicolumn{2}{c}{\textbf{N-BEATS}} \\
          \cmidrule(lr){3-4} \cmidrule(lr){5-6} \cmidrule(lr){7-8} \cmidrule(lr){9-10} \cmidrule(lr){11-12}
          \multicolumn{2}{c}{\scalebox{1.2}{\textbf{Metrics}}}
          & \textbf{MSE} & \textbf{MAE}
          & \textbf{MSE} & \textbf{MAE}
          & \textbf{MSE} & \textbf{MAE}
          & \textbf{MSE} & \textbf{MAE}
          & \textbf{MSE} & \textbf{MAE} \\
          \midrule
% ========= ETTh1 =========
    \multirow{4}[1]{*}{ETTh1} & 96  
          & 0.345 & 0.373 
          & \secondres{0.335} & \secondres{0.371} 
          & \boldres{0.323} & \boldres{0.365} 
          & 0.478 & 0.476 
          & 0.383 & 0.405 \\
          & 192  
          & \secondres{0.372} & \secondres{0.396} 
          & 0.374 & 0.400 
          & \boldres{0.359} & \boldres{0.391} 
          & 0.510 & 0.486 
          & 0.453 & 0.447 \\
          & 336  
          & \secondres{0.389} & \boldres{0.412} 
          & 0.390 & \boldres{0.412} 
          & \boldres{0.388} & \secondres{0.418} 
          & 0.548 & 0.505 
          & 0.517 & 0.493 \\
          & 720  
          & \secondres{0.410} & \secondres{0.443} 
          & \boldres{0.402} & \boldres{0.433} 
          & 0.425 & 0.450 
          & 0.549 & 0.515 
          & 0.594 & 0.546 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & 0.379 & \secondres{0.406} 
          & \secondres{0.375} & \boldres{0.404} 
          & \boldres{0.373} & \secondres{0.406} 
          & 0.521 & 0.496 
          & 0.487 & 0.473 \\
\midrule
% ========= ETTh2 =========
    \multirow{4}[0]{*}{ETTh2} & 96  
          & \secondres{0.276} & \secondres{0.340} 
          & 0.278 & \boldres{0.335} 
          & \boldres{0.274} & 0.338 
          & 0.352 & 0.387 
          & 0.362 & 0.384 \\
          & 192  
          & \secondres{0.331} & \secondres{0.371} 
          & 0.345 & 0.373 
          & \boldres{0.330} & \boldres{0.370} 
          & 0.429 & 0.432 
          & 0.413 & 0.430 \\
          & 336  
          & \secondres{0.373} & \secondres{0.402} 
          & 0.384 & \secondres{0.402} 
          & \boldres{0.362} & \boldres{0.396} 
          & 0.461 & 0.460 
          & 0.430 & 0.448 \\
          & 720  
          & \secondres{0.404} & \secondres{0.431} 
          & 0.437 & 0.437 
          & \boldres{0.370} & \boldres{0.417} 
          & 0.475 & 0.473 
          & 0.554 & 0.530 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & \secondres{0.346} & \secondres{0.386} 
          & 0.361 & \secondres{0.386} 
          & \boldres{0.334} & \boldres{0.380} 
          & 0.429 & 0.438 
          & 0.440 & 0.448 \\
\midrule
% ========= ETTm1 =========
    \multirow{4}[0]{*}{ETTm1} & 96  
          & 0.286 & 0.334 
          & \secondres{0.264} & \secondres{0.325} 
          & \boldres{0.256} & \boldres{0.323} 
          & 0.468 & 0.444 
          & 0.334 & 0.372 \\
          & 192  
          & 0.307 & 0.358 
          & \secondres{0.295} & \secondres{0.350} 
          & \boldres{0.281} & \boldres{0.343} 
          & 0.557 & 0.488 
          & 0.379 & 0.401 \\
          & 336  
          & 0.354 & 0.390 
          & \boldres{0.323} & \secondres{0.376} 
          & \secondres{0.326} & \boldres{0.374} 
          & 0.682 & 0.528 
          & 0.421 & 0.425 \\
          & 720  
          & \secondres{0.433} & \secondres{0.445} 
          & \boldres{0.409} & \boldres{0.435} 
          & 0.454 & 0.452 
          & 0.722 & 0.565 
          & 0.476 & 0.471 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & 0.345 & 0.381 
          & \boldres{0.322} & \boldres{0.371} 
          & \secondres{0.329} & \secondres{0.373} 
          & 0.607 & 0.506 
          & 0.403 & 0.417 \\
\midrule
% ========= ETTm2 =========
    \multirow{4}[0]{*}{ETTm2} & 96  
          & \secondres{0.172} & \secondres{0.265} 
          & \boldres{0.169} & \boldres{0.259} 
          & 0.183 & 0.273 
          & 0.223 & 0.295 
          & 0.208 & 0.283 \\
          & 192  
          & \secondres{0.228} & 0.306 
          & \boldres{0.223} & \boldres{0.295} 
          & \boldres{0.223} & \secondres{0.301} 
          & 0.281 & 0.329 
          & 0.344 & 0.372 \\
          & 336  
          & \secondres{0.281} & 0.345 
          & 0.293 & \secondres{0.341} 
          & \boldres{0.278} & \boldres{0.339} 
          & 0.364 & 0.373 
          & 0.354 & 0.383 \\
          & 720  
          & \boldres{0.403} & \boldres{0.424} 
          & 0.451 & \secondres{0.433} 
          & \secondres{0.425} & \boldres{0.424} 
          & 0.475 & 0.435 
          & 0.460 & 0.455 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & \boldres{0.271} & 0.335 
          & 0.284 & \boldres{0.332} 
          & \secondres{0.277} & \secondres{0.334} 
          & 0.336 & 0.358 
          & 0.342 & 0.373 \\
\midrule
% ========= Weather =========
    \multirow{4}[0]{*}{Weather} & 96  
          & \secondres{0.151} & \secondres{0.203} 
          & \boldres{0.149} & \boldres{0.201} 
          & 0.154 & 0.208 
          & 0.186 & 0.231 
          & 0.165 & 0.224 \\
          & 192  
          & \secondres{0.195} & \secondres{0.246} 
          & \boldres{0.192} & \boldres{0.244} 
          & 0.202 & 0.251 
          & 0.240 & 0.275 
          & 0.209 & 0.269 \\
          & 336  
          & \secondres{0.247} & 0.288 
          & \boldres{0.245} & \boldres{0.285} 
          & 0.252 & \secondres{0.287} 
          & 0.302 & 0.317 
          & 0.261 & 0.310 \\
          & 720  
          & \secondres{0.352} & 0.366 
          & \secondres{0.352} & \secondres{0.365} 
          & 0.392 & 0.376 
          & 0.388 & 0.369 
          & \boldres{0.336} & \boldres{0.362} \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & \secondres{0.236} & \secondres{0.275} 
          & \boldres{0.234} & \secondres{0.273} 
          & 0.250 & 0.280 
          & 0.279 & 0.298 
          & 0.243 & 0.291 \\
\midrule
% ========= Global Temp =========
    \multirow{4}[0]{*}{Global Temp} & 96  
          & \secondres{0.192} & \secondres{0.328} 
          & \secondres{0.192} & 0.329 
          & \boldres{0.189} & \boldres{0.322} 
          & 0.260 & 0.390 
          & 0.210 & 0.344 \\
          & 192  
          & 0.238 & \boldres{0.375} 
          & \secondres{0.236} & \boldres{0.375} 
          & \boldres{0.234} & \secondres{0.376} 
          & 0.301 & 0.423 
          & 0.253 & 0.385 \\
          & 336  
          & 0.259 & \boldres{0.397} 
          & \secondres{0.256} & \boldres{0.397} 
          & \boldres{0.253} & \secondres{0.399} 
          & 0.359 & 0.464 
          & 0.282 & 0.411 \\
          & 720  
          & 0.345 & 0.465 
          & \secondres{0.322} & \secondres{0.451} 
          & \boldres{0.292} & \boldres{0.426} 
          & 0.371 & 0.477 
          & 0.342 & 0.457 \\
          \rowcolor{tabhighlight}
          & {\textbf{Avg.}} 
          & 0.258 & 0.391 
          & \secondres{0.251} & \secondres{0.388} 
          & \boldres{0.242} & \boldres{0.380} 
          & 0.323 & 0.439 
          & 0.272 & 0.399 \\
\midrule
% ========= Overall Average =========
    \rowc\rowcolor{blue!15}
    \multicolumn{2}{c|}{\scalebox{1.1}{\textbf{Average}}} 
          & 0.306 & 0.362 
          & \secondres{0.304} & \secondres{0.359} 
          & \boldres{0.301} & \boldres{0.358} 
          & 0.416 & 0.422 
          & 0.364 & 0.400 \\
          \bottomrule
    \end{tabular}%
  }
```
`\label{tab:in_addition}`{=latex} `\vspace{-3mm}`{=latex}

```{=latex}
\newpage
```
Forecast Showcases
==================

To visualize the performance differences among various time series foundation models, we present the forecasting results of our model, `\method`{=latex}, in comparison to the ground truth across six real-world benchmarks. These benchmarks include ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Global Temp datasets. Alongside `\method`{=latex}'s results, we also show the performance of other foundation models at different scales, providing a comprehensive view of their comparative capabilities (Figures `\ref{fig:etth1_show_case}`{=latex} -- `\ref{fig:electricity_show_case}`{=latex}). In all figures, the context length is set to 512, and the forecast horizon is 96. To enhance clarity and aesthetics, we display the full forecast output, complemented by a portion of the preceding historical input data, ensuring a more intuitive comparison.

The results clearly demonstrate the superiority of `\method `{=latex}over the other foundational models. Its ability to consistently produce more accurate forecasts across a range of datasets underscores the effectiveness of its architecture and design. The performance gains are especially noticeable in long-term prediction scenarios, where `\method`{=latex}'s handling of temporal dependencies proves more robust than its counterparts. These visual comparisons highlight the practical advantages of `\method `{=latex}in large-scale time series forecasting, reinforcing its status as a state-of-the-art model.

```{=latex}
\centering
```
![Zero-shot forecasting cases from ETTh1 by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/etth1.png "fig:"){#fig:etth1_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}

```{=latex}
\centering
```
![Zero-shot forecasting cases from ETTh2 by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/etth2.png "fig:"){#fig:etth2_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}

```{=latex}
\centering
```
![Zero-shot forecasting cases from ETTm1 by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/ettm1.png "fig:"){#fig:ettm1_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}

```{=latex}
\centering
```
![Zero-shot forecasting cases from ETTm2 by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/ettm2.png "fig:"){#fig:ettm2_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}

```{=latex}
\centering
```
![Zero-shot forecasting cases from Weather by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/weather.png "fig:"){#fig:weather_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}

```{=latex}
\centering
```
![Zero-shot forecasting cases from Global Temp by different models, with forecast horizon 96. [Blue]{style="color: blue"} lines are the ground truths and [read]{style="color: red"} lines are the model predictions.](figures/show_case_big/global_temp.png "fig:"){#fig:electricity_show_case width="1.05\\linewidth"} `\vspace{8pt}`{=latex}
