---
abstract: |
  Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting. The implementation is available at: <https://github.com/google-research/google-research/tree/master/tsmixer>.
author:
- |
  `\name `{=latex}Si-An Chen `\email `{=latex}d09922007\@ntu.edu.tw\
  `\addr `{=latex}National Taiwan University\
  Google Cloud AI Research `\AND`{=latex} `\name `{=latex}Chun-Liang Li `\email `{=latex}chunliang\@google.com\
  `\addr `{=latex}Google Cloud AI Research `\AND`{=latex} `\name `{=latex}Nathanael C. Yoder `\email `{=latex}nyoder\@google.com\
  `\addr `{=latex}Google Cloud AI Research\
  `\AND`{=latex} `\name `{=latex}Sercan Ö. Arık `\email `{=latex}soarik\@google.com\
  `\addr `{=latex}Google Cloud AI Research\
  `\AND`{=latex} `\name `{=latex}Tomas Pfister `\email `{=latex}tpfister\@google.com\
  `\addr `{=latex}Google Cloud AI Research
bibliography:
- main.bib
title: 'TSMixer: An All-MLP Architecture for Time Series Forecasting'
---

```{=latex}
\newcommand{\cl}[1]{{\color{red}{CL: #1}}}
```
```{=latex}
\newcommand{\revision}[1]{#1}
```
```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\let\classAND\AND
```
```{=latex}
\let\AND\relax
```
```{=latex}
\let\algoAND\AND
```
```{=latex}
\let\AND\classAND
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\def\month{09}
```
```{=latex}
\def\year{2023}
```
```{=latex}
\def\openreview{\url{https://openreview.net/forum?id=wbpxTuXgm0}}
```
```{=latex}
\maketitle
```
![TSMixer for multivariate time series forecasting. The columns of the inputs means different features/variates and the rows are time steps. The fully-connected operations are row-wise. TSMixer contains interleaving time-mixing and feature-mixing MLPs to aggregate information. `\revision{The number of mixer layer is denoted as $N$.}`{=latex} The time-mixing MLPs are shared across all features and the feature-mixing MLPs are shared across all of the time steps. The design allow TSMixer to automatically adapt the use of both temporal and cross-variate information with limited number of parameters for superior generalization. The extension with auxiliary information is also explored in this paper.](tsmixer_simple.png){#fig:tsmixer_simple width="\\columnwidth"}

Introduction
============

Time series forecasting is a prevalent problem in numerous real-world use cases, such as for forecasting of demand of products [@BJ17; @CP99], pandemic spread [@ZJ18], and inflation rates [@CC10]. The forecastability of time series data often originates from three major aspects:

-   Persistent temporal patterns: encompassing trends and seasonal patterns, e.g., long-term inflation, day-of-week effects;

-   Cross-variate information: correlations between different variables, e.g., an increase in blood pressure associated with a rise in body weight;

-   Auxiliary features: comprising static features and future information, e.g., product categories and promotional events.

Traditional models, such as ARIMA [@BG70], are designed for univariate time series, where only temporal information is available. Therefore, they face limitations when dealing with challenging real-world data, which often contains complex cross-variate information and auxiliary features. In contrast, numerous deep learning models, particularly Transformer-based models, have been proposed due to their capacity to capture both complex temporal patterns and cross-variate dependencies [@GJ17; @SL19; @WR17; @HZ21; @HW21; @LB21b; @SL22; @TZ22; @LY22; @TZ22b] .

The natural intuition is that multivariate models, such as those based on Transformer architectures, should be more effective than univariate models due to their ability to leverage cross-variate information. However, @AZ22 revealed that this is not always the case -- Transformer-based models can indeed be significantly worse than simple univariate temporal linear models on many commonly used forecasting benchmarks. The multivariate models seem to suffer from overfitting especially when the target time series is not correlated with other covariates. This surprising finding has raised two essential questions:

1.  Does cross-variate information truly provide a benefit for time series forecasting?

2.  When cross-variate information is not beneficial, can multivariate models still perform as well as univariate models?

To address these questions, we begin by analyzing the effectiveness of temporal linear models. Our findings indicate that their time-step-dependent characteristics render temporal linear models great candidates for learning temporal patterns under common assumptions. Consequently, we gradually increase the capacity of linear models by

1.  stacking temporal linear models with non-linearities (TMix-Only),

2.  introducing cross-variate feed-forward layers (TSMixer).

The resulting TSMixer alternatively applies MLPs across time and feature dimensions, conceptually corresponding to *time-mixing* and *feature-mixing* operations, efficiently capturing both temporal patterns and cross-variate information, as illustrated in Fig. `\ref{fig:tsmixer_simple}`{=latex}. The residual designs ensure that TSMixer retains the capacity of temporal linear models while still being able to exploit cross-variate information.

We evaluate TSMixer on commonly used long-term forecasting datasets [@HW21] where univariate models have outperformed multivariate models. Our ablation study demonstrates the effectiveness of stacking temporal linear models and validates that cross-variate information is less beneficial on these popular datasets, explaining the superior performance of univariate models. Even so, TSMixer is on par with state-of-the-art univariate models and significantly outperforms other multivariate models.

To demonstrate the benefit of multivariate models, we further evaluate TSMixer on the challenging M5 benchmark, a large-scale retail dataset used in the M-competition [@SM22]. M5 contains crucial cross-variate interactions such as sell prices [@SM22]. The results show that cross-variate information indeed brings significant improvement, and TSMixer can effectively leverage this information. Furthermore, we propose a principle design to extend TSMixer to handle auxiliary information such as static features and future time-varying features. It aligns the different types of features into the same shape then applied mixer layers on the concatenated features to leverage the interactions between them. In this more practical and challenging setting, TSMixer outperforms models that are popular in industrial applications, including DeepAR (@SD20, Amazon SageMaker) and TFT (@LB21, Google Cloud Vertex), demonstrating its strong potential for real world impact.

```{=latex}
\revision{
We summarize our contributions as below:
\begin{itemize}
    \item We analyze the effectiveness of state-of-the-art linear models and indicate that their time-step-dependent characteristics make them great candidates for learning temporal patterns under common assumptions.
    \item We propose TSMixer, an innovative architecture which retains the capacity of linear models to capture temporal patterns while still being able to exploit cross-variate information.
    \item We point out the potential risk of evaluating multivariate models on common long-term forecasting benchmarks.
    \item Our empirical studies demonstrate that TSMixer is the first multivariate model which is on par with univariate models on common benchmarks and achieves state-of-the-art on a large-scale industrial application where cross-variate information is crucial.
\end{itemize}
}
```
```{=latex}
\centering
```
```{=latex}
\setlength{\tabcolsep}{2pt}
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{c|c|c|c|l}
\hline
\multirow{3}{*}{Category} & Extrapolating & Consideration of & Consideration of & \multicolumn{1}{c}{\multirow{3}{*}{Models}} \\
 & temporal patterns & cross-variate information & auxiliary features & \multicolumn{1}{c}{} \\
 &  & (i.e. multivariateness) &  & \multicolumn{1}{c}{} \\ \hline
\multirow{4}{*}{I} & \multirow{4}{*}{\ding{52}} & \multirow{4}{*}{} & \multirow{4}{*}{} & ARIMA~\citep{BG70} \\
 &  &  &  & N-BEATS~\citep{BO20} \\
 &  &  &  & LTSF-Linear~\citep{AZ22} \\
 &  &  &  & PatchTST~\citep{YN23} \\ \hline
\multirow{7}{*}{II} & \multirow{7}{*}{\ding{52}} & \multirow{7}{*}{\ding{52}} & \multirow{7}{*}{} & Informer~\citep{HZ21} \\
 &  &  &  & Autoformer~\citep{HW21} \\
 &  &  &  & Pyraformer~\citep{SL22} \\
 &  &  &  & FEDformer~\citep{TZ22} \\
 &  &  &  & NS-Transformer~\citep{LY22} \\
 &  &  &  & FiLM~\citep{TZ22b} \\
 &  &  &  & \textbf{TSMixer} (this work) \\ \hline
\multirow{5}{*}{III} & \multirow{5}{*}{\ding{52}} & \multirow{5}{*}{\ding{52}} & \multirow{5}{*}{\ding{52}} & MQRNN~\citep{WR17} \\
 &  &  &  & DSSM~\citep{RS18} \\
 &  &  &  & DeepAR~\citep{SD20} \\
 &  &  &  & TFT~\citep{LB21} \\
 &  &  &  & \textbf{TSMixer-Ext} (this work) \\ \hline
\end{tabular}%
}
```
Related Work {#sec:related_work}
============

Broadly, time series forecasting is the task of predicting future values of a variable or multiple related variables, given a set of historical observations. Deep neural networks have been widely investigated for this task [@ZG98; @KN13; @LB21b]. In Table `\ref{table:recent_work}`{=latex} we coarsely split notable works into three categories based on the information considered by the model: (I) univariate forecasting, (II) multivariate forecasting, and (III) multivariate forecasting with auxiliary information.

Multivariate time series forecasting with deep neural networks has been getting increasingly popular with the motivation that *modeling the complex relationships between covariates should improve the forecasting performance*. Transformer-based models (Category II) are common choices for this scenario because of their superior performance in modeling long and complex sequential data [@VA17]. Various variants of Transformers have been proposed to further improve efficiency and accuracy. Informer [@HZ21] and Autoformer [@HW21] tackle the efficiency bottleneck with different attention designs costing less memory usage for long-term forecasting. FEDformer [@TZ22] and FiLM [@TZ22b] decompose the sequences using Fast Fourier Transformation for better extraction of long-term information. There are also extensions on improving specific challenges, such as non-stationarity [@TK22; @LY22]. Despite the advances in Transformer-based models for multivariate forecasting, @AZ22 indeed show the counter-intuitive result that a simple univariate linear model (Category I), which treats multivariate data as several univariate sequences, can outperform all of the proposed multivariate Transformer models by a significant margin on commonly-used long-term forecasting benchmarks. Similarly, @YN23 advocate against modeling the cross-variate information and propose a univariate patch Transformer for multivariate forecasting tasks and show state-of-the-art accuracy on multiple datasets. As one of the core contributions, instead, we find that this conclusion mainly comes from the dataset bias, and might not generalize well to some real-world applications.

There are other works that consider a scenario when auxiliary information ((Category III)), such as static features (e.g. location) and future time-varying features (e.g. promotion in coming weeks), are available. Commonly used forecasting models have been extended to handle these auxiliary features. These include state-space models [@RS18; @AA19; @AG22], RNN variants [@WR17; @SD20], and attention models [@LB21]. Most real-world time-series datasets are more aligned with this setting and that is why these deep learning models have achieved great success in various applications and are widely used in industry (e.g. DeepAR [@SD20] of AWS SageMaker and TFT [@LB21] of Google Cloud Vertex). One drawback of these models is their complexity, particularly when compared to the aforementioned univariate models.

Our motivations for TSMixer stem from analyzing the performance of linear models for time series forecasting. Similar architectures have been considered for other data types before, for example the proposed TSMixer in a way resembles the well-known MLP Mixer architecture, from computer vision [@IT21]. Mixer models have also been applied to text [@FF22], speech [@TO22], network traffic [@ZY22] and point cloud [@CJ22]. Yet, to the best of our knowledge, the use of an MLP Mixer based architecture for time series forecasting has not been explored in the literature.

Linear Models for Time Series Forecasting {#sec:linear}
=========================================

The superiority of linear models over more complex sequential architectures, like Transformers, has been empirically demonstrated @AZ22. We first provide theoretical insights on the capacity of linear models which might have been overlooked due to its simplicity compared to other sequential models. We then compare linear models with other architectures and show that linear models have a characteristic not present in RNNs and Transformers -- they have the appropriate representation capacity to learn the time dependency for a univariate time series. This finding motivates the design of our proposed architecture, presented in Sec. `\ref{sec:arch}`{=latex}.

**Notation:** Let the historical observations be $\boldsymbol{X} \in \mathbb{R}^{L \times C_x}$, where $L$ is the length of the lookback window and $C_x$ is the number of variables. We consider the task of predicting $\boldsymbol{Y} \in \mathbb{R}^{T \times C_y}$, where $T$ is the number of future time steps and $C_y$ is the number of time series we want to predict. In this work, we focus on the case when the past values of the target time series are included in the historical observation ($C_y \leq C_x$). A linear model learns parameters $\boldsymbol{A} \in \mathbb{R}^{T \times L}, \boldsymbol{b} \in \mathbb{R}^{T \times 1}$ to predict the values of the next $T$ steps as: $$\hat{\boldsymbol{Y}} = \boldsymbol{A}\boldsymbol{X} \oplus \boldsymbol{b} \in \mathbb{R}^{T \times C_x},$$ where $\oplus$ means column-wise addition. The corresponding $C_y$ columns in $\hat{\boldsymbol{Y}}$can be used to predict $\boldsymbol{Y}$.

#### Theoretical insights:

For time series forecasting, most impactful real-world applications have either smoothness or periodicity in them, as otherwise the predictability is low and the predictive models would not be reliable. First, we consider the common assumption that the time series is periodic [@HC04; @ZG05]. Given an arbitrary periodic function $x(t) = x(t-P)$, where $P<L$ is the period. There is a solution of linear models to perfectly predict the future values as follows: $$\boldsymbol{A}_{ij} =
    \begin{cases}
        1, & \text{if $j = L - P + (i \bmod P)$}\\
        0, & \text{otherwise}
    \end{cases}, \boldsymbol{b}_i = 0.$$ When extending to affine-transformed periodic sequences, $x(t) = a \cdot x(t-P) + c$, where $a, c \in \mathbb{R}$ are constants, the linear model still has a solution for perfect prediction: $$\boldsymbol{A}_{ij} =
    \begin{cases}
        a, & \text{if $j = L - P + (i \bmod P)$}\\
        0, & \text{otherwise}
    \end{cases}, \boldsymbol{b}_i = c.$$ A more general assumption is that the time series can be decomposed into a periodic sequence and a sequence with smooth trend [@HC04; @ZG05; @HW21; @TZ22]. In this case, we show the following property (see the proof in Appendix `\ref{appendix:proof}`{=latex}):

```{=latex}
\begin{restatable}{theorem}{smooth}
\label{thm:smooth}
Let $x(t) = g(t) + f(t)$, where $g(t)$ is a periodic signal with period $P$ and $f(t)$ is Lipschitz smooth with constant $K$ (i.e. $\left| \frac{f(a) - f(b)}{a-b} \right| \leq K$), then there exists a linear model with lookback window size $L \geq P + 1$ such that $|y_i - \hat{y}_i| \leq K(i + \min(i, P)), \forall i = 1, \dots, T$.
\end{restatable}
```
This derivation illustrates that linear models constitute strong candidates to capture temporal relationships. For the non-periodic patterns, as long as they are smooth, which is often the case in practice, the error is still bounded given an adequate lookback window size.

#### Differences from conventional deep learning models.

Following the discussions in @AZ22 and @YN23, our analysis of linear models offers deeper insights into why previous deep learning models tend to overfit the data. Linear models possess a unique characteristic wherein the weights of the mapping are fixed for each time step in the input sequence. This \`\`time-step-dependent" characteristic is a crucial component of our previous findings and stands in contrast to recurrent or attention-based architectures, where the weights over the input sequence are outputs of a \"data-dependent\" function, such as the gates in LSTMs or attention layers in Transformers. Time-step-dependent models vs. data-dependent models are illustrated in Fig. `\ref{fig:dependent}`{=latex}. The time-step-dependent linear model, despite its simplicity, proves to be highly effective in modeling temporal patterns. Conversely, even though recurrent or attention architectures have high representational capacity, achieving time-step independence is challenging for them. They usually overfit on the data instead of solely considering the positions. This unique property of linear models may help explain the results in @AZ22, where no other method was shown to match the performance of the linear model.

```{=latex}
\revision{
\paragraph{Limitations of the analysis.}
The purpose of the analysis is to understand the effectiveness of temporal linear models in univariate scenario.
Real-world time series data might have high volatility, making the patterns non-periodic and non-smooth. In such scenarios, relying solely on past-observed temporal patterns might be suboptimal. 
The analysis beyond Lipschitz cases could be challenging and out of the scope of this paper~\citep{ZT23}, so we leave the analysis for more complex cases for the future work.
Nevertheless, the analysis motivates us to develop a more powerful model based on linear models, which are introduced in Section~\ref{sec:arch}. We also show the importance of effectively utilizing multivariate information as other covariates might contain the information that can be used to model volatility -- indeed our results in Table 5 underline that.
}
```
```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
\Vertex[x=0,label=$x_{t-2}$, fontscale=.8]{x1}
\Vertex[x=1.5,label=$x_{t-1}$, fontscale=.8]{x2}
\Vertex[x=3,label=$x_{t}$, fontscale=.8]{x3}
\Vertex[x=1.5,y=1.8,label=$x_{t+1}$, fontscale=.8]{y1}
\Text[x=1.5,y=2.5]{$x_{t+1} = \sum_{i=1}^t w_i x_i$}
\Edge[Direct, label=$w_{t-2}$, fontscale=.8](x1)(y1)
\Edge[Direct, label=$w_{t-1}$, fontscale=.8](x2)(y1)
\Edge[Direct, label=$w_{t}$, fontscale=.8](x3)(y1)
\Text[x=1.5,y=-0.8]{Time-step-dependent}

\Vertex[x=7,label=$x_{t-2}$, fontscale=.8]{x4}
\Vertex[x=8.5,label=$x_{t-1}$, fontscale=.8]{x5}
\Vertex[x=10,label=$x_{t}$, fontscale=.8]{x6}
\Vertex[x=8.5,y=1.8,label=$x_{t+1}$, fontscale=.8]{y2}
\Text[x=8.5,y=2.5]{$x_{t+1} = \sum_{i=1}^t f_i(\boldsymbol{x})x_i$}
\Edge[Direct, label=$f_{t-2}(\boldsymbol{x})$, fontscale=.8](x4)(y2)
\Edge[Direct, label=$f_{t-1}(\boldsymbol{x})$, fontscale=.8](x5)(y2)
\Edge[Direct, label=$f_{t}(\boldsymbol{x})$, fontscale=.8](x6)(y2)
\Text[x=8.5,y=-0.8]{Data-dependent}
\end{tikzpicture}
```
TSMixer Architecture {#sec:arch}
====================

Expanding upon our finding that linear models can serve as strong candidates for capturing time dependencies, we initially propose a natural enhancement by stacking linear models with non-linearities to form multi-layer perceptrons (MLPs). Common deep learning techniques, such as normalization and residual connections, are applied to facilitate efficient learning. However, this architecture does not take cross-variate information into account.

To better leverage cross-variate information, we propose the application of MLPs in the time-domain and the feature-domain in an alternating manner. The time-domain MLPs are shared across all of the features, while the feature-domain MLPs are shared across all of the time steps. This resulting model is akin to the MLP-Mixer architecture from computer vision [@IT21], with time-domain and feature-domain operations representing time-mixing and feature-mixing operations, respectively. Consequently, we name our proposed architecture Time-Series Mixer (TSMixer).

The interleaving design between these two operations efficiently utilizes both temporal dependencies and cross-variate information while limiting computational complexity and model size. It allows TSMixer to use a long lookback window (see Sec. `\ref{sec:linear}`{=latex}), while maintaining the parameter growth in only $O(L+C)$ instead of $O(LC)$ if fully-connected MLPs were used. To better understand the utility of cross-variate information and feature-mixing, we also consider a simplified variant of TSMixer that only employs time-mixing, referred to as TMix-Only, which consists of a residual MLP shared across each variate, as illustrated in Fig. `\ref{fig:tmix_only}`{=latex}. We also present the extension of TSMixer to scenarios where auxiliary information about the time series is available.

![The architecture of TMix-Only. It is similar to TSMixer but only applies time-mixing.](tmix_only.png){#fig:tmix_only width="0.5\\columnwidth"}

TSMixer for Multivariate Time Series Forecasting
------------------------------------------------

For multivariate time series forecasting where only historical data are available, TSMixer applies MLPs alternatively in time and feature domains. The architecture is illustrated in Fig. `\ref{fig:tsmixer_simple}`{=latex}. TSMixer comprises the following components:

-   **Time-mixing MLP**: Time-mixing MLPs model temporal patterns in time series. They consist of a fully-connected layer followed by an activation function and dropout. They transpose the input to apply the fully-connected layers along the time domain and shared by features. We employ a single-layer MLP, as demonstrated in Sec.`\ref{sec:linear}`{=latex}, where a simple linear model already proves to be a strong model for learning complex temporal patterns.

-   **Feature-mixing MLP**: Feature-mixing MLPs are shared by time steps and serve to leverage covariate information. Similar to Transformer-based models, we consider two-layer MLPs to learn complex feature transformations.

-   **Temporal Projection**: Temporal projection, identical to the linear models in@AZ22, is a fully-connected layer applied on time domain. They not only learn the temporal patterns but also map the time series from the original input length $L$ to the target forecast length $T$.

-   **Residual Connections**: We apply residual connections between each time-mixing and feature-mixing layer. These connections allow the model to learn deeper architectures more efficiently and allow the model to effectively ignore unnecessary time-mixing and feature-mixing operations.

-   **Normalization**: Normalization is a common technique to improve deep learning model training. While the preference between batch normalization and layer normalization is task-dependent, @YN23 demonstrates the advantages of batch normalization on common time series datasets. In contrast to typical normalization applied along the feature dimension, we apply 2D normalization on both time and feature dimensions due to the presence of time-mixing and feature-mixing operations.

Contrary to some recent Transformer advances with increased complexity, the architecture of TSMixer is relatively simple to implement. Despite its simplicity, we demonstrate in Sec. `\ref{sec:exp}`{=latex} that TSMixer remains competitive with state-of-the-art models at representative benchmarks.

Extended TSMixer for Time Series Forecasting with Auxiliary Information {#subsec:tsmixer_aux}
-----------------------------------------------------------------------

In addition to the historical observations, many real-world scenarios allow us to have access to static $\boldsymbol{S} \in \mathbb{R}^{1 \times C_s}$ (e.g. location) and future time-varying features $\boldsymbol{Z} \in \mathbb{R}^{T \times C_z}$ (e.g. promotion in subsequent weeks). `\revision{The problem can also be extended to multiple time series, represented by ${\boldsymbol{X}^{(i)}}_{i=1}^M$, where $M$ is the number of time series, with each time series is associated with its own set of features.}`{=latex} Most recent work, especially those focus on long-term forecasting, only consider the historical features and targets on all variables (i.e. $C_x = C_y > 1, C_s = C_z = 0$). In this paper, we also consider the case where auxiliary information is available (i.e. $C_s > 0, C_z > 0$).

To leverage the different types of features, we propose a principle design that naturally leverages the feature mixing to capture the interaction between them. We first design the align stage to project the feature with different shapes into the same shape. Then we can concatenate the features and seamlessly apply feature mixing on them. We extend TSMixer as illustrated in Fig. `\ref{fig:tsmixer_advanced}`{=latex}. The architecture comprises two parts: align and mixing. In the align stage, TSMixer aligns historical features ($\mathbb{R}^{L \times C_x}$) and future features ($\mathbb{R}^{T \times C_z}$) into the same shape ($\mathbb{R}^{L \times C_h}$) by applying temporal projection and a feature-mixing layer, where $C_h$ represents the size of hidden layers. Additionally, it repeats the static features to transform their shape from $\mathbb{R}^{1 \times C_s}$ to $\mathbb{R}^{T \times C_s}$ in order to align the output length.

In the mixing stage, the mixing layer, which includes time-mixing and feature-mixing operations, naturally leverages temporal patterns and cross-variate information from all features collectively. Lastly, we employ a fully-connected layer to generate outputs for each time step. The outputs can either be real values of the forecasted time series ($\mathbb{R}^{T \times C_y}$), typically optimized by mean absolute error or mean square error, or in some tasks, they may generate parameters of a target distribution, such as negative binomial distribution for retail demand forecasting [@SD20]. We slightly modify mixing layers to better handle M5 dataset, as described in Appendix `\ref{appendix:imp_detail}`{=latex}.

```{=latex}
\revision{
\subsection{Differences between TSMixer and MLP-Mixer}
While TSMixer shares architectural similarities with MLP-Mixer, the development of TSMixer, motivated by our analysis in Section~\ref{sec:linear}, has led to a unique normalization approach.
In TSMixer, two dimensions represent features and time steps, unlike MLP-Mixer's features and patches.
Consequently, we apply 2D normalization to maintain scale across features and time steps, since we have discovered the importance of utilizing temporal patterns in forecasting. 
Besides, we have proposed an extended version of TSMixer to better extract information from heterogeneous inputs, essential to achieve state-of-the-art results in real-world scenarios.
}
```
![TSMixer with auxiliary information. The columns of the inputs are features and the rows are time steps. We first align the sequence lengths of different types of inputs to concatenate them. Then we apply mixing layers to model their temporal patterns and cross-variate information jointly.](tsmixer_advanced.png){#fig:tsmixer_advanced width="\\columnwidth"}

```{=latex}
\begin{table*}[!tb]\centering
\small
\caption{Statistics of all datasets. Note that Electricity and Traffic can be considered as multivariate time series or multiple univariate time series since all variates share the same physical meaning in the dataset (e.g. electricity consumption at different locations).}
\label{table:data_stat}
\begin{tabular}{l|cc|ccc|c}
\hline
 & ETTh1/h2 & ETTm1/m2 & Weather & Electricity & Traffic & M5 \\ \hline
\# of time series ($M$) & 1 & 1 & 1 & 1 & 1 & 30,490 \\
\# of variants ($C$) & 7 & 7 & 21 & 321 & 862 & 1 \\
Time steps & 17,420 & 699,680 & 52,696 & 26,304 & 17,544 & 1,942 \\
Granularity & 1 hour & 15 minutes & 10 minutes & 1 hour & 1 hour & 1day \\
Historical feature ($C_x$) & 0 & 0 & 0 & 0 & 0 & 14 \\
Future feature ($C_z$) & 0 & 0 & 0 & 0 & 0 & 13 \\
Static feature ($C_s$) & 0 & 0 & 0 & 0 & 0 & 6 \\ \hline
Data partition & \multicolumn{2}{c|}{\multirow{2}{*}{12/4/4 (month)}} & \multicolumn{3}{c|}{\multirow{2}{*}{7:2:1}} & \multirow{2}{*}{1886/28/28 (day)} \\
(Train/Validation/Test) & \multicolumn{2}{c|}{} & \multicolumn{3}{c|}{} & \\

\hline
\end{tabular}
\end{table*}
```
```{=latex}
\begin{table*}[!tb]\centering
%\footnotesize
\setlength\tabcolsep{2pt}
\caption{Evaluation results on the long-term forecasting datasets. The numbers of models marked with ``*'' are obtained from~\citet{YN23}. The best numbers in each row are shown in {\color[HTML]{FE0000}\textbf{bold}} and the second best numbers are {\color[HTML]{3531FF}\underline{underlined}}. \revision{We skip TMix-Only in comparisons as it performs similar to TSMixer. The last row shows the average percentage of MSE improvement of TSMixer over other methods.}}
\label{table:ltsf_main}
\resizebox{\textwidth}{!}{%
\begin{tabular}{cc|cccccccccc|cccccc}
\hline
 &  & \multicolumn{10}{c|}{Multivariate Model} & \multicolumn{6}{c}{Univariate Model} \\ \hline
\multicolumn{2}{c|}{Models} & \multicolumn{2}{c|}{\textbf{TSMixer}} & \multicolumn{2}{c|}{TFT} & \multicolumn{2}{c|}{FEDformer*} & \multicolumn{2}{c|}{Autoformer*} & \multicolumn{2}{c|}{Informer*} & \multicolumn{2}{c|}{\textbf{TMix-Only}} & \multicolumn{2}{c|}{Linear} & \multicolumn{2}{c}{PatchTST*} \\ \hline
\multicolumn{2}{c|}{Metric} & MSE & \multicolumn{1}{c|}{MAE} & MSE & \multicolumn{1}{c|}{MAE} & MSE & \multicolumn{1}{c|}{MAE} & MSE & \multicolumn{1}{c|}{MAE} & MSE & MAE & MSE & \multicolumn{1}{c|}{MAE} & MSE & \multicolumn{1}{c|}{MAE} & MSE & MAE \\ \hline
\multicolumn{1}{c|}{ETTh1} & 96 & {\color[HTML]{FF0000} \textbf{0.361}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.392}}} & 0.674 & \multicolumn{1}{c|}{0.634} & 0.376 & \multicolumn{1}{c|}{0.415} & 0.435 & \multicolumn{1}{c|}{0.446} & 0.941 & 0.769 & 0.359 & \multicolumn{1}{c|}{0.391} & {\color[HTML]{0000FF} {\ul 0.368}} & \multicolumn{1}{c|}{{\color[HTML]{FE0000} \textbf{0.392}}} & 0.370 & 0.400 \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{FE0000} \textbf{0.404}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.418}}} & 0.858 & \multicolumn{1}{c|}{0.704} & 0.423 & \multicolumn{1}{c|}{0.446} & 0.456 & \multicolumn{1}{c|}{0.457} & 1.007 & 0.786 & 0.402 & \multicolumn{1}{c|}{0.415} & {\color[HTML]{FF0000} \textbf{0.404}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.415}}} & 0.413 & 0.429 \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{FF0000} \textbf{0.420}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.431}}} & 0.900 & \multicolumn{1}{c|}{0.731} & 0.444 & \multicolumn{1}{c|}{0.462} & 0.486 & \multicolumn{1}{c|}{0.487} & 1.038 & 0.784 & 0.420 & \multicolumn{1}{c|}{0.434} & 0.436 & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.439}}} & {\color[HTML]{0000FF} {\ul 0.422}} & 0.440 \\
\multicolumn{1}{c|}{} & 720 & {\color[HTML]{0000FF} {\ul 0.463}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.472}}} & 0.745 & \multicolumn{1}{c|}{0.666} & 0.469 & \multicolumn{1}{c|}{0.492} & 0.515 & \multicolumn{1}{c|}{0.517} & 1.144 & 0.857 & 0.453 & \multicolumn{1}{c|}{0.467} & 0.481 & \multicolumn{1}{c|}{0.495} & {\color[HTML]{FF0000} \textbf{0.447}} & {\color[HTML]{FF0000} \textbf{0.468}} \\ \hline
\multicolumn{1}{c|}{ETTh2} & 96 & {\color[HTML]{FE0000} \textbf{0.274}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.341}}} & 0.409 & \multicolumn{1}{c|}{0.505} & 0.332 & \multicolumn{1}{c|}{0.374} & 0.332 & \multicolumn{1}{c|}{0.368} & 1.549 & 0.952 & 0.275 & \multicolumn{1}{c|}{0.342} & 0.297 & \multicolumn{1}{c|}{0.363} & {\color[HTML]{FF0000} \textbf{0.274}} & {\color[HTML]{FF0000} \textbf{0.337}} \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{FF0000} \textbf{0.339}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.385}}} & 0.953 & \multicolumn{1}{c|}{0.651} & 0.407 & \multicolumn{1}{c|}{0.446} & 0.426 & \multicolumn{1}{c|}{0.434} & 3.792 & 1.542 & 0.339 & \multicolumn{1}{c|}{0.386} & 0.398 & \multicolumn{1}{c|}{0.429} & {\color[HTML]{0000FF} {\ul 0.341}} & {\color[HTML]{FF0000} \textbf{0.382}} \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{0000FF} {\ul 0.361}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.406}}} & 1.006 & \multicolumn{1}{c|}{0.709} & 0.400 & \multicolumn{1}{c|}{0.447} & 0.477 & \multicolumn{1}{c|}{0.479} & 4.215 & 1.642 & 0.366 & \multicolumn{1}{c|}{0.413} & 0.500 & \multicolumn{1}{c|}{0.491} & {\color[HTML]{FF0000} \textbf{0.329}} & {\color[HTML]{FF0000} \textbf{0.384}} \\
\multicolumn{1}{c|}{} & 720 & 0.445 & \multicolumn{1}{c|}{0.470} & 1.187 & \multicolumn{1}{c|}{0.816} & {\color[HTML]{0000FF} {\ul 0.412}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.469}}} & 0.453 & \multicolumn{1}{c|}{0.490} & 3.656 & 1.619 & 0.437 & \multicolumn{1}{c|}{0.465} & 0.795 & \multicolumn{1}{c|}{0.633} & {\color[HTML]{FF0000} \textbf{0.379}} & {\color[HTML]{FF0000} \textbf{0.422}} \\ \hline
\multicolumn{1}{c|}{ETTm1} & 96 & {\color[HTML]{FF0000} \textbf{0.285}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.339}}} & 0.752 & \multicolumn{1}{c|}{0.626} & 0.326 & \multicolumn{1}{c|}{0.390} & 0.510 & \multicolumn{1}{c|}{0.492} & 0.626 & 0.560 & 0.284 & \multicolumn{1}{c|}{0.338} & 0.303 & \multicolumn{1}{c|}{0.346} & {\color[HTML]{0000FF} {\ul 0.293}} & {\color[HTML]{0000FF} {\ul 0.346}} \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{FF0000} \textbf{0.327}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.365}}} & 0.752 & \multicolumn{1}{c|}{0.649} & 0.365 & \multicolumn{1}{c|}{0.415} & 0.514 & \multicolumn{1}{c|}{0.495} & 0.725 & 0.619 & 0.324 & \multicolumn{1}{c|}{0.362} & 0.335 & \multicolumn{1}{c|}{{\color[HTML]{FE0000} \textbf{0.365}}} & {\color[HTML]{0000FF} {\ul 0.333}} & 0.370 \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{FF0000} \textbf{0.356}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.382}}} & 0.810 & \multicolumn{1}{c|}{0.674} & 0.392 & \multicolumn{1}{c|}{0.425} & 0.510 & \multicolumn{1}{c|}{0.492} & 1.005 & 0.741 & 0.359 & \multicolumn{1}{c|}{0.384} & {\color[HTML]{0000FF} {\ul 0.365}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.384}}} & 0.369 & 0.392 \\
\multicolumn{1}{c|}{} & 720 & {\color[HTML]{0000FF} {\ul 0.419}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.414}}} & 0.849 & \multicolumn{1}{c|}{0.695} & 0.446 & \multicolumn{1}{c|}{0.458} & 0.527 & \multicolumn{1}{c|}{0.493} & 1.133 & 0.845 & 0.419 & \multicolumn{1}{c|}{0.414} & {\color[HTML]{0000FF} {\ul 0.419}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.415}}} & {\color[HTML]{FF0000} \textbf{0.416}} & 0.420 \\ \hline
\multicolumn{1}{c|}{ETTm2} & 96 & {\color[HTML]{FF0000} \textbf{0.163}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.252}}} & 0.386 & \multicolumn{1}{c|}{0.472} & 0.180 & \multicolumn{1}{c|}{0.271} & 0.205 & \multicolumn{1}{c|}{0.293} & 0.355 & 0.462 & 0.162 & \multicolumn{1}{c|}{0.249} & 0.170 & \multicolumn{1}{c|}{0.266} & {\color[HTML]{0000FF} {\ul 0.166}} & {\color[HTML]{0000FF} {\ul 0.256}} \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{FF0000} \textbf{0.216}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.290}}} & 0.739 & \multicolumn{1}{c|}{0.626} & 0.252 & \multicolumn{1}{c|}{0.318} & 0.278 & \multicolumn{1}{c|}{0.336} & 0.595 & 0.586 & 0.220 & \multicolumn{1}{c|}{0.293} & 0.236 & \multicolumn{1}{c|}{0.317} & {\color[HTML]{0000FF} {\ul 0.223}} & {\color[HTML]{0000FF} {\ul 0.296}} \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{FF0000} \textbf{0.268}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.324}}} & 0.477 & \multicolumn{1}{c|}{0.494} & 0.324 & \multicolumn{1}{c|}{0.364} & 0.343 & \multicolumn{1}{c|}{0.379} & 1.270 & 0.871 & 0.269 & \multicolumn{1}{c|}{0.326} & 0.308 & \multicolumn{1}{c|}{0.369} & {\color[HTML]{0000FF} {\ul 0.274}} & {\color[HTML]{0000FF} {\ul 0.329}} \\
\multicolumn{1}{c|}{} & 720 & 0.420 & \multicolumn{1}{c|}{0.422} & 0.523 & \multicolumn{1}{c|}{0.537} & {\color[HTML]{0000FF} {\ul 0.410}} & \multicolumn{1}{c|}{0.420} & 0.414 & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.419}}} & 3.001 & 1.267 & 0.358 & \multicolumn{1}{c|}{0.382} & 0.435 & \multicolumn{1}{c|}{0.449} & {\color[HTML]{FF0000} \textbf{0.362}} & {\color[HTML]{FF0000} \textbf{0.385}} \\ \hline
\multicolumn{1}{c|}{Weather} & 96 & {\color[HTML]{FF0000} \textbf{0.145}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.198}}} & 0.441 & \multicolumn{1}{c|}{0.474} & 0.238 & \multicolumn{1}{c|}{0.314} & 0.249 & \multicolumn{1}{c|}{0.329} & 0.354 & 0.405 & 0.145 & \multicolumn{1}{c|}{0.196} & 0.170 & \multicolumn{1}{c|}{0.229} & {\color[HTML]{0000FF} {\ul 0.149}} & {\color[HTML]{FE0000} \textbf{0.198}} \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{FF0000} \textbf{0.191}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.242}}} & 0.699 & \multicolumn{1}{c|}{0.599} & 0.275 & \multicolumn{1}{c|}{0.329} & 0.325 & \multicolumn{1}{c|}{0.370} & 0.419 & 0.434 & 0.190 & \multicolumn{1}{c|}{0.240} & 0.213 & \multicolumn{1}{c|}{0.268} & {\color[HTML]{0000FF} {\ul 0.194}} & {\color[HTML]{FF0000} \textbf{0.241}} \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{FF0000} \textbf{0.242}} & \multicolumn{1}{c|}{{\color[HTML]{FF0000} \textbf{0.280}}} & 0.693 & \multicolumn{1}{c|}{0.596} & 0.339 & \multicolumn{1}{c|}{0.377} & 0.351 & \multicolumn{1}{c|}{0.391} & 0.583 & 0.543 & 0.240 & \multicolumn{1}{c|}{0.279} & 0.257 & \multicolumn{1}{c|}{0.305} & {\color[HTML]{0000FF} {\ul 0.245}} & {\color[HTML]{0000FF} {\ul 0.282}} \\
\multicolumn{1}{c|}{} & 720 & 0.320 & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.336}}} & 1.038 & \multicolumn{1}{c|}{0.753} & 0.389 & \multicolumn{1}{c|}{0.409} & 0.415 & \multicolumn{1}{c|}{0.426} & 0.916 & 0.705 & 0.325 & \multicolumn{1}{c|}{0.339} & {\color[HTML]{0000FF} {\ul 0.318}} & \multicolumn{1}{c|}{0.356} & {\color[HTML]{FF0000} \textbf{0.314}} & {\color[HTML]{FF0000} \textbf{0.334}} \\ \hline
\multicolumn{1}{c|}{Electricity} & 96 & {\color[HTML]{0000FF} {\ul 0.131}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.229}}} & 0.295 & \multicolumn{1}{c|}{0.376} & 0.186 & \multicolumn{1}{c|}{0.302} & 0.196 & \multicolumn{1}{c|}{0.313} & 0.304 & 0.393 & 0.132 & \multicolumn{1}{c|}{0.225} & 0.135 & \multicolumn{1}{c|}{0.232} & {\color[HTML]{FF0000} \textbf{0.129}} & {\color[HTML]{FF0000} \textbf{0.222}} \\
\multicolumn{1}{c|}{} & 192 & 0.151 & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.246}}} & 0.327 & \multicolumn{1}{c|}{0.397} & 0.197 & \multicolumn{1}{c|}{0.311} & 0.211 & \multicolumn{1}{c|}{0.324} & 0.327 & 0.417 & 0.152 & \multicolumn{1}{c|}{0.243} & {\color[HTML]{0000FF} {\ul 0.149}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.246}}} & {\color[HTML]{FF0000} \textbf{0.147}} & {\color[HTML]{FF0000} \textbf{0.240}} \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{FF0000} \textbf{0.161}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.261}}} & 0.298 & \multicolumn{1}{c|}{0.380} & 0.213 & \multicolumn{1}{c|}{0.328} & 0.214 & \multicolumn{1}{c|}{0.327} & 0.333 & 0.422 & 0.166 & \multicolumn{1}{c|}{0.260} & 0.164 & \multicolumn{1}{c|}{0.263} & {\color[HTML]{0000FF} {\ul 0.163}} & {\color[HTML]{FF0000} \textbf{0.259}} \\
\multicolumn{1}{c|}{} & 720 & {\color[HTML]{FE0000} {\ul \textbf{0.197}}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.293}}} & 0.338 & \multicolumn{1}{c|}{0.412} & 0.233 & \multicolumn{1}{c|}{0.344} & 0.236 & \multicolumn{1}{c|}{0.342} & 0.351 & 0.427 & 0.200 & \multicolumn{1}{c|}{0.291} & 0.199 & \multicolumn{1}{c|}{0.297} & {\color[HTML]{FF0000} \textbf{0.197}} & {\color[HTML]{FF0000} \textbf{0.290}} \\ \hline
\multicolumn{1}{c|}{Traffic} & 96 & {\color[HTML]{0000FF} {\ul 0.376}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.264}}} & 0.678 & \multicolumn{1}{c|}{0.362} & 0.576 & \multicolumn{1}{c|}{0.359} & 0.597 & \multicolumn{1}{c|}{0.371} & 0.733 & 0.410 & 0.370 & \multicolumn{1}{c|}{0.258} & 0.395 & \multicolumn{1}{c|}{0.274} & {\color[HTML]{FF0000} \textbf{0.360}} & {\color[HTML]{FF0000} \textbf{0.249}} \\
\multicolumn{1}{c|}{} & 192 & {\color[HTML]{0000FF} {\ul 0.397}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.277}}} & 0.664 & \multicolumn{1}{c|}{0.355} & 0.610 & \multicolumn{1}{c|}{0.380} & 0.607 & \multicolumn{1}{c|}{0.382} & 0.777 & 0.435 & 0.390 & \multicolumn{1}{c|}{0.268} & 0.406 & \multicolumn{1}{c|}{0.279} & {\color[HTML]{FF0000} \textbf{0.379}} & {\color[HTML]{FF0000} \textbf{0.256}} \\
\multicolumn{1}{c|}{} & 336 & {\color[HTML]{0000FF} {\ul 0.413}} & \multicolumn{1}{c|}{0.290} & 0.679 & \multicolumn{1}{c|}{0.354} & 0.608 & \multicolumn{1}{c|}{0.375} & 0.623 & \multicolumn{1}{c|}{0.387} & 0.776 & 0.434 & 0.404 & \multicolumn{1}{c|}{0.276} & 0.416 & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.286}}} & {\color[HTML]{FF0000} \textbf{0.392}} & {\color[HTML]{FF0000} \textbf{0.264}} \\
\multicolumn{1}{c|}{} & 720 & {\color[HTML]{0000FF} {\ul 0.444}} & \multicolumn{1}{c|}{{\color[HTML]{0000FF} {\ul 0.306}}} & 0.610 & \multicolumn{1}{c|}{0.326} & 0.621 & \multicolumn{1}{c|}{0.375} & 0.639 & \multicolumn{1}{c|}{0.395} & 0.827 & 0.466 & 0.443 & \multicolumn{1}{c|}{0.297} & 0.454 & \multicolumn{1}{c|}{0.308} & {\color[HTML]{FF0000} \textbf{0.432}} & {\color[HTML]{FF0000} \textbf{0.286}} \\ \hline
\multicolumn{4}{c|}{\revision{\textbf{TSMixer MSE Imp.}}} & \multicolumn{2}{c|}{\textbf{51.94\%}} & \multicolumn{2}{c|}{\textbf{16.69\%}} & \multicolumn{2}{c|}{\textbf{24.51\%} } & \multicolumn{2}{c|}{\textbf{62.40\%} } & \multicolumn{2}{c|}{-0.66\%} & \multicolumn{2}{c|}{\textbf{6.77\%}} & \multicolumn{2}{c}{-1.53\%}\\ \hline
\end{tabular}%
}
\end{table*}
```
Experiments {#sec:exp}
===========

We evaluate TSMixer on seven popular multivariate long-term forecasting benchmarks and a large-scale real-world retail dataset, M5 [@SM22]. The long-term forecasting datasets cover various applications such as weather, electricity, and traffic, and are comprised of multivariate time series without auxiliary information. The M5 dataset is for the competition task of predicting the sales of various items at Walmart. It is a large scale dataset containing 30,490 time series with static features such as store locations, as well as time-varying features such as campaign information. This complexity renders M5 a more challenging benchmark to explore the potential benefits of cross-variate information and auxiliary features. The statistics of these datasets are presented in Table `\ref{table:data_stat}`{=latex}.

For multivariate long-term forecasting datasets, we follow the settings in recent research [@LY22; @TZ22b; @YN23]. We set the input length $L = 512$ as suggested in @YN23 and evaluate the results for prediction lengths of $T = \{96, 192, 336, 720\}$. We use the Adam optimization algorithm [@DK15] to minimize the mean square error (MSE) training objective, and consider MSE and mean absolute error (MAE) as the evaluation metrics. We apply reversible instance normalization [@TK22] to ensure a fair comparison with the state-of-the-art PatchTST [@YN23].

For the M5 dataset, we mostly follow the data processing from @gluonts_jmlr. We consider the prediction length of $T = 28$ (same as the competition), and set the input length to $L = 35$. We optimize log-likelihood of negative binomial distribution as suggested by @SD20. We follow the competition's protocol [@SM22] to aggregate the predictions at different levels and evaluate them using the weighted root mean squared scaled error (WRMSSE). More details about the experimental setup and hyperparameter tuning can be found in Appendices `\ref{appendix:exp_detail}`{=latex} and  `\ref{appendix:best_hp}`{=latex}.

Multivariate Long-term Forecasting
----------------------------------

For multivariate long-term forecasting tasks, we compare TSMixer to state-of-the-art multivariate models such as FEDformer [@TZ22], Autoformer [@HW21], Informer [@HZ21], and univariate models like PatchTST [@YN23] and LTSF-Linear [@AZ22]. Additionally, we include TFT [@LB21], a deep learning-based model that considers auxiliary information, as a baseline to understand the limitations of solely relying on historical features. We also evaluate TMix-Only, a variant of TSMixer that only applies time-mixing, to assess the effectiveness of feature-mixing. The results are presented in Table `\ref{table:ltsf_main}`{=latex}. A comparison with other MLP-like alternatives is provided in Appendix `\ref{appendix:alter}`{=latex}.

#### TMix-Only

We first examine the results of univariate models. Compared to the linear model, TMix-Only shows that stacking proves beneficial, even without considering cross-variate information. Moreover, TMix-Only performs at a level comparable to the state-of-the-art PatchTST, suggesting that the simple time-mixing layer is on par with more complex attention mechanisms.

#### TSMixer

Our results indicate that TSMixer exhibits similar performance to TMix-Only and PatchTST. It significantly outperforms state-of-the-art multivariate models and achieves competitive performance compared to PatchTST, the state-of-the-art univariate model. TSMixer is the only multivariate model that is competitive with univariate models with all other multivariate models performing significantly worse than univariate models. The performance of TSMixer is also similar to that of TMix-Only, which implies that feature-mixing is not beneficial for these benchmarks. These observations are consistent with findings in [@AZ22] and [@YN23]. The results suggest that cross-variate information may be less significant in these datasets, indicating that the commonly used datasets may not be sufficient to evaluate a model's capability of utilizing covariates. However, we will demonstrate that cross-variate information can be useful in other scenarios.

#### Effects of lookback window length

To gain a deeper understanding of TSMixer's capacity to leverage longer sequences, we conduct experiments with varying lookback window sizes, specifically $L = \{96, 336, 512, 720\}$. We also perform similar experiments on linear models to support our findings presented in Section `\ref{sec:linear}`{=latex}. The results of these experiments are depicted in Fig. `\ref{fig:seq_len_sub}`{=latex}. More results and details can be found in Appendix `\ref{appendix:seq_len}`{=latex}. Our empirical analyses reveal that the performance of linear models improves significantly as the lookback window size increases from 96 to 336, and appears to be reaching a convergence point at 720. This aligns with our prior findings that the performance of linear models is dependent on the lookback window size. On the other hand, TSMixer achieves the best performance when the window size is set to 336 or 512, and maintains the similar level of performance as the window size is increased to 720. As noted by @YN23, many multivariate Transformer-based models (such as Transformer, Informer, Autoformer, and FEDformer) do not benefit from lookback window sizes greater than 192, and are prone to overfitting when the window size is increased. In comparison, TSMixer demonstrates a superior ability to leverage longer sequences and better generalization capabilities than other multivariate models.

![Performance comparison on varying lookback window size $L$ of linear models and TSMixer.](seq_len_sub.png){#fig:seq_len_sub width="\\columnwidth"}

```{=latex}
\centering
```
::: {#table:m5_past}
  Models              Multivariate           Test WRMSSE         Val WRMSSE
  --------------- --------------------- --------------------- -----------------
  Linear                                   0.983$\pm$0.016     1.045$\pm$0.018
  PatchTST                                 0.976$\pm$0.014     0.992$\pm$0.011
  **TMix-Only**                            0.960$\pm$0.041     1.000$\pm$0.027
  Autoformer       `\ding{52}`{=latex}     0.742$\pm$0.029     0.640$\pm$0.023
  FEDformer        `\ding{52}`{=latex}     0.804$\pm$0.039     0.674$\pm$0.014
  **TSMixer**      `\ding{52}`{=latex}   **0.737$\pm$0.033**   0.605$\pm$0.027

  : Evaluation on M5 without auxiliary information. We report the mean and standard deviation of WRMSSE across 5 different random seeds. TMix-Only is a univariate variant of TSMixer where only time-mixing is applied. The multivariate models outperforms univariate models with a significant gap.
:::

```{=latex}
\centering
```
::: {#table:m5_aux}
+:---------------------------------------+:------------------------------------------------:+:--------------------------------------:+:-------------------------------------:+:---------------:+
| `\multirow{2}{*}{Models}`{=latex}      | `\multicolumn{2}{c|}{Auxiliary feature}`{=latex} | `\multirow{2}{*}{Test WRMSSE}`{=latex} | `\multirow{2}{*}{Val WRMSSE}`{=latex} |                 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
| ```{=latex}                            | Static                                           | Future                                 |                                       |                 |
| \cline{2-3}                            |                                                  |                                        |                                       |                 |
| ```                                    |                                                  |                                        |                                       |                 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
| DeepAR                                 | `\ding{52}`{=latex}                              | `\ding{52}`{=latex}                    | 0.789$\pm$0.025                       | 0.611$\pm$0.007 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
| TFT                                    | `\ding{52}`{=latex}                              | `\ding{52}`{=latex}                    | 0.670$\pm$0.020                       | 0.579$\pm$0.011 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
| `\multirow{4}{*}{TSMixer-Ext}`{=latex} |                                                  |                                        | 0.737$\pm$0.033                       | 0.000$\pm$0.000 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
|                                        | `\ding{52}`{=latex}                              |                                        | 0.657$\pm$0.046                       | 0.000$\pm$0.000 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
|                                        |                                                  | `\ding{52}`{=latex}                    | 0.697$\pm$0.028                       | 0.000$\pm$0.000 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+
|                                        | `\ding{52}`{=latex}                              | `\ding{52}`{=latex}                    | **0.640**$\pm$0.013                   | 0.568$\pm$0.009 |
+----------------------------------------+--------------------------------------------------+----------------------------------------+---------------------------------------+-----------------+

: Evaluation on M5 with auxiliary information.
:::

Large-scale Demand Forecasting
------------------------------

We evaluate TSMixer on the large-scale retail dataset M5 to explore the model's ability to leverage complicated cross-variate information and auxiliary features. M5 comprises thousands of multivariate time series, each with its own historical observations, future time-varying features, and static features, in contrast to the long-term forecasting benchmarks, which typically consist of a single multivariate historical time series. We utilize TSMixer-Ext, the architecture introduced in Sec.`\ref{subsec:tsmixer_aux}`{=latex}, to leverage the auxiliary information. Furthermore, the presence of a high proportion of zeros in the target sequence presents an additional challenge for prediction. Therefore, we learn negative binomial distributions, as suggested by@SD20, to better fit the distribution.

#### Forecast with Historical Features Only

First, we compare TSMixer with other baselines using historical features only. As shown in Table `\ref{table:m5_past}`{=latex} the multivariate models perform much better than univariate models for this dataset. Notably, PatchTST, which is designed to ignore cross-variate information, performs significantly worse than multivariate TSMixer and FEDformer. This result underscores the importance of modeling cross-variate information on some forecasting tasks, as opposed to the argument in [@YN23]. Furthermore, TSMixer substantially outperforms FEDformer, a state-of-the-art multivariate model.

TSMixer exhibits a unique value as it is the only model that performs as well as univariate models when cross-variate information is not useful, and it is the best model to leverage cross-variate information when it is useful.

#### Forecast with Auxiliary Information

To understand the extent to which TSMixer can leverage auxiliary information, we compare TSMixer against established time series forecasting algorithms, TFT [@LB21] and DeepAR [@SD20]. Table `\ref{table:m5_aux}`{=latex} shows that with auxiliary features TSMixer outperforms all other baselines by a significant margin. This result demonstrates the superior capability of TSMixer for modeling complex cross-variate information and effectively leveraging auxiliary features, an impactful capability for real-world time-series data beyond long-term forecasting benchmarks. We also conduct ablation studies by removing the static features and future time-varying features. The results demonstrates that while the impact of static features is more prominent, both static and future time-varying features contribute to the overall performance of TSMixer. This further emphasizes the importance of incorporating auxiliary features in time series forecasting models.

#### Computational Cost

We measure the computational cost of each models with their best hyperparameters on M5. As shown in Table `\ref{table:cost}`{=latex}, TSMixer has much smaller size compared to RNN- and Transformer-based models. TSMixer has similar training time with multivariate models, however, it achieves much faster inference, which is almost the same as simple linear models. Note that PatchTST has faster inference speed because it merges the feature dimension into the batch dimension, which leads to more parallelism but loses the multivariate information, a key aspect for high forecasting accuracy on real-world time-series data.

```{=latex}
\centering
```
```{=latex}
\small
```
`\label{table:cost}`{=latex}

  Models             `\multicolumn{1}{l}{Multivariate}`{=latex}  Auxiliary feature                         \# of params     `\multicolumn{1}{l}{training time (s)}`{=latex}   `\multicolumn{1}{l}{inference (step/s)}`{=latex}
  ----------------- -------------------------------------------- ----------------------------------------- -------------- ------------------------------------------------- --------------------------------------------------
  Linear                   `\multicolumn{1}{l}{}`{=latex}                                                  1K                                                       2958.18                                                110
  PatchTST                 `\multicolumn{1}{l}{}`{=latex}                                                  26.7K                                                    886.101                                                120
  **TMix-Only**            `\multicolumn{1}{l}{}`{=latex}                                                  6.3K                                                     4073.72                                                110
  Autoformer                    `\ding{52}`{=latex}                                                        471K                                                   119087.64                                                 42
  FEDformer                     `\ding{52}`{=latex}                                                        1.7M                                                    11084.43                                                 56
  **TSMixer**                   `\ding{52}`{=latex}                                                        189K                                                    11077.95                                                 96
  DeepAR                        `\ding{52}`{=latex}              `\multicolumn{1}{c}{\ding{52}}`{=latex}   1M                                                       8743.55                                                105
  TFT                           `\ding{52}`{=latex}              `\multicolumn{1}{c}{\ding{52}}`{=latex}   2.9M                                                    14426.79                                                 22
  **TSMixer-Ext**               `\ding{52}`{=latex}              `\multicolumn{1}{c}{\ding{52}}`{=latex}   244K                                                    11615.87                                                108

  : Computational cost on M5. All models are trained on a single NVIDIA Tesla V100 GPU. All models are implemented in PyTorch, except TFT, which is implemented in MXNet.

Conclusions
===========

We propose TSMixer, a novel architecture for time series forecasting that is designed using MLPs instead of commonly used RNNs and attention mechanisms to obtain superior generalization with a simple architecture. Our results at a wide range of real-world time series forecasting tasks demonstrate that TSMixer is highly effective in both long-term forecasting benchmarks for multivariate time-series, and real-world large-scale retail demand forecasting tasks. Notably, TSMixer is the only multivariate model that is able to achieve similar performance to univariate models in long term time series forecasting benchmarks. The TSMixer architecture has significant potential for further improvement and we believe it will be useful in a wide range of time series forecasting tasks. Some of the potential future works include further exploring the interpretability of TSMixer, as well as its scalability to even larger datasets. We hope this work will pave the way for more innovative architectures for time series forecasting.

```{=latex}
\bibliographystyle{tmlr}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Proof of Theorem `\ref{thm:smooth}`{=latex} {#appendix:proof}
===========================================

```{=latex}
\smooth*
```
```{=latex}
\begin{proof}
Without loss of generality, we assume the lookback window starts at $t = 1$ and the historical values is $\boldsymbol{x} \in \mathbb{R}^L$. The ground truth of the future time series:
\begin{equation*}
    y_i = x(L + i) = x(P + 1 + i) = g(P + 1 + i) + f(P + 1 + i) = g(1 + i) + f(P + 1 + i)
\end{equation*}
Let $\boldsymbol{A} \in \mathbb{R}^{T \times (P+1)}$, and
\begin{equation*}
    \boldsymbol{A}_{ij} =
    \begin{cases}
        1, & \text{if $j = P + 1$ or $j = (i \bmod P) + 1$} \\
        -1, & \text{if $j = 1$} \\
        0, & \text{otherwise}
    \end{cases}, \boldsymbol{b}_i = 0
\end{equation*}
Then
\begin{align*}
    \hat{y}_i &= \boldsymbol{A}_i\boldsymbol{x} + \boldsymbol{b} \\
    &= x_{(i \bmod P) + 1} - x_1 + x_{P+1} \\
    &= x((i \bmod P) + 1) - x(1) + x(P+1)
\end{align*}
So we have:
\begin{align*}
    y_i - \hat{y}_i &= x(P+i+1) - x((i \bmod P) + 1) + x(1) - x(P+1) \\
    &= \left( x(P+i+1) - x((i \bmod P) + 1) \right) + \left( x(1)  - x(P+1) \right) \\
    &= f(P+i+1) - f((i \bmod P) + 1) + g(P+i+1) - g((i \bmod P) + 1) \\
    &+ f(1) - f(P+1) + g(1) - g(P+1) \\
    &= (f(P+i+1) - f(P+1)) - (f((i \bmod P) + 1) - f(1))
\end{align*}
And the mean absolute error between $y_i$ and $\hat{y}_i$ would be:
\begin{align*}
    |y_i - \hat{y}_i| &= |(f(P+i+1) - f(P+1)) - (f((i \bmod P) + 1) - f(1))| \\
    &\leq |f(P+i+1) - f(P+1)| + |f((i \bmod P) + 1) - f(1)| \\
    &\leq K|(P+i+1) - (P+1)| + K|(i \bmod P + 1) - 1| \\
    &\leq K(i + \min(i, P))
\end{align*}

\end{proof}
```
Implementation Details {#appendix:imp_detail}
======================

Normalization
-------------

There are three types of normalizations used in the implementation:

1.  Global normalization: Global normalization standardizes all variates of time series independently as a data pre-processing. The standardized data is then used for training and evaluation. It is a common setup in long-term time series forecasting experiments to prevent from the affects of different variate scales. For M5, since there is only one target time series (sales), we do not apply the global normalization.

2.  Local normalization: In contrast to global normalization, local normalization is applied on each batch as pre-processing or post-processing. For long-term forecasting datasets, we apply reversible instance normalization [@TK22] to ensure a fair comparison with the state-of-the-art results. In M5, we independently scale the sales of all products by their mean for model input and re-scale the model output.

3.  Model-level normalization: We apply batch normalization on long-term forecasting datasets as suggested in [@YN23] and apply layer normalization on M5 as described below.

Differences between TSMixer and TSMixer-Ext
-------------------------------------------

Due to the different normalizations between long-term forecasting benchmarks and M5, we slightly modify the mixing layers in TSMixer-Ext to better fit M5. We consider post-normalization rather than pre-normalization [@RX20] because pre-normalization may lead to NaN when the scale of input is too large. Furthermore, we apply layer normalization instead of batch normalization because batch normalization requires much larger batch size to obtain stable statistics of M5. The resulting architecture is shown in Fig. `\ref{fig:m5_arch}`{=latex}.

![Mixing layers in TSMixer-Ext.](m5_arch.png){#fig:m5_arch width="\\textwidth"}

```{=latex}
\revision{
\subsection{Formulae of TSMixer architecture}
\label{appendix:formula}

In this section, we provide the mathematical formulae of each components in TSMixer.

\subsubsection{TSMixer Components}
The TSMixer architecture is composed of several key components, which are implemented using a combination of linear layers, nonlinear activation functions, dropout, normalization, and residual connections. These are all standard deep learning operations that are commonly used. 
%This makes TSMixer easy to implement. 
The major components of TSMixer are:
\begin{enumerate}
    \setlength{\itemsep}{1pt}
    \item \textbf{Temporal Projection} and \textbf{Time Mixing}, which are used to model transformations between time steps.
    \item \textbf{Feature Mixing}, which is used to model feature transformations.
    \item \textbf{Conditional Feature Mixing}, which is used to transform hidden features based on the static features $\boldsymbol{S}$.
    \item \textbf{Mixer Layer}, which is the composition of the Time Mixing and the Feature Mixing.
    \item \textbf{Conditional Mixer Layer}, which is the composition of the Time Mixing and the Conditional Feature Mixing.
\end{enumerate}
For the layers involving the change of output size, we use subscripts $A \rightarrow B$ denotes the size is changing from $A$ to $B$.

\paragraph{Temporal Projection}
Given an input matrix $\boldsymbol{X} \in \mathbb{R}^{L \times C}$, the Temporal Projection (TP) is a linear layer that acts on the columns of $\boldsymbol{X}$ (denoted as $\boldsymbol{X}_{*, i}$) and is shared across all columns to project the time series from the input length to the prediction length.
The operation is defined as:
\begin{equation}
\begin{split}
    \text{TP}_{L \rightarrow T}(\boldsymbol{X})_{*, i} &= \boldsymbol{W}_1 \boldsymbol{X}_{*, i} + \boldsymbol{b}_1, \forall i = 1, \dots, C,
\end{split}
\end{equation}
where $\boldsymbol{W}_1 \in \mathbb{R}^{L \times T}$ and $\boldsymbol{b}_1 \in \mathbb{R}^T$ are the weights and biases of the linear layer respectively.
The subscript $L \rightarrow T$ denotes the mapping between input and output dimensions.

\paragraph{Time Mixing}
Similar to the Temporal Projection, the Time Mixing (TM) acts on all columns of $\boldsymbol{X}$ and applies commonly used deep learning layers to perform temporal feature transformation. The operation is defined as:
\begin{multline}
    \text{TM}(\boldsymbol{X})_{*, i} = \\
    \text{Norm} \left( \boldsymbol{X}_{*, i} + \text{Drop} \left( \sigma \left( \text{TP}_{L \rightarrow L} \left( \boldsymbol{X} \right)_{*, i} \right) \right) \right), \\
    \forall i = 1, \dots, C,
\end{multline}
where $\sigma(\cdot)$ is an activation function, $\text{Drop}(\cdot)$ is dropout and $\text{Norm}(\cdot)$ can be layer normalization or batch normalization.
It is important to note that the normalization is applied on the entire matrix (along both time and feature domain), rather than row-by-row (along the feature domain) as in Transformer-based models.
The TM block allows TSMixer to effectively capture temporal dependencies in the time series data.

\paragraph{Feature Mixing}
The Feature Mixing (FM) is a two-layer residual MLP that acts on the rows of the input matrix $\boldsymbol{X} \in \mathbb{R}^{L \times C}$ and is shared across all rows.
The block is designed to model feature transformations and is applied to each row $\boldsymbol{X}_{j, *}$ of the input matrix.
The operation is defined as:
\begin{gather*}
    \boldsymbol{U}_{j, *} =  \text{Drop} \left( \sigma \left( \boldsymbol{W}_2  \boldsymbol{X}_{j, *} + \boldsymbol{b}_2 \right) \right), \\
    \text{FM}_{C \rightarrow C}(\boldsymbol{X})_{j, *} = \text{Norm} \left(\boldsymbol{X}_{j, *} +\text{Drop} \left(  \boldsymbol{W}_3 \boldsymbol{U}_{j, *} + \boldsymbol{b}_3 \right)\right),\\
    \forall j = 1, \dots, L, \label{eq:fr}
\end{gather*}
where $\boldsymbol{W}_2, \boldsymbol{W}_3 \in \mathbb{R}^{C \times C}$ and $\boldsymbol{b}_2, \boldsymbol{b}_3 \in \mathbb{R}^C$.

When it is necessary to project the features to a different size $H$ ($H \neq C$), TSMixer applies a linear transformation to the residual term:
\begin{gather*}
    \text{FM}_{C \rightarrow H}(\boldsymbol{X})_{j, *} = \text{Norm} \left(\boldsymbol{W}_H\boldsymbol{X}_{j, *} + \boldsymbol{b}_H +\text{Drop} \left(  \boldsymbol{W}_3 \boldsymbol{U}_{j, *} + \boldsymbol{b}_3 \right)\right), \\
    \forall j = 1, \dots, L,
\end{gather*}
where $\boldsymbol{W}_3, \boldsymbol{W}_H \in \mathbb{R}^{H \times C}, \boldsymbol{b}_3, \boldsymbol{b}_H \in \mathbb{R}^{H}$.

\paragraph{Conditional Feature Mixing}
The Conditional Feature Mixing (CFM) is a variation of the FM block that takes into account an associated static feature $\boldsymbol{S} \in \mathbb{R}^{1 \times C_s}$ in addition to the input sequence $\boldsymbol{X} \in \mathbb{R}^{L \times H}$.
The block is designed to transform hidden features depending on the static features.
The operation is defined as:
\begin{align}
    & \boldsymbol{V}_{j, *} = \text{FR}_{C_s \rightarrow H}(\text{Expand}_L(\boldsymbol{S})) \nonumber \\
    &\text{CFM}_{C \rightarrow H}(\boldsymbol{X}, \boldsymbol{S})_{j, *} = \text{FM}_{C+H \rightarrow H}(\boldsymbol{X} \oplus \boldsymbol{V})_{j, *} \\
    &\forall j = 1, \dots, L,
\end{align}
where $\text{Expand}_L(\cdot)$ expands the input along the time dimension by repeating it $L$ times, $\boldsymbol{V} \in \mathbb{R}^{L \times H}$ and $\boldsymbol{X} \oplus \boldsymbol{V} \in \mathbb{R}^{L \times (C + H)}$ is the concatenation of $\boldsymbol{X}$ and $\boldsymbol{V}$ along the feature dimension.

\paragraph{Mixer Layer and Conditional Mixer Layer}
The Mixer Layer (Mix) is a composition of the Time Mixing and Feature Mixing, whereas the Conditional Mixer Layer (CMix) is a composition of the Time Mixing and Conditional Feature Mixing. Both Mix and CMix blocks apply the temporal and feature transformations respectively:
\begin{align}
\text{Mix}_{C \rightarrow H}(\boldsymbol{X}) &= \text{FR}_{C \rightarrow H} \left( \text{TR}_{L \rightarrow L} (\boldsymbol{X}) \right) \nonumber \\
\text{CMix}_{C \rightarrow H}(\boldsymbol{X}, \boldsymbol{S}) &= \text{CFR}_{C \rightarrow H} \left( \text{TR}_{L \rightarrow L} (\boldsymbol{X}), \boldsymbol{S} \right). \nonumber \\
\end{align}

\subsubsection{Basic TSMixer for Multivariate Time Series Forecasting}
For long-term time series forecasting (LTSF) tasks, TSMixer only uses the historical target time series $\boldsymbol{X}$ as input.
A series of mixer blocks are applied to project the input data to a latent representation of size $C$.
The final output is then projected to the prediction length $T$:
\begin{align}
    \boldsymbol{O}_1 &= \text{Mix}_{C \rightarrow C}(\boldsymbol{X}) \nonumber \\
    \boldsymbol{O}_k &= \text{Mix}_{C \rightarrow C}(\boldsymbol{O}_{k-1}), \forall k = 2, \dots, K \nonumber \\
    \hat{\boldsymbol{Y}} &= \text{TP}_{L \rightarrow T}(\boldsymbol{O}_K) \nonumber
\end{align}
where $\boldsymbol{O}_k$ is the latent representation of the $k$-th mixer block and $\hat{\boldsymbol{Y}}$ is the prediction.
We project the sequence to length $T$ after the mixer blocks as $T$ may be quite long in LTSF tasks.
To increase the model capacity, we modify the hidden layers in Feature Mixing by using $\boldsymbol{W}_2 \in \mathbb{R}^{H \times C}, \boldsymbol{W}_3 \in \mathbb{R}^{C \times H}, \boldsymbol{b}_2 \in \mathbb{R}^{H}, \boldsymbol{b}_3 \in \mathbb{R}^{C}$ in Eq.~\eqref{eq:fr}, where $H$ is a hyper-parameter indicating the hidden size.
Another modification is using pre-normalization~\citep{RX20} instead of post-normalization in residual blocks to keep the input scale.

\subsubsection{Extended TSMixer for Time Series Forecasting with Auxiliary Information}
Given input data consisting of a target time series $\boldsymbol{X} \in \mathbb{R}^{L \times C}$, historical features $\hat{\boldsymbol{X}} \in \mathbb{R}^{L \times C_x}$, apriori known future features $\boldsymbol{Z} \in \mathbb{R}^{T \times C_z}$, and static features $\boldsymbol{S} \in \mathbb{R}^{1 \times C_s}$, TSMixer applies a series of conditional feature mixing and conditional mixer layers to project the input data to a latent representation of size $H$. 
The operation of a the architecture consisting $K$ blocks is defined as:
\begin{align}
    \boldsymbol{X}' &= \text{CFM}_{C+C_x \rightarrow H}(\text{TL}_{L \rightarrow T}(\boldsymbol{X} \oplus \hat{\boldsymbol{X}}),  \boldsymbol{S}) \nonumber \\
    \boldsymbol{Z}' &= \text{CFM}_{C_z \rightarrow H}(\boldsymbol{Z}, \boldsymbol{S}) \nonumber \\
    \boldsymbol{O}_1 &= \text{CMix}_{2H \rightarrow H}(\boldsymbol{X}' \oplus \boldsymbol{Z}', \boldsymbol{S}) \nonumber \\
    \boldsymbol{O}_k &= \text{CMix}_{H \rightarrow H}(\boldsymbol{O}_{k-1}, \boldsymbol{S}), \forall k = 2, \dots, K \nonumber
\end{align}
where $\boldsymbol{X}' \in \mathbb{R}^{T \times H}$ is the latent representation of all past information projected to the prediction length, $\boldsymbol{Z}' \in \mathbb{R}^{T \times H}$ is the latent representation of future features, $\boldsymbol{O}_k \in \mathbb{R}^{T \times H}$ is the output of the $k$-th mixer block.
The final output, $\boldsymbol{O}_K$, is then linearly projected to the prediction space, which can be real values or the parameters of a probability distribution (e.g. negative binomial distribution that is commonly used for demand prediction~\citep{SD20}).
}
```
Experimental Setup {#appendix:exp_detail}
==================

Long-term time series forecasting datasets
------------------------------------------

For the long-term forecasting datasets (ETTm2, Weather, Electricity, and Traffic), we use publicly-available data that have been pre-processed by @HW21, and we follow experimental settings used in recent papers [@LY22; @TZ22b; @YN23]. Specifically, we standardize each covariate independently and do not re-scale the data when evaluating the performance. We train each model with a maximum 100 epochs and do early stopping if the validation loss is not improved after 5 epochs.

M5 dataset
----------

We obtain the M5 dataset from Kaggle[^1]. Please refer to the participants guide to check the details about the competition and the dataset. We refer to the example script in GluonTS [@gluonts_jmlr][^2] and the repository of the third place solution[^3] in the competition to implement our basic feature engineering. We list the features we used in our experiment in Table `\ref{table:m5_feature}`{=latex}.

```{=latex}
\centering
```
::: {#table:m5_feature}
  Static features   Time-varying features
  ----------------- ------------------------------------------------------
  state\_id         snap\_CA
  store\_id         snap\_TX
  category\_id      snap\_WI
  department\_id    event\_type\_1
  item\_id          event\_type\_2
  mean\_sales       normalized\_price\_per\_item
                    normalized\_price\_per\_group
                    day\_of\_week
                    day\_of\_month
                    day\_of\_year
                    sales (prediction target, only available in history)

  : Static features and time-varying features used in our experiments.
:::

Our implementation is based on GluonTS. We use TFT and DeepAR provided in GluonTS, and implement PatchTST, FEDformer, and our TSMixer ourselves. We modified these models if necessary to optimize the negative binomial distribution, as suggested by DeepAR paper [@SD20]. We train each model with a maximum 300 epochs and employ early stopping if the validation loss is not improved after 30 epochs. We noticed that optimizing other objective function might get significantly worse results when evaluate WRMSSE. To obtain more stable results, for all models, we take the top 8 hyperparameter settings based on validation WRMSSE and train them for an additional 4 trials (totaling 5 trials) and select the best hyperparameters based on their mean validation WRMSSE, then report the evaluation results on the test set. The hyperparameter settings can be found in Appendix `\ref{appendix:best_hp}`{=latex}.

Effects of Lookback Window Size
===============================

We show the effects of different lookback window size $L = \{96, 336, 512, 720\}$ with the prediction length $T = \{96, 192, 336, 720\}$ on ETTm2, Weather, Electricity, and Traffic. The results are shown in Fig. `\ref{fig:seq_len_full}`{=latex}.

`\label{appendix:seq_len}`{=latex}

![Effects of lookback window size on TSMixer.](seq_len_full.png){#fig:seq_len_full width="\\linewidth"}

Hyperparameters {#appendix:best_hp}
===============

+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{ETTh1}`{=latex}         |                                             |                                                       |                                                 |                                     |
+:===================================+:===============================:+:============================================+:============================================+:======================================================+:================================================+:====================================+
| ```{=latex}                        | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
| \endfoot                           |                                 |                                             |                                             |                                                       |                                                 |                                     |
| ```                                |                                 |                                             |                                             |                                                       |                                                 |                                     |
| ```{=latex}                        |                                 |                                             |                                             |                                                       |                                                 |                                     |
| \endlastfoot                       |                                 |                                             |                                             |                                                       |                                                 |                                     |
| ```                                |                                 |                                             |                                             |                                                       |                                                 |                                     |
| Search space                       |                                 |                                             |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{8, 16, 32, 64}`{=latex}     | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 6                                           | 0.9                                                   | 512                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       | 4                                           | 0.9                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       | 4                                           | 0.9                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       | 2                                           | 0.9                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.001                                       |                                             | 0.3                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       |                                             | 0.1                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       |                                             | 0.1                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       |                                             | 0.1                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{ETTh2}`{=latex}         |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{8, 16, 32, 64}`{=latex}     | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 4                                           | 0.9                                                   | 8                                               | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       | 1                                           | 0.9                                                   | 8                                               | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 1                                           | 0.9                                                   | 16                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 2                                           | 0.9                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.0001                                      |                                             | 0.9                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       |                                             | 0.9                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       |                                             | 0.7                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       |                                             | 0.7                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{ETTm1}`{=latex}         |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{8, 16, 32, 64}`{=latex}     | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 6                                           | 0.9                                                   | 16                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 4                                           | 0.9                                                   | 32                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 4                                           | 0.9                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 4                                           | 0.9                                                   | 16                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.001                                       |                                             | 0.5                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       |                                             | 0.3                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       |                                             | 0.3                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       |                                             | 0.9                                                   |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{ETTm2}`{=latex}         |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{8, 16, 32, 64}`{=latex}     | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.001                                       | 8                                           | 0.9                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 1                                           | 0.9                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 8                                           | 0.9                                                   | 512                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 8                                           | 0.1                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.0001                                      |                                             | 0.7                                                   | 512                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      |                                             | 0.3                                                   | 256                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      |                                             | 0.3                                                   | 128                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      |                                             | 0.1                                                   | 512                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{Weather}`{=latex}       |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{8, 16, 32, 64}`{=latex}     | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 4                                           | 0.3                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 8                                           | 0.7                                                   | 32                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 2                                           | 0.7                                                   | 8                                               | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 8                                           | 0.7                                                   | 16                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.001                                       | 2                                           | 0.9                                                   | 64                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       | 1                                           | 0.1                                                   | 32                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       | 1                                           | 0.1                                                   | 32                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       | 2                                           | 0.7                                                   | 64                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{Electricity}`{=latex}   |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{64, 128, 256, 512}`{=latex} | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 6                                           | 0.7                                                   | 32                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 8                                           | 0.7                                                   | 16                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 6                                           | 0.7                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       | 6                                           | 0.7                                                   | 64                                              | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.0001                                      | 4                                           | 0.5                                                   | 32                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 6                                           | 0.9                                                   | 8                                               |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 4                                           | 0.1                                                   | 8                                               |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.001                                       | 4                                           | 0.3                                                   | 64                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{5}{c}{Traffic}`{=latex}       |                                             |                                                       |                                                 |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Search space                       | `\multicolumn{1}{l|}{}`{=latex} | `\multicolumn{1}{c}{Learing rate}`{=latex}  | `\multicolumn{1}{c}{Blocks}`{=latex}        | `\multicolumn{1}{c}{Dropout}`{=latex}                 | `\multicolumn{1}{c}{Hidden size}`{=latex}       | `\multicolumn{1}{c}{Heads}`{=latex} |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| Model                              | $T$                             | `\multicolumn{1}{c}{0.001, 0.0001}`{=latex} | `\multicolumn{1}{c}{1, 2, 4, 6, 8}`{=latex} | `\multicolumn{1}{c}{0.1, 0.3, 0.5, 0.7, 0.9}`{=latex} | `\multicolumn{1}{c}{64, 128, 256, 512}`{=latex} | `\multicolumn{1}{c}{4, 8}`{=latex}  |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TSMixer}`{=latex} | 96                              | 0.0001                                      | 8                                           | 0.7                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.0001                                      | 8                                           | 0.7                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.0001                                      | 6                                           | 0.7                                                   | 512                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 2                                           | 0.9                                                   | 256                                             | relu                                |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
| `\multirow{4}{*}{TFT}`{=latex}     | 96                              | 0.001                                       | 4                                           | 0.3                                                   | 64                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 192                             | 0.001                                       | 4                                           | 0.9                                                   | 64                                              |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 336                             | 0.001                                       | 6                                           | 0.7                                                   | 128                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+
|                                    | 720                             | 0.0001                                      | 8                                           | 0.1                                                   | 256                                             |                                     |
+------------------------------------+---------------------------------+---------------------------------------------+---------------------------------------------+-------------------------------------------------------+-------------------------------------------------+-------------------------------------+

: Hyerparamter tuning spaces and best configurations for TSMixer and TFT on long-term forecasting benchmarks.

```{=latex}
\centering
```
                  `\multicolumn{5}{c}{M5}`{=latex}                                                       
  -------------- ---------------------------------- ------------ ------------------- ------------------- -------
  Search space              Learing rate               Blocks          Dropout           Hidden size      Heads
  Model                        0.001                 1, 2, 3, 4   0, 0.05, 0.1, 0.3   64, 128, 256, 512   4, 8
  PatchTST                     0.001                     2                0                  64          
  Autoformer                   0.001                     2                0                  128         
  FEDformer                    0.001                     1                0                  256         
  DeepAR                       0.001                     2              0.05                 256         
  TFT                          0.001                     1              0.05                 64             4
  TSMixer                      0.001                     2                0                  64          

  : Hyerparamter tuning spaces and best configurations for all models on M5.

Alternatives to MLPs {#appendix:alter}
====================

In Section `\ref{sec:linear}`{=latex}, we discuss the advantages of linear models and their time-step-dependent characteristics. In addition to linear models and the proposed TSMixer, there are other architectures whose weights are time-step-dependent. In this section, we examine full MLP and Convolutional Neural Networks (CNNs), as alternatives to MLPs. The building block of full MLPs applies linear operations on the vectorized input with $T \times C$ dimensions and vectorized output with $L \times C$ dimensions. As CNNs, we consider a 1-D convolution layer followed by a linear layer.

The results of this evaluation, conducted on the ETTm2 and Weather datasets, are presented in Table `\ref{table:alter}`{=latex}. The results show that while full MLPs have the highest computation cost, they perform worse than both TSMixer and CNNs. On the other hand, the performance of CNNs is similar to TSMixer on the Weather dataset, but significantly worse on ETTm2, which is a more non-stationary dataset. Compared with TSMixer, the main difference is both full MLPs and CNNs mix time and feature information simultaneously in each linear operation, while TSMixer alternatively conduct either time or feature mixing. The alternative mixing allows TSMixer to use a large lookback window, which is favorable theoretically (Section `\ref{sec:linear}`{=latex}) and empirically (previous ablation), but also keep a reasonable number of parameters, which leads to better generalization. On the other hand, the number of parameters of conventional MLPs and CNNs grow faster than TSMixer when increasing the window size $L$, which may suffer higher chance of overfitting than TSMixer.

```{=latex}
\centering
```
```{=latex}
\setlength
```
```{=latex}
\tabcolsep{3.3pt}
```
::: {#table:alter}
            `\multicolumn{2}{c|}{Models}`{=latex}            `\multicolumn{2}{c|}{TSMixer}`{=latex}   `\multicolumn{2}{c|}{Full MLP}`{=latex}   `\multicolumn{2}{c}{CNN}`{=latex}                              
  --------------------------------------------------------- ---------------------------------------- ----------------------------------------- ----------------------------------- ------- ------- ----------- -------
            `\multicolumn{2}{c|}{Metric}`{=latex}                             MSE                                       MAE                                    MSE                   MAE     MSE       MAE     
    `\multicolumn{1}{c|}{\multirow{4}{*}{ETTm2}}`{=latex}                      96                                    **0.163**                              **0.252**               0.441   0.486     0.232     0.334
               `\multicolumn{1}{c|}{}`{=latex}                                192                                    **0.216**                              **0.290**               1.028   0.755     0.323     0.410
               `\multicolumn{1}{c|}{}`{=latex}                                336                                    **0.268**                              **0.324**               1.765   1.049     0.616     0.593
               `\multicolumn{1}{c|}{}`{=latex}                                720                                    **0.420**                              **0.422**               2.724   1.305     2.009     1.214
   `\multicolumn{1}{c|}{\multirow{4}{*}{Weather}}`{=latex}                     96                                    **0.145**                              **0.198**               0.190   0.279     0.149     0.220
               `\multicolumn{1}{c|}{}`{=latex}                                192                                    **0.191**                              **0.242**               0.250   0.338     0.194     0.263
               `\multicolumn{1}{c|}{}`{=latex}                                336                                    **0.242**                              **0.280**               0.298   0.375   **0.242**   0.306
               `\multicolumn{1}{c|}{}`{=latex}                                720                                      0.320                                **0.336**               0.360   0.422   **0.293**   0.355

  : Comparison with other MLP-like alternatives.
:::

[^1]: <https://www.kaggle.com/competitions/m5-forecasting-accuracy/data>

[^2]: <https://github.com/awslabs/gluonts/blob/dev/examples/m5_gluonts_template.ipynb>

[^3]: <https://github.com/devmofl/M5_Accuracy_3rd>
