---
abstract: |
  Recently, channel-independent methods have achieved state-of-the-art performance in multivariate time series (MTS) forecasting. Despite reducing overfitting risks, these methods miss potential opportunities in utilizing channel dependence for accurate predictions. We argue that there exist locally stationary lead-lag relationships between variates, i.e., some lagged variates may follow the leading indicators within a short time period. Exploiting such channel dependence is beneficial since leading indicators offer advance information that can be used to reduce the forecasting difficulty of the lagged variates. In this paper, we propose a new method named LIFT that first efficiently estimates leading indicators and their leading steps at each time step and then judiciously allows the lagged variates to utilize the advance information from leading indicators. LIFT plays as a plugin that can be seamlessly collaborated with arbitrary time series forecasting methods. Extensive experiments on six real-world datasets demonstrate that LIFT improves the state-of-the-art methods by 5.4% in average forecasting performance. Our code is available at <https://github.com/SJTU-DMTai/LIFT>.
author:
- |
  Lifan Zhao\
  Shanghai Jiao Tong University\
  `mogician233@sjtu.edu.cn`\
  `\And `{=latex} Yanyan Shen\
  Shanghai Jiao Tong University\
  `shenyy@sjtu.edu.cn`\
bibliography:
- main.bib
title: 'Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\model}{{LIFT}~}
```
```{=latex}
\newcommand{\modelns}{{LIFT}}
```
```{=latex}
\newcommand{\eat}[1]{}
```
```{=latex}
\maketitle
```
Introduction
============

Multivariate time series (MTS) forecasting, one of the most popular research topics, is a fundamental task in various domains such as weather, traffic, and finance. An MTS consists of multiple channels (*a.k.a.*, variates[^1]), where each channel is a univariate time series. Many MTS forecasting researches argue each channel has dependence on other channels. Accordingly, numerous approaches adopt *channel-dependent* (CD) strategies and *jointly* model multiple variates by advanced neural architectures, including GNNs [@MTGNN; @StemGNN; @CrossGNN; @FourierGNN], MLPs [@TSMixerGoogle; @TSMixerKDD2023; @TimeMixer; @yi2023frequencydomain], CNNs [@TimesNet], Transformers [@Informer; @BasisFormer; @CARD; @iTransformer], and others [@shen2024multiresolution; @jia2023witran; @fan2024mgtsd].

Unexpectedly, CD methods have been defeated by recently proposed channel-independent (CI) methods [@PatchTST; @PITS; @GPT4TS; @jin2024timellm; @cao2024tempo; @chen2024pathformer; @dai2024periodicity] and even a simple linear model [@DLinear; @RLinear; @FITS]. These CI methods *seperately* forecast each univariate time series based on its own historical values, instead of referring to other variates. While only modeling cross-time dependence, CI Transformers [@PatchTST; @GPT4TS] surprisingly outperform CD Transformers that jointly model cross-time and cross-variate dependence [@STformer; @Crossformer]. One reason is that existing CD methods lack prior knowledge about channel dependence and may encounter the overfitting issue [@RevisitCI]. This gives rise to an interesting question: *is there any explicit channel dependence that is effective to MTS forecasting*?

In this work, we turn the spotlight on the **locally stationary lead-lag relationship** between variates. An intriguing yet underestimated characteristic of many MTS is that the evolution of variates may lag behind some other variates, termed as leading indicators. Leading indicators may directly influence the wave of other variates, while the influence requires a certain time delay to propagate and take effect. For example, an increasing concentration of an anti-fever drug in the blood may cause a decrease in body temperature after an hour but not immediately. On top of this, another common case is that both leading indicators and lagged variates depend on some latent factors, while the leading ones are the first to get affected. For example, a typhoon first cools down coastal cities and, after a few days, cools down inland cities.

```{=latex}
\centering
```
```{=latex}
\subcaptionbox{\label{fig:locally:a}}
```
![Illustration of locally stationary lead-lag relationships. (a) On training data, three variates $v_1$, $v_2$, and $v_3$ share similar temporal patterns (see colors) across the lookback window and horizon window, while $v_1$ and $v_2$ run ahead of $v_3$ by four and two steps, respectively. However, the leading indicators and leading steps can only keep static for a short period. (b) On test data, $v_1$ is no longer a leading indicator, and $v_2$ also changes its leading steps to five.](training_sample.png "fig:"){#fig:locally width=".465\\linewidth"} `\hspace{1.5em}`{=latex} `\subcaptionbox{\label{fig:locally:b}}`{=latex} ![Illustration of locally stationary lead-lag relationships. (a) On training data, three variates $v_1$, $v_2$, and $v_3$ share similar temporal patterns (see colors) across the lookback window and horizon window, while $v_1$ and $v_2$ run ahead of $v_3$ by four and two steps, respectively. However, the leading indicators and leading steps can only keep static for a short period. (b) On test data, $v_1$ is no longer a leading indicator, and $v_2$ also changes its leading steps to five.](test_sample.png "fig:"){#fig:locally width=".465\\linewidth"}

As such effects typically change little within a certain period, the lead-lag relationships are locally stationary once established. As illustrated in Figure `\ref{fig:locally:a}`{=latex}, the lagged variate and its leading indicators share similar temporal patterns across the lookback window and the horizon window. If a leading indicator evolves $\delta$-step ahead of the target variate, the latest $\delta$ steps of its lookback window will share similar temporal patterns with the future $\delta$ steps of the lagged variate. Particularly, when the lagged variate completely follows its leading indicator, the difficulty of forecasting $H$ steps for the lagged variate can be reduced to forecasting $H-\delta$ steps by previewing the advance information. Despite the advent of lead-lag relationships, **the dynamic variation in leading indicators and leading steps** poses the challenge to modeling channel dependence. As shown in Figure `\ref{fig:locally}`{=latex}, the specific leading indicators and the corresponding leading steps can vary over time.

In light of this, we propose a method named `\model `{=latex}(short for *Learning from **L**eading **I**ndicators **F**or M**T**S Forecasting*), involving three key steps. *First*, we develop an efficient cross-correlation computation algorithm to dynamically estimate the leading indicators and the leading steps at each time step. *Second*, as depicted in Figure `\ref{fig:motivation}`{=latex}, we align each variate and its leading indicators via a target-oriented shift trick. *Third*, we employ a backbone to make preliminary predictions and introduce a Lead-aware Refiner to calibrate the rough predictions. It is noteworthy that many MTS are heterogeneous, where the variates are different dimensions of an object (*e.g.*, wind speed, humidity, and air temperature in weather). In these cases, the lagged variates may be correlated with the leading indicators by sharing only a part of temporal patterns. To address this issue, we exploit desirable signals in the frequency domain and realize the Lead-aware Refiner by an Adaptive Frequency Mixer that adaptively filters out undesirable frequency components of leading indicators and absorbs the remaining desirable ones. The main contributions of this paper are summarized as follows.

-   We propose a novel method called `\model `{=latex}that exploits the locally stationary lead-lag relationship between variates for MTS forecasting. `\model `{=latex}works as a plug-and-play module and can seamlessly incorporate arbitrary time series forecasting backbones.

-   We introduce an efficient algorithm to estimate the leading indicators and the corresponding leading steps at any time step. We further devise a Lead-aware Refiner that adaptively leverages the informative signals of leading indicators in the frequency domain to refine the predictions of lagged variates.

-   Extensive experimental results on six real-world datasets demonstrate that `\model `{=latex}significantly improves the state-of-the-art methods in both short-term and long-term MTS forecasting. Specifically, `\model `{=latex}makes an average improvement of **7.9%** over CI models and **3.0%** over CD models. We also introduce a lightweight yet strong method LightMTS, which enjoys high parameter efficiency and achieves the best performance on popular Weather and Electricity datasets.

```{=latex}
\centering
```
```{=latex}
\subcaptionbox{Overfitting training patterns in Figure~\ref{fig:locally:a}\label{fig:misplaced}}
```
![Illustration of our key idea. In one case of test data, $v_1$ no longer leads $v_3$. Instead, the leading indicators of $v_3$ are $v_2$ and $v_4$, which lead by five and three steps, respectively. An intuitive idea is to shift $v_2$ and $v_4$ by the corresponding leading steps to keep them always aligned with $v_3$. `\label{fig:motivation}`{=latex}](misplaced.png "fig:"){width=".43\\linewidth"} `\hspace{1.5em}`{=latex} `\subcaptionbox{Proposed target-oriented shift\label{fig:aligned}}`{=latex} ![Illustration of our key idea. In one case of test data, $v_1$ no longer leads $v_3$. Instead, the leading indicators of $v_3$ are $v_2$ and $v_4$, which lead by five and three steps, respectively. An intuitive idea is to shift $v_2$ and $v_4$ by the corresponding leading steps to keep them always aligned with $v_3$. `\label{fig:motivation}`{=latex}](aligned.png "fig:"){width=".46\\linewidth"}

Preliminaries
=============

A multivariate time series (MTS)[^2] is denoted by $\boldsymbol{\mathcal X}=\left\{\mathcal X^{(1)}, \cdots, \mathcal X^{(C)}\right\}$, where $C$ is the number of variates (*a.k.a.* channels) and $\mathcal X^{(j)}$ is the time series of the $j$-th variate. Given an $L$-length lookback window $\boldsymbol{\mathcal{X}}_{t-L+1: t}=\{\mathcal X^{(j)}_{t-L+1}, \cdots, \mathcal X^{(j)}_{t}\}_{j=1}^{C}\in \mathbb{R}^{C \times L}$, the MTS forecasting task at time $t$ aims to predict $H$ consecutive future time steps in the horizon window, *i.e.*, $\boldsymbol{\mathcal X}_{t+1: t+H} \in \mathbb{R}^{C \times H}$.

We assume $\mathcal X^{(j)}_{t+1:t+H}$ is similar to $\mathcal X^{(i)}_{t+1-\delta:t+H-\delta}$ if variate $i$ leads variate $j$ by $\delta$ steps at time $t$. Through the lens of locally stationary lead-lag relationships, one can use recent observations to estimate the leading indicators and the leading steps. Specifically, the lead-lag relationship can be quantified by the *cross-correlation coefficient* between $\mathcal X^{(i)}_{t-L+1-\delta:t-\delta}$ and $\mathcal X^{(j)}_{t-L+1:t}$, which is defined as follows.

```{=latex}
\begin{definition}[Cross-correlation coefficient]Assuming variate $i$ is $\delta$ steps ahead of variate $j$ over the $L$-length lookback window, the cross-correlation coefficient between the two variates at time $t$ is defined as:
    \begin{equation}\label{eq:cross corr}
        R^{(j)}_{i,t}(\delta) = \frac{\operatorname{Cov}(\mathcal X^{(i)}_{t-L+1-\delta:t-\delta}, \mathcal X^{(j)}_{t-L+1:t})}{\sigma^{(i)}\sigma^{(j)}}  = \frac{1}{L}\sum_{t^\prime=t-L+1}^{t}{\frac{\mathcal X^{(i)}_{t^\prime-\delta}-\mu^{(i)}}{\sigma^{(i)}}\cdot\frac{\mathcal X^{(j)}_{t^\prime}-\mu^{(j)}}{\sigma^{(j)}}},
    \end{equation}
    where $\mu^{(\cdot)}\in \mathbb R$ and $\sigma^{(\cdot)}\in \mathbb R$ represent the mean and standard variation of the univariate time series within the lookback window, respectively.
%   and $\mu^{(i)}\in \mathbb R$ and $\sigma^{(i)}\in \mathbb R$ represent those of $\mathcal X^{(j)}_{t-L+1:t}$, respectively.
\end{definition}
```
The LIFT Approach
=================

In this section, we propose our `\model `{=latex}method that dynamically identifies leading indicators and adaptively leverages them for MTS forecasting.

Overview {#sec:overview}
--------

Figure `\ref{fig:framework}`{=latex} depicts the overview of `\modelns`{=latex}, which involves 6 major steps as follows.

```{=latex}
\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{overview.pdf}
    \caption{Overview of \modelns. All layers in the grey background are non-parametric. We depict the input of the lookback window by solid curves and the predictions of the horizon window by dashed curves. As an illustration, we choose the two most possible leading indicators for each target variate, \textit{e.g.}, the orange and the yellow ones are leading indicators of the red at time $t$.}
    \label{fig:framework}
\end{figure*}
```
1.  **Preliminary forecasting.** Given a lookback window $\boldsymbol{{\mathcal{X}}}_{t-L+1:t}$, we first obtain rough predictions $\boldsymbol{\widehat{\mathcal{X}}}_{t+1:t+H}$ from a black-box backbone, which can be implemented by any existing time series forecasting model.

2.  **Instance normalization.** Given $\boldsymbol{{\mathcal{X}}}_{t-L+1:t}$ and $\boldsymbol{\widehat{\mathcal{X}}}_{t+1:t+H}$, we apply instance normalization [@RevIN] without affine parameters so as to unify the value range across the variates. Specifically, based on the mean and standard deviation of each variate in $\boldsymbol{{\mathcal{X}}}_{t-L+1:t}$, we obtain a normalized lookback window $\boldsymbol{{{X}}}_{t-L+1:t}$ and normalized predictions ${\boldsymbol{\widehat{X}}}_{t+1:t+H}$.

3.  **Lead estimation.** Given $\boldsymbol{{{X}}}_{t-L+1:t}$, the Lead Estimator calculates the cross-correlation coefficients for pair-wise variates. For each variate $j$, we select the $K$ most possible leading indicators $\mathcal I^{(j)}_t\in \mathbb R^{K}$ ($K$$\ll$$C$) along with the corresponding leading steps $\{\delta^{(j)}_{i,t}\mid i\in \mathcal I^{(j)}_t\}$ and cross-correlation coefficients $\boldsymbol{{R}}_t^{(j)}\in \mathbb R^{K}$.

4.  **Target-oriented shifts.** After obtaining $\mathcal I^{(j)}_t$ and $\{\delta^{(j)}_{i,t}\}_{i\in \mathcal I^{(j)}_t}$ for variate $j$, we shift ${\boldsymbol X}^{(i)}_{t-L+1:t}$ and ${\widehat{{\boldsymbol X}}}^{(i)}_{t+1:t+H}$ by $\delta^{(j)}_{i,t}$ steps where $i\in \mathcal I^{(j)}_t$. We thereby obtain a $j$-oriented MTS segment $\boldsymbol S^{(j)}_{t}\in \mathbb R^{K\times H}$, where the $K$ leading indicators get aligned with variate $j$ in the horizon window.

5.  **Lead-aware refinement.** The Lead-aware Refiner extracts signals from $\boldsymbol S^{(j)}_{t}$ and refines the normalized preliminary predictions ${\widehat{\boldsymbol X}}^{(j)}_{t+1:t+H}$ as ${\widetilde{\boldsymbol X}}^{(j)}_{t+1:t+H}$.

6.  **Instance denormalization.** Finally, we denormalize ${\widetilde{\boldsymbol X}}^{(j)}_{t+1:t+H}$ with the original mean and standard deviation, yielding the final predictions ${\widetilde{\boldsymbol{ \mathcal{X}}}}^{(j)}_{t+1:t+H}$.

**Training scheme.** We can jointly train the backbone and Lead-aware Refiner by the MSE between $\boldsymbol{\widetilde{\mathcal X}}_{t+1:t+H}$ and the ground truth $\boldsymbol{{\mathcal X}}_{t+1:t+H}$. Alternatively, given a pretrained and frozen backbone, we can precompute the preliminary predictions only once on training data, reducing the time of hyperparameter tuning and GPU memory occupation during training.

**Technical challenges.** Notably, it is non-trivial to leverage the lead-lag relationships due to issues of efficiency and noise. As Eq. (`\ref{eq:cross corr}`{=latex}) requires $\mathcal O(L)$, a brute-force estimation method that searches all possible $\delta$ in $\{1, \cdots, L\}$ requires $\mathcal O(L^2)$ computations. Also, $\boldsymbol S^{(j)}_{t}$ can contain some irrelevant patterns from leaders which are noise to the lagged variate.

To tackle these issues, we implement the Lead Estimator by an efficient algorithm of $\mathcal O(L\log L)$ complexity. And we develop the Lead-aware Refiner by an Adaptive Frequency Mixer that adaptively generates frequency-domain filters and mixes desirable frequency components according to the cross-correlations and variate states.

Lead Estimator
--------------

Given the normalized lookback window $\boldsymbol{{{X}}}_{t-L+1:t}$, the Lead Estimator first computes the cross-correlation coefficients between each pair of variate $i$ and variate $j$, based on an extension of Wiener--Khinchin theorem [@Wiener1930] (see details in Appendix `\ref{appendix:math}`{=latex}). Formally, we estimate the coefficients for all possible leading steps in $\{0, \cdots, L-1\}$ at once by the following equation: $$\left\{{R}^{(j)}_{i,t}(\tau)\right\}_{\tau=0}^{L-1}=\frac{1}{L}\mathcal F^{-1}\left(\mathcal F(X^{(j)}_{t-L+1:t})\odot\overline{\mathcal F(X^{(i)}_{t-L+1:t})}\right),$$ where $\mathcal F$ is the Fast Fourier Transform, $\mathcal F^{-1}$ is its inverse, $\odot$ is the element-wise product, and the bar denotes the conjugate operation. The complexity is reduced to $\mathcal O(L\log L)$.

Note that variates can exhibit either positive or negative correlations. The leading step $\delta^{(j)}_{i,t}$ between the target variate $j$ and its leading indicator $i$ is meant to reach the maximum absolute cross-correlation coefficient, *i.e.*, $$\label{eq:argmax}
    % \begin{aligned}
        \delta^{(j)}_{i,t}= \arg\max_{1\le \tau\le L-1} |R^{(j)}_{i,t}(\tau)|.
        % \text{s.t.} \quad |{R}^{(j)}_{i,t}(\tau-1)| &< |{R}^{(j)}_{i,t}(\tau)| < |{R}^{(j)}_{i,t}(\tau+1)|.
    % \end{aligned}$$ For simplicity, we denote the maximum absolute coefficient $|R^{(j)}_{i,t}(\delta^{(j)}_{i,t})|$ as $|R^{(j)*}_{i,t}|$. Then, we choose $K$ variates that show the most significant lead-lag relationships as leading indicators of variate $j$, which are defined as: $$\mathcal I^{(j)}_t=\underset{1\le i\le C}{\arg\operatorname{TopK}}(|R^{(j)*}_{i,t}|).
    % \mathcal I^{(j)}_t=\{k\in[1,C]\wedge k\neq j \mid |R^{(j)*}_{k,t}| \in{\rm TopK}\big(\{|R^{(j)*}_{i,t}|\}_i\big)\}.$$ Specifically, the $K$ leading indicators $\mathcal I^{(j)}_t$ are sorted by cross-correlations in descending order, *i.e.*, the $k$-th indicator in $\mathcal I^{(j)}_t$ has the $k$-th highest $|R^{(j)*}_{i,t}|$ *w.r.t.* variate $j$. Furthermore, we use $\boldsymbol{{R}}^{(j)}_t\in \mathbb R^{K}$ to denote an array of $\{|R^{(j)*}_{i, t}|\}_{i\in \mathcal I^j_t}$.

Notably, our Lead Estimator is non-parametric and we can precompute the estimations only once on training data, instead of repeating the computations at every epoch.

Lead-aware Refiner
------------------

For each variate $j$, the Lead-aware Refiner is to refine ${\widehat{\mathcal{X}}}^{(j)}_{t+1:t+H}$ by its leading indicators. We will describe the refinement process for variate $j$, and the other $C-1$ variates are refined in parallel.

#### Target-oriented shifts

For each leading indicator $i\in \mathcal I^{(j)}_t$, we shift its sequence by the leading step as follows: $${{\boldsymbol X}^{(i\rightarrow j)}_{t+1:t+H}}=
    \begin{cases}
        {\boldsymbol X}^{(i)}_{t+1-\delta^{(j)}_{i,t}:\ t+H-\delta^{(j)}_{i,t}}, & {{\rm if~} \delta^{(j)}_{i,t}\ge H}\\
        {\boldsymbol X}^{(i)}_{t+1-\delta^{(j)}_{i,t}:\ t}{\Huge\parallel}\ {\widehat {\boldsymbol X}}^{(i)}_{t+1:\ t+H-\delta^{(j)}_{i,t}}, & {{\rm otherwise}}
    \end{cases}$$ where $\parallel$ is the concatenation.

For a leading indicator $i$ that is negatively correlated with the variate $j$, we flip its values at each time step to reflect ${R^{(j)*}_{i,t}}<0$. Formally, for each $i\in \mathcal{I}^{(j)}_i$, we have: $$%   \mathtt{turn}(x)=\text{sign}({R^{(j)*}_{i,t}})\cdot x
    \mathtt{turn}({\boldsymbol X}^{(i\rightarrow j)}_{t+1:t+H}) =\text{sign}({R^{(j)*}_{i,t}})\cdot {\boldsymbol X}^{(i\rightarrow j)}_{t+1:t+H}.$$ We then collect $\{\mathtt{turn}({\boldsymbol X}^{(i\rightarrow j)}_{t+1:t+H})\mid i\in \mathcal I^{(j)}_t\}$ as a target-oriented MTS segment $\boldsymbol S^{(j)}_{t}\in \mathbb R^{K\times H}$.

#### State estimation

For a comprehensive understanding of leading indicators, it is noteworthy that the lead-lag patterns also depend on variate states. Different variates lie in their specific states with some intrinsic periodicities (or trends), *e.g.*, solar illumination is affected by rains in the short term but keeps its daily periodicity. The state of a variate may also change over time, exhibiting different correlation strengths with other variates, *e.g.*, correlations between the traffic speeds of two adjacent roads are strong within peak hours but much weaker within off-peak hours. Therefore, the variate states are informative signals that can guide us to filter out uncorrelated patterns.

Assuming there are $N$ states in total, we estimate the state probabilities of variate $j$ at time $t$ by: $${P}^{(j)}_t = \operatorname{softmax}\left(P_{0}^{(j)} + f_{\text{state}}(\mathcal{X}^{(j)}_{t-L+1:t})\right),$$ where $P_0^{(j)} \in \mathbb R^{N}$ represents the intrinsic state distribution of variate $j$ and is a learnable parameter, $f_{\text{state}}:\mathbb R^{L}\mapsto \mathbb R^{N}$ is implemented by a linear layer, and ${P}^{(j)}_t = \{{p}^{(j)}_{t,n}\}_{n=1}^N \in \mathbb R^{N}$ includes the probabilities of all potential states at time $t$. Our adaptive frequency mixer will take ${P}^{(j)}_t$ to generate filters to filter out noisy channel dependence according to the variate state.

#### Adaptive frequency mixer

To extract valuable information from leading indicators, we propose to model cross-variate dependence in the frequency domain. Given the normalized predictions of variate $j$ and its target-oriented MTS segment $\boldsymbol S^{(j)}_{t}$, we derive their Fourier transforms by: $$V^{(j)} = \mathcal F({\widehat{{\boldsymbol X}}}^{(j)}_{t+1:t+H})\quad \text{and} \quad
    \boldsymbol{U}^{(j)} = \mathcal F(\boldsymbol S^{(j)}_{t}),$$ where $\mathcal F$ is the Fast Fourier Transform, $V^{(j)}\in \mathbb C^{\lfloor H / 2\rfloor+1}$, and $\boldsymbol{U}^{(j)}\in \mathbb C^{K\times (\lfloor H / 2\rfloor+1)}$. Each element of $\boldsymbol{U}^{(j)}$, denoted as ${U}^{(j)}_k$, is the frequency components of the $k$-th leading indicator. Let ${\Delta}^{(j)}_k={U}^{(j)}_k - V^{(j)}$ denote the difference between variate $j$ and the $k$-th leading indicator.

```{=latex}
\centering
```
![Architecture of the adaptive frequency mixer. ](mixer2.png){#fig:mixer width="85%"}

Intuitively, the preliminary predictions deserve more refinement from the leading indicators when the estimated correlation $\boldsymbol{{R}}_t^{(j)}$ is large. To filter signals in $V^{(j)}$ and $\boldsymbol{U}^{(j)}$, we employ a filter factory to generate $2K+1$ frequency-domain filters as defined below: $$\label{eq:factory}
        [r^{(j)}_{U,1},\cdots,r^{(j)}_{U,K}, r^{(j)}_{\Delta,1},\cdots,r^{(j)}_{\Delta,K},r^{(j)}_{V}] = \sum\nolimits_{n=1}^{N} {p}^{(j)}_{n} \cdot f_{n}(\boldsymbol{{R}}_t^{(j)}),$$ where $f_{n}: \mathbb R^{K} \mapsto \mathbb R^{(2K+1)(\lfloor H / 2\rfloor+1)}$ is a linear layer with parameters specific to the $n$-th state. On the one hand, we use the first $2K$ filters to model two kinds of lead-lag relationships: (1) variate $j$ is directly influenced by the $k$-th leader, and the ground-truth $V^{(j)}_{true}$ contains a degree of $U^{(j)}_k$, *e.g.*, $V^{(j)}_{true} \approx {V}^{(j)} + r^{(j)}_{U,k}\odot U^{(j)}_k$; (2) variate $j$ is similar to the $k$-th leader when they are both influenced by a latent factor, and the ground-truth $V^{(j)}_{true}$ is the interpolation between ${V}^{(j)}$ and $U^{(j)}_k$, *e.g.*, $V ^{(j)}_{true} \approx (1-r^{(j)}_{\Delta,k})\odot {V}^{(j)} + r^{(j)}_{\Delta,k}\odot U=V^{(j)}+ r^{(j)}_{\Delta,k}\odot{\Delta}^{(j)}_k$. On the other hand, we use $r^{(j)}_{V}\in \mathbb R^{\lfloor H / 2\rfloor+1}$ to dismiss unreliable frequency components of $V^{(j)}$. Formally, we scale the frequency components by: $$\label{eq:filter}
        {\widetilde V}^{(j)} = {r}^{(j)}_{V} \odot {V}^{(j)}, \quad
    {\widetilde U}^{(j)}_k = {r}^{(j)}_{U,k} \odot {U}^{(j)}_k, \quad
    {\widetilde\Delta}^{(j)}_k = {r}^{(j)}_{\Delta,k} \odot {\Delta}^{(j)}_k.$$ Then, we gather information from $K$ leading indicators and mix the frequency components by: $$\begin{gathered}
    \widetilde{V}^{(j)} = g\left({\widetilde V}^{(j)}\parallel \sum\nolimits_{k=1}^{K} {\widetilde U}^{(j)}_k \parallel \sum\nolimits_{k=1}^{K} {\widetilde \Delta}^{(j)}_k \right), \label{eq:mix}\end{gathered}$$ where $g: \mathbb C^{3(\lfloor H / 2\rfloor+1)}\mapsto \mathbb C^{\lfloor H / 2\rfloor+1}$ is a complex-valued linear layer.

Finally, we apply inverse Fast Fourier Transform and denormalization in order to derive the final refined predictions, which are formulated as: $$\mathcal{\tilde{X}}^{(j)}_{t+1:t+H}=\operatorname{denorm}(\mathcal F^{-1}(\widetilde{V}^{(j)})),$$ where we use the mean and standard deviation of $\mathcal{X}^{(j)}_{t-L+1:t}$ for denormalization.

Discussion
----------

**Reasoning why CD models show inferior performance.** Many variates are unaligned with each other, while traditional models (e.g., Informer [@Informer]) simply mix multivariate information at the same time step. Consequently, they introduce outdated information from lagged variates which are noise and disturb predicting leaders. Though other models (e.g., Vector Auto-Regression [@VAR]) memorize CD from different time steps by static weights, they can suffer from overfitting issues since the leading indicators and leading steps vary over time.

**`\model `{=latex}can cooperate with arbitrary time series forecasting backbones.** When combining LIFT with a CI backbone, we decompose MTS forecasting into two stages which focus on modeling time dependence and channel dependence, respectively. This scheme avoids introducing noisy channel dependence during the first stage and may reduce optimization difficulty compared with traditional CD methods. When combining LIFT with a CD backbone, we expect LIFT to refine the rough predictions with the actual observations of leading indicators in $\boldsymbol{S}_t^{(j)}$.

**`\model `{=latex}alleviates distribution shifts by dynamically selecting and shifting indicators.** Existing normalization-based methods [@RevIN; @Dish-TS; @SAN] handle distribution shifts of the statistical properties (*e.g.*, mean and variance) in the lookback window and the horizon window. Our work is orthogonal to them as we take a novel investigation into a different kind of distribution shifts in channel dependence (see visualization in Appendix `\ref{appendix:visualization}`{=latex}).

Lightweight MTS Forecasting with LIFT
=====================================

Thanks to the flexibility of `\modelns`{=latex}, we introduce a lightweight MTS forecasting method named LightMTS, where a simple linear layer serves as a CI backbone. Following [@RLinear], we conduct instance normalization before preliminary forecasting to alleviate distribution shifts.

As we do not learn representations in the high-dimensional latent space, LightMTS is more lightweight than popular CD models, including Transformers [@Crossformer; @iTransformer] and CNNs [@TimesNet]. Empirical evidence is provided in Appendix `\ref{appendix:parameter}`{=latex}, where the parameter efficiency of LightMTS keeps similar to DLinear [@DLinear].

Experiments
===========

Experimental Settings
---------------------

**Datasets.** We conduct extensive experiments on six widely-used MTS datasets, including Weather [@DLinear], Electricity [@MTGNN], Traffic [@LSTNet], Solar [@iTransformer], Wind [@Pyraformer], and PeMSD8 [@Song2020SpatialTemporalSG]. We provide the dataset details in Appendix `\ref{appendix:dataset}`{=latex} and conduct experiments on more datasets in Appendix `\ref{appendix:performance}`{=latex}.

**Comparison Methods.** As LIFT can incorporate arbitrary time series forecasting backbones, we verify the effectiveness of LIFT with (i) *two state-of-the-art CI models*: PatchTST [@PatchTST] and DLinear [@DLinear]; (ii) *the state-of-the-art CD model*: Crossformer [@Crossformer]; (iii) *a classic CD model*: MTGNN [@MTGNN]. We use them to instantiate the backbone of LIFT, while we keep the same model hyperparameters for fair comparison. We also include the baselines of PatchTST, such as FEDformer [@FEDformer] and Autoformer [@Autoformer].

**Setups.** All of the methods follow the same experimental setup with the forecast horizon $H\in\{24, 48, 96, 192, 336, 720\}$ for both short-term and long-term forecasting. We collect some baseline results reported by PatchTST to compare performance with LightMTS, where PatchTST has tuned the lookback length $L$ of FEDformer and Autoformer. For other methods, we set $L$ to $336$. We use Mean Squared Error (MSE) and Mean Absolute Error (MAE) as evaluation metrics.

Performance Evaluation
----------------------

Table `\ref{tab:main}`{=latex} compares the forecasting performance between the four state-of-the-art methods and LIFT on the six MTS datasets, showing that LIFT can outperform the SOTA methods in most cases. Specifically, LIFT improves the corresponding backbone by 5.4% on average.

```{=latex}
\begin{table*}[t]\centering
\caption{Performance comparison in terms of forecasting errors. We highlight the better results between each pair of backbones and LIFT in \textbf{bold} and the best results among all methods on each dataset with \underline{underlines}. We show the relative improvement of LIFT over the corresponding backbone in the rightmost column.}
    \resizebox{\linewidth}{!}{\begin{tabular}{cc|cccc|cccc|cccc|cccc|c}
\toprule
\multicolumn{2}{c|}{\multirow{2}{*}{Method}} & \multicolumn{2}{c}{PatchTST} & \multicolumn{2}{c|}{\textbf{+ LIFT}} & \multicolumn{2}{c}{DLinear} & \multicolumn{2}{c|}{\textbf{+ LIFT}} & \multicolumn{2}{c}{Crossformer} & \multicolumn{2}{c|}{\textbf{+ LIFT}} & \multicolumn{2}{c}{MTGNN} & \multicolumn{2}{c|}{\textbf{+ LIFT}} & \multirow{2}{*}{Impr.} \\
\multicolumn{2}{c|}{} & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE &  \\ \midrule
\multirow{6}{*}{\begin{sideways}Weather\end{sideways}} & 24 & 0.091 & 0.122 & \textbf{0.089} & \textbf{\underline{0.119}} & 0.104 & 0.152 & \textbf{0.090} & \textbf{0.125} & \underline{0.086} & 0.126 & \underline{0.086} & 0.126 & 0.090 & 0.128 & 0.090 & \textbf{0.126} & 4.7\% \\
 & 48 & 0.119 & 0.164 & \textbf{0.115} & \textbf{\underline{0.158}} & 0.137 & 0.194 & \textbf{0.114} & \textbf{0.163} & \underline{0.112} & 0.166 & \underline{0.112} & \textbf{0.165} & 0.117 & 0.170 & \textbf{0.115} & \textbf{0.167} & 5.3\% \\
 & 96 & 0.152 & 0.199 & \textbf{0.146} & \textbf{\underline{0.196}} & 0.176 & 0.237 & \textbf{\underline{0.145}} & \textbf{0.203} & \textbf{\underline{0.145}} & \textbf{0.209} & 0.146 & 0.210 & 0.157 & 0.216 & \textbf{0.154} & \textbf{0.212} & 5.0\% \\
 & 192 & 0.197 & 0.243 & \textbf{0.190} & \textbf{\underline{0.238}} & 0.220 & 0.282 & \textbf{\underline{0.189}} & \textbf{0.249} & 0.197 & 0.264 & \textbf{0.196} & \textbf{0.262} & 0.205 & 0.269 & \textbf{0.203} & \textbf{0.266} & 4.3\% \\
 & 336 & 0.249 & 0.283 & \textbf{\underline{0.243}} & \textbf{\underline{0.281}} & 0.265 & 0.319 & \textbf{\underline{0.243}} & \textbf{0.292} & 0.246 & 0.309 & \textbf{0.245} & \textbf{0.305} & 0.258 & 0.312 & \textbf{0.256} & \textbf{0.308} & 3.0\% \\
 & 720 & 0.320 & 0.335 & \textbf{\underline{0.315}} & \textbf{\underline{0.333}} & 0.323 & 0.362 & \textbf{0.317} & \textbf{0.349} & 0.323 & 0.364 & \textbf{0.321} & \textbf{0.360} & 0.335 & 0.369 & \textbf{0.333} & \textbf{0.365} & 1.4\% \\ \midrule
\multirow{6}{*}{\begin{sideways}Electricity\end{sideways}} & 24 & 0.099 & 0.196 & \textbf{0.094} & \textbf{\underline{0.190}} & 0.110 & 0.209 & \textbf{0.099} & \textbf{0.197} & 0.095 & 0.195 & \textbf{\underline{0.093}} & \textbf{0.193} & 0.097 & 0.195 & \textbf{0.094} & \textbf{0.193} & 3.6\% \\
 & 48 & 0.115 & 0.210 & \textbf{\underline{0.110}} & \textbf{\underline{0.205}} & 0.125 & 0.223 & \textbf{0.113} & \textbf{0.209} & 0.116 & 0.216 & \textbf{0.113} & \textbf{0.211} & 0.116 & 0.215 & \textbf{0.112} & \textbf{0.211} & 4.0\% \\
 & 96 & 0.130 & \underline{0.222} & \textbf{\underline{0.128}} & \underline{0.222} & 0.140 & 0.237 & \textbf{0.130} & \textbf{0.225} & 0.142 & 0.243 & \textbf{0.138} & \textbf{0.238} & 0.138 & 0.238 & \textbf{0.133} & \textbf{0.233} & 2.9\% \\
 & 192 & 0.148 & 0.240 & \textbf{\underline{0.147}} & \textbf{\underline{0.239}} & 0.153 & 0.249 & \textbf{0.148} & \textbf{0.242} & 0.159 & 0.259 & \textbf{0.154} & \textbf{0.251} & 0.160 & 0.261 & \textbf{0.153} & \textbf{0.252} & 2.7\% \\
 & 336 & 0.167 & 0.261 & \textbf{\underline{0.163}} & \textbf{\underline{0.257}} & 0.169 & 0.267 & \textbf{\underline{0.163}} & \textbf{0.261} & 0.192 & 0.293 & \textbf{0.176} & \textbf{0.276} & 0.193 & 0.284 & \textbf{0.187} & \textbf{0.275} & 3.8\% \\
 & 720 & 0.202 & 0.291 & \textbf{\underline{0.195}} & \textbf{\underline{0.289}} & 0.203 & 0.301 & \textbf{0.198} & \textbf{0.295} & 0.264 & 0.353 & \textbf{0.224} & \textbf{0.312} & 0.242 & 0.327 & \textbf{0.216} & \textbf{0.305} & 6.6\% \\ \midrule
\multirow{6}{*}{\begin{sideways}Traffic\end{sideways}} & 24 & 0.323 & 0.235 & \textbf{\underline{0.300}} & \textbf{\underline{0.214}} & 0.371 & 0.267 & \textbf{0.347} & \textbf{0.255} & 0.483 & 0.273 & \textbf{0.392} & \textbf{0.246} & 0.402 & 0.260 & \textbf{0.392} & \textbf{0.259} & 7.3\% \\
 & 48 & 0.342 & 0.240 & \textbf{\underline{0.329}} & \textbf{\underline{0.236}} & 0.393 & 0.276 & \textbf{0.367} & \textbf{0.260} & 0.513 & 0.290 & \textbf{0.428} & \textbf{0.289} & 0.450 & \textbf{0.274} & \textbf{0.436} & 0.281 & 4.4\% \\
 & 96 & 0.367 & 0.251 & \textbf{\underline{0.352}} & \textbf{\underline{0.242}} & 0.410 & 0.282 & \textbf{0.394} & \textbf{0.273} & 0.519 & 0.293 & \textbf{0.462} & \textbf{0.284} & 0.479 & 0.289 & \textbf{0.464} & \textbf{0.286} & 4.2\% \\
 & 192 & 0.385 & 0.259 & \textbf{\underline{0.373}} & \textbf{\underline{0.251}} & 0.423 & 0.287 & \textbf{0.413} & \textbf{0.281} & 0.522 & 0.296 & \textbf{0.490} & \textbf{0.283} & 0.507 & 0.307 & \textbf{0.491} & \textbf{0.301} & 3.3\% \\
 & 336 & 0.398 & 0.265 & \textbf{\underline{0.389}} & \textbf{\underline{0.262}} & 0.436 & 0.296 & \textbf{0.426} & \textbf{0.288} & 0.530 & \textbf{0.300} & \textbf{0.517} & 0.303 & 0.539 & 0.314 & \textbf{0.519} & \textbf{0.309} & 1.9\% \\
 & 720 & 0.434 & 0.287 & \textbf{\underline{0.429}} & \textbf{\underline{0.286}} & 0.466 & 0.315 & \textbf{0.454} & \textbf{0.307} & 0.584 & 0.369 & \textbf{0.543} & \textbf{0.322} & 0.616 & 0.352 & \textbf{0.532} & \textbf{0.340} & 5.4\% \\ \midrule
\multirow{6}{*}{\begin{sideways}Solar\end{sideways}} & 24 & 0.095 & 0.160 & \textbf{0.087} & \textbf{0.147} & 0.133 & 0.219 & \textbf{0.093} & \textbf{0.149} & 0.082 & 0.134 & \textbf{0.079} & \textbf{0.129} & 0.070 & 0.125 & \textbf{\underline{0.069}} & \textbf{\underline{0.122}} & 11.0\% \\
 & 48 & 0.153 & 0.227 & \textbf{0.143} & \textbf{0.200} & 0.190 & 0.267 & \textbf{0.145} & \textbf{0.197} & 0.146 & 0.203 & \textbf{0.140} & \textbf{0.178} & 0.131 & 0.180 & \textbf{\underline{0.130}} & \textbf{\underline{0.177}} & 11.0\% \\
 & 96 & 0.176 & 0.227 & \textbf{0.174} & \textbf{0.224} & 0.222 & 0.291 & \textbf{0.185} & \textbf{0.238} & 0.179 & 0.245 & \textbf{0.174} & \textbf{0.224} & 0.167 & 0.224 & \textbf{\underline{0.166}} & \textbf{\underline{0.223}} & 6.2\% \\
 & 192 & 0.205 & 0.260 & \textbf{0.190} & \textbf{0.245} & 0.249 & 0.309 & \textbf{0.194} & \textbf{0.253} & 0.204 & 0.254 & \textbf{0.197} & \textbf{0.250} & 0.180 & 0.243 & \textbf{\underline{0.179}} & \textbf{\underline{0.239}} & 7.6\% \\
 & 336 & 0.200 & 0.252 & \textbf{0.194} & \textbf{0.249} & 0.269 & 0.324 & \textbf{0.198} & \textbf{0.260} & 0.216 & 0.257 & \textbf{0.204} & \textbf{0.254} & 0.191 & 0.251 & \textbf{\underline{0.190}} & \textbf{\underline{0.245}} & 7.5\% \\
 & 720 & 0.229 & 0.282 & \textbf{0.203} & \textbf{0.261} & 0.271 & 0.327 & \textbf{0.207} & \textbf{\underline{0.260}} & 0.211 & \textbf{0.250} & \textbf{0.202} & 0.255 & 0.197 & 0.256 & \textbf{\underline{0.195}} & \textbf{\underline{0.251}} & 8.5\% \\ \midrule
\multirow{6}{*}{\begin{sideways}Wind\end{sideways}} & 24 & 0.137 & 0.179 & \textbf{0.131} & \textbf{0.175} & 0.151 & 0.198 & \textbf{0.136} & \textbf{0.182} & 0.122 & 0.173 & \textbf{\underline{0.121}} & \textbf{0.168} & 0.124 & 0.172 & 0.124 & \textbf{\underline{0.170}} & 3.8\% \\
 & 48 & 0.163 & 0.200 & \textbf{0.155} & \textbf{0.196} & 0.175 & 0.214 & \textbf{0.159} & \textbf{0.200} & \underline{0.147} & 0.194 & \underline{0.147} & \textbf{0.189} & 0.149 & 0.192 & \textbf{0.148} & \textbf{\underline{0.191}} & 3.4\% \\
 & 96 & 0.186 & 0.216 & \textbf{0.175} & \textbf{0.213} & 0.197 & 0.230 & \textbf{0.177} & \textbf{0.214} & 0.172 & 0.218 & \textbf{0.169} & \textbf{0.208} & 0.170 & 0.211 & \textbf{\underline{0.169}} & \textbf{\underline{0.208}} & 4.1\% \\
 & 192 & 0.204 & 0.229 & \textbf{0.191} & \textbf{0.224} & 0.218 & 0.245 & \textbf{0.193} & \textbf{0.226} & 0.189 & 0.230 & \textbf{0.187} & \textbf{\underline{0.222}} & 0.186 & 0.223 & \textbf{\underline{0.184}} & \textbf{\underline{0.220}} & 4.4\% \\
 & 336 & 0.216 & 0.239 & \textbf{0.202} & \textbf{0.234} & 0.233 & 0.258 & \textbf{0.205} & \textbf{0.238} & 0.201 & 0.240 & \textbf{0.199} & \textbf{0.232} & 0.195 & 0.233 & \textbf{\underline{0.192}} & \textbf{\underline{0.227}} & 4.6\% \\
 & 720 & 0.231 & 0.253 & \textbf{0.215} & \textbf{0.247} & 0.254 & 0.278 & \textbf{0.225} & \textbf{0.256} & 0.237 & 0.286 & \textbf{0.224} & \textbf{0.254} & \underline{0.200} & 0.236 & \underline{0.200} & \textbf{\underline{0.232}} & 6.0\% \\ \midrule
\multirow{6}{*}{\begin{sideways}PeMSD8\end{sideways}} & 24 & 0.289 & 0.247 & \textbf{\underline{0.285}} & \textbf{\underline{0.246}} & 0.361 & 0.318 & \textbf{0.306} & \textbf{0.265} & 0.303 & 0.253 & \textbf{0.299} & \textbf{0.252} & 0.314 & 0.257 & \textbf{0.306} & \textbf{0.256} & 5.2\% \\
 & 48 & 0.367 & 0.281 & \textbf{0.356} & \textbf{0.277} & 0.475 & 0.378 & \textbf{0.386} & \textbf{0.303} & 0.342 & 0.271 & \textbf{\underline{0.340}} & \textbf{\underline{0.270}} & 0.357 & 0.281 & \textbf{0.356} & \textbf{0.279} & 5.8\% \\
 & 96 & 0.445 & 0.316 & \textbf{0.410} & \textbf{0.305} & 0.562 & 0.421 & \textbf{0.449} & \textbf{0.336} & 0.373 & 0.290 & \textbf{\underline{0.368}} & \textbf{\underline{0.286}} & 0.393 & 0.304 & \textbf{0.386} & \textbf{0.297} & 7.6\% \\
 & 192 & 0.519 & 0.354 & \textbf{0.471} & \textbf{0.337} & 0.611 & 0.443 & \textbf{0.502} & \textbf{0.364} & 0.409 & 0.312 & \textbf{\underline{0.399}} & \textbf{\underline{0.303}} & 0.440 & 0.333 & \textbf{0.429} & \textbf{0.324} & 7.6\% \\
 & 336 & 0.562 & 0.366 & \textbf{0.511} & \textbf{0.353} & 0.648 & 0.462 & \textbf{0.532} & \textbf{0.379} & 0.439 & 0.318 & \textbf{\underline{0.430}} & \textbf{\underline{0.310}} & 0.468 & 0.350 & \textbf{0.441} & \textbf{0.333} & 9.1\% \\
 & 720 & 0.653 & 0.403 & \textbf{0.563} & \textbf{0.378} & 0.748 & 0.519 & \textbf{0.597} & \textbf{0.414} & 0.488 & 0.356 & \textbf{\underline{0.468}} & \textbf{\underline{0.338}} & 0.511 & 0.379 & \textbf{0.484} & \textbf{0.342} & 12.3\% \\  \bottomrule
\end{tabular}
}
\label{tab:main}
\end{table*}
```
#### Improvement over CI Backbones.

LIFT makes an average improvement of 7.9% over PatchTST and DLinear on the six datasets. Notably, PatchTST and DLinear surpass Crossformer and MTGNN by a large margin on Weather, Electricity, and Traffic datasets, indicating the challenge of modeling channel dependence. Intriguingly, LIFT significantly improves CI backbones by an average margin of 4.7% on these challenging datasets, achieving the best performance in most cases. This confirms that LIFT can reduce overfitting risks by introducing prior knowledge about channel dependence.

#### Improvement over CD Backbones.

LIFT makes an average improvement of 3.0% over Crossformer and MTGNN on the six datasets. As CD backbones outperform CI ones on Solar, Wind, and PeMSD8, we conjecture that these datasets have fewer distribution shifts in channel dependence, leading to fewer overfitting risks. Even though the CD backbones have benefited from channel dependence, LIFT can still refine their predictions, *e.g.*, improving Crossformer by 4.1% on Solar. This indicates that existing CD approaches cannot fully exploit the lead-lag relationships without prior knowledge about the dynamic variation of leading indicators and leading steps. Moreover, Crossformer mixes information from the variates that show similarity at the same time step but pays insufficient attention to the different yet informative signals of leading indicators. MTGNN learns a static graph structure among variates on the training data and aggregates information within a fixed subset of variates. MTGNN may well suffer from distribution shifts in channel dependence, while LIFT dynamically selects leading indicators and reduces overfitting risks.

#### LightMTS as a Strong Baseline.

Moreover, we compare the performance of LightMTS and all baselines on Weather, Electricity, and Traffic datasets. We borrow the baseline results from the paper of PatchTST with $H \in \{96, 192, 336, 720\}$. As shown in Figure `\ref{fig:lightmts:a}`{=latex}, LightMTS with a simple linear layer as its backbone still shows considerable performance among the state-of-the-art models. In particular, LightMTS surpasses PatchTST, the complex Transformer model, by 3.2% on Weather and 0.7% on Electricity. However, PatchTST significantly outperforms LightMTS on the Traffic dataset. As Traffic contains the greatest number of variates with complex temporal patterns, it requires a strong backbone to model the intricate cross-time dependence. Nevertheless, LightMTS is still the most competitive baseline on Traffic.

Ablation Study
--------------

To verify the effectiveness of our designs, we introduce three variants of LightMTS by removing the influence term $\sum\nolimits_{k=1}^{K} {\widetilde U}^{(j)}_k$ in Eq. (`\ref{eq:mix}`{=latex}), removing the difference term $\sum\nolimits_{k=1}^{K} {\widetilde \Delta}^{(j)}_k$ in Eq. (`\ref{eq:mix}`{=latex}), and directly using $V^{(j)}$, $\sum\nolimits_{k=1}^{K} {U}^{(j)}_k$ and $\sum\nolimits_{k=1}^{K} {\Delta}^{(j)}_k$ in Eq. (`\ref{eq:mix}`{=latex}), respectively.

As shown in Figure `\ref{fig:lightmts:b}`{=latex}, we conduct experiments on these variants with $H$ set to 96, reporting the relative MSE *w.r.t.* LightMTS on Weather, Electricity, and Traffic datasets. With both the influence and the difference involved, LightMTS considers two kinds of lead-lag relationships and keeps the best performance across the datasets. In contrast, LightMTS w/o influence and LightMTS w/o difference only consider one-sided information of leading indicators, thus showing inferior performance, especially on the Electricity dataset. Furthermore, LightMTS w/o filter achieves the worst results in all the cases, which fails to adaptively filter out the noise in leading indicators.

```{=latex}
\subcaptionbox{\label{fig:lightmts:a}}
```
![(a) Performance comparison between LightMTS and all baselines; (b) Performance comparison between variants of LightMTS; (c) Performance of DLinear+LIFT under different numbers of the selected leading indicators (*i.e.*, $K$) and the states (*i.e.*, $N$).](LightMTS2.png "fig:"){#fig:LightMTS width=".31\\linewidth"} `\hspace{-1.5mm}`{=latex} `\subcaptionbox{\label{fig:lightmts:b}}`{=latex} ![(a) Performance comparison between LightMTS and all baselines; (b) Performance comparison between variants of LightMTS; (c) Performance of DLinear+LIFT under different numbers of the selected leading indicators (*i.e.*, $K$) and the states (*i.e.*, $N$).](ablation2.png "fig:"){#fig:LightMTS width=".21\\linewidth"} `\subcaptionbox{\label{fig:hyper}}`{=latex} ![(a) Performance comparison between LightMTS and all baselines; (b) Performance comparison between variants of LightMTS; (c) Performance of DLinear+LIFT under different numbers of the selected leading indicators (*i.e.*, $K$) and the states (*i.e.*, $N$).](hyper.png "fig:"){#fig:LightMTS width=".465\\linewidth"}

Hyperparameter Study
--------------------

Our method introduces merely two additional hyperparameters, *i.e.*, the number of selected leading indicators $K$ and the number of states $N$. Thus it requires a little labor for hyperparameter selection.

With DLinear as the backbone and $H$ set to 96, we study the hyperparameter sensitivity of LIFT. As shown in Figure `\ref{fig:hyper}`{=latex}, LIFT achieves lower MSE with an increasing $K$ on most datasets. Nevertheless, LIFT may well include more noise with a too large $K$ (*e.g.*, on the Wind dataset), resulting in performance degradation. Besides, LIFT cannot enjoy significant improvement with a larger $K$ on the Electricity dataset, where the lead-lag relationships are perhaps more sparse. As for variate states, LIFT achieves lower MSE with an increasing $N$ in most cases. We observe the most significant performance drop on Weather when ignoring the variate states. It is noteworthy that the variates of Weather (*e.g.*, wind speed, humidity, and air temperature) are recorded by various kinds of sensors, and the lead-lag patterns naturally vary with the variate states.

```{=latex}
\eat{
}
```
```{=latex}
\eat{
\section{Related Works}
\subsection{Multivariate Time Series Forecasting}
Vector auto-regressive model (VAR) and Gaussian process model (GP), assume a linear dependency among variables.

% model the pair-wise dependence among variables explicitly


\subsection{Univariate Time Series Forecasting}

LTSF-Linear~\citep{DLinear} uses a simple linear layer to learn shared periodic patterns, beating complex Transformer-based models. The following works~\citep{PatchTST, GPT4TS} make parameter-rich Transformers fully engaged in memorizing cross-time dependence, outperforming Transformer-based MTS forecasting methods that jointly model cross-time and cross-variate dependence~\citep{Crossformer, STformer}
}
```
Conclusion
==========

In this work, we rethink the channel dependence in MTS and highlight the locally stationary lead-lag relationship between variates. We propose a novel method called `\model `{=latex}that efficiently estimates the relationships and dynamically incorporates leading indicators in the frequency domain for MTS forecasting. `\model `{=latex}can work as a plug-and-play module and is generally applicable to arbitrary forecasting models. We further introduce LightMTS as a lightweight yet strong baseline for MTS forecasting, which keeps similar parameter efficiency to linear models and shows considerable performance. We anticipate that the lead-lag relationship can offer a novel cross-time perspective on the channel dependence in MTS, which is a promising direction for the future development of channel-dependent Transformers or other complex neural networks.

Acknowledgements {#acknowledgements .unnumbered}
================

This work is supported by the National Key Research and Development Program of China (2022YFE0200500), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and SJTU Global Strategic Partnership Fund (2021 SJTU-HKUST).

```{=latex}
\bibliographystyle{iclr2024_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Mathematical Proofs {#appendix:math}
===================

Discrete-time Fourier Transform
-------------------------------

Discrete Fourier Transform (DFT) provides a frequency-domain view of discrete time series. Given a particular frequency $f \in \{0, \frac{1}{L}, \cdots, \frac{\lfloor L/2\rfloor}{L}\}$, the corresponding frequency component $U_f$ of univariate time series $u_{t-L+1:t}$ is derived by $$\label{eq:DTFT}
    U_f = \mathcal{F}(u_{t-L+1:t})_f = \sum_{\ell=0}^{L-1} u_{t-L+1+\ell} \cdot e^{-i 2 \pi f \ell},$$ where $i$ is the imaginary unit and $U_f\in \mathbb C$ is a complex number. The inverse DFT is defined by $$\mathcal{F}^{-1}\left(U\right)_\ell=\frac{1}{L}\sum_{f} U_f \cdot e^{i 2 \pi f \ell}.$$

We can calculate the amplitude $|U_f|$ and the phase $\phi(U_f)$ of the corresponding cosine signal by $$\begin{gathered}
    \left|U_f\right|=\sqrt{\mathfrak{R}\left\{U_f\right\}^2+\mathfrak{I}\left\{U_f\right\}^2} \quad \text{and} \quad 
    \phi\left(U_f\right)=\tan ^{-1}\left(\frac{\mathfrak{I}\left\{U_f\right\}}{\mathfrak{R}\left\{U_f\right\}}\right),\end{gathered}$$ where $\mathfrak{R}\{U_f\}$ and $\mathfrak{I}\{U_f\}$ denotes its real and imaginary components of $U_f$. With $\mathfrak{R}\{U_f\}$ and $\mathfrak{I}\{U_f\}$ scaled in the same rate, our proposed real-valued filters only scale the amplitude but keep the phase unchanged.

Efficient Cross-correlation Estimation
--------------------------------------

Given another time series $v$ that lags behind $u$ by $\delta$ steps, we denote its frequency components as $V$ and define the cross-correlation between their lookback windows by $$\label{eq:corr}
\begin{aligned}
    R(\delta)&\triangleq\frac{1}{L}\sum_{\ell=0}^{L}u_{t-L+1+\ell-\delta}\cdot v_{t-L+1+\ell}
    =\frac{1}{L}\sum_{\ell=0}^{L}u(\ell-\delta) v(\ell).
\end{aligned}$$ With $u(\ell-\delta)$ denoted as $\check u(\delta-\ell)$ and $(\delta-\ell)$ denoted as $\delta'$, we derive the Fourier Transform of $\displaystyle\left\{\sum_{\ell=0}^{L-1}v(\ell)u(\ell-\delta)\right\}_{\delta=0}^{L-1}$ as follows. $$\begin{aligned}
    \mathcal F\left(\sum_{\ell=0}^{L-1}v(\ell) \check u(\delta-\ell)\right)_f
    =&\sum_{\delta=0}^{L-1}\left(\sum_{\ell=0}^{L-1}v(\ell) \check u(\delta-\ell)\right)e^{-i2\pi f\delta}\\
    =&\sum_{\ell=0}^{L-1}v(\ell)\left(\sum_{\delta=0}^{L-1} \check u(\delta-\ell)e^{-i2\pi f\delta}\right)\\
    =&\sum_{\ell=0}^{L-1}v(\ell)e^{-i2\pi f\ell}\left(\sum_{\delta=0}^{L-1} \check u(\delta-\ell)e^{-i2\pi f(\delta-\ell)}\right)\\
    =&V_f \left(\sum_{\delta=0}^{L-1} \check u(\delta')e^{-i2\pi f\delta'}\right).\\
\end{aligned}$$ Assuming $\check u$ is $L$-periodic, we have $$\begin{aligned}
    \sum_{\delta'=-\ell}^{L-1-\ell} \check u(\delta')e^{-i2\pi f\delta'}
    =&\sum_{\delta'=-\ell}^{-1} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f(\delta'+L)} +\sum_{\delta'=0}^{L-1-\ell} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f\delta'} \\ 
    =&\sum_{\delta'=-\ell}^{-1} \check u_{t-L+1+\delta'+L}\cdot e^{-i2\pi f(\delta'+L)}+\sum_{\delta'=0}^{L-1-\ell} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f\delta'}\\
    =&\sum_{\delta'=L-\ell}^{L-1} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f\delta'}+\sum_{\delta'=0}^{L-1-\ell} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f\delta'}\\
    =&\sum_{\delta'=0}^{L-1} \check u_{t-L+1+\delta'}\cdot e^{-i2\pi f\delta'}\\
    =&\mathcal F(\check u)_f.
\end{aligned}$$ Due to the conjugate symmetry of the DFT on real-valued signals, we have $$\mathcal F(\check u)_f=\overline{\mathcal F(u)}_f=\overline{U_f}$$ where the bar is the conjugate operation. Thereby, we can obtain $$\begin{aligned}
    \mathcal F\left(\sum_{\ell=0}^{L-1}v(\ell)  u(\ell-\delta)\right)_f
    =\mathcal V_f \overline{U_f},
\end{aligned}$$ Finally, we can estimate Eq. (`\ref{eq:corr}`{=latex}) as $$\label{eq:corr_freq}
    R(\delta)\approx \frac{1}{L}\mathcal F^{-1}(\mathcal F(v_{t-L+1:t})\odot \overline{\mathcal F(u_{t-L+1:t})})_\delta.$$ Note that $-1 \le R(\delta) \le 1$ when $u_{t-L+1:t}$ and $v_{t-L+1:t}$ have been normalized.

To obtain more accurate results, one can first obtain an approximate leading step $\delta$ by Eq. (`\ref{eq:corr_freq}`{=latex}) and Eq. (`\ref{eq:argmax}`{=latex}), and then compute Eq. (`\ref{eq:cross corr}`{=latex}) with $\{\tau\in \mathbb N\mid \delta-\epsilon\le \tau \le \delta+\epsilon\}$, where $\epsilon \ll L$. We would like to leave this improvement as future work.

Details of Lead Estimator
=========================

Given the cross-correlation coefficients $\{R^{(j)}_{i,t}(\tau)\mid 0\le \tau \le L-1\}$ between variate $i$ and variate $j$, we identify the leading step by $$\delta^{(j)}_{i,t}= \arg\max_{1\le \tau\le L-2} |R^{(j)}_{i,t}(\tau)|,$$ $$\text{s.t.} \quad |{R}^{(j)}_{i,t}(\tau-1)| < |{R}^{(j)}_{i,t}(\tau)| < |{R}^{(j)}_{i,t}(\tau+1)|, \label{eq:global}$$ which is targeted at the globally maximal absolute cross-correlation. Note that Eq. (`\ref{eq:corr_freq}`{=latex}) only estimates cross-correlations with $\tau$ in $\{0, \cdots, L-1\}$. If the real leading step is greater than $L-1$ (*e.g.*, $|{R}^{(j)}_{i,t}(L)|>|{R}^{(j)}_{i,t}(L-1)|$), we could mistakenly estimate $\delta$ as $L-1$. Therefore, we only consider the peak values as constrained by Eq. (`\ref{eq:global}`{=latex}).

Besides, we further normalize the cross-correlation coefficients $|R^{(j)*}_{i, t}|_{i \in \mathcal I^{(j)}_t}$. As the evolution of the target variate is affected by both itself and the $K$ leading indicators, it is desirable to evaluate the relative leading effects. Specifically, we derive a normalized coefficient for each leading indicator $i\in \mathcal I^{(j)}_t$ by: $${\widetilde{R}}_{i, t}^{(j)} = \frac{\exp |R^{(j)*}_{i, t}|}{\exp R^{(j)}_{j,t}(0) + \sum_{i^\prime\in \mathcal I^{(j)}_t}\exp |R^{(j)*}_{i^\prime, t}|},$$ where $R^{(j)}_{j,t}(0)\equiv1$. Though $\mathcal I^{(j)}_t$ may also involve the variate $j$ itself in periodic data, we can only include variate $j$ in its *last period* due to Eq. (`\ref{eq:global}`{=latex}). Note that time series contains not only the seasonality (*i.e.*, periodicity) but also its trend. Thus we use $R^{(j)}_{j,t}(0)$ to consider the *current* evolution effect from variate $j$ itself beyond its periodicity.

In terms of the proposed filter factory, we generate filters based on $\boldsymbol{\widetilde{R}}_t^{(j)}=\{{\widetilde{R}}_{i, t}^{(j)} \mid i\in \mathcal I^{(j)}_t\} \in \mathbb R^{K}$, which represents the proportion of leading effects.

Experimental Details
====================

Dataset Descriptions {#appendix:dataset}
--------------------

```{=latex}
\centering
```
```{=latex}
\resizebox{\linewidth}{!}{
    \begin{tabular}{l|cccccccccccccc}
    \toprule
    Datasets         & Weather & Electricity & Traffic & Solar  & Wind   & PeMSD8 & ETTm1 & ETTh1 & ILI\\
    \midrule
    \# of variates   & 21      & 321         & 862     & 137    & 28     & 510  &  7 & 7 & 7  \\
\# of timestamps & 52,696  & 26,304      & 17,544  & 52,560 & 50,000 & 17,856 & 69,680 & 17,420 & 966 \\
Sampling Rate    & 10 min  & 1 hour      & 1 hour  & 10 min & 1 hour & 5 min & 15 min & 1 hour & 1 week \\  \bottomrule
    \end{tabular}
    }
```
`\label{tab:dataset}`{=latex}

We provide the statistics of the nine popular MTS datasets in Table `\ref{tab:dataset}`{=latex}. The detailed descriptions are listed as follows.

-   Electricity[^3] includes the hourly electricity consumption (Kwh) of 321 clients from 2012 to 2014.

-   Weather[^4] includes 21 features of weather, *e.g.*, air temperature and humidity, which are recorded every 10 min for 2020 in Germany.

-   Traffic[^5] includes the hourly road occupancy rates recorded by the sensors of San Francisco freeways from 2015 to 2016.

-   Solar[^6] includes the solar power output hourly collected from 137 PV plants in Alabama State in 2007.

-   Wind[^7] includes hourly wind energy potential in 28 European countries. We collect the latest 50,000 records (about six years) before 2015.

-   PeMSD8[^8] includes the traffic flow, occupation, and speed in San Bernardino from July to August in 2016, which are recorded every 5 min by 170 detectors. We take the dataset as an MTS of 510 channels in most experiments, while only MTGNN models the 170 detectors with three features for each detector.

-   ETT (Electricity Transformer Temperature)[^9] includes seven oil and load features of electricity transformers from July 2016 to July 2018. ETTm1 is 15-minutely collected and ETTh1 is hourly collected.

-   ILI [^10] includes the ratio of patients seen with influenzalike illness and the number of patients. It includes weekly data from the Centers for Disease Control and Prevention of the United States from 2002 to 2021.

To evaluate the forecasting performance of the baselines, we divide each dataset into the training set, validation set, and test set by the ratio of 7:1:2.

Implementation Details
----------------------

All experiments are conducted on a single Nvidia A100 40GB GPU. We use the official implementations of all baselines and follow their recommended hyperparameters. Typically, the batch size is set to 32 for most baselines, while PatchTST recommends 128 for Weather. We adopt the Adam optimizer and search the optimal learning rate in {0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001}. As for LIFT, we continue the grid search with $K$ in {1, 2, 4, 8, 12, 16} and $N$ in {1, 2, 4, 8, 12, 16}. Note that we stop the hyperparameter tuning for consistent performance drop along one dimension of the hyperparameters, *i.e.*, we only conduct the search in a subset of the grid.

As the Lead-aware Refiner has a few dependencies on the preliminary predictions from the backbone, it is sometimes hard to train LIFT in the early epochs, especially when using complex CD backbones. To speed up the convergence, one alternative way is to pretrain the backbone for epochs and then jointly train the framework. In our experiments, we report the best result in Table 2, while DLinear+LIFT and LightMTS are always trained in an end-to-end manner.

Additional Experiments
======================

Efficiency Study {#appendix:parameter}
----------------

```{=latex}
\begin{table*}[t]\centering
\caption{Comparison of practical efficiency of PatchTST, Crossformer, and LIFT with $H=720$. MACs are the number of multiply-accumulate operations per sample. The batch size is set to 1.}
    \resizebox{\linewidth}{!}{
    \begin{tabular}{ll|ccc|ccc}
\toprule
 &  & PatchTST & +LIFT & \makecell[c]{Relative\\ additional cost} & Crossformer & +LIFT & \makecell[c]{Relative\\ additional cost}  \\ \midrule
\multirow{4}{*}{Weather} & Parameter  (M) & 4.3 & 4.7 & 9.3\% & 11.7 & 12.5 & 7.1\% \\
 & MACs (G) & 0.5 & 0.5 & 1.8\% & 4.4 & 4.4 & 0.2\% \\
 & Memory (MB) & 43 & 46 & 7.7\% & 206 & 209 & 1.6\% \\
 & Time (ms) & 4.7 & 6.0 & 29.6\% & 37.7 & 40.6 & 7.6\% \\ \midrule
\multirow{4}{*}{Electricity} & Parameter (M) & 4.3 & 4.7 & 9.3\% & 2.4 & 2.8 & 16.7\% \\
 & MACs (G) & 7.1 & 7.2 & 1.8\% & 8.6 & 8.7 & 1.4\% \\
 & Memory (MB) & 0.4 & 0.4 & 1.2\% & 873 & 878 & 0.6\% \\
 & Time (ms) & 5.1 & 6.0 & 17.4\% & 31.5 & 33.1 & 5.1\% \\ \midrule
\multirow{4}{*}{Traffic} & Parameter (M) & 4.3 & 4.7 & 9.3\% & 3.2 & 3.6 & 12.4\% \\
 & MACs (G) & 19.0 & 19.4 & 1.8\% & 10.6 & 10.9 & 3.2\% \\
 & Memory (MB) & 1061 & 1092 & 2.9\% & 1558 & 1578 & 1.3\% \\
 & Time (ms) & 5.1 & 5.8 & 14.5\% & 32.7 & 36.4 & 11.3\% \\ \midrule
\multirow{4}{*}{Solar} & Parameter (M) & 2.0 & 2.9 & 41.2\% & 1.9 & 2.3 & 23.4\% \\
 & MACs (G) & 0.9 & 1.0 & 5.3\% & 3.7 & 3.7 & 1.6\% \\
 & Memory (MB) & 92 & 96 & 4.3\% & 377 & 381 & 1.0\% \\
 & Time (ms) & 4.9 & 6.0 & 23.4\% & 31.9 & 33.4 & 5.0\% \\ \midrule
\multirow{4}{*}{Wind} & Parameter (M) & 2.0 & 2.5 & 20.6\% & 11.8 & 12.2 & 3.4\% \\
 & MACs (G) & 0.2 & 0.2 & 5.6\% & 5.7 & 5.8 & 0.2\% \\
 & Memory (MB) & 25 & 29 & 13.1\% & 256 & 259 & 1.3\% \\
 & Time (ms) & 4.7 & 6.1 & 28.9\% & 45.2 & 47.2 & 4.4\% \\ \midrule
\multirow{4}{*}{PeMSD8} & Parameter (M) & 2.0 & 2.8 & 35.8\% & 2.9 & 4.0 & 36.5\% \\
 & MACs (G) & 3.5 & 3.7 & 5.6\% & 13.6 & 13.8 & 1.5\% \\
 & Memory (MB) & 319 & 325 & 1.9\% & 1381 & 1387 & 0.4\% \\
 & Time (ms) & 5.6 & 6.0 & 7.0\% & 32.4 & 33.7 & 4.3\% \\  \bottomrule
\end{tabular}
}
\label{tab:efficiency}
\end{table*}
```
Following [@DLinear], we compare the backbones and LIFT by the number of parameters, the number of operations, the GPU memory consumption, and the inference time. As shown in Table `\ref{tab:efficiency}`{=latex}, LIFT additionally requires an average of 10.7% parameters, 1.7% MACs, 2.5% GPU memory, and 14.2% inference time more than its backbone. It is also noteworthy that our Lead Estimator is non-parametric and can pre-compute estimation only once on the training data, reducing the practical training time.

Moreover, we compare the parameter efficiency of LightMTS and all baselines on the six datasets. As shown in Figure `\ref{fig:param}`{=latex}, LightMTS keeps similar parameter efficiency with DLinear, a simple linear model. On average, the parameter size of LightMTS is 1/5 of PatchTST, 1/25 of Crossformer, 1/50 of MTGNN, and 1/70 of Informer and Autoformer. It is noteworthy that a larger $H$ enlarges the gap between PatchTST and LightMTS because PatchTST employs a fully connected layer to decode the $H$-length sequence of high-dimensional hidden states. Although the parameter sizes of Informer and Autoformer are irrelevant to $H$, they are still the most parameter-heavy due to their high-dimensional learning throughout encoding and decoding.

```{=latex}
\centering
```
![The number of model parameters on six datasets with horizon $H$ in {24, 48, 96, 192, 336, 720}. For Crossformer, we follow its recommended lookback length $L$. For Informer and Autoformer, $L$ is 96. For other methods, $L$ is 336.](param.png){#fig:param width="\\textwidth"}

Distribution Shifts {#appendix:visualization}
-------------------

To investigate the dynamic variation of lead-lag relationships, we adopt the proposed lead estimator ($K=2$) to count the lead-lag relationships in the training data and test data. As shown in Figure `\ref{fig:indicator}`{=latex}, some leading indicators of a specific variate in the training data cannot keep the relationships in the test data, while the test data also encounters new patterns. GNN-based [@MTGNN] and MLP-based [@MTSMixer] methods are susceptible to such distribution shifts of leading indicators due to their static parameter weights in modeling channel dependence.

```{=latex}
\begin{figure*}[b]\centering
    \includegraphics[width=\textwidth]{distribution2.pdf}
    \caption{Distributions of leading indicators ($K=2$) in the training data (\textit{left}) and test data (\textit{mid}) on the Weather dataset, where each cell represents the occurrence frequency of the lead-lag relationship between each pair of variates. The \textit{right} shows the changes in occurrence frequency from training data to test data.}
    \label{fig:indicator}
\end{figure*}
```
Furthermore, we visualize the distribution of leading steps between the pair of variates. We choose a lagged variate and its leading indicator that is the most commonly observed across training and test data. As shown in Figure `\ref{fig:leading step}`{=latex}, some of the leading steps (*e.g.*, 250) observed in training data rarely reoccur in the test data. By contrast, the leading indicator show new leading steps (*e.g.*, 40 and 125) in the test data. Furthermore, the leading step is not fixed but dynamically varies across phases, increasing the difficulty of modeling channel dependence.

```{=latex}
\centering
```
![The histogram of the leading step between a selected pair of variates in the training data and test data on the Weather dataset. We also estimate the distributions with a kernel density estimator.](Weather_step.png){#fig:leading step width="0.7\\linewidth"}

Performance on Other Datasets {#appendix:performance}
-----------------------------

We conduct more experiments on Illness, ETTm1, and ETTh1 datasets. As PatchTST and DLinear perform the best on these benchmarks, we employ them as backbones. As shown in Table `\ref{tab:additional}`{=latex}, LIFT fails in some cases. We reason that these datasets are composed of only 7 variates and perhaps have insufficient leading indicators for each variate. Nevertheless, it is worth mentioning that we can collect abundant variates in practical applications, paving the way for adopting LIFT.

```{=latex}
\centering
```
```{=latex}
\resizebox{.7\linewidth}{!}{
    \begin{tabular}{cc|cc|cc|cc|cc}
\toprule
\multicolumn{2}{c|}{\multirow{2}{*}{Method}} & \multicolumn{2}{c|}{PatchTST} & \multicolumn{2}{c|}{\textbf{+ LIFT}} & \multicolumn{2}{c|}{DLinear} & \multicolumn{2}{c}{\textbf{+ LIFT}} \\
\multicolumn{2}{c|}{} & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE \\ \midrule
\multirow{6}{*}{ETTh1} & 24 & 0.307 & 0.358 & \textbf{0.303} & \textbf{0.356} & 0.323 & 0.367 & \textbf{0.316} & \textbf{0.360} \\
 & 48 & 0.338 & 0.374 & \textbf{0.337} & 0.374 & 0.345 & 0.378 & \textbf{0.342} & \textbf{0.376} \\
 & 96 & 0.375 & 0.399 & \textbf{0.370} & \textbf{0.395} & 0.375 & 0.399 & \textbf{0.372} & \textbf{0.395} \\
 & 192 & 0.414 & 0.421 & \textbf{0.410} & \textbf{0.419} & \textbf{0.405} & \textbf{0.416} & 0.413 & 0.423 \\
 & 336 & 0.431 & 0.436 & 0.433 & \textbf{0.435} & \textbf{0.439} & \textbf{0.443} & 0.453 & 0.453 \\
 & 720 & 0.449 & 0.466 & \textbf{0.447} & \textbf{0.464} & \textbf{0.472} & \textbf{0.490} & 0.509 & 0.512 \\ \midrule
\multirow{6}{*}{ETTm1} & 24 & 0.193 & 0.270 & \textbf{0.190} & \textbf{0.269} & 0.211 & 0.285 & \textbf{0.196} & \textbf{0.275} \\
 & 48 & 0.254 & 0.319 & \textbf{0.252} & \textbf{0.316} & 0.272 & 0.326 & \textbf{0.258} & \textbf{0.321} \\
 & 96 & 0.290 & 0.342 & \textbf{0.287} & \textbf{0.338} & 0.299 & 0.343 & \textbf{0.293} & \textbf{0.342} \\
 & 192 & 0.332 & 0.369 & \textbf{0.329} & \textbf{0.367} & 0.335 & 0.365 & \textbf{0.334} & 0.365 \\
 & 336 & 0.366 & 0.392 & \textbf{0.365} & \textbf{0.390} & 0.369 & \textbf{0.386} & 0.369 & 0.387 \\
 & 720 & 0.420 & 0.424 & \textbf{0.412} & \textbf{0.420} & \textbf{0.425} & \textbf{0.421} & 0.426 & 0.424 \\ \midrule
\multirow{6}{*}{ILI} & 6 & 0.909 & 0.562 & \textbf{0.819} & \textbf{0.543} & 1.115 & 0.713 & \textbf{0.974} & \textbf{0.645} \\
 & 12 & 1.523 & 0.764 & \textbf{1.378} & \textbf{0.726} & 1.844 & 0.941 & \textbf{1.705} & \textbf{0.904} \\
 & 24 & 1.767 & \textbf{0.830} & \textbf{1.677} & 0.841 & 2.215 & 1.081 & \textbf{2.083} & \textbf{1.002} \\
 & 36 & 1.651 & 0.857 & \textbf{1.629} & \textbf{0.830} & 2.301 & 1.076 & \textbf{2.172} & \textbf{1.033} \\
 & 48 & 1.711 & 0.859 & \textbf{1.668} & \textbf{0.844} & 2.316 & 1.091 & \textbf{2.187} & \textbf{1.052} \\
 & 60 & 1.816 & 0.884 & \textbf{1.770} & \textbf{0.880} & 2.445 & 1.123 & \textbf{2.342} & \textbf{1.094} \\ \bottomrule
\end{tabular}
}
```
`\label{tab:additional}`{=latex}

Performance of LightMTS
-----------------------

In Table `\ref{tab:lightmts2}`{=latex}, we compare the forecasting errors of LightMTS with all baselines. The lookback length $L$ is set to 336 for LightMTS, PatchTST, DLinear, Crossformer, and MTGNN. We borrow the results of other baselines from the paper of PatchTST. As shown in Table `\ref{tab:lightmts2}`{=latex}, LightMTS achieves comparable performance in all the cases.

```{=latex}
\centering
```
```{=latex}
\resizebox{\linewidth}{!}{
\begin{tabular}{cc|cc|cc|cc|cc|cc|cc|cc|cc}
\toprule
\multicolumn{2}{c|}{Models} & \multicolumn{2}{c|}{LightMTS} & \multicolumn{2}{c|}{PatchTST} & \multicolumn{2}{c|}{Dlinear} & \multicolumn{2}{c|}{Crossformer} & \multicolumn{2}{c|}{MTGNN} & \multicolumn{2}{c|}{FEDformer} & \multicolumn{2}{c|}{Autoformer} & \multicolumn{2}{c}{Informer} \\ \midrule
\multicolumn{2}{c|}{Metric} & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE & MSE & MAE \\ \midrule
\multirow{4}{*}{\begin{sideways}Weather\end{sideways}} & 96 & 0.147 & \textbf{0.198} & 0.152 & \underline{0.199} & 0.176 & 0.237 & \textbf{\underline{0.145}} & 0.209 & 0.157 & 0.216 & 0.238 & 0.314 & 0.249 & 0.329 & 0.354 & 0.405 \\
 & 192 & \textbf{0.190} & \textbf{0.238} & \underline{0.197} & \underline{0.243} & 0.220 & 0.282 & \underline{0.197} & 0.264 & 0.205 & 0.269 & 0.275 & 0.329 & 0.325 & 0.370 & 0.419 & 0.434 \\
 & 336 & \textbf{0.240} & \textbf{0.278} & \underline{0.249} & \underline{0.283} & 0.265 & 0.319 & 0.246 & 0.309 & 0.258 & 0.312 & 0.339 & 0.377 & 0.351 & 0.391 & 0.583 & 0.543 \\
 & 720 & \textbf{0.318} & \textbf{0.332} & \underline{0.320} & \underline{0.335} & 0.323 & 0.362 & 0.323 & 0.364 & 0.335 & 0.369 & 0.389 & 0.409 & 0.415 & 0.426 & 0.916 & 0.705 \\  \midrule
\multirow{4}{*}{\begin{sideways}Electricity\end{sideways}} & 96 & \underline{0.131} & \underline{0.224} & \textbf{0.130} & \textbf{0.222} & 0.140 & 0.237 & 0.142 & 0.243 & 0.139 & 0.242 & 0.186 & 0.302 & 0.196 & 0.313 & 0.304 & 0.393 \\
 & 192 & \textbf{0.148} & \textbf{0.240} & \textbf{0.148} & \textbf{0.240} & 0.153 & 0.249 & 0.159 & 0.259 & 0.160 & 0.261 & 0.197 & 0.311 & 0.211 & 0.324 & 0.327 & 0.417 \\
 & 336 & \textbf{0.165} & \textbf{0.257} & \underline{0.167} & \underline{0.261} & 0.169 & 0.267 & 0.192 & 0.293 & 0.193 & 0.284 & 0.213 & 0.328 & 0.214 & 0.327 & 0.333 & 0.422 \\
 & 720 & \underline{0.203} & \textbf{0.288} & \textbf{0.202} & \underline{0.291} & \underline{0.203} & 0.301 & 0.225 & 0.316 & 0.242 & 0.327 & 0.233 & 0.344 & 0.236 & 0.342 & 0.351 & 0.427 \\ \midrule
\multirow{4}{*}{\begin{sideways}Traffic\end{sideways}} & 96 & \underline{0.386} & \underline{0.268} & \textbf{0.367} & \textbf{0.251} & 0.410 & 0.282 & 0.519 & 0.293 & 0.479 & 0.289 & 0.576 & 0.359 & 0.597 & 0.371 & 0.733 & 0.410 \\
 & 192 & \underline{0.405} & \underline{0.276} & \textbf{0.385} & \textbf{0.259} & 0.423 & 0.287 & 0.522 & 0.296 & 0.507 & 0.307 & 0.610 & 0.380 & 0.607 & 0.382 & 0.777 & 0.435 \\
 & 336 & \underline{0.417} & \underline{0.282} & \textbf{0.398} & \textbf{0.265} & 0.436 & 0.296 & 0.530 & 0.300 & 0.539 & 0.314 & 0.608 & 0.375 & 0.623 & 0.387 & 0.776 & 0.434 \\
 & 720 & \underline{0.444} & \underline{0.298} & \textbf{0.434} & \textbf{0.287} & 0.466 & 0.315 & 0.584 & 0.369 & 0.616 & 0.352 & 0.621 & 0.375 & 0.639 & 0.395 & 0.827 & 0.466 \\ \midrule
\multirow{4}{*}{\begin{sideways}ETTh1\end{sideways}} & 96 & \textbf{0.369} & \textbf{0.391} & \underline{0.375}    & \underline{0.399} & \underline{0.375}    & \underline{0.399}  & 0.423 & 0.448 & 0.440 & 0.450 & 0.326 & 0.390 & 0.510 & 0.492 & 0.626 & 0.560 \\
 & 192 & \underline{0.407}    & \textbf{0.416} & 0.414          & 0.421       & \textbf{0.405} & \textbf{0.416} & 0.471 & 0.474 & 0.449 & 0.433 & 0.365 & 0.415 & 0.514 & 0.495 & 0.725 & 0.619 \\
 & 336 & \underline{0.437}    & \textbf{0.435} & \textbf{0.431} & \underline{0.436} & 0.439          & 0.443 & 0.570 & 0.546 & 0.598 & 0.554 & 0.392 & 0.425 & 0.510 & 0.492 & 1.005 & 0.741 \\
 & 720 & \underline{0.451}    & \textbf{0.464} & \textbf{0.449} & \underline{0.466} & 0.472          & 0.490 & 0.653 & 0.621 & 0.685 & 0.620 & 0.446 & 0.458 & 0.527 & 0.493 & 1.133 & 0.845 \\ \midrule
\multirow{4}{*}{\begin{sideways}ETTm1\end{sideways}} & 96 & \textbf{0.285} & \textbf{0.340} & \underline{0.290} & \underline{0.342} & 0.299 & 0.343 & 0.315 & 0.370 & 0.330 & 0.388 & 0.326 & 0.390 & 0.510 & 0.492 & 0.626 & 0.560 \\
 & 192 & \textbf{0.325} & \textbf{0.364} & \underline{0.332} & 0.369 & 0.335 & \underline{0.365} & 0.348 & 0.390 & 0.376 & 0.419 & 0.365 & 0.415 & 0.514 & 0.495 & 0.725 & 0.619 \\
 & 336 & \textbf{0.365} & \textbf{0.384} & \underline{0.366} & 0.392 & 0.369 & \underline{0.386} & 0.414 & 0.432 & 0.432 & 0.461 & 0.392 & 0.425 & 0.510 & 0.492 & 1.005 & 0.741 \\
 & 720 & \underline{0.422} & \underline{0.423} & \textbf{0.420} & 0.424 & 0.425 & \textbf{0.421} & 0.511 & 0.552 & 0.485 & 0.488 & 0.446 & 0.458 & 0.527 & 0.493 & 1.133 & 0.845 \\ 
\toprule
\end{tabular}
}
```
`\label{tab:lightmts2}`{=latex}

```{=latex}
\eat{
\section{Related Works}
% \paragraph{Difference to Existing CD Method} 
Most MTS forecasting methods focus on modeling channel dependence, and existing CD Methods mainly fall into two categories, implicit CD methods and explicit CD methods. As shown in Figure~\ref{fig:pipeline:a}, implicit CD methods including Informer~\citep{Informer} and Autoformer~\citep{Autoformer} simply sum up the embedding of all variates at each time step, introducing much noise from uncorrelated variates. As shown in Figure~\ref{fig:pipeline:b}, explicit CD methods devise cross-variate layers to selectively learn channel dependencies. Crossformer~\cite{Crossformer}, the state-of-the-art Transformer-based method, mainly mixes up information from the variates that show similarity at the same time step. However, as illustrated in Figure~\ref{fig:indicator}, the target variate may not be similar to its leading indicators at each time step, which . MTGNN~\citep{MTGNN} learns a static graph structure among variates and aggregates information within a subset of variates. Recent MLP-based methods~\citep{MTSMixer, TSMixerGoogle, TSMixerKDD2023} apply static weights over. 


\begin{figure}
        \centering
        \subcaptionbox{Implict CD methods\label{fig:pipeline:a}}
        {
                \includegraphics[width=.23\linewidth]{implicit.pdf}
            }
        \subcaptionbox{Explicit CD methods\label{fig:pipeline:b}}
        {
                \includegraphics[width=0.295\linewidth]{explicit.pdf}
            }
        \caption{The pipelines of existing multivariate forecasters. $D$ is the embedding dimension. (a) The Implicit way models first mix up variate information in the latent space, and the designed neural network can seamlessly switch to UTS forecasting without mixing. (b) The explicit way stacks multiple blocks, each of which contains alternate cross-time layers and cross-variate layers. Some cross-time layers can change the length of the hidden state sequence, \textit{e.g.}, compress it to a single-step representation.}
        \label{fig:pipeline}
\end{figure}
}
```

[^1]: We use the terms \`\`*variate*\" and \`\`*channel*\" interchangeably.

[^2]: We use bold symbols to denote matrices of multiple variates.

[^3]: <https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014/>

[^4]: <https://www.bgc-jena.mpg.de/wetter/>

[^5]: <http://pems.dot.ca.gov/>

[^6]: <https://www.nrel.gov/grid/solar-power-data/>

[^7]: <https://www.kaggle.com/datasets/sohier/30-years-of-european-wind-generation/>

[^8]: <https://github.com/wanhuaiyu/ASTGCN/>

[^9]: <https://github.com/zhouhaoyi/ETDataset>

[^10]: <https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html>