TSMixer: An All-MLP Architecture for Time Series Forecasting

Source

Core Claim

TSMixer argues that an all-MLP forecasting architecture can compete with more elaborate recurrent, convolutional, and attention-based forecasters by alternating mixing over the time dimension and feature dimension.

Key Contributions

  • Introduces Time-Series Mixer, a stack of residual MLP blocks that mix temporal positions and cross-variate features without self-attention.
  • Separates a historical-only multivariate setting from an extended setting that can use static features and known future exogenous variables.
  • Emphasizes that time-step-dependent weights can be a useful forecasting prior, contrasting them with data-dependent attention or recurrent gates.
  • Reports competitive long-term forecasting results on common academic datasets and strong results on the M5 retail benchmark.
  • Makes TSMixer an important architectural ancestor for compact pretrained models such as Tiny Time Mixers.

Method Notes

TSMixer is a passive forecasting model: it predicts future observations from historical observations and optional exogenous variables, without an action, control input, or intervention interface.

The core block alternates time mixing and feature mixing. The paper’s design keeps parameter growth closer to O(L + C) for sequence length L and channel count C, instead of using a fully connected O(LC) mapping over every time-channel entry.

TSMixer does not use attention. Its relevance to this wiki is the compact sequence-mixing prior: some time-series tasks may not need expensive quadratic attention if the model can mix local temporal structure and cross-channel information directly.

Evidence And Results

  • On long-term forecasting benchmarks, TSMixer is reported as comparable to specialized state-of-the-art models.
  • On M5, the paper reports stronger performance than compared alternatives, emphasizing the value of auxiliary and cross-variate information.
  • Ablations separate the role of time mixing, feature mixing, static features, and known future features.

Limitations

  • TSMixer is trained per task or dataset in the paper, not released as a broad pretrained time-series foundation model.
  • The evidence is forecasting-centered and does not cover classification, anomaly detection, imputation, or reasoning tasks as first-class targets.
  • The architecture can use exogenous variables, but that is not the same as modeling actions, control inputs, or interventions for counterfactual planning.

Open Questions

  • How much of TTM’s transfer comes from TSMixer-style inductive bias versus the pretraining data mixture and resolution-conditioning recipe?
  • When does feature mixing over channels scale poorly enough that hierarchy, grouping, or token-wise channel modeling becomes necessary?
  • Can mixer-style backbones be extended to explicit action-conditioned world models without losing their serving simplicity?