Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Source

Core Claim

Time-MoE argues that sparse mixture-of-experts scaling can make time-series forecasting foundation models larger and more capable without forcing inference cost to grow with total parameter count.

Key Contributions

  • Introduces a decoder-only autoregressive time-series forecasting foundation model with sparse temporal mixture-of-experts layers, causal attention, point-wise tokenization, and multi-resolution forecasting heads.
  • Builds Time-300B, a cleaned large-scale pretraining corpus spanning more than nine domains and about 309B time points.
  • Scales the model family up to a 2.4B total-parameter model with 1.1B activated parameters, while also training smaller 50M and 200M activated-parameter variants for lower-cost inference.
  • Reports that Time-MoE improves average forecasting errors over existing time-series foundation models in zero-shot settings and over strong full-shot forecasting baselines after one epoch of downstream fine-tuning.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Time-MoE-50MBase benchmarked Time-MoE model12 layers, 12 heads, 8 experts, top-2 routing, 384 hidden size, 50M activated parameters, and 113M total parameters.Maple728/TimeMoE-50M
Time-MoE-200MLarge benchmarked Time-MoE model12 layers, 12 heads, 8 experts, top-2 routing, 768 hidden size, 200M activated parameters, and 453M total parameters.Maple728/TimeMoE-200M
Time-MoE-UltraLargest benchmarked Time-MoE model36 layers, 16 heads, 8 experts, top-2 routing, 1024 hidden size, 1.1B activated parameters, and 2.4B total parameters.Not listed in the requested official checkpoints.

Method Notes

Time-MoE is a passive forecasting model rather than an action-conditioned world model: it consumes observed time-series histories and predicts future numeric values, without explicit action, control input, or intervention channels. The paper handles multivariate time series through channel independence, turning each channel into a univariate sequence, so cross-channel dynamics are not the primary modeling target.

The architecture replaces dense feed-forward layers with sparse mixture-of-experts layers that include isolated experts plus a shared expert. Each time point is routed to a small top-k subset of experts, and an auxiliary balancing loss is used to reduce expert routing collapse.

Multi-resolution forecasting heads predict horizons of 1, 8, 32, and 64 time steps during training. At inference, a greedy scheduling procedure combines these heads autoregressively so the model can forecast flexible horizons rather than only one fixed output length.

Evidence And Results

  • Zero-shot forecasting is evaluated on ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Global Temp with horizons 96, 192, 336, and 720; the paper reports more than 20% average MSE reduction over the most competitive zero-shot baselines.
  • In-distribution forecasting fine-tunes pretrained Time-MoE models for one epoch on the same benchmark family; the paper reports 24% average MSE reduction over recent full-shot forecasting baselines.
  • Sparse-vs-dense scaling experiments report that Time-MoE reduces training cost by 78% and inference cost by 39% relative to dense variants with comparable activated-parameter budgets.
  • Ablations report worse average MSE when removing mixture-of-experts layers, multi-resolution heads, Huber loss, or the auxiliary routing-balance loss.

Limitations

  • The paper is focused on forecasting accuracy and scaling behavior, not time-series reasoning, causal discovery, action-conditioned simulation, or intervention planning.
  • Channel-independent handling of multivariate time series improves universality but may miss important cross-channel structure in domains where coupled dynamics matter.
  • The largest model is not among the requested public Hugging Face checkpoints, so the released public artifacts most directly support the 50M and 200M variants.

Open Questions

  • How much of Time-MoE’s gain comes from sparse expert routing versus Time-300B data scale and cleaning?
  • Would explicit multivariate coupling improve results on domains where channel interactions are central?
  • Can sparse forecasting models like Time-MoE become useful backbones for reasoning-oriented systems such as TimeOmni-1?