Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Source

Core Claim

Moirai-MoE argues that time-series foundation models should specialize by token-level learned routing instead of hand-assigned frequency buckets: sparse mixture-of-experts layers can let similar local time-series patterns share experts even when their sampling frequencies differ.

Key Contributions

  • Introduces Moirai-MoE, a sparse mixture-of-experts time-series foundation model built on the Moirai family.
  • Replaces Moirai’s multiple frequency-specific input and output projections with a single projection layer plus token-level expert routing inside Transformer blocks.
  • Uses a decoder-only forecasting objective so one training update can supervise multiple context lengths more efficiently than the earlier masked-encoder setup.
  • Proposes a token-cluster gating function, where expert routing is initialized from k-means centroids over pretrained Moirai token representations.
  • Evaluates on 39 datasets across in-distribution Monash forecasting and zero-shot forecasting settings.
  • Releases the implementation through Uni2TS and public small/base Moirai-MoE checkpoints.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Moirai-MoE-1.0-R-SmallSmall released benchmark model6 layers, 384 hidden size, 512 feed-forward size, 2 activated experts out of 32, about 11M activated parameters, and about 117M total parameters.Salesforce/moirai-moe-1.0-R-small
Moirai-MoE-1.0-R-BaseBase released benchmark model12 layers, 768 hidden size, 1024 feed-forward size, 2 activated experts out of 32, about 86M activated parameters, and about 935M total parameters.Salesforce/moirai-moe-1.0-R-base

Method Notes

Moirai-MoE is a passive dynamics model for forecasting: it predicts future numeric observations from observed histories and does not expose an explicit action, control input, intervention, or treatment channel. It is best read as a strong probabilistic time-series foundation model baseline, not as an action-conditioned world model.

The model keeps Moirai’s patch-based handling of multivariate time series, flattening variates into a causal Transformer sequence, but moves specialization into sparse experts. Each MoE layer routes a token to 2 of 32 experts, so activated compute stays close to dense Moirai at the same size class while total model capacity is much larger.

The paper’s main design bet is that frequency is a weak proxy for temporal structure. The authors show examples where different frequencies have similar patterns, same-frequency series have different patterns, and non-stationarity changes distribution within a short context window. Moirai-MoE therefore uses a single patch size and projection path, then lets routing adapt at the token level.

Evidence And Results

  • On 29 in-distribution Monash datasets, the paper reports that Moirai-MoE beats the compared Monash baselines, TimesFM, Chronos, and dense Moirai variants on aggregate normalized MAE.
  • The small Moirai-MoE model is reported as 17% better than dense Moirai small on the Monash aggregate, while also outperforming dense Moirai base and large in that setting.
  • On 10 zero-shot datasets outside LOTSA, Moirai-MoE base reports the best average CRPS and MASE among the compared foundation-model and full-shot baselines.
  • Against dense Moirai variants in zero-shot forecasting, Moirai-MoE small reports 3%-14% better CRPS and 8%-16% better MASE while using about 11M activated parameters.
  • Ablations indicate that switching from masked encoder to decoder-only training helps, but the larger gain comes from replacing frequency-level projections with MoE token specialization.
  • Expert analysis suggests that shallow layers use more diverse routing for local and periodic patterns, while deeper layers converge toward more frequency-invariant representations.

Limitations

  • The paper does not introduce an action-conditioned interface, so it does not directly address control, intervention planning, counterfactual simulation, or causal discovery.
  • Total parameter counts are much larger than activated parameter counts, especially for the base model, so memory and serving constraints still matter even when per-token compute is sparse.
  • The strongest comparison is within forecasting; broader time-series understanding, reasoning, classification, and generation tasks require separate evidence.
  • Some expert-routing analyses are based on visualization and aggregate behavior, so they should be treated as mechanistic hypotheses rather than complete explanations.

Open Questions

  • Can Moirai-MoE’s token-level specialization help with high-dimensional telemetry or observability event streams where frequency is less informative than regime, workload, or incident phase?
  • How stable are expert assignments under distribution shift, missingness, and rare interventions?
  • Would sparse expert routing become more useful for world-model-style systems if actions, control inputs, or interventions were encoded as future-conditioning variables?