Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
Source
- Raw Markdown: paper_moirai-moe-2024.md
- PDF: paper_moirai-moe-2024.pdf
- Preprint: arXiv 2410.10469
- Official code: SalesforceAIResearch/uni2ts
- Official checkpoint: Salesforce/moirai-moe-1.0-R-small
- Official checkpoint: Salesforce/moirai-moe-1.0-R-base
Core Claim
Moirai-MoE argues that time-series foundation models should specialize by token-level learned routing instead of hand-assigned frequency buckets: sparse mixture-of-experts layers can let similar local time-series patterns share experts even when their sampling frequencies differ.
Key Contributions
- Introduces Moirai-MoE, a sparse mixture-of-experts time-series foundation model built on the Moirai family.
- Replaces Moirai’s multiple frequency-specific input and output projections with a single projection layer plus token-level expert routing inside Transformer blocks.
- Uses a decoder-only forecasting objective so one training update can supervise multiple context lengths more efficiently than the earlier masked-encoder setup.
- Proposes a token-cluster gating function, where expert routing is initialized from k-means centroids over pretrained Moirai token representations.
- Evaluates on 39 datasets across in-distribution Monash forecasting and zero-shot forecasting settings.
- Releases the implementation through Uni2TS and public small/base Moirai-MoE checkpoints.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Moirai-MoE-1.0-R-Small | Small released benchmark model | 6 layers, 384 hidden size, 512 feed-forward size, 2 activated experts out of 32, about 11M activated parameters, and about 117M total parameters. | Salesforce/moirai-moe-1.0-R-small |
| Moirai-MoE-1.0-R-Base | Base released benchmark model | 12 layers, 768 hidden size, 1024 feed-forward size, 2 activated experts out of 32, about 86M activated parameters, and about 935M total parameters. | Salesforce/moirai-moe-1.0-R-base |
Method Notes
Moirai-MoE is a passive dynamics model for forecasting: it predicts future numeric observations from observed histories and does not expose an explicit action, control input, intervention, or treatment channel. It is best read as a strong probabilistic time-series foundation model baseline, not as an action-conditioned world model.
The model keeps Moirai’s patch-based handling of multivariate time series, flattening variates into a causal Transformer sequence, but moves specialization into sparse experts. Each MoE layer routes a token to 2 of 32 experts, so activated compute stays close to dense Moirai at the same size class while total model capacity is much larger.
The paper’s main design bet is that frequency is a weak proxy for temporal structure. The authors show examples where different frequencies have similar patterns, same-frequency series have different patterns, and non-stationarity changes distribution within a short context window. Moirai-MoE therefore uses a single patch size and projection path, then lets routing adapt at the token level.
Evidence And Results
- On 29 in-distribution Monash datasets, the paper reports that Moirai-MoE beats the compared Monash baselines, TimesFM, Chronos, and dense Moirai variants on aggregate normalized MAE.
- The small Moirai-MoE model is reported as 17% better than dense Moirai small on the Monash aggregate, while also outperforming dense Moirai base and large in that setting.
- On 10 zero-shot datasets outside LOTSA, Moirai-MoE base reports the best average CRPS and MASE among the compared foundation-model and full-shot baselines.
- Against dense Moirai variants in zero-shot forecasting, Moirai-MoE small reports 3%-14% better CRPS and 8%-16% better MASE while using about 11M activated parameters.
- Ablations indicate that switching from masked encoder to decoder-only training helps, but the larger gain comes from replacing frequency-level projections with MoE token specialization.
- Expert analysis suggests that shallow layers use more diverse routing for local and periodic patterns, while deeper layers converge toward more frequency-invariant representations.
Limitations
- The paper does not introduce an action-conditioned interface, so it does not directly address control, intervention planning, counterfactual simulation, or causal discovery.
- Total parameter counts are much larger than activated parameter counts, especially for the base model, so memory and serving constraints still matter even when per-token compute is sparse.
- The strongest comparison is within forecasting; broader time-series understanding, reasoning, classification, and generation tasks require separate evidence.
- Some expert-routing analyses are based on visualization and aggregate behavior, so they should be treated as mechanistic hypotheses rather than complete explanations.
Links Into The Wiki
- Moirai
- Time-Series Foundation Models
- Mixture Of Experts
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Moirai
- Time-MoE
- TimesFM
Open Questions
- Can Moirai-MoE’s token-level specialization help with high-dimensional telemetry or observability event streams where frequency is less informative than regime, workload, or incident phase?
- How stable are expert assignments under distribution shift, missingness, and rare interventions?
- Would sparse expert routing become more useful for world-model-style systems if actions, control inputs, or interventions were encoded as future-conditioning variables?