Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Source
- Raw Markdown: paper_time-moe-2024.md
- PDF: paper_time-moe-2024.pdf
- Preprint: arXiv 2409.16040
- Official code: Time-MoE/Time-MoE
- Official checkpoint: Maple728/TimeMoE-50M
- Official checkpoint: Maple728/TimeMoE-200M
Core Claim
Time-MoE argues that sparse mixture-of-experts scaling can make time-series forecasting foundation models larger and more capable without forcing inference cost to grow with total parameter count.
Key Contributions
- Introduces a decoder-only autoregressive time-series forecasting foundation model with sparse temporal mixture-of-experts layers, causal attention, point-wise tokenization, and multi-resolution forecasting heads.
- Builds Time-300B, a cleaned large-scale pretraining corpus spanning more than nine domains and about 309B time points.
- Scales the model family up to a 2.4B total-parameter model with 1.1B activated parameters, while also training smaller 50M and 200M activated-parameter variants for lower-cost inference.
- Reports that Time-MoE improves average forecasting errors over existing time-series foundation models in zero-shot settings and over strong full-shot forecasting baselines after one epoch of downstream fine-tuning.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Time-MoE-50M | Base benchmarked Time-MoE model | 12 layers, 12 heads, 8 experts, top-2 routing, 384 hidden size, 50M activated parameters, and 113M total parameters. | Maple728/TimeMoE-50M |
| Time-MoE-200M | Large benchmarked Time-MoE model | 12 layers, 12 heads, 8 experts, top-2 routing, 768 hidden size, 200M activated parameters, and 453M total parameters. | Maple728/TimeMoE-200M |
| Time-MoE-Ultra | Largest benchmarked Time-MoE model | 36 layers, 16 heads, 8 experts, top-2 routing, 1024 hidden size, 1.1B activated parameters, and 2.4B total parameters. | Not listed in the requested official checkpoints. |
Method Notes
Time-MoE is a passive forecasting model rather than an action-conditioned world model: it consumes observed time-series histories and predicts future numeric values, without explicit action, control input, or intervention channels. The paper handles multivariate time series through channel independence, turning each channel into a univariate sequence, so cross-channel dynamics are not the primary modeling target.
The architecture replaces dense feed-forward layers with sparse mixture-of-experts layers that include isolated experts plus a shared expert. Each time point is routed to a small top-k subset of experts, and an auxiliary balancing loss is used to reduce expert routing collapse.
Multi-resolution forecasting heads predict horizons of 1, 8, 32, and 64 time steps during training. At inference, a greedy scheduling procedure combines these heads autoregressively so the model can forecast flexible horizons rather than only one fixed output length.
Evidence And Results
- Zero-shot forecasting is evaluated on ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Global Temp with horizons 96, 192, 336, and 720; the paper reports more than 20% average MSE reduction over the most competitive zero-shot baselines.
- In-distribution forecasting fine-tunes pretrained Time-MoE models for one epoch on the same benchmark family; the paper reports 24% average MSE reduction over recent full-shot forecasting baselines.
- Sparse-vs-dense scaling experiments report that Time-MoE reduces training cost by 78% and inference cost by 39% relative to dense variants with comparable activated-parameter budgets.
- Ablations report worse average MSE when removing mixture-of-experts layers, multi-resolution heads, Huber loss, or the auxiliary routing-balance loss.
Limitations
- The paper is focused on forecasting accuracy and scaling behavior, not time-series reasoning, causal discovery, action-conditioned simulation, or intervention planning.
- Channel-independent handling of multivariate time series improves universality but may miss important cross-channel structure in domains where coupled dynamics matter.
- The largest model is not among the requested public Hugging Face checkpoints, so the released public artifacts most directly support the 50M and 200M variants.
Links Into The Wiki
- Time-Series Foundation Models
- Mixture Of Experts
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- MOMENT
- TimeOmni-1
Open Questions
- How much of Time-MoE’s gain comes from sparse expert routing versus Time-300B data scale and cleaning?
- Would explicit multivariate coupling improve results on domains where channel interactions are central?
- Can sparse forecasting models like Time-MoE become useful backbones for reasoning-oriented systems such as TimeOmni-1?