Unified Training of Universal Time Series Forecasting Transformers

Source

Core Claim

Moirai argues that a single masked-encoder Transformer can become a universal time-series forecaster by training over heterogeneous domains, frequencies, variate counts, context lengths, forecast horizons, and predictive distributions.

Key Contributions

  • Introduces Moirai, a masked encoder-based universal time-series forecasting Transformer with multi-patch-size projections, Any-variate Attention, and a flexible mixture distribution head.
  • Builds the Large-scale Open Time Series Archive, or LOTSA, with more than 27B observations across nine domains for open pretraining.
  • Trains small, base, and large models with 14M, 91M, and 311M parameters, respectively.
  • Evaluates Moirai as a single zero-shot model against full-shot forecasting baselines on in-distribution Monash tasks, out-of-distribution probabilistic forecasting tasks, and long sequence forecasting tasks.
  • Releases the Uni2TS library, data pipeline, and public Moirai checkpoints through official Salesforce artifacts.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Moirai-1.0-R-SmallSmall released benchmark modelMatches the paper’s small Moirai scale: 6 layers, 384 hidden size, 6 attention heads, and about 14M parameters.Salesforce/moirai-1.0-R-small
Moirai-1.0-R-BaseBase released benchmark modelMatches the paper’s base Moirai scale: 12 layers, 768 hidden size, 12 attention heads, and about 91M parameters.Salesforce/moirai-1.0-R-base
Moirai-1.0-R-LargeLarge released benchmark modelMatches the paper’s large Moirai scale: 24 layers, 1024 hidden size, 16 attention heads, and about 311M parameters.Salesforce/moirai-1.0-R-large
Moirai-1.1-R-SmallUpdated released benchmark modelSame public Moirai family and small size class, released as the 1.1 R checkpoint line.Salesforce/moirai-1.1-R-small
Moirai-1.1-R-BaseUpdated released benchmark modelSame public Moirai family and base size class, released as the 1.1 R checkpoint line.Salesforce/moirai-1.1-R-base
Moirai-1.1-R-LargeUpdated released benchmark modelSame public Moirai family and large size class, released as the 1.1 R checkpoint line.Salesforce/moirai-1.1-R-large

Method Notes

Moirai is a passive dynamics model for forecasting: it predicts future observations from historical observations and known covariates, without a controllable action, control input, treatment, or intervention channel. It is therefore a strong time-series foundation model baseline, but not an action-conditioned world model by itself.

The architecture patchifies each variate, flattens time and variate axes into one Transformer sequence, and uses learned binary attention biases so the model can distinguish same-variate and cross-variate token relationships while preserving invariance to arbitrary variate identities. Multi-patch-size input and output projections handle different sampling frequencies through a heuristic frequency-to-patch-size mapping.

The output head predicts a mixture over parametric distributions, including Student’s t, negative binomial, log-normal, and low-variance normal components. This makes Moirai a probabilistic forecaster rather than only a point forecaster and helps it cover positive counts, skewed values, heavy tails, and high-confidence regimes.

Evidence And Results

  • On the Monash in-distribution benchmark, the paper reports that all three Moirai sizes outperform the benchmark baselines under normalized MAE while remaining a single shared model across datasets.
  • On out-of-distribution probabilistic forecasting, the base and large models are reported as consistently strong zero-shot forecasters, often first or second by CRPS and MSIS against full-shot DeepAR, PatchTST, TiDE, and TFT baselines.
  • On long sequence forecasting, Moirai is competitive with full-shot baselines across ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather, though the large model is not uniformly better than the base model.
  • Ablations attribute performance to the combined design: removing multi-patch-size handling, Any-variate Attention, the mixture distribution, LOTSA pretraining, or sequence packing worsens normalized MAE.

Limitations

  • The frequency-to-patch-size mapping is hand-designed, and the authors explicitly identify it as a heuristic that should be replaced by more flexible cross-frequency modeling.
  • High-dimensional multivariate time series remain challenging because flattening variates into the Transformer sequence increases input length.
  • The evidence is forecasting-centered; Moirai does not directly address causal discovery, natural-language reasoning, action-conditioned simulation, or intervention planning.
  • Later benchmark papers should be checked for train/test overlap, because broad pretraining corpora can create leakage-like ambiguity for nominal zero-shot evaluations.

Open Questions

  • How much of Moirai’s transfer comes from architecture versus LOTSA coverage and data cleaning?
  • Can Any-variate Attention scale to high-dimensional telemetry, observability, or physiological event streams without sparse or factorized attention?
  • Would Moirai’s known-covariate interface become useful for action-conditioned world modeling if actions, control inputs, or interventions were represented as explicit future-conditioning channels?