Moirai 2.0: When Less Is More for Time Series Forecasting
Source
- Raw Markdown: paper_moirai-2-2025.md
- PDF: paper_moirai-2-2025.pdf
- Preprint: arXiv 2511.11698
- Official code: SalesforceAIResearch/uni2ts
- Official checkpoint: Salesforce/moirai-2.0-R-small
Core Claim
Moirai 2.0 argues that a smaller, simpler decoder-only time-series foundation model can outperform the earlier Moirai masked-encoder family by using quantile forecasting, multi-token prediction, recursive multi-quantile decoding, and a larger but still leakage-conscious pretraining corpus.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Moirai-2.0-R-Small | Released benchmark model | Corresponds to the paper’s small model: 11.4M parameters, normalized MASE 0.728, and normalized CRPS 0.516 on GIFT-Eval. | Salesforce/moirai-2.0-R-small |
| Moirai 2.0 base | Scaling ablation | 87.1M parameters; worse than the small model on the reported GIFT-Eval aggregate, with MASE 0.732 and CRPS 0.525. | Not linked in the paper. |
| Moirai 2.0 large | Scaling ablation | 305M parameters; worse than the small model on the reported GIFT-Eval aggregate, with MASE 0.743 and CRPS 0.530. | Not linked in the paper. |
Key Contributions
- Replaces Moirai 1.0’s masked-encoder architecture with a decoder-only Transformer that computes training losses across the causal token sequence.
- Replaces multi-patch inputs and mixture-distribution outputs with a single patch size and direct quantile forecasts trained with pinball loss.
- Introduces autoregressive multi-quantile decoding, which expands multiple quantile paths and collapses them back to the target quantile grid during rollout.
- Uses multi-token prediction and patch-level random masking to improve long-horizon efficiency and robustness.
- Trains on a new corpus of about 36M time series and roughly 295B observations, combining GIFT-Eval pretraining data, GIFT-Eval train splits, Chronos-Mixup, KernelSynth, and internal Salesforce CloudOps telemetry.
Method Notes
Moirai 2.0 is a passive dynamics model for forecasting. It predicts future observations from historical observations, but it explicitly drops multivariate forecasting support and covariate support in this version. In the knowledge-base terminology, it does not model action, control input, intervention, or treatment channels.
The key simplification is architectural: the model treats each variate as an independent univariate series, normalizes from the first part of the input to avoid decoder-only future leakage, patches the series into tokens, and trains a causal Transformer to emit future quantile patches. This makes Moirai 2.0 closer to LLM-style autoregressive forecasting than the original Moirai masked-encoder design.
Evidence And Results
- On GIFT-Eval, the paper reports Moirai 2.0 at 5th by normalized MASE and 6th by normalized CRPS among the filtered pretrained foundation-model comparison.
- The small released model is reported as stronger than the base and large Moirai 2.0 variants, suggesting that parameter scaling alone does not help under the paper’s current data and architecture setup.
- The model is reported as approximately 30x smaller and 2x faster than Moirai-Large while also improving accuracy on the benchmark aggregate.
- Domain and horizon breakdowns show strong short-range results, but the model’s relative rank weakens for longer prediction horizons and Nature-domain tasks.
- The ablation path attributes most of the gain to the decoder-only backbone, quantile loss, recursive multi-quantile decoding, multi-token prediction, and the residual output projection.
Limitations
- Moirai 2.0 drops multivariate forecasting and covariate support, so it is less expressive than models that can use cross-variate structure or known future exogenous variables.
- Scaling from small to base and large worsens the reported aggregate GIFT-Eval metrics.
- Long-horizon forecasting remains weaker than short-horizon forecasting.
- The corpus includes internal Salesforce telemetry, so that part of the pretraining data is not independently reproducible.
- It is a forecasting model, not an action-conditioned world model or a causal intervention model.
Links Into The Wiki
- Moirai
- Time-Series Foundation Models
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Moirai
- Chronos-2
- Sundial
- TiRex
Open Questions
- Is the small model’s advantage a true architecture-data sweet spot, or mainly an artifact of the current pretraining corpus and training budget?
- Can the decoder-only quantile interface recover useful multivariate and covariate behavior if synthetic or curated real multivariate corpora improve?
- How should time-series foundation models trade off direct quantile outputs against full probabilistic generative rollouts for world-model use cases?