Moirai 2.0: When Less Is More for Time Series Forecasting

Source

Raw Markdown: paper_moirai-2-2025.md
PDF: paper_moirai-2-2025.pdf
Preprint: arXiv 2511.11698
Official code: SalesforceAIResearch/uni2ts
Official checkpoint: Salesforce/moirai-2.0-R-small

Core Claim

Moirai 2.0 argues that a smaller, simpler decoder-only time-series foundation model can outperform the earlier Moirai masked-encoder family by using quantile forecasting, multi-token prediction, recursive multi-quantile decoding, and a larger but still leakage-conscious pretraining corpus.

Benchmarked Models

Model	Role In Paper	Notes	Official Artifact
Moirai-2.0-R-Small	Released benchmark model	Corresponds to the paper’s small model: 11.4M parameters, normalized MASE 0.728, and normalized CRPS 0.516 on GIFT-Eval.	Salesforce/moirai-2.0-R-small
Moirai 2.0 base	Scaling ablation	87.1M parameters; worse than the small model on the reported GIFT-Eval aggregate, with MASE 0.732 and CRPS 0.525.	Not linked in the paper.
Moirai 2.0 large	Scaling ablation	305M parameters; worse than the small model on the reported GIFT-Eval aggregate, with MASE 0.743 and CRPS 0.530.	Not linked in the paper.

Key Contributions

Replaces Moirai 1.0’s masked-encoder architecture with a decoder-only Transformer that computes training losses across the causal token sequence.
Replaces multi-patch inputs and mixture-distribution outputs with a single patch size and direct quantile forecasts trained with pinball loss.
Introduces autoregressive multi-quantile decoding, which expands multiple quantile paths and collapses them back to the target quantile grid during rollout.
Uses multi-token prediction and patch-level random masking to improve long-horizon efficiency and robustness.
Trains on a new corpus of about 36M time series and roughly 295B observations, combining GIFT-Eval pretraining data, GIFT-Eval train splits, Chronos-Mixup, KernelSynth, and internal Salesforce CloudOps telemetry.

Method Notes

Moirai 2.0 is a passive dynamics model for forecasting. It predicts future observations from historical observations, but it explicitly drops multivariate forecasting support and covariate support in this version. In the knowledge-base terminology, it does not model action, control input, intervention, or treatment channels.

The key simplification is architectural: the model treats each variate as an independent univariate series, normalizes from the first part of the input to avoid decoder-only future leakage, patches the series into tokens, and trains a causal Transformer to emit future quantile patches. This makes Moirai 2.0 closer to LLM-style autoregressive forecasting than the original Moirai masked-encoder design.

Evidence And Results

On GIFT-Eval, the paper reports Moirai 2.0 at 5th by normalized MASE and 6th by normalized CRPS among the filtered pretrained foundation-model comparison.
The small released model is reported as stronger than the base and large Moirai 2.0 variants, suggesting that parameter scaling alone does not help under the paper’s current data and architecture setup.
The model is reported as approximately 30x smaller and 2x faster than Moirai-Large while also improving accuracy on the benchmark aggregate.
Domain and horizon breakdowns show strong short-range results, but the model’s relative rank weakens for longer prediction horizons and Nature-domain tasks.
The ablation path attributes most of the gain to the decoder-only backbone, quantile loss, recursive multi-quantile decoding, multi-token prediction, and the residual output projection.

Limitations

Moirai 2.0 drops multivariate forecasting and covariate support, so it is less expressive than models that can use cross-variate structure or known future exogenous variables.
Scaling from small to base and large worsens the reported aggregate GIFT-Eval metrics.
Long-horizon forecasting remains weaker than short-horizon forecasting.
The corpus includes internal Salesforce telemetry, so that part of the pretraining data is not independently reproducible.
It is a forecasting model, not an action-conditioned world model or a causal intervention model.

Links Into The Wiki

Open Questions

Is the small model’s advantage a true architecture-data sweet spot, or mainly an artifact of the current pretraining corpus and training budget?
Can the decoder-only quantile interface recover useful multivariate and covariate behavior if synthetic or curated real multivariate corpora improve?
How should time-series foundation models trade off direct quantile outputs against full probabilistic generative rollouts for world-model use cases?

Alex Knowledge Base

Explorer

Moirai 2.0: When Less Is More for Time Series Forecasting

Moirai 2.0: When Less Is More for Time Series Forecasting

Source

Core Claim

Benchmarked Models

Key Contributions

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks