Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting
Source
- Raw Markdown: paper_reverso-2026.md
- PDF: paper_reverso-2026.pdf
- Preprint: arXiv 2602.17634
- Official code: shinfxh/reverso
- Official checkpoint: shinfxh/reverso
Core Claim
Reverso argues that zero-shot time-series foundation models do not need transformer-scale parameter counts: a small hybrid backbone that alternates long convolution and DeltaNet sequence mixing can match or beat much larger forecasting models on benchmark efficiency-performance trade-offs.
Benchmarked Model Entry
- Model: Reverso-Small-550K
- Family: Reverso efficient time-series foundation models
- Organization: MIT, Allen Institute for AI, and Qube Research and Technologies
- Parameters: 550K
- Architecture: four layers with hidden dimension 64, alternating long convolution and DeltaNet sequence mixing, MLP channel mixing, and an attention-based decoder head.
- Primary task surface: zero-shot univariate point forecasting from historical numeric features.
- Context and prediction interface: context length 2048 with patch prediction length 48, rolled out autoregressively for longer horizons.
- Training data: GiftEval Pretrain plus synthetic Gaussian-process, spike, trapezoidal, trend, seasonality, and irregularity sequences.
- Official artifact: the
shinfxh/reversocode repository and theshinfxh/reversoHugging Face checkpoint.
Key Contributions
- Shows that Reverso-Small at 550K parameters can outperform several similarly small or larger zero-shot forecasting baselines on Gift-Eval while remaining orders of magnitude smaller than 100M- to billion-parameter transformer-based time-series foundation models.
- Uses a minimal input interface that normalizes a univariate numeric sequence and avoids calendar metadata, frequency tags, known future exogenous variables, or multivariate channel structure.
- Combines long convolutions with DeltaNet linear RNN layers, using a state-weaving strategy where the last hidden state from the previous layer is added to the first hidden state before the next DeltaNet layer.
- Adds a training recipe with balanced sampling from GiftEval Pretrain, standard time-series augmentations, and synthetic sequences generated from Gaussian-process kernels, spike processes, trapezoidal patterns, trends, seasonality, and irregularity.
- Adds inference-time flip equivariance and FFT-guided downsampling to improve robustness and let long-period structure fit inside the fixed context window.
Method Notes
Reverso is best read as a compact recipe for passive time-series forecasting rather than an action-conditioned world model. It predicts future observations from past observations only; actions, control inputs, interventions, treatments, known future exogenous variables, and cross-channel coupling are not first-class inputs.
The architecture keeps each time step as a numeric observation rather than tokenizing into a large discrete vocabulary. Sequence mixing alternates FFT-friendly long convolutions with DeltaNet linear RNN layers, while the decoder attends from learned horizon queries over the final contextualized history representation.
Evidence And Results
On Gift-Eval, the paper reports overall MASE of 0.711 for Reverso-2.6M, 0.726 for Reverso-Small, and 0.760 for Reverso-Nano. The paper emphasizes the Pareto comparison: Reverso-Small is close to or ahead of much larger baselines while using only 550K parameters.
Across the Gift-Eval horizon split, Reverso-Small reports MASE of 0.648 on short horizons, 0.728 on medium horizons, and 0.754 on long horizons. The full Reverso-2.6M model is strongest in medium and long horizons among the Reverso variants.
On the LTSF transfer benchmark, averaged across ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather over prediction lengths 96, 192, 336, and 720, the paper reports average MAE of 0.325 for Reverso-Small and 0.322 for Reverso-2.6M.
Ablations attribute performance to the hybrid sequence mixer and attention decoder: the convolution-plus-DeltaNet variant outperforms single-mixer alternatives in the reported smaller-scale ablation, and the attention decoder improves long-range dependency capture over a simple linear output head.
Limitations
- Reverso is primarily a univariate point-forecasting model and does not natively model multivariate time-series interactions.
- The model does not expose explicit action, control input, intervention, treatment, or known future exogenous-variable channels.
- Short-horizon performance lags some larger time-series foundation models even when medium and long horizons are strong.
- The paper focuses on point predictions; uncertainty estimates would need an additional layer such as conformal calibration or another probabilistic wrapper.
Links Into The Wiki
- Time-Series Foundation Models
- Synthetic Data For Time Series
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Chronos-2
- TempoPFN
- TiRex
- TimesFM
Open Questions
- Does the long-convolution plus DeltaNet recipe remain efficient when extended from univariate forecasting to native multivariate time-series modeling?
- Can Reverso’s downsampling and flip-equivariant inference tricks be reused by larger foundation models without retraining?
- How much of the performance-efficiency gain comes from architecture versus the GiftEval balancing, augmentation, and synthetic-data recipe?