Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series

Source

Core Claim

Tiny Time Mixers shows that small TSMixer-derived models, starting around 1M parameters, can be pretrained on public time-series datasets and still deliver strong zero-shot and few-shot forecasting for multivariate time series.

Key Contributions

  • Introduces Tiny Time Mixers, compact pretrained forecasting models built from TSMixer-style mixing blocks rather than self-attention-heavy Transformer stacks.
  • Uses adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle heterogeneous pretraining data with different temporal resolutions.
  • Pretrains with a direct forecasting objective over roughly 1B public samples from Monash and LibCity-derived data, while excluding evaluation datasets from pretraining.
  • Separates a frozen channel-independent backbone from a small fine-tuned head, allowing the decoder and exogenous mixer to model cross-channel correlations and exogenous variables in target data.
  • Reports strong zero-shot and few-shot forecasting results while reducing parameter count, inference cost, and deployment requirements relative to larger time-series foundation models.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
TTM-r2Main released Granite TTM checkpoint family for the paper’s latest Base, Enhanced, and Advanced variantsThe paper benchmarks TTM variants around 1M, 4M, and 5M parameters; the r2 Granite checkpoint is the preferred Apache-licensed release for these latest variants.ibm-granite/granite-timeseries-ttm-r2

Method Notes

TTM is a passive dynamics model for forecasting: it predicts future observations from historical observations, without an action or intervention channel. It does, however, explicitly distinguishes target variables from exogenous variables whose future values may be known during forecasting.

The model first pretrains a channel-independent backbone, then adapts to a target multivariate time series by fine-tuning the TTM head. Channel mixing in the decoder handles target-channel correlations, while the exogenous mixer incorporates future exogenous variables into the forecast horizon.

Evidence And Results

  • Zero-shot forecasting: TTM variants outperform most compared Moirai and TimesFM variants on averaged D1 benchmarks, with large model-size advantages.
  • Last-window zero-shot comparisons: TTM reports strong gains over Chronos and Lag-Llama at shorter forecast lengths, while using far fewer parameters than most Chronos variants.
  • Few-shot forecasting: with 5% of target training data, TTM improves over GPT4TS, Time-LLM, and several non-pretrained forecasting architectures in the paper’s averaged D1 comparison.
  • Head probing: full-data head probing with the backbone frozen is reported as competitive with or better than Moment, GPT4TS, Time-LLM, and several fully trained forecasting architectures.
  • Exogenous-variable setting: TTM with decoder channel mixing and the exogenous mixer outperforms plain TTM fine-tuning and strong trained-from-scratch baselines on the D2 datasets.

Limitations

  • TTM is focused on forecasting, not classification, anomaly detection, generation, or natural-language reasoning over time series.
  • The architecture is sensitive to context length, so the paper trains separate variants for different context-length settings and adds forecast-length adaptation procedures.
  • The released model family is point-forecasting oriented; probabilistic forecasting is listed as future work.
  • Because TTM is a passive forecaster, it is not an action-conditioned world model unless extended with explicit action, control input, or intervention channels.

Open Questions

  • How much of TTM’s transfer comes from TSMixer inductive bias versus the resolution-diverse pretraining recipe?
  • Can a TTM-style backbone be extended from passive forecasting to action-conditioned world modeling with explicit control inputs or interventions?
  • Would probabilistic heads preserve the speed and size advantages while improving uncertainty calibration?