A decoder-only foundation model for time-series forecasting

Source

Raw Markdown: paper_timesfm-2023.md
PDF: paper_timesfm-2023.pdf
Preprint: arXiv 2310.10688
Official code: google-research/timesfm
Official checkpoint: google/timesfm-2.5-200m-pytorch

Core Claim

TimesFM argues that a decoder-only Transformer trained directly on a large time-series corpus can become a practical zero-shot forecasting foundation model across domains, context lengths, horizons, and temporal granularities.

Key Contributions

Introduces TimesFM, a patched decoder-only time-series model trained to predict future output patches from historical input patches with causal self-attention.
Builds a large pretraining corpus from Google Trends, Wikipedia page views, public forecasting datasets, and synthetic time series designed to cover diverse trends, seasonality, and granularities.
Uses input patching, longer output patches, and randomized context masking so one model can support flexible context lengths and arbitrary forecast horizons through autoregressive decoding.
Reports competitive zero-shot forecasting on Monash, Darts, and Informer-style ETT evaluations, often close to or better than supervised baselines trained per dataset.

Benchmarked Models

Model	Role In Paper	Notes	Official Artifact
TimesFM-200M	Main benchmarked paper model	Decoder-only Transformer with 20 layers, model dimension 1280, 16 attention heads, input patch length 32, output patch length 128, and dropout 0.2.	google-research/timesfm
TimesFM-2.5-200M	Requested official released checkpoint	Public PyTorch checkpoint in the TimesFM family; useful as the current official artifact for reproducing or extending the 200M-scale TimesFM line.	google/timesfm-2.5-200m-pytorch

Method Notes

TimesFM is a passive dynamics model for numeric time series: it forecasts future observations from historical observations and does not include an explicit action, control input, intervention, or exogenous-variable channel during pretraining.

The model treats patches as the time-series analogue of tokens. Input patches are mapped through a residual block into Transformer tokens; causal self-attention processes the token sequence; an output residual block predicts the next output patch. The output patch can be longer than the input patch, which reduces the number of autoregressive steps needed for long horizons.

The pretraining data mixture combines real and synthetic data. The paper describes an 80% real and 20% synthetic loader, with real data balanced across hourly/sub-hourly, daily, weekly, and monthly groups. Synthetic data is included to improve generalization to underrepresented temporal granularities and periodicities.

Evidence And Results

Monash: the paper reports TimesFM as the top model on geometric-mean scaled MAE among the compared methods, while remaining zero-shot.
Darts: TimesFM is statistically close to the best methods, including seasonal ARIMA and llmtime, on the average scaled MAE view.
Informer-style ETT tasks: TimesFM and PatchTST are the strongest methods in the reported 96-step and 192-step horizon comparisons.
Ablations support the design choices: larger compute improves Monash scaled MAE, longer output patches improve 512-step ETT forecasting, input patch length 16 or 32 performs best among tested choices, and adding synthetic data helps underrepresented granularities.

Limitations

The paper focuses on point forecasting; probabilistic forecasting heads are described as possible future extensions rather than the main evaluated setting.
The pretrained model does not use covariates, so exogenous variables and known future features require additional inference-time or fine-tuning strategies.
It is a forecasting foundation model, not an action-conditioned world model; it cannot directly answer counterfactual questions about interventions without additional causal structure.
The paper reports that foundation-model forecasts can fail on individual inputs, so high-stakes deployment still needs broad task-specific evaluation and human review.

Links Into The Wiki

Open Questions

How much of TimesFM’s zero-shot transfer comes from decoder-only patch training versus corpus scale and mixture design?
Can the TimesFM line incorporate covariates or multivariate coupling without losing broad zero-shot behavior?
How should TimesFM-style passive forecasting models be combined with action-conditioned world models when interventions or control inputs are central?

Alex Knowledge Base

Explorer

A decoder-only foundation model for time-series forecasting

A decoder-only foundation model for time-series forecasting

Source

Core Claim

Key Contributions

Benchmarked Models

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks