Timer: Generative Pre-trained Transformers Are Large Time Series Models
Source
- Raw Markdown: paper_timer-2024.md
- PDF: paper_timer-2024.pdf
- Preprint: arXiv 2402.02368
- Benchmark source: thuml/Timer
- Paper source: thuml/Large-Time-Series-Model
- Official checkpoint: thuml/timer-base-84m
Core Claim
Timer argues that a GPT-style decoder-only Transformer trained directly on large time-series corpora can become a large time-series model with useful few-shot, zero-shot, scaling, and task-general behavior across forecasting, imputation, and anomaly detection.
Key Contributions
- Curates Unified Time Series Dataset (UTSD), a hierarchical pretraining corpus spanning seven domains and up to 1B time points in the main paper setup.
- Introduces single-series sequence (S3) formatting, which normalizes and pools individual variates into fixed-context single-series sequences so heterogeneous univariate, multivariate, and irregular time series can be used for pretraining.
- Uses segment tokens, causal attention, and next-token mean-squared-error supervision in a decoder-only Transformer, adapting the generative pretraining recipe from language models to time series.
- Casts forecasting, segment-level imputation, and predictive anomaly detection as generative tasks over future or masked segments.
- Establishes a zero-shot forecasting comparison among Timer, Moirai, MOMENT, Chronos, TimesFM, Lag-Llama, and TimeGPT-1, while noting that zero-shot scaling behavior remains uneven.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Timer-Base-84M | Requested benchmarked Timer checkpoint | Official Hugging Face release for the Timer line: decoder-only causal Transformer, 84M parameters, patch length 96, context length up to 2880, and univariate pretraining on 260B time points. The paper’s own quality table reports Timer variants at 29M, 50M, and 67M parameters and zero-shot Timer variants scaled by pretraining corpus; this checkpoint is the current requested public artifact for benchmarking. | thuml/timer-base-84m |
Method Notes
Timer is a passive dynamics model: it predicts or reconstructs future observations from observed time-series history, but it does not include an explicit action, control input, or intervention channel. Its S3 format handles multivariate time series by decomposing channels into single-variate sequences, which improves dataset unification but does not directly model cross-channel dynamics as first-class structure.
The paper’s central architectural bet is that autoregressive generation over time-series segment tokens is a better scaling target than the encoder-only, direct multi-step prediction pattern common in smaller forecasting models. The resulting model can reuse the same next-token mechanism for variable lookback lengths, rolling forecasts, segment imputation, and predictive anomaly detection.
Evidence And Results
- Few-shot forecasting: fine-tuned pretrained Timer matches or beats strong full-data small-model baselines in several low-data settings, including 1% ETTh1, 5% Traffic, 3% PEMS03, and 25% PEMS04.
- Imputation: in segment-level imputation, Timer is reported to outperform TimesNet in all 44 scenarios with 5% downstream samples, 86.4% of scenarios with 20% samples, and 56.8% of scenarios with full samples.
- Anomaly detection: Timer uses predictive anomaly detection on UCR Anomaly Archive tasks, comparing forecasted normal segments with observed future segments rather than reconstructing the same input window.
- Scaling: model-size and data-scale experiments report improved PEMS forecasting as Timer grows from small decoder-only variants toward larger pretrained variants.
- Zero-shot forecasting: Timer is one of the top-ranked models in the paper’s large time-series model comparison, alongside Moirai and TimesFM, but the paper cautions that stronger zero-shot generalization and synchronized data/model scaling are still open problems.
Limitations
- Timer is not an action-conditioned world model; it is a passive time-series model unless extended with explicit actions, control inputs, or interventions.
- S3 channel decomposition makes heterogeneous pretraining easier, but it weakens direct modeling of coupled multivariate structure.
- The paper does not unify time-series classification in the same generative formulation and does not support probabilistic forecasting.
- UTSD is useful infrastructure but still small relative to later claims of tens or hundreds of billions of time points in time-series foundation model pretraining.
Links Into The Wiki
- Time-Series Foundation Models
- Synthetic Data For Time Series
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- MOMENT
- Time-MoE
Open Questions
- How much of Timer’s transfer comes from decoder-only autoregression versus UTSD curation and S3 channel decomposition?
- Can Timer-style segment generation be extended to model coupled multivariate dynamics without losing the benefits of heterogeneous pretraining?
- What changes are needed to turn Timer from a passive dynamics model into an action-conditioned world model with explicit control inputs or interventions?