Scaling-laws for Large Time-series Models
Source
- Raw Markdown: paper_scaling-laws-large-time-series-models-2024.md
- PDF: paper_scaling-laws-large-time-series-models-2024.pdf
- Preprint: arXiv 2405.13867
Core Claim
Large decoder-only time-series Transformers exhibit LLM-like power-law scaling with parameter count, dataset size, and training compute. This is one of the key papers making time-series foundation models look like a justified scaling program rather than only a collection of benchmark tricks.
Key Contributions
- Trains decoder-only forecasting Transformers across roughly five orders of magnitude in model size.
- Builds a heterogeneous univariate time-series corpus with about 8 billion data points, 30,211,687 individual series, and 38 data sources.
- Measures scaling behavior with MSE, CRPS, and log-likelihood rather than only a single point-forecast metric.
- Finds architecture shape choices such as aspect ratio and number of heads are relatively weak compared with scale across broad ranges.
- Uses a Student-t distribution head to handle heavy-tailed temporal observations more stably than a simple Gaussian or MSE-only head.
Method Notes
The model family is a passive forecasting model. It predicts future numeric observations from historical observations and does not expose actions, control inputs, interventions, or counterfactual rollout channels.
The paper focuses on univariate time series. It explicitly leaves multivariate scaling laws, exogenous variables, and richer distribution heads for future work. That boundary matters: the result supports TSFM scale-up, but not yet native high-dimensional multivariate world models.
Evidence And Results
The strongest durable evidence is the power-law fit across parameters, data, and compute on in-sequence next-step test losses. The paper also shows that data scaling only becomes clear when dataset diversity is preserved while scaling the amount of data.
The paper’s argument is not “bigger always wins on every leaderboard.” It is narrower and more important: time-series forecasting appears to have predictable neural scaling behavior under broad, heterogeneous pretraining.
Alex Notes
- Alex flagged this with Scaling Law for Time Series Forecasting as evidence that TSFMs have a right to exist: they show a scaling-law substrate analogous to LLMs.
Limitations
- Univariate-only study.
- Largest models are around 100M parameters, so extrapolation to billion-scale TSFMs is still an extrapolation.
- Forecasting is evaluated mainly through next-step or in-sequence loss; long-horizon rollout scaling is not the central experiment.
- Does not address text context, action conditioning, causal interventions, or native high-dimensional channel structure.
Links Into The Wiki
- Time-Series Foundation Models
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- TimesFM
- Moirai
Open Questions
- Do the same scaling exponents hold for multivariate time series with channel coupling?
- How do scaling laws change when the model supports known future exogenous variables, actions, or interventions?
- Can compact models such as Tiny Time Mixers keep their advantage once compared under explicit data/model/compute scaling curves?