Time-Series Scaling And Efficiency

Summary

The time-series foundation model cluster does not have one settled scaling path. Some papers argue for larger dense or sparse models, while others argue that compact backbones, adaptive tokenization, recurrent state, or compression can match larger systems under the right benchmark and horizon.

Large-Model And Sparse-Capacity Direction

Toto 2.0 explicitly frames forecasting as entering a scaling era, with open-weights checkpoints from 4M to 2.5B parameters and reported monotonic gains through the largest released size. Its scaling claim is tied to observability and synthetic pretraining, contiguous patch masking, and fast long-horizon inference.

Time-MoE scales total capacity with sparse temporal mixture-of-experts layers, keeping activated parameters lower than total parameters. Moirai-MoE uses token-level expert routing inside the Moirai family and argues that learned pattern specialization is better than hand-assigned frequency buckets.

Sundial, Timer, and TimesFM continue the dense sequence-modeling direction through decoder-only Transformers, continuous flow-matching forecast heads, segment generation, patching, and large pretraining corpora.

Compact And Specialized Direction

Tiny Time Mixers shows that 1M to 5M parameter mixer-style forecasters can be strong zero-shot and few-shot baselines, especially when the backbone is pretrained channel-independently and the head handles target adaptation and exogenous variables.

Reverso pushes the compact argument further with a 550K-parameter small model built from long convolutions, DeltaNet-style linear recurrence, an attention decoder, flip equivariance, and FFT-guided downsampling.

Kairos argues for adaptive temporal abstraction rather than parameter count alone. Its mixture-of-size encoder, dynamic RoPE, and multi-patch decoder let small models adapt token granularity to local sequence structure.

TiRex uses xLSTM recurrent state plus contiguous patch masking to preserve long-horizon forecast state without relying only on larger Transformer capacity.

FlowState uses an SSM encoder, functional basis decoder, and parallel forecasts to make small models adapt context length, target length, and sampling rate without relying on patching or large Transformer capacity.

Moirai 2.0 is another efficiency counterpoint: the released small model is reported stronger than larger Moirai 2.0 variants on the paper’s GIFT-Eval aggregate, while also simplifying the original Moirai interface.

TabPFN-3 is mostly a static tabular model, but it is useful as a scaling analogy because its report combines row compression, reduced KV caching, row chunking, and a specialized TabPFN-TS-3 checkpoint. The lesson to port carefully is context compression for large structured inputs, not that static table rows are temporal tokens.

Compression And Rank Structure

FlowRanks studies low-rank structure in time-series Transformers and uses that structure to compress Chronos-style models. It suggests that time-series representations may have stronger rank decay than language or vision features, which would make after-the-fact compression or rank-aware architecture design unusually valuable.

U-Cast adds a channel-dimension scaling case: full channel attention can become impractical when a multivariate time series has thousands of channels, so the model uses hierarchical latent queries and upsampling to reduce channel compute while retaining cross-channel structure.

Architecture Tradeoffs To Track

  • Dense decoder-only Transformers scale naturally but can be costly at long context and long horizon.
  • Sparse MoE models increase total capacity while keeping activated compute lower, but memory, routing stability, and serving complexity remain.
  • Mixer, convolution, xLSTM, and linear-RNN hybrids can be much smaller, but may need carefully matched training and inference recipes.
  • Continuous basis decoders can expose flexible sampling rates and horizons, but they make the coefficient-to-observation interface part of the modeling contract.
  • Adaptive tokenization can reduce wasted tokens, but it complicates position encoding, batching, and multivariate alignment.
  • Point-wise numeric value embeddings preserve temporal resolution, but they may increase token count relative to patching and need careful treatment of exogenous variables and control inputs.
  • Channel-independent univariate modeling improves corpus unification and serving simplicity, but it can miss native multivariate dynamics.
  • Hierarchical channel compression can reduce cost in high-dimensional multivariate forecasting, but it must preserve channel-specific deviations rather than only global shared trends.
  • Direct multi-patch, contiguous-mask, or one-pass horizon prediction reduces sequential decoding cost, but may trade off long-horizon uncertainty propagation.

Evidence

The evidence is not a single scaling law. Toto 2.0 reports monotonic parameter scaling in its article; Time-MoE and Moirai-MoE report sparse-routing gains; TTM, Reverso, Kairos, TiRex, FlowState, Moirai 2.0, and TabPFN-3 argue that architecture and inference design can beat raw parameter count in specific benchmark regimes. U-Cast adds that the channel dimension can be the scaling bottleneck even before parameter count dominates. EIDOS adds that point-wise scalar tokenization and latent prediction can improve representation geometry. Cross-paper comparisons should be routed through Time-Series Benchmark Hygiene before treating any rank as settled.

Open Questions

  • Where does parameter scaling saturate for forecasting once benchmark leakage and fine-tuned or ensemble entries are separated?
  • Which compact architectures keep their advantage when native multivariate coupling and known future exogenous variables are required?
  • Which channel-compression mechanisms scale to tens of thousands of channels without erasing local deviations?
  • Can sparse expert routing specialize by regime, horizon, frequency, covariate structure, or incident phase in an interpretable way?
  • Are rank-aware designs better built into the model from the start, or applied as compression after pretraining?