Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Source

Core Claim

Kairos argues that time-series foundation models can gain zero-shot forecasting generalization through adaptive temporal abstraction rather than mostly through larger parameter counts: dynamic patching, mixture-of-size encoding, and dynamic RoPE let the model adapt token granularity and positional scale to heterogeneous time-series structure.

Key Contributions

  • Introduces a Mixture-of-Size Encoder that routes each coarse segment to a sparse set of patch-size experts, with null experts allowing the model to skip unnecessary granularities.
  • Adds Dynamic Rotary Position Embedding (DRoPE), which modulates RoPE frequencies from instance-level spectral features and calibrates token positions for mixed patch sizes.
  • Uses a Multi-Patch Decoder with learnable forecast tokens to predict multiple future patches in parallel, reducing the amount of autoregressive rollout needed for longer horizons.
  • Builds the Predictability-Stratified Time Series (PreSTS) pretraining corpus, over 300B time points sampled to prioritize predictable real-world sequences while adding complementary synthetic data.
  • Reports zero-shot forecasting results on GIFT-Eval and Time-Series-Library, plus frozen-representation transfer results on UCR classification tasks.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Kairos-10MMini benchmarked checkpointThe paper’s mini configuration uses 4 layers, 4 heads, 256 model width, 10M parameters, and patch sizes {32, 64, 128}.mldi-lab/Kairos_10m
Kairos-23MSmall benchmarked checkpointThe paper’s small configuration uses 4 layers, 8 heads, 384 model width, 23M parameters, and patch sizes {32, 64, 128}. On GIFT-Eval, it is reported ahead of several larger zero-shot TSFMs by normalized MASE.mldi-lab/Kairos_23m
Kairos-50MBase released checkpointThe paper’s base configuration is reported as 53M parameters with 6 layers, 8 heads, 512 model width, and patch sizes {32, 64, 128, 256}; the requested official artifact is the 50m checkpoint release.mldi-lab/Kairos_50m

Method Notes

Kairos is a passive forecasting model: it predicts future numeric observations from historical observations and does not introduce an explicit action, control input, treatment, or intervention channel. It handles multivariate time series with channel-independent modeling, so each variable is treated as an individual sequence rather than through native cross-channel dynamics.

The Mixture-of-Size Encoder first partitions a sequence into coarse segments, then routes each segment to selected patch-size experts. This creates a variable effective tokenization: stable regions can use coarser tokens, while volatile or high-information regions can use finer tokens.

DRoPE addresses the fact that mixed patch sizes break the usual assumption that token index is a uniform proxy for elapsed time. It combines instance-specific spectral modulation of RoPE frequencies with granularity-aware position calibration, so attention can reflect both periodic structure and physical time distance.

Evidence And Results

  • On GIFT-Eval, the paper reports Kairos-Base with the best normalized MASE among the compared methods and second-best CRPS, while Kairos-Small also reports a stronger MASE than larger zero-shot TSFMs such as Toto and Sundial.
  • On Time-Series-Library zero-shot forecasting, Kairos-Mini is reported to outperform recent TSFMs and most full-shot deep learning baselines in the paper’s aggregate comparison.
  • Ablations attribute the GIFT-Eval gains to the combined architecture: replacing the adaptive encoder with fixed patching, removing DRoPE, or reverting to single-patch autoregressive decoding all worsens normalized MASE.
  • Routing interventions support the segment-level adaptation claim: uniform granularity weights and shuffled routing decisions degrade performance substantially compared with the full model.
  • Matched-data comparisons suggest architecture is the primary source of the reported gains, with PreSTS adding a smaller but useful contribution.

Limitations

  • The model is focused on forecasting; anomaly detection, imputation, and broader task support are left for future versions, though the paper includes a classification-transfer appendix.
  • Channel-independent modeling means Kairos does not explicitly capture inter-variable dependencies in multivariate time series.
  • The benchmark story is centered on GIFT-Eval and selected TSLib datasets, so downstream users should check domain, frequency, and horizon match before treating the reported zero-shot results as general.
  • Because Kairos is not action-conditioned, it is not directly a world model for interventions or controllable dynamics without adding explicit control-input structure.

Open Questions

  • How much of Kairos’s advantage survives when native multivariate channel mixing is added without losing the parameter-efficiency gains from adaptive tokenization?
  • Can the segment-level router become a useful interpretability signal for regime changes, anomaly boundaries, or forecast difficulty?
  • Would explicit covariates, actions, control inputs, or interventions fit naturally into the mixture-of-size tokenization scheme, or would they require a separate event-stream representation?