Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models
Source
- Raw Markdown: paper_kairos-2025.md
- PDF: paper_kairos-2025.pdf
- Preprint: arXiv 2509.25826
- Official project page: foundation-model-research.github.io/Kairos
- Official code: foundation-model-research/Kairos
- Official checkpoint: mldi-lab/Kairos_10m
- Official checkpoint: mldi-lab/Kairos_23m
- Official checkpoint: mldi-lab/Kairos_50m
Core Claim
Kairos argues that time-series foundation models can gain zero-shot forecasting generalization through adaptive temporal abstraction rather than mostly through larger parameter counts: dynamic patching, mixture-of-size encoding, and dynamic RoPE let the model adapt token granularity and positional scale to heterogeneous time-series structure.
Key Contributions
- Introduces a Mixture-of-Size Encoder that routes each coarse segment to a sparse set of patch-size experts, with null experts allowing the model to skip unnecessary granularities.
- Adds Dynamic Rotary Position Embedding (DRoPE), which modulates RoPE frequencies from instance-level spectral features and calibrates token positions for mixed patch sizes.
- Uses a Multi-Patch Decoder with learnable forecast tokens to predict multiple future patches in parallel, reducing the amount of autoregressive rollout needed for longer horizons.
- Builds the Predictability-Stratified Time Series (PreSTS) pretraining corpus, over 300B time points sampled to prioritize predictable real-world sequences while adding complementary synthetic data.
- Reports zero-shot forecasting results on GIFT-Eval and Time-Series-Library, plus frozen-representation transfer results on UCR classification tasks.
Benchmarked Models
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Kairos-10M | Mini benchmarked checkpoint | The paper’s mini configuration uses 4 layers, 4 heads, 256 model width, 10M parameters, and patch sizes {32, 64, 128}. | mldi-lab/Kairos_10m |
| Kairos-23M | Small benchmarked checkpoint | The paper’s small configuration uses 4 layers, 8 heads, 384 model width, 23M parameters, and patch sizes {32, 64, 128}. On GIFT-Eval, it is reported ahead of several larger zero-shot TSFMs by normalized MASE. | mldi-lab/Kairos_23m |
| Kairos-50M | Base released checkpoint | The paper’s base configuration is reported as 53M parameters with 6 layers, 8 heads, 512 model width, and patch sizes {32, 64, 128, 256}; the requested official artifact is the 50m checkpoint release. | mldi-lab/Kairos_50m |
Method Notes
Kairos is a passive forecasting model: it predicts future numeric observations from historical observations and does not introduce an explicit action, control input, treatment, or intervention channel. It handles multivariate time series with channel-independent modeling, so each variable is treated as an individual sequence rather than through native cross-channel dynamics.
The Mixture-of-Size Encoder first partitions a sequence into coarse segments, then routes each segment to selected patch-size experts. This creates a variable effective tokenization: stable regions can use coarser tokens, while volatile or high-information regions can use finer tokens.
DRoPE addresses the fact that mixed patch sizes break the usual assumption that token index is a uniform proxy for elapsed time. It combines instance-specific spectral modulation of RoPE frequencies with granularity-aware position calibration, so attention can reflect both periodic structure and physical time distance.
Evidence And Results
- On GIFT-Eval, the paper reports Kairos-Base with the best normalized MASE among the compared methods and second-best CRPS, while Kairos-Small also reports a stronger MASE than larger zero-shot TSFMs such as Toto and Sundial.
- On Time-Series-Library zero-shot forecasting, Kairos-Mini is reported to outperform recent TSFMs and most full-shot deep learning baselines in the paper’s aggregate comparison.
- Ablations attribute the GIFT-Eval gains to the combined architecture: replacing the adaptive encoder with fixed patching, removing DRoPE, or reverting to single-patch autoregressive decoding all worsens normalized MASE.
- Routing interventions support the segment-level adaptation claim: uniform granularity weights and shuffled routing decisions degrade performance substantially compared with the full model.
- Matched-data comparisons suggest architecture is the primary source of the reported gains, with PreSTS adding a smaller but useful contribution.
Limitations
- The model is focused on forecasting; anomaly detection, imputation, and broader task support are left for future versions, though the paper includes a classification-transfer appendix.
- Channel-independent modeling means Kairos does not explicitly capture inter-variable dependencies in multivariate time series.
- The benchmark story is centered on GIFT-Eval and selected TSLib datasets, so downstream users should check domain, frequency, and horizon match before treating the reported zero-shot results as general.
- Because Kairos is not action-conditioned, it is not directly a world model for interventions or controllable dynamics without adding explicit control-input structure.
Links Into The Wiki
- Time-Series Foundation Models
- Synthetic Data For Time Series
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Sundial
- Tiny Time Mixers
- TiRex
Open Questions
- How much of Kairos’s advantage survives when native multivariate channel mixing is added without losing the parameter-efficiency gains from adaptive tokenization?
- Can the segment-level router become a useful interpretability signal for regime changes, anomaly boundaries, or forecast difficulty?
- Would explicit covariates, actions, control inputs, or interventions fit naturally into the mixture-of-size tokenization scheme, or would they require a separate event-stream representation?