Time-Series Benchmark Hygiene
Summary
Time-series foundation model rankings are brittle unless task, protocol, context length, horizon, covariates, leakage controls, and adaptation mode match. Future source pages should link here instead of repeating the same benchmark caveats.
Required Separations
- Zero-shot base-model results should be separated from few-shot adaptation, linear probing, full fine-tuning, and dataset-specific training.
- Fine-tuned and ensemble entries should be separated from base released checkpoints. Toto 2.0, for example, reports base models, a fine-tuned 2.5B variant, and a family-and-friends ensemble.
- Frozen feature extraction for classification should be separated from label-free zero-shot prediction. Many classification “zero-shot” results still train a Random Forest, SVM, logistic-regression head, or similar downstream classifier on target labels.
- Single-model results should be separated from representation fusion, such as MantisV2 plus TiViT features.
- Point forecasting, quantile forecasting, probabilistic forecasting, imputation, anomaly detection, classification, and reasoning benchmarks should not be ranked as if they measure one ability.
Benchmark Families
Forecasting benchmark names in this wiki include BOOM, GIFT-Eval, TIME, Chronos-ZS, fev-bench, Monash, LSF/LTSF, Time-Series-Library, Informer-style ETT tasks, Darts, Chronos Benchmark II, and Time-HD. These differ in horizon, frequency, metric, target/covariate interface, channel count, channel-dependency structure, and leakage policy.
Classification and representation-learning benchmarks include UCR and UEA. They test labeled shape or sequence discrimination, not direct future-value forecasting.
Static tabular benchmarks such as TALENT or small-data TabPFN-style suites are adjacent but should be kept separate from time-series evaluations because rows are not temporal histories by default.
TabPFN-3 makes this separation especially important. Its technical report includes static TabArena results, API/enterprise TabPFN-3-Plus and Thinking entries, and a specialized TabPFN-TS-3 time-series checkpoint; those are not one benchmark protocol.
Leakage And Overlap Risks
Broad pretraining corpora can include public datasets, training splits, or near-duplicates that later appear in benchmark reports. Chronos-2 explicitly discusses GIFT-Eval overlap and reports a synthetic-only ablation for stricter zero-shot evidence. TiRex flags overlap risks for some baselines and reports an appendix update intended to remove full GiftEval overlap. FlowState makes data-overlap handling part of its GIFT-style zero-shot claim and separately evaluates unseen sampling rates. Toto 2.0 says public forecasting datasets were excluded from pretraining, making leakage control part of its claim.
Synthetic data does not eliminate benchmark risk by itself. Template families can resemble benchmark dynamics, and synthetic generators can encode artifacts that make downstream ranks look stronger than real transfer would be.
Cross-Paper Comparison Checklist
Before comparing two reported ranks, check:
- Is the task forecasting, classification, imputation, anomaly detection, reasoning, or generation?
- Is the model zero-shot, few-shot, linear-probed, fine-tuned, or ensembled?
- Are context length, horizon, patch size, frequency, and rollout method comparable?
- Are known future exogenous variables, covariates, or grouped series available to both models?
- Are metrics aligned, such as MASE, CRPS, weighted quantile loss, MSE, MAE, accuracy, macro-F1, or rank?
- Does either pretraining corpus include benchmark train or test data, private telemetry, or unreleased synthetic generators?
- Is the model univariate, channel-independent, or native multivariate?
- Is the benchmark channel count high enough to test high-dimensional multivariate behavior rather than ordinary low-channel forecasting?
Evidence
The source pages already show why this page is needed. Toto 2.0 mixes base, fine-tuned, and ensemble leaderboard entries; TiRex, FlowState, and Chronos-2 discuss overlap or stricter zero-shot settings; Moirai 2.0 reports a smaller model outperforming larger variants under one aggregate; U-Cast argues that low-channel forecasting benchmarks under-test native multivariate dependency modeling; TabPFN-v2, TabPFN-3, and TabICL operate mainly on static tabular tasks; TabPFN-TS-3, TempoPFN, MantisV2, and UniShape each change the task or adaptation mode again.
Open Questions
- Should the wiki maintain a normalized benchmark table only for results that share protocol and metrics?
- How should private training corpora or private observability benchmarks be weighted relative to fully reproducible public benchmarks?
- Should HDTSF benchmarks report channel-count, correlation, hierarchy, memory, and training-time axes next to forecasting error?
- What is the right benchmark for action-conditioned time-series world models where interventions, not only forecasts, matter?