MantisV2: Closing the Zero-Shot Gap in Time Series Classification with Synthetic Data and Test-Time Strategies
Source
- Raw Markdown: paper_mantisv2-2026.md
- PDF: paper_mantisv2-2026.pdf
- Preprint: https://arxiv.org/abs/2602.17868
- Official code: https://github.com/vfeofanov/mantis
- MantisPlus checkpoint: https://huggingface.co/paris-noah/MantisPlus
- MantisV2 checkpoint: https://huggingface.co/paris-noah/MantisV2
Core Claim
MantisV2 argues that zero-shot time-series classification can close much of the gap to fine-tuned encoders by combining synthetic-data pretraining, a lighter refined Mantis architecture, and test-time representation strategies.
Key Contributions
- Introduces MantisPlus, the original Mantis architecture retrained only on 2M synthetic time series generated by CauKer.
- Introduces MantisV2, a smaller refined encoder with a larger convolution kernel, smaller Transformer head dimension, RoPE, RMS normalization, and SwiGLU feed-forward layers.
- Shows that intermediate Transformer layers can be better frozen feature extractors than the final layer, especially as synthetic pretraining scale grows.
- Uses test-time strategies including class-token plus mean-token aggregation, multi-scale interpolation self-ensembling, first-difference embeddings, logistic-regression probing, and cross-model embedding fusion.
- Benchmarks on UCR, UEA, human activity recognition, and EEG classification datasets against time-series, tabular, forecasting, and vision-adapted foundation-model baselines.
Benchmarked Models
- MantisPlus: Original Mantis architecture pretrained on 2M CauKer synthetic time series for 200 epochs; the paper reports it as a strong zero-shot frozen encoder and releases the checkpoint at Hugging Face.
- MantisV2: Refined Mantis encoder with about 4.2M original parameters and 2.2M parameters after layer pruning; the paper reports stronger UCR performance than MantisPlus while remaining lightweight.
Method Notes
This source extends Mantis and CauKer from synthetic pretraining evidence into a more complete classification foundation-model recipe. It keeps the focus on frozen feature extraction: input time series are resized, encoded channel-by-channel for multivariate time series, and passed to a downstream classifier trained on the task labels.
The most reusable modeling lesson is that the final contrastive layer is not necessarily the best representation for zero-shot classification. The paper treats layer selection and token aggregation as test-time choices rather than architectural afterthoughts.
Within the Mantis lineage, MantisV2 changes the data and inference strategy, while UTICA changes the objective by adapting self-distillation to the same Mantis-style classification backbone.
Evidence And Results
With random-forest classifiers on frozen embeddings, the paper reports average UCR accuracy of 0.8061 for MantisPlus and 0.8195 for MantisV2, with MantisV2 leading the compared methods on the 128-dataset UCR average.
On UEA-27 with random forests, MantisPlus and MantisV2 are the top two reported deep feature extractors, with averages of 0.7449 and 0.7420 respectively.
In the final UCR comparison using logistic regression for deep methods, the paper reports average accuracy of 0.8360 for MantisV2, 0.8369 for SE-MantisPlus, 0.8397 for SE-MantisV2, 0.8466 for MantisV2 plus TiViT-H, and 0.8494 for MantisV2 plus TiConvNext, compared with 0.8500 for fine-tuned MantisV2.
Limitations
The method is specialized for classification, not forecasting or action-conditioned world modeling. The self-ensembling and model-fusion results improve accuracy but also increase feature dimensionality and inference cost. Multivariate time series are handled by per-channel encoding and concatenation rather than native cross-channel modeling.
Links Into The Wiki
- Mantis
- Time-Series Foundation Models
- Time-Series Classification Foundation Models
- Synthetic Data For Time Series
- Time-Series Benchmark Hygiene
- Mantis
- UTICA
- CauKer
Open Questions
- How much of the gain comes from synthetic data diversity versus the contrastive objective and test-time classifier choice?
- Can MantisV2-style intermediate-layer selection transfer to forecasting or action-conditioned world-model settings?
- Would native multivariate tokenization improve UEA, HAR, and EEG performance without losing the small-model advantage?