NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time-Series Pretraining

Source

Core Claim

NuTime argues that large-scale time-series representation learning needs an explicit numerical-scale interface: window-level normalized shapes can be encoded conventionally, but raw window means and standard deviations need a numerically multi-scaled embedding so one pretrained encoder can transfer across domains with very different amplitudes.

Key Contributions

  • Introduces NuTime, a Transformer encoder for time-series representation learning where each token represents a non-overlapping window through normalized shape, window mean, and window standard deviation.
  • Proposes numerically multi-scaled embedding (NME), which embeds scalar mean and standard-deviation values by ensembling linear-plus-LayerNorm blocks across bias multipliers from 1e-4 to 1e4.
  • Builds a large heterogeneous pretraining set from UCR, UEA, and additional public datasets, yielding about 1.89 million univariate training sequences and about 60 million tokens after cropping and windowing.
  • Uses a simple BYOL-style self-supervised objective with random resized crops, then transfers the pretrained model to univariate classification, multivariate classification, few-shot learning, clustering, and sample-level anomaly detection.
  • Reports state-of-the-art classification results against self-supervised and domain-specific baselines on UCR and UEA benchmark suites.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
NuTime-Bias9Main released pretrained NuTime checkpointTransformer encoder with 6 layers, 8 heads, 128-dimensional latent vectors, window size 16, mean/std embedding dimension 32, and 9 numerical scales for NME. The paper uses this 9-scale setting as the default for large-scale transfer experiments.checkpoint_bias9.pth

Method Notes

NuTime is a passive time-series representation model rather than an action-conditioned world model: it learns latent representations from observed time-series samples and does not include an explicit action, control input, intervention, or treatment channel.

The key modeling move is to separate local shape from local scale. Each time series is split into fixed windows; each window is normalized by its own mean and standard deviation for the shape path, while the mean and standard deviation are preserved as scalar inputs. This avoids instance normalization’s loss of scale information while still giving the Transformer well-conditioned token embeddings.

For multivariate time series, the paper encodes each channel window independently with shared parameters, concatenates the channel embeddings, and projects them back to the Transformer feature dimension. This makes multivariate transfer possible, but cross-channel dynamics are not modeled as a first-class temporal structure in the pretraining interface.

Evidence And Results

  • UCR supervised comparison: NuTime ranks first on the reported 112-dataset UCR archive comparison against HIVE-COTE2.0, MultiRocket, InceptionTime, and other specialized classifiers.
  • UCR self-supervised comparison: across 125 UCR datasets plus Epilepsy, FD-B, and EMG, NuTime reports the best average accuracy or macro-F1 in the paper’s comparison against TNC, T-Loss, TS-TCC, TS2Vec, TF-C, Ti-MAE, and SimMTM.
  • UEA multivariate classification: the same pretrained model transfers to UEA-style multivariate classification, ranking first in the supervised comparison over 26 datasets and reporting the best average accuracy/rank in the self-supervised comparison over 29 datasets.
  • Few-shot learning: with 5-shot finetuning over 41 UCR datasets, NuTime reports higher average accuracy than 1NN, DTW, BOSS, ResNet trained from scratch, and the FS-1/FS-2 meta-learning baselines.
  • Ablations support the numerical-scale design: the 9-scale NME setting outperforms single-scale and smaller-scale transfer variants, and large-scale pretraining improves substantially over pretraining on each individual UCR dataset alone.

Limitations

  • NuTime is strongest as a representation and classification pretraining model; the paper explicitly says it is not directly applicable to forecasting because decoding latent representations back to raw numerical values at arbitrary scales remains unsolved.
  • The model is pretrained mainly through univariate sequences, and multivariate support is added through channel-wise encoding plus projection rather than native coupled multivariate dynamics.
  • The checkpoint is tied to specific hyperparameters such as window size and predefined numerical scales, and the paper notes that retraining on other datasets may require tuning those choices.
  • The learned representation may inherit biases or inequalities from the pretraining data and can behave unexpectedly on data outside its training distribution.

Open Questions

  • Can NME-style scalar encoding be extended into a decoder that forecasts raw values across arbitrary numerical scales?
  • How much of NuTime’s transfer comes from the numerical embedding itself versus large heterogeneous pretraining data and BYOL-style augmentation?
  • Would a native multivariate version that models cross-channel coupling outperform the channel-wise extension without losing broad transfer behavior?