Distribution Priors In Self-Supervised Learning
Summary
SSL objectives often encode assumptions about the desired distribution of embeddings, clusters, prototypes, or latent states. These assumptions can prevent collapse, but they can also bias which factors are preserved when the pretraining data is imbalanced.
What The Wiki Currently Believes
- The Hidden Uniform Cluster Prior in Self-Supervised Learning shows that SimCLR, VICReg, SwAV, and MSN can impose a hidden uniform cluster or feature prior through volume maximization, whitening, entropy maximization, or prototype balancing.
- The same source shows that this prior can be useful on class-balanced ImageNet-like data and harmful on long-tailed data; power-law or class-matched priors help when they match the data distribution and hurt when they do not.
- A Cookbook of Self-Supervised Learning gives the practical context: batch statistics, projectors, predictors, augmentations, collapse prevention, and evaluation protocol are method-defining choices, not implementation trivia.
Evidence
The Hidden Uniform Cluster Prior paper gives both a theoretical and empirical argument: K-means-like assumptions appear inside several modern SSL losses, and changing the mini-batch class distribution changes semantic transfer for methods with explicit volume-maximization regularizers. The Cookbook explains why this belongs in the broader SSL checklist: collapse-prevention mechanisms, batch construction, and projector heads are part of the representation objective itself.
Implications For Time-Series And World Models
Temporal data is usually long-tailed. Rare incidents, regime changes, interventions, treatments, control inputs, asset types, sensors, patients, users, and environments may be sparse but central. A uniform latent-cluster prior can make the model overrepresent common balanced-looking factors or suppress naturally imbalanced factors that matter downstream.
For time-series JEPA, contrastive, or non-contrastive SSL, agents should ask which prior the loss imposes over latent regimes. If the prior is accidental, benchmark results may depend on sampler design, batch composition, horizon selection, and whether the evaluation set hides or exposes tail regimes.
Gotchas
- Uniformity is not the opposite of collapse; it is a specific anti-collapse prior.
- Balanced mini-batches are not automatically more faithful to the data distribution.
- Long-tailed priors are not universally better; they help only when they match the relevant semantic or regime structure.
- Downstream aggregate scores can hide rare-regime damage. Evaluation should report tail regimes, anomaly windows, intervention windows, and domain-specific slices when available.
Open Questions
- Can weak metadata, clustering, textual descriptions, or causal structure estimate useful SSL priors without labels?
- Which collapse-prevention mechanisms preserve long-tailed temporal regimes best?
- Should time-series foundation models expose sampler and batch-composition details as part of the model card?