A Cookbook of Self-Supervised Learning

Source

Raw Markdown: paper_cookbook-self-supervised-learning-2023.md
PDF: paper_cookbook-self-supervised-learning-2023.pdf
Preprint: arXiv:2304.12210

Core Claim

This is an entry-point survey and practical guide for SSL as of early 2023. It does not introduce a new SSL objective, but it organizes the field’s method families, theory hooks, implementation choices, evaluation protocols, and common training pitfalls in one place.

Why It Matters

Alex recommends this source for anyone starting work on SSL. Its value is pedagogical and taxonomic: it gives newcomers a shared map before they dive into specific methods such as SimCLR, MoCo, BYOL, SimSiam, DINO, VICReg, Barlow Twins, SwAV, MAE, iBOT, and DINOv2.

Taxonomy

The paper groups modern visual SSL into four broad families:

Deep metric learning and contrastive methods, including SimCLR, MoCo, NNCLR, MeanShift, and supervised contrastive variants.
Self-distillation methods, including BYOL, SimSiam, DINO, iBOT, and DINOv2.
CCA-style or covariance-regularized methods, including VICReg, Barlow Twins, SwAV, and W-MSE.
Masked image modeling, including BEiT, MAE, SimMIM, iBOT-style latent targets, and later masked generative variants.

It also surveys older SSL roots such as information restoration, temporal relationships in video, spatial-context pretexts, clustering, and generative models.

Practical Takeaways

Augmentations define the invariances a joint-embedding SSL model learns. This is a feature and a risk: the ImageNet-style augmentation recipe may not match every downstream task.
Projectors are central engineering components. The paper summarizes why they can improve SSL performance, absorb augmentation noise, and interact with Guillotine Regularization.
Collapse has multiple meanings. Constant-output collapse, dimensional collapse, covariance/rank pathologies, and loss-specific degeneracy should not be mixed together.
Evaluation protocol matters. KNN, linear probing, MLP probing, full fine-tuning, unsupervised rank diagnostics, dense prediction, and visual decoding measure different properties.
Training details are not incidental: batch size, learning-rate schedule, optimizer, weight decay, EMA teacher, predictor, ViT patch size, stochastic depth, LayerScale, and distributed all-gather behavior can decide whether a recipe works.

Gotchas

Do not cite this as evidence for a new method. Use it as a structured map and implementation guide.
Do not treat its taxonomy as final after 2023. Later sources such as DINOv3, JEPA/NEPA variants, and Perception Encoder extend or complicate the map.
Do not copy an SSL recipe across domains without checking which invariances the augmentations create. For time series and world models, augmentations can erase scale, phase, channel identity, local structure, action information, or exogenous variables.
Do not compare SSL papers only by a single ImageNet linear-probe number. The cookbook itself stresses protocol differences and the need to evaluate beyond classification.

Links Into The Wiki

Open Questions

Which parts of the early-2023 SSL taxonomy still hold after scaled JEPA-style, DINOv3-style, and Perception Encoder-style systems?
Which SSL cookbook choices have direct analogs for multivariate time-series representation learning?
Can the wiki maintain a compact beginner path that maps from this cookbook to current vision and time-series sources?

Alex Knowledge Base

Explorer

A Cookbook of Self-Supervised Learning

A Cookbook of Self-Supervised Learning

Source

Core Claim

Why It Matters

Taxonomy

Practical Takeaways

Gotchas

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks