A Cookbook of Self-Supervised Learning
Source
- Raw Markdown: paper_cookbook-self-supervised-learning-2023.md
- PDF: paper_cookbook-self-supervised-learning-2023.pdf
- Preprint: arXiv:2304.12210
Core Claim
This is an entry-point survey and practical guide for SSL as of early 2023. It does not introduce a new SSL objective, but it organizes the field’s method families, theory hooks, implementation choices, evaluation protocols, and common training pitfalls in one place.
Why It Matters
Alex recommends this source for anyone starting work on SSL. Its value is pedagogical and taxonomic: it gives newcomers a shared map before they dive into specific methods such as SimCLR, MoCo, BYOL, SimSiam, DINO, VICReg, Barlow Twins, SwAV, MAE, iBOT, and DINOv2.
Taxonomy
The paper groups modern visual SSL into four broad families:
- Deep metric learning and contrastive methods, including SimCLR, MoCo, NNCLR, MeanShift, and supervised contrastive variants.
- Self-distillation methods, including BYOL, SimSiam, DINO, iBOT, and DINOv2.
- CCA-style or covariance-regularized methods, including VICReg, Barlow Twins, SwAV, and W-MSE.
- Masked image modeling, including BEiT, MAE, SimMIM, iBOT-style latent targets, and later masked generative variants.
It also surveys older SSL roots such as information restoration, temporal relationships in video, spatial-context pretexts, clustering, and generative models.
Practical Takeaways
- Augmentations define the invariances a joint-embedding SSL model learns. This is a feature and a risk: the ImageNet-style augmentation recipe may not match every downstream task.
- Projectors are central engineering components. The paper summarizes why they can improve SSL performance, absorb augmentation noise, and interact with Guillotine Regularization.
- Collapse has multiple meanings. Constant-output collapse, dimensional collapse, covariance/rank pathologies, and loss-specific degeneracy should not be mixed together.
- Evaluation protocol matters. KNN, linear probing, MLP probing, full fine-tuning, unsupervised rank diagnostics, dense prediction, and visual decoding measure different properties.
- Training details are not incidental: batch size, learning-rate schedule, optimizer, weight decay, EMA teacher, predictor, ViT patch size, stochastic depth, LayerScale, and distributed all-gather behavior can decide whether a recipe works.
Gotchas
- Do not cite this as evidence for a new method. Use it as a structured map and implementation guide.
- Do not treat its taxonomy as final after 2023. Later sources such as DINOv3, JEPA/NEPA variants, and Perception Encoder extend or complicate the map.
- Do not copy an SSL recipe across domains without checking which invariances the augmentations create. For time series and world models, augmentations can erase scale, phase, channel identity, local structure, action information, or exogenous variables.
- Do not compare SSL papers only by a single ImageNet linear-probe number. The cookbook itself stresses protocol differences and the need to evaluate beyond classification.
Links Into The Wiki
- Self-Supervised Representation Learning
- Representation Collapse
- Intermediate-Layer Representations
- Vision Foundation Models
Open Questions
- Which parts of the early-2023 SSL taxonomy still hold after scaled JEPA-style, DINOv3-style, and Perception Encoder-style systems?
- Which SSL cookbook choices have direct analogs for multivariate time-series representation learning?
- Can the wiki maintain a compact beginner path that maps from this cookbook to current vision and time-series sources?