---
author:
- Randall Balestriero
- Mark Ibrahim
- '*Vlad Sobal*'
- '*Ari Morcos*'
- '*Shashank Shekhar*'
- '*Tom Goldstein*'
- '*Florian Bordes*'
- '*Adrien Bardes*'
- '*Gregoire Mialon*'
- '*Yuandong Tian*'
- '*Avi Schwarzschild*'
- '*Andrew Gordon Wilson*'
- '*Jonas Geiping*'
- '*Quentin Garrido*'
- '*Pierre Fernandez*'
- '*Amir Bar*'
- '*Hamed Pirsiavash*'
- '*Yann LeCun*'
- '*Micah Goldblum*'
bibliography:
- references.bib
title: 'A Cookbook of Self-Supervised Learning'
---

```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\sigmoid{{\textnormal{sigmoid}}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\def\vol{{\text{Vol}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\Corr}{\mathrm{Corr}}
```
```{=latex}
\newcommand{\HSIC}{\mathrm{HSIC}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\norm}{\mathrm{renorm}}
```
```{=latex}
\newcommand{\CE}{\mathrm{CrossEnt}}
```
```{=latex}
\newcommand{\cent}{\mathrm{center}}
```
```{=latex}
\newcommand{\sg}{\mathrm{sg}}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\DeclareMathOperator{\rank}{rank}
```
```{=latex}
\DeclareMathOperator{\mean}{mean}
```
```{=latex}
\DeclareMathOperator{\spann}{span}
```
```{=latex}
\DeclareMathOperator{\CoSim}{CoSim}
```
```{=latex}
\DeclareMathOperator{\diag}{diag}
```
```{=latex}
\DeclareMathOperator{\NN}{NN}
```
```{=latex}
\DeclareMathOperator{\relu}{ReLU}
```
```{=latex}
\DeclareMathOperator{\View}{View}
```
```{=latex}
\def\vsigma{{\bm{\sigma}}}
```
```{=latex}
\def\vlambda{{\bm{\lambda}}}
```
```{=latex}
\def\Indic{{\bm{1}}}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\bb}[0]{
\vspace{-0.25cm}}
```
```{=latex}
\newcommand{\volume}[4]{
\draw[step=1cm,black,very thin] (#1,#2) grid (#3,#4);
\draw[black,fill=gray,fill opacity=0.2] (#1,#2) -- (#1,#4) -- (#3,#4) -- (#3,#2) -- cycle;
}
```
```{=latex}
\newcommand{\Cvolume}[5]{
\draw[step=1cm,gray,very thin] (#1,#2) grid (#3,#4);
\draw[draw=none,fill=#5,fill opacity=0.2] (#1,#2) -- (#1,#4) -- (#3,#4) -- (#3,#2) -- cycle;
}
```
```{=latex}
\newcommand{\lsComment}[1]{\color{codegreen}{#1}}
```
```{=latex}
\newcommand{\DrawCard}[7]{
\draw[rounded corners=\cardroundingradius] (#3,#4) rectangle  (#3+\cardwidth,#4+\cardheight*#7);
\fill[#5,rounded corners=\striproundingradius] (#3+\strippadding,#4+\cardheight*#7-\strippadding) rectangle (#3+\cardwidth-\strippadding,#4+\cardheight*#7-\stripheight) node[rotate=90,black] {};
\fill (#3+\strippadding+\cardwidth*0.5,#4+\cardheight*#7-\strippadding*0.5)   
node [below, rounded corners=\cardroundingradius*0.7,inner sep=2]{{\footnotesize \#}#6};
\node[text width=(\cardwidth-2*\textpadding)*1cm,below left,inner sep=0] at (#3+\cardwidth-\textpadding,#4+\cardheight*#7-2*\strippadding-\stripheight) 
    { 
        {\topsize #1}\\[-0.5em]
        \tikz{\fill (#3+\textpadding,#4) rectangle (#3+\cardwidth-\textpadding,#4+\ruleheight);}\\[0em]
        {\bottomsize #2}
    };
    }
```
```{=latex}
\newcommand{\SummaryCard}[3]{
\draw[rounded corners=\cardroundingradius] (0,0) rectangle  (\textwidth,#1);
\fill[gray!40,rounded corners=\striproundingradius] (\strippadding,#1-\strippadding) rectangle (\textwidth-\strippadding*1cm,#1-\stripheight-\strippadding) node[rotate=90,black] {};
\node at (\textwidth*0.5,#1-\stripheight*0.5-\strippadding)  {#2};
\node[below] at (\textwidth*0.5,#1-\stripheight-\strippadding-\textpadding) {{\topsize #3}};
}
```
```{=latex}
\newcommand{\Quote}[2]{
% \draw[rounded corners=\cardroundingradius] (0,0) rectangle  (\textwidth,#1);
% \fill[green!20,rounded corners=\striproundingradius] (0,#1) rectangle (\textwidth,#1) node[black] {};
% \node[rectangle,fill=green!20,rounded corners=\striproundingradius,draw=black,text width=0.92\linewidth]  at (0,0)  {\centering ``#1'', \textit{#2}};
}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\emptybox}{\makebox[0pt][l]{$\square$}\raisebox{.15ex}{\hspace{0.1em}{\color{red} \xmark}}}
```
```{=latex}
\newcommand{\checkedbox}{\makebox[0pt][l]{$\square$}\raisebox{.15ex}{\hspace{0.1em}{\color{blue} $\checkmark$}}}
```
```{=latex}
\newcommand{\florian}[1]{{\color{blue}[Florian: #1]}}
```
```{=latex}
\newcommand{\Mark}[1]{{\color{blue}[Mark: #1]}}
```
```{=latex}
\newcommand{\ari}[1]{{\color{magenta}[\textbf{Ari}: #1]}}
```
```{=latex}
\newcommand{\shashank}[1]{{\color{brown}[\textbf{Shashank}: #1]}}
```
```{=latex}
\newcommand{\agw}[1]{{\color{red}[\textbf{Andrew}: #1]}}
```
```{=latex}
\newcommand{\tom}[1]{{\color{red}[\textbf{Tom}: #1]}}
```
```{=latex}
\newcommand{\jog}[1]{{\color{teal}[\textbf{Jonas}: #1]}}
```
```{=latex}
\newcommand{\hamed}[1]{{\color{green}[\textbf{Hamed}: #1]}}
```
```{=latex}
\newcommand{\quentin}[1]{{\color{orange}[\textbf{Quentin}: #1]}}
```
```{=latex}
\newcommand{\greg}[1]{{\color{violet}[\textbf{Gregoire}: #1]}}
```
```{=latex}
\newcommand{\micah}[1]{{\color{red}[\textbf{Micah}: #1]}}
```
```{=latex}
\renewcommand\Authands{ and }
```
```{=latex}
\pgfmathsetmacro{\cardroundingradius}{3mm}
```
```{=latex}
\pgfmathsetmacro{\striproundingradius}{2mm}
```
```{=latex}
\pgfmathsetmacro{\cardwidth}{5}
```
```{=latex}
\pgfmathsetmacro{\cardheight}{8}
```
```{=latex}
\pgfmathsetmacro{\stripheight}{0.4}
```
```{=latex}
\pgfmathsetmacro{\strippadding}{0.1}
```
```{=latex}
\pgfmathsetmacro{\textpadding}{0.2}
```
```{=latex}
\pgfmathsetmacro{\ruleheight}{0.05}
```
```{=latex}
\newcommand{\topsize}{\footnotesize}
```
```{=latex}
\newcommand{\bottomsize}{\tiny}
```
```{=latex}
\maketitle
```
```{=latex}
\newpage
```
```{=latex}
\tableofcontents
```
```{=latex}
\newpage
```
What is Self-Supervised Learning and Why Bother?
================================================

*Self-supervised learning*, dubbed "the dark matter of intelligence" [^1], is a promising path to advance machine learning. As opposed to *supervised learning*, which is limited by the availability of labeled data, self-supervised approaches can learn from vast unlabeled data [@chen2020simple; @misra2020self]. Self-supervised learning (SSL) underpins deep learning's success in natural language processing leading to advances from automated machine translation to large language models trained on web-scale corpora of unlabeled text [@brown2020language; @popel2020transforming]. In computer vision, SSL pushed new bounds on data size with models such as SEER trained on 1 billion images [@goyal2021self]. SSL methods for computer vision have been able to match or in some cases surpass models trained on labeled data, even on highly competitive benchmarks like ImageNet [@tomasev2022pushing; @he2020momentum; @deng2009imagenet]. SSL has also been successfully applied across other modalities such as video, audio, and time series [@wickstrom2022mixing; @liu2022audio; @schiappa2022self].

Self-supervised learning defines a pretext task based on unlabeled inputs to produce descriptive and intelligible representations [@hastie2009overview; @goodfellow2016deep]. In natural language, a common SSL objective is to mask a word in the text and predict the surrounding words. This objective of predicting the context surrounding a word encourages the model to capture relationships among words in the text without the need for any labels. The same SSL model representations can be used across a range of downstream tasks such as translating text across languages, summarizing, or even generating text, along with many others. In computer vision, analogous objectives exist with models such as MAE or BYOL learning to predict masked patches of an image or representation [@grill2020bootstrap; @he2022masked]. Other SSL objectives encourage two views of the same image, formed by say adding color or cropping, to be mapped to similar representations.

With the power to train on vast unlabeled data comes many benefits. While traditional supervised learning methods are trained on a specific task often known a priori based on the available labeled data, SSL learns generic representations useful across many tasks. SSL can be especially useful in domains such as medicine where labels are costly or the specific task can not be known a priori [@krishnan2022self; @CIGA2022100198]. There's also evidence SSL models can learn representations that are more robust to adversarial examples, label corruption, and input perturbations---and are more fair---compared to their supervised counterparts [@hendrycks2019using; @goyal2022vision]. Consequently, SSL is a field garnering growing interest. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry.

Why a Cookbook for Self-Supervised Learning?
--------------------------------------------

While many components of SSL are familiar to researchers, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. SSL research has a high barrier to entry due to (i) its computational cost, (ii) the absence of fully transparent papers detailing the intricate implementations required to fully enable SSL's potential, and (iii) the absence of a unified vocabulary and theoretical view of SSL. As SSL established a distinct paradigm from traditional *reconstruction-based* unsupervised learning methods such as (denoising, variational) Autoencoders [@vincent2008extracting; @vincent2010stacked; @kingma2013auto], our vocabulary for understanding SSL in a unified view is limited. In fact, attempts at unifying SSL methods under a single viewpoint have only started to emerge in the last year [@haochen2021provable; @balestriero2022contrastive; @shwartz2022we; @garrido2022duality]. Without a common ground to characterize the different components of SSL methods, it's more challenging for researchers to start working on SSL methods. Meanwhile, SSL research is in dire need for new researchers since SSL is now deployed throughout the real-world. Yet, many open research questions remain regarding SSL's generalization guarantees, fairness properties, and robustness to adversarial attacks or even naturally occurring variations. Such questions are crucial to the reliability of SSL methods.

Furthermore, SSL---which is empirically driven---comes with many moving pieces (mostly hyper-parameters) that may impact key properties of the final representations and are not necessarily well-detailed in published work. That is, to start studying SSL methods, one must first exhaustively empirically probe those methods to fully grasp the impact and behaviors of all those moving pieces. Such empirical blind spots are strong limitations as they demand large computational resources and pre-existing hands-on experience. All in all, the co-occurrence of SOTA performances from seemingly different yet overlapping methods, little existing theoretical research, and widespread real-world deployment, make the need for a cookbook unifying the techniques and their recipes essential to lower SSL's research barrier to entry.

Our goal is to lower the barrier to entry into SSL research by laying the foundations and latest SSL recipes in the style of a cookbook. To successfully cook, you must first learn the basic techniques: chopping, sautéing, etc. We begin in `\Cref{sec:methods}`{=latex} with the fundamental techniques of self-supervised learning using a common vocabulary. Specifically, we describe the families of methods along with theoretical threads to connect their objectives in a unified perspective. We highlight key concepts such as loss terms or training objectives in concept boxes. Next, a cook must learn to skillfully apply the techniques to form a delicious dish. This requires learning existing recipes, assembling ingredients, and evaluating the dish. In `\Cref{sec:practical_matters}`{=latex} we introduce the practical considerations to implementing SSL methods successfully. We discuss common training recipes including hyperparameter choices, how to assemble components such as architectures and optimizers, as well as how to evaluate SSL methods. We also share practical tips from leading researchers on common training configurations and pitfalls. We hope this cookbook serves as a practical foundation for successfully training and exploring self-supervised learning.

The Families and Origins of SSL {#sec:methods}
===============================

SSL methods have enjoyed a renaissance since 2020, thanks in large part to the availability of extremely large datasets and high-memory GPUs. However, the origins of SSL go back to the very beginning of the deep learning era.

Origins of SSL
--------------

Contemporary methods build upon the knowledge we gained from early experiments. In this section, we give a brief overview of the main ideas of SSL prior to 2020. While many of the specific methods have fallen out of mainstream use because they no longer provide state-of-the-art performance on benchmark problems, and they will not be discussed in great detail, the ideas from these papers form the foundation for many of the modern methods. For example, the core objective of restoring missing or distorted parts of an input or contrasting two views of the same image form the foundation for modern SSL methods. Early progress in SSL focused on the development of methods that fell into the following (sometimes overlapping) categories:

```{=latex}
\vspace{4pt}
```
**1. Information restoration:** A wide range of methods have been developed that mask or remove something from an image, and then train a neural network to restore the missing information. Colorization-based SSL methods convert an image to grayscale, and then train a network to predict the original RGB values [@zhang2016colorful; @larsson2016learning; @vondrick2018tracking]. Because colorization requires understanding object semantics and boundaries, colorization was demonstrated as an early SSL method for object segmentation. The most straightforward application of information restoration is to mask, aka remove, a portion of an image and then train a network to inpaint the missing pixel values [@pathak2016context]. This idea evolved into masked auto-encoding methods [@he2022masked], in which the masked region is a union of image patches that can be predicted using a transformer.

```{=latex}
\vspace{4pt}
```
**2. Using temporal relationships in video:** While the focus of this review is on image (and not video) processing, a range of specialized methods have been developed for learning single-image representations by pre-training on videos. Note that information restoration methods are particularly useful for videos, which contain multiple modalities of information that can be masked. @wang2015unsupervised pre-train a model using a triplet loss that promotes similarities between representations of an object in two different frames. The resulting model performed well for object detection. @pathak2017learning trains a model to predict the motion of objects in a single frame, and adapts the resulting features to solve single-frame detection problems. @agrawal2015learning predicts the ego-motion of a camera given multiple frames. @owens2016ambient propose to remove the audio track from a video, and then predict the missing sound. For specialized applications like depth mapping, self-supervised methods have been proposed that learn monocular depth models from unlabeled image pairs [@eigen2014depth] and later the frames from a single-camera video [@zhou2017unsupervised]. Such methods remain an active area of research.

```{=latex}
\vspace{4pt}
```
**3. Learning spatial context:** This category of methods trains a model to understand the relative positions and orientations of objects within a scene. RotNet [@gidaris2018unsupervised] masks the direction of gravity by applying a random rotation and then asks the model to predict the rotation. @doersch2015unsupervised is one of the first SSL methods that simply predicts the relative location of two randomly sampled patches in an image. This strategy was superseded by \`\`jigsaw" methods [@pathak2016context; @noroozi2018boosting] that break an image into an array of disjoint patches and predict the relative location of each. A different spatial task is learning to count [@noroozi2017representation]: the model is trained to output the number of objects in an image in a self-supervised way.

```{=latex}
\vspace{4pt}
```
**4. Grouping similar images together:** One can learn rich features by grouping semantically similar images together. K-means clustering is one of the most widely used methods from classical machine learning. A number of studies have adapted k-means to perform SSL with neural models. Deep clustering alternates between assigning labels to images by performing k-means in the feature space, and updating the model to respect these assigned class labels [@caron2018deep]. More recent treatments of this approach use mean-shift updates to push features towards their cluster center, and have been shown to complement BYOL, a method based on two networks with the objective to predict pseudo-labels for each sample [@koohpayegani2021mean] (discussed in Section `\ref{sec:self-distillation}`{=latex}). Other improvements to deep clustering include using optimal transport methods in feature space to create more informative clusters [@asano2019self].

```{=latex}
\vspace{4pt}
```
**5. Generative models:** An early influential SSL method is greedy layer-wise pretraining [@bengio2006greedy], in which layers of a deep network are trained one-at-a-time using an autoencoder loss. An analogous approach from the time used Restricted Boltzman Machines (RBMs), which could be trained layer-wise and stacked to create deep belief nets [@hinton2006fast]. While these methods were abandoned in favor of simpler initialization strategies and longer training runs, they were historically impactful uses of SSL, as they enabled the training of the first \`\`deep" networks. Later advancements improved on the representation learning ability of auto-encoders, including denoising autoencoders [@vincent2008extracting], cross-channel prediction [@zhang2017split], and deep canonically correlated autoencoders [@wang2015deep]. Nonetheless, it was ultimately found that representation transferability is better when the auto-encoder is asked to restore a missing part of its input, resulting in the \`\`information restoration" category of SSL methods.

Generative Adversarial Networks (GANs) [@goodfellow2014generative] consist of an image generator and a discriminator that differentiates real images from generated images. Both components of this model pair can be trained without supervision, and both potentially contain knowledge useful for transfer learning. Early GANs papers [@salimans2016improved] experimented with downstream image classification using GAN components. Specialized feature learning routines have also been developed that modify the discriminator [@springenberg2015unsupervised], add a generator [@dai2017good], or learn additional mappings from image to latent space [@donahueadversarial] to improve transfer learning.

```{=latex}
\vspace{4pt}
```
**6. Multi-view invariance:** Many modern SSL methods, especially those that we focus on in this article, use contrastive learning to create feature representations that are invariant to simple transforms. The idea of contrastive learning is to encourage a model to represent two augmented versions of an input similarly. A number of methods led the charge in this direction by enforcing invariance in various ways before contrastive learning was widely adopted.

One of the most popular frameworks for learning from unlabeled data is to use a weakly trained network to apply pseudolabels to images, and then train using these labels in a standard supervised fashion [@lee2013pseudo]. This approach was later improved by enforcing invariance to transformations. Virtual adversarial training [@miyato2018virtual] trains a network on images using their pseudolabels, and additionally performs adversarial training so that learned features are nearly invariant to small perturbations to the input image. Later works focused on maintaining invariance to data augmentation transforms. Important early methods in this category include MixMatch [@berthelot2019mixmatch], which chooses pseudolabels by averaging outputs of a network on several different random augmentations of the training images, resulting in labels that are augmentation invariant. Around the same time, it was discovered that good SSL performance could be achieved by training a network to maximize the mutual information between the representations of an image under different views [@bachman2019learning]. These augmentation-based methods formed a bridge between the older methods described above and the contemporary methods that are the focus of this paper.

With these origins, we now turn to categorizing SSL into four broad families: The Deep Metric Learning Family, The Self-Distillation Family, The Canonical Correlation Analysis Family, and the Masked Image Modeling Family.

The Deep Metric Learning Family: SimCLR/NNCLR/MeanSHIFT/SCL
-----------------------------------------------------------

The Deep Metric Learning (DML) family of methods is based on the principle of encouraging similarity between semantically transformed versions of an input. DML originated with the idea of a *contrastive loss*, which transforms this principle into a learning objective. Contrastive loss was first introduced in [@bromley1993signature] then more formally defined in [@chopra2005learning; @hadsell2006dimensionality]. In DML one trains a network to predict whether two inputs are from the same class (or not) by making their embedding close (or far from each other). Since data is without labels, to identify similar inputs, we often form variants of a single input using known semantic preserving transformations. The variants of the inputs are called *positive pairs* or examples; the samples we wish to make dissimilar are called *negatives*. Often there's a margin parameter, $m$, imposing the distance between examples from different classes should be larger than $m$. Similar to the contrastive loss, the Triplet loss [@weinberger2009distance; @chechik2010large; @schroff2015facenet] shares a similar spirit, but is composed of triplets: a query, a positive example, and a negative example (see `\cref{eq:triplet}`{=latex}). Compared to contrastive loss, triplet loss only requires the difference of (dis-)similarities between positive and negative examples to the query point to be larger than a margin $m$.

The shift from DML to what is now referred to as SSL might have occurred when @sohn2016improved introduced the (N+1)-tuple loss, a loss similar to the contrastive predictive coding (CPC) loss from [@oord2018representation]. The use of other sample positive views as the negative view of other pairs is introduce as an efficient strategy coined *N-pair-mc loss*. @ni2021close shows that contrastive learning is a special case of meta-learning, and existing meta-learners can be directly applied to SSL with competitive performance. CPC was extended to images in [@henaff2020data]. A key ingredient in CPC was the introduction of the InfoNCE loss described in `\ref{fig:infonce_plus}`{=latex} [@oh2016deep], which became central in SSL.

To summarize, the main paradigm shift between DML and Contrastive SSL arises from a few key changes, namely using data-augmentation instead of sampling to obtain the positive/negative pairs, the use of deeper networks, and the use of a predictor network, which we note in `\Cref{fig:dml_ssl}`{=latex}. One of the most prominent methods coming from the paradigm shift to SSL in the deep learning family is SimCLR.

**SimCLR** learns visual representations by encouraging similarity between two augmented views of an image. In SimCLR the two views are formed by applying a combination of transformations including random resizing, cropping, color jittering, and random blurring. After encoding each view, SimCLR uses a *projector*, often a MLP (multi-layer perceptron) followed by a ReLU (rectified linear unit) activation, to map the initial embeddings into another space where the constrative loss if applied to encourage similarity between the views. For downstream tasks, extracting the representation before the projector has been shown to improve performance. Further discussions of the role of the projector are in sections `\ref{sec:projector_theory}`{=latex} and `\ref{sec:projector}`{=latex}.

Another key ingredient along with the InfoNCE loss used in SimCLR is the non-parametric softmax introduced by @wu2018unsupervised. This name is motivated by removing the need to have a \"parametrized\" linear layer on top of the representation to compute the softmax by instead comparing representations with each others. This loss formulation already contained a *temperature parameter* in the softmax which is responsible for increasing or decreasing the sharpness of events in predictions. Other noteworthy developements include @schroff2015facenet use triplet loss with active triplet selection (hard positive, hard negative) either online from the current mini-batch or from a past checkpoint akin to momentum networks (discussed in section `\ref{sec:self-distillation}`{=latex}). @weinberger2009distance introduced push-pull weighting, to push negatives apart while pulling positives together, in a triplet loss to increase the margin of K-NN based models. @tian2020contrastive introduced the possibility of many positive views.

Aside from forming positives using semantic preserving transformations, mining positive pairs naturally arising in data is also possible. An iconic triplet loss is based on video frames where the positive pairs come from nearby frames (while negatives are from far away frames) developed in @sermanet2018time coined Time-Contrastive (TC) Time-Contrastive Learning. Nonlinear ICA [@hyvarinen2016unsupervised] introduced a proof that you can learn the log PDF when doing classification tasks. @alexey2015discriminative trains a classification pretext task by transforming image patches in comparison to different transformations of image patches. One disadvantage is that this setup can involve too many classes leading performance to degrade on downstream tasks. To overcome this, NCE has been successfully employed in @mnih2012fast [@mnih2013learning] to modify the denominator in order not to loop over all classes. This is an alternative to sampling based estimation of the gradient that was found to be less stable [@bengio2003quick; @bengio2008adaptive]. This introduces the concept of will become momentum encoder by imposing that features maps do not vary quickly referred to as proximal algorithm [@parikh2014proximal]. One other consideration in SSL motivated by the DML is the idea of \`\`hard negative data mining" where the negative samples are intentionaly selected to be close to but distinct from the positives to form a more challenging learning objective. Next we describe an alternative to deep metric learning based on self-distillation.

```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{9.5}{Noise Contrastive Estimation: Learning Unnormalized Densities}{\begin{minipage}{0.95\textwidth}
    \begin{itemize}
        \item introduced by \citet{gutmann2010noise} to learn unnormalized probability distributions given i.i.d observations $\vx_1,\dots,\vx_N$ from the distribution $X\sim p_{X}$. NCE enables approximation of $p_{X}$ by a parametrized function $f_{\theta}$ without enforcing $\int f_{\theta}(\vx)d\vx=1$ during training
        \item let's first introduce a noise variable $\epsilon \sim p_{\epsilon}$ and let's consider the following mixture distribution
        $$T\sim \mathcal{B}(s),s \in (0,1)$$ $$ V \sim X1_{\{T=1\}}+\varepsilon 1_{\{T=0\}}$$
        \item using Bayes rule and denoting $\eta=(1-s)/s$ we have $$p_{T|V}(T=1|V=\vv)=\frac{p_{V|T}(V=\vv|T=1)}{p_{V|T}(V=\vv|T=1)+\eta p_{V|T}(V=\vv|T=0)},$$
        \item parametrize $p_{V|T}(V=\vv|T=1)=f_{\theta}(\vv)\exp(c)$ with $f_{\theta}>0$ and learnable parameters $\{\theta,c\}$
        \item minimize the NLL of logistic regression (usual binary classification set-up)
        \begin{align*}
            \mathcal{L}(\theta,c)=-\mathbb{E}_{(\vv,t)\sim(V,T)}\log[p_{T|V}(T=t|V=\vv)]
        \end{align*}
        \item the minimum is attained at $f_{\theta^*}\exp(c^*)=p_{X}$ if $p_{X}(\vv)=0\implies p_{\epsilon}(\vv)>0$. If $f_{\theta}$ is powerful enough, one can set $c=0$ and the model will self-normalize \citep{mnih2012fast}
        \item \citet{ceylan2018conditional} extends NCE to nonindependent noise realization i.e. $\epsilon$ depends on $X$, \citet{ma2018noise} considers conditional distribution $X|Y$, \citet{dyer2014notes} compares NCE and Negative Sampling \citep{mikolov2013distributed} (the latter being a special case of the former) both extending Importance Sampling estimation \citep{bengio2003quick} of the partition function (normalization factor)
        \end{itemize}
    \end{minipage}
    }
\end{tikzpicture}
```
```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{14}{A Brief History of the infoNCE loss}{\begin{minipage}{0.95\textwidth}
    In the descriptions below $z_i$ denotes the model representation of sample i, $\sP$ denotes the set of positive samples, and $\tau$ is the temperature hyperparameter.
    
    \begin{itemize}
        \item \citet{bromley1993signature,chopra2005learning} introduces the {\bf contrastive loss} for Deep Metric Learning
        \begin{align}
            \mathcal{L}_{\rm cont}(\mZ)=\sum_{(i,j)\in \sP}\|\vz_j-\vz_i\|_2+\sum_{(i,j)\not \in \sP}\relu(m-\|\vz_i-\vz_j\|_2)^2,m>0,\label{eq:contrastive}
        \end{align}
        \item \citet{goldberger2004neighbourhood} introduced {\bf Neighbourhood Component Analysis} to improve maximum margin of NN-classifiers by learning a quadratic distance (Mahalanobis distance is a special case of such a distance) using
        \begin{align}          
            \mathcal{L}_{\rm NCA}(\mZ)=-\sum_{(i,j)\in\sP}\frac{e^{-\|\vz_i-\vz_j\|_2^2}}{\sum_{(k,l)\in[N]^2}e^{-\|\vz_k-\vz_l\|_2^2}},
        \end{align}
        \item \citet{weinberger2009distance,chechik2010large} extends \cref{eq:contrastive} to a {\bf triplet loss}
        \begin{align}
            \mathcal{L}_{\rm triplet}(\mZ)=\sum_{(i,j) \in \sP}\sum_{(k,l)\not \in \sP,k=i\}}\relu(\|\vz_i-\vz_j \|_2-\|\vz_i-\vz_k \|+m),m>0,\label{eq:triplet}
        \end{align}
        \item \citet{sohn2016improved} extends the triplet and NCA losses to form the {\bf (N+1)-tuple loss}
        \begin{align}          
            \mathcal{L}_{\rm tuple}(\mZ)=-\sum_{(i,j)\in\sP}\log\left(\frac{e^{\langle\vz_i,\vz_j\rangle}}{\sum_{(k,l)\in \sP}e^{\langle\vz_i,\vz_l\rangle}}\right)+\beta \|\mZ\|_F^2,
        \end{align}
        where the denominator sum only runs through one view of the other samples, and the negative distance is replaced by the inner product \underline{and} $\ell_2$-penalty of the feature maps $\mZ$. Explicit normalization was found to be unstable yet introduced (along with a temperature parameter) in \citet{yu2019deep}.
        \item \citet{wu2018unsupervised} introduces the {\bf Noise-Contrastive Estimation} (NCE) loss without positive pairs
        \begin{align}          
            \mathcal{L}_{\rm NCE}=-\sum_{n=1}^{N}\log\left(\frac{e^{\CoSim( \vz_i,\vz_i^{(t-1)})/\tau}}{\sum_{k=1}^{N}e^{\CoSim(\vz_i,\vz_k)/\tau}}\right)+\beta \|\mZ-\mZ^{(t-1)}\|_F^2,\label{eq:wu_nce}
        \end{align}
        also coining the term non-parametric softmax. NCE loss introduces explicit normalization, a temperature parameter $\tau$, and the idea of momentum encoder (via proximal optimization method), and employs NCE to approximate the denominator when $N$ is large
        \item \citet[{\bf CPC}]{oord2018representation} coins the name {\bf infoNCE} by removing the proximal constraint and using positive pairs
        \begin{align}          
            \mathcal{L}_{\rm infoNCE}=-\sum_{(i,j)\in\sP}\log\left(\frac{e^{\CoSim( \vz_i,\vz_j)/\tau}}{\sum_{k=1}^{N}e^{\CoSim(\vz_i,\vz_k)/\tau}}\right),
        \end{align}
        \end{itemize}
    \end{minipage}
    }
\end{tikzpicture}
```
```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{10}{The infoNCE Offsprings}{\begin{minipage}{0.95\textwidth}
    \begin{itemize}
        \item \citet[{\bf MoCo}]{he2020momentum} introduces momentum encoder as an alternative to the memory bank regularization of \cref{eq:wu_nce} and introduces a queue to store many negative samples from previous batches; \citep[{\bf MoCoV2}]{chen2020improved} adds a projector, \citep[{\bf MoCoV3}]{chen2021empirical} adds ViTs
        \item \citet[{\bf SimCLR}]{chen2020simple} removes the momentum encoder and the $i^{\rm th}$ term from the denominator coining it {\bf NT-Xent} (Normalized Temperature-scaled cross entropy)
        \begin{align*}          
            \mathcal{L}_{\rm NT-Xent}(\mZ)=-\sum_{(i,j)\in \sP}\frac{e^{\CoSim( \vz_i,\vz_j)}}{\sum_{k=1}^{N}\1_{\{k\not = i\}}e^{\CoSim(\vz_i,\vz_k)}},
        \end{align*}
        \item \citet[{\bf DCL}]{yeh2021decoupled} additionally removes the positive pair in the denominator
        \begin{align*}          
            \mathcal{L}_{\rm DCL}(\mZ)=-\sum_{(i,j)\in \sP}\frac{e^{\CoSim( \vz_i,\vz_j)}}{\sum_{k=1}^{N}\1_{\{k\not = i \wedge (i,k)\not = \sP\}}e^{\CoSim(\vz_i,\vz_k)}},
        \end{align*}
        \item \citet[{\bf NNCLR}]{dwibedi2021little} uses nearest neighbors from a queue $\sQ$
        \begin{align*}          
            \mathcal{L}_{\rm NNCLR}(\mZ)=-\sum_{(i,j)\in \sP}\frac{e^{\CoSim(\NN(\vz_i,\sQ), \vz_j)}}{\sum_{(k,l)\in \sP}^Ne^{\CoSim(\NN(\vz_i,\sQ),\vz_{l})}},
        \end{align*}
        \item \citet[{\bf RELIC}]{mitrovic2020representation} adds a regularization term to enforce invariance
        \begin{align*}          
            \mathcal{L}_{\rm RELIC}(\mZ)=-\sum_{(i,j)\in \sP}\frac{e^{\CoSim( \vz_i,\vz_j)}}{\sum_{k=1}^{N}\1_{\{k\not = i\}}e^{\CoSim(\vz_i,\vz_k)}}+KL(p(\vz_i),p(\vz_j)),
        \end{align*}
        \item \citet[{\bf PCL}]{li2020prototypical} uses prototypes 
        \end{itemize}
    \end{minipage}
    }
\end{tikzpicture}
```
```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{4.8}{Paradigm Shift Between Deep Metric Learning and Contrastive SSL}{
    \def\arraystretch{1.5}
        \begin{tabular}{rcl}
             \multicolumn{1}{c}{\bf Deep Metric Learning} &  & \multicolumn{1}{c}{\bf Contrastive SSL}\\
             \parbox{6.2cm}{positive/negative pairs come from labels or fixed transforms e.g. two halves of an image}& $\implies$&\parbox{6.45cm}{positive pairs come from designed DAs that are continuously sampled, negative pairs are all non-positive pairs regardless of class membership}\\
             Hard-Negative Sampling for each mini-batch& $\implies$  & random sampling\\
             encoder DN & $\implies$ & encoder DN $+$ projector MLP\\ 
             small dataset (N<200k)& $\implies$&large dataset\\
             zero-shot k-NN validation & $\implies $&
             
             \parbox{5.5cm}{-zero-shot k-NN validation\\ -zero/few-shot/fine-tuning linear probing}
        \end{tabular}
    }
\end{tikzpicture}
```
The Self-Distillation Family: BYOL/SimSIAM/DINO {#sec:self-distillation}
-----------------------------------------------

```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{10.5}{A Brief History of the Self-Distillation Family}{\begin{minipage}{0.95\textwidth}
    \begin{itemize}
    \item \citet[{\bf MMC}]{xu2004maximum,joulin2010discriminative} searches pseudo-labels so that if a classifier were train on them it would have good margin (on true labels)
        \item \citet[{\bf NaT}]{bojanowski2017unsupervised} introduces Noise as Targets i.e. $C$ real {\em frozen} targets $\mM\triangleq [\vm_1,\dots,\vm_N] \in \mathbb{R}^{D \times C}$ with {\em assignment constraints} of $P\triangleq [\vp_1,\dots,\vp_N] \in \{0,1\}^{C \times N}$ with
        \begin{align}
            \mathcal{L}_{\rm NaT}=\min_{P:1P\leq 1,P^T1=1}-\sum_{n=1}^{N}CosSim(f_{\theta}(\vx_n),\mM \vp_n),
        \end{align}
        \item \citet[{\bf DeepCluster}]{caron2018deep} extends NaT by allowing learning of the targets in a K-means fashion with various cluster sampling and reallocation tricks to prevent collapse
        \begin{align}
            \mathcal{L}_{\rm DeepCluster}=CrossEntropy(f_{\theta}(\vx),\argmin_{k}\|f_{\theta}(\vx)-\vm_k\|_2^2) + K-means (f_{\theta}(\mX),\mM),
        \end{align}
        \item \citet[{\bf SLSC}]{YM.2020Self-labelling} further prevents collapse in DeepCluster through {\em constrained clustering membership} using Sinkhorn to infer the cluster membership probabilities
        \item \citet[{\bf BYOL}]{grill2020bootstrap} introduces BYOL removing the clustering step, introducing a {\em predictor} and projector network, defining the continuous targets as the output of a momentum network, renormalize each sample representation by its $\ell_2$-norm and leverage positive pairs. The predictor acts as a whitening operator preventing collapse \citep{tian2021understanding}, and momentum network can be applied only to the projector \citep{pham2022pros}
        \item \citet[{\bf SimSIAM}]{chen2021exploring}  replaces the BYOL moving average encoder by a stop-gradient
        \item \citet[{\bf DINO}]{caron2021emerging} introduces DINO which extends BYOL and SimSIAM to discrete representations/targets and still relies on momentum encoder
        \item \citet[{\bf iBOT}]{zhou2021ibot} and \citet[{\bf DINOv2}]{oquab2023dinov2} build upon DINO by combining its objective with a latent space masked-image modeling one, combining the best of both families 
        \end{itemize}
    \end{minipage}
    }
\end{tikzpicture}
```
Self-distillation methods such as BYOL [@grill2020bootstrap], SimSIAM [@chen2021exploring], DINO [@caron2021emerging], along with their variants rely on a simple mechanism: feeding two different views to two encoders, and mapping one to the other by means of a predictor. To prevent the encoders from *collapsing* by predicting a constant for any input, various techniques are employed. A common approach to prevent collapse is to update one of the two encoder weights with a running average of the other encoder's weights. We discuss the particularities of each method.

**BYOL** (bootstrap your own latent) first introduced self-distillation as a means to avoid collapse. BYOL uses two networks along with a predictor to map the outputs of one network to the other. The network predicting the output is called the *online* or *student* network while the network producing the target is called the *target* or *teacher* network. Each network receives a different view of the same image formed by image transformations including random resizing, cropping, color jittering, and brightness alterations. The student network is updated throughout training using gradient descent. The teacher network is updated with an exponential moving average (EMA) updates of the weights of the online network. The slow updates induced by exponential moving average creates an asymmetry that is crucial to BYOL's success. The loss can be defined as $$\mathcal{L}_{\rm BYOL}\left(\theta_{\rm s},\gamma\right)=\mathbb{E}_{(\vx,t_1,t_2)\sim(X,T_1,T_2)}\left[ \left\| \norm(p_{\gamma}(f_{\theta_{\rm s}}(t_1(\vx))))-\norm(f_{\theta_{\rm t}}(t_2(\vx)))\right\|_2^2\right]$$ where the two vectors in representation space are automatically $\ell_2$-normalized i.e. $$\begin{aligned}
    \norm(\vv) = \frac{\vv}{\max(\|\vv\|_2+\eps)},\end{aligned}$$ where $\epsilon$ is often set at $1^{-12}$. $f_{\theta_{\rm s}}$ is the online encoder network often denoted as the *student* parametrized by $\theta_{\rm s}$, and $p_{\gamma}$ is the predictor network parameterized by $\gamma$. $\vx \sim X$ is the input sampled from the data distribution $X$, and $t_1(\vx), t_2(\vx)$ are two augmented views of $\vx$ where $t_1 \sim T_1 , t_2 \sim T_2$ are two data augmentations. The target network $f_{\theta_{\rm t}}$ is of the same architecture as the student and is updated by EMA with $\xi$ controlling to what degree the target network preserves its history as in $$\theta_{\rm t}\leftarrow\xi \theta_{\rm t}+(1 -\xi)\theta_{\rm s}$$ with initialization $\eta=\theta_{\rm s}$.

**SimSiam** is aimed at understanding which components in BYOL are most important. SimSiam showed that the EMA was not necessary in practice, even if it led to a small boost in performance. This enabled the use of a simplified loss defined by $$\mathcal{L}_{\rm SimSIAM}\left(\theta_{\rm s},\gamma\right)=\mathbb{E}_{(\vx,t_1,t_2)}\left[ \| \norm(p_{\gamma}(f_{\theta_{\rm s}}(t_1(\vx))))-\sg(\norm(f_{\theta_{\rm s}}(t_2(\vx))))\|_2^2\right],$$ where for clarity we omit the distribution over which $x,t_1,t_2$ are sampled from. Several works have aimed at understanding how BYOL and SimSiam avoid collapse such as [@tian2021understanding] or [@halvagal2022predictor], where they found that the asymmetry between the two branches is the key, as well the training dynamics which regularize the variance of the embeddings implicitly.

**DINO** performs a centering of the output of the student network using a running mean (to avoid sensitivity to mini-batch size) and discretize (smoothly) the representations by means of a softmax with a temperate $\tau$ usually taken to be around $0.1$ as in $$\mathcal{L}_{\rm DINO}\left(\theta_{\rm s},\gamma\right)=\mathbb{E}_{(\vx,t_1,t_2)}\left[\CE\left(\softmax(f_{\theta_{\rm s}}(t_1(\vx))/\tau),\sg(\softmax(\cent(f_{\theta_{\rm t}}(t_2(\vx)))/\tau))\right)\right],$$ where akin to BYOL the teacher again has a moving average of the student network's weights, usually with value $\xi$ following a cosine schedule from $0.996$ to $1$ during training. The discretization in DINO caused by the softmax can be interepreted as an online clustering mechanism, where the last layer before the softmax contains the clustering prototypes and its weight. As such, the output of the penultimate layer is clustered using the weights of the last layer.

**iBOT** builds on DINO and combines its objective with a masked image modeling objective applied in latent space directly. Here, the target reconstruction is not the image pixels but the same patches embedded through the teacher network.

**DINOv2** further builds on iBOT and improves its performance significantly in both linear and k-NN evaluations by improving the training recipe, the architecture, and by introducing additional regularizers such as KoLeo [@sablayrolles2018spreading]. In addition, DINOv2 curates a larger pretraining dataset consisting of 142 million images (further discussion in `\Cref{sec:weakly-curated-data}`{=latex}).

Many other methods belong to this self-distillation family. MoCo is another popular method based on building a dictionary look-up that was shown to in some cases to surpass supervised learning on segmentation and object detection benchmarks [@he2020momentum]. Originally the momentum encoder was introduced as a substitute for a queue in contrastive learning [@he2020momentum], which extends the result of [@dosovitskiy2014discriminative]. MoCo's moving average uses a relatively large momentum with a default value of $\xi=0.999$. This higher momentum value works much better than a smaller value of say $\xi=0.9$. When SimCLR introduced the use of a projector and stronger data-augmentations, MoCoV2 [@chen2020improved] followed suite with stronger data-augmentations and a projector head to boost performance. In a similar spirit, ISD [@tejankar2021isd] compares a query distribution to anchors from the student distribution using KL-divergence that relaxes the binary distinction between positive and negative samples. MSF [@koohpayegani2021mean] compares a query's nearest neighbor representation to the student target's representation and then minimize the $\ell_2$ distnace between them with renormalization (akin to cosine similarity maximization). Another approach, SSCD builds on the contrastive objective to the task of copy detection outperforming copy detection models and other contrastive methods [@pizzi2022selfsupervised]. Aside from the widespread use of the contrastive objective, many more methods employ similar running average updates as part of their training mechanism. For example, self-distillation [@hinton2015distilling; @furlanello2018born], Deep Q Network in reinforcement learning [@mnih2013playing], Mean Teacher in semi-supervised learning [@tarvainen2017mean], and even model average in supervised and generative modeling [@jean2014using].

The Canonical Correlation Analysis Family: VICReg/BarlowTwins/SWAV/W-MSE
------------------------------------------------------------------------

The SSL canonical correlation analysis family originates with the Canonical Correlation Framework (CCA) [@hotelling1992relations]. The high-level goal of CCA is to infer the relationship between two variables by analyzing their cross-covariance matrices. Specifically, let $\mX\in \mathbb{R}^{D}$ and $\mY\in\mathbb{R}^{D}$. The CCA framework seeks two transformations $\mU = f_{x}(\mX)$ and $\mV = f_{y}(\mY)$ such that $$\begin{gathered}
\mathcal{L} = -\sum_{n=1}^{N}\langle \mU_n,\mV_n\rangle,\nonumber\\
 \text{ such that }
\underbrace{\frac{1}{N}\sum_{n=1}^{N}\mU_n=\frac{1}{N}\sum_{n=1}^{N}\mV_n=\mathbf{0}}_{\text{zero-mean representations}},\underbrace{\frac{1}{N}\mU^T\mU=\frac{1}{N}\mV^T\mV=\mI}_{\text{identity covariance representations}},\label{eq:CCA}\end{gathered}$$ with $d$ (the dimension of the output mappings) such that $d \leq \min (\dim(\mX),\dim(\mY))$. Linear CCA [@hotelling1992relations] considers the two mappings to be linear in which case the optimal parameters can be found through the SVD of $\Sigma_{x}^{-\frac{1}{2}} \Sigma_{xy}\Sigma_{y}^{-\frac{1}{2}}$, involving the covariance matrices of $\mX,\mY$ and their cross-covariance. A major advance in the study nonlinear CCA was achieved by @breiman1985estimating in the univariate output setting, and by @makur2015efficient in the multivariate output setting, by connecting the solution to `\cref{eq:CCA}`{=latex} to the Alternating Conditional Expectation (ACE) method. @painsky2020nonlinear study the link between the optimal representation for nonlinear CCA using the Alternating Conditional Expectation proving new theoretical bounds that lead to further refinements of CCA.

These ideas were extended to deep learning in Deep Canonically Correlated Autoencoders (DCCAE) an autoencoder regularized via CCA. @hsieh2000nonlinear and @andrew2013deep introduce the objective of jointly learning parameters for two networks, $f_1, f_2$, such they their outputs are maximally correlated. The inputs to these networks are two views $X_1$ and $X_2$. Specifically the objective is then to find parameters $\theta_1, \theta_2$ for each network such that $$\begin{aligned}
(\theta_1^*, \theta_2^*) = \text{argmax}_{(\theta_1, \theta_2)} \text{corr}(f_1(X_1; \theta_1), f_2(X_2; \theta_2).\end{aligned}$$

This DCCAE objective was extended to multivariate outputs and arbitrary DDNs in @wang2015deep.

From these origins, stems SSL methods such as VICReg [@bardes2021vicreg], Barlow Twins [@zbontar2021barlow], SWAV [@caron2020unsupervised], and W-MSE [@ermolov2021whitening]. **VICReg**, the most recent among these methods, balances three objectives based on co-variance matrices of representations from two views: variance, invariance, co-variance shown in Figure `\ref{fig:vicreg-diagram}`{=latex}. Regularizing the variance along each dimension of the representation prevents collapse, the invariance ensures two views are encoded similarly, and the co-variance encourages different dimensions of the representation to capture different features.

```{=latex}
\centering
```
![**VICReg**: penalizes variance, invariance, and co-variance terms to learn representations from unlabeled data.](figures/vicreg_archi.png){#fig:vicreg-diagram width="\\textwidth"}

Masked Image Modeling`\label{sec:mim}`{=latex}
----------------------------------------------

A number of prominent early self-supervised pre-training algorithms for computer vision applied degradations to training images, such as decolorization [@zhang2016colorful], noise [@vincent2008extracting], or shuffling image patches [@noroozi2016unsupervised], and taught models to undo these degradations. Context encoders instead mask out large portions of an image and replace their pixel values with white, teaching an autoencoder to inpaint the white patches [@pathak2016context]. This early attempt at masked image modeling does not achieve competitive performance with supervised learning on downstream tasks, and pre-dates vision transformer architectures which modern masked training routines build upon. Subsequently, BERT [@devlin2019bert] shook up the natural language processing world by replacing text tokens input to a transformer language model with learnable mask tokens and teaching the model to recover the original text. This paradigm, termed *masked language modeling* (MLM), can also be interpreted as a form of the above strategy, degrading a sample via masking and teaching a model to undo the masking degradation. MLM, along with span-infilling techniques, remains popular as a SSL objective for large language models [@raffel2020exploring; @wang_what_2022; @tay_unifying_2022].

We can also similarly mask out portions of an image and teach a model to inpaint them. This pre-training vision strategy is known as masked image modeling (MIM). Inspired by BERT, @dosovitskiyimage exploit the vision transformer architecture by masking out patch tokens and replacing them with learned mask tokens. They then teach their model to predict pixel values directly, but they find that this pre-training strategy is significantly less effective than supervised pre-training.

@bao2021beit note that applying the BERT strategy directly to images is difficult because whereas text tokens can only take on a small number of values that can be predicted as a classification problem, image patches can assume considerably more possible values and hence more classes than would be suitable for classification. Instead, the authors cast MIM as a regression problem, first using an autoencoder to encode image patches as discrete tokens, and then pre-training their transformer to predict the discrete token values for masked tokens. BEiT achieves significantly improved performance on downstream image classification and semantic segmentation over previous supervised and self-supervised baselines, but its training pipeline is complex since it requires a powerful autoencoder for converting image patches to discrete tokens.

In order to streamline MIM pre-training, two concurrent works [@he2022masked; @xie2022simmim] propose simplified algorithms, masked autoencoders (MAE) and SimMIM respectively, which directly reconstruct masked image patches rather than discrete image tokens extracted from an encoder as in BEiT. Moreover, these simplified pre-training strategies achieve superior performance to BEiT on downstream image classification, semantic segmentation, and object detection tasks. Since then, masked image modeling has achieved competitive performance on a wide variety of vision tasks [@zhou2021ibot; @woo2023convnext; @oquab2023dinov2] and even vision-language representation learning [@fang2022eva]. The most successful approaches when using a frozen encoder, iBOT [@zhou2021ibot] and DINOV2 [@oquab2023dinov2] employ a mix of masked image modeling and more classical approaches such as self-distillation. Howver, their masked image modeling objective reconstructs in latent-space with a teacher network used to provide targets instead of using the original image as the reconstruction target.

Consider that MIM is fundamentally a generative modeling task. Such models are trained to generate missing image parts conditional on the observed ones. Note that BEiT, MAE, and SimMIM are deployed on downstream prediction problems by removing the decoder and replacing it with a prediction head. However, masked image models can also achieve strong generative modeling [@chang2022maskgit], including text-conditional generation [@chang2023muse]. Compared to autoregressive models for image generation [@yuscaling] which generate patches sequentially, MIM-based generative models are significantly more efficient, since they can generate patches in parallel.

In `\Cref{subsec:techniques_masked}`{=latex}, we will discuss various techniques harnessed by state-of-the-art masked image modeling systems to achieve such competitive performance.

```{=latex}
\centering
```
```{=latex}
\begin{tikzpicture}
    \SummaryCard{7}{A Brief History of Masked Image Modeling}{\begin{minipage}{0.95\textwidth}
    \begin{itemize}
    \item \citet{pathak2016context} implement a masked pre-training strategy where large portions of an image are replaced with white and inpainted by an encoder decoder model.
    \item \citet{devlin2019bert} propose the masked language modeling SSL task.  BERT achieves state-of-the-art performance on a variety of downstream language problems.
    \item \citet{dosovitskiyimage} adapt the BERT pre-training strategy for the vision transformer architecture.
    \item \citet{bao2021beit} propose BEiT which replaces the pixelwise reconstruction loss by predicting discrete visual tokens extracted by a discrete VAE encoder.
    \item \citet{he2022masked} simplify BEiT by removing the VAE encoder in favor of the pixelwise reconstruction loss, but tune the pipeline for superior performance.  Masked autoencoders (MAE) achieve state-of-the-art ImageNet 1k performance among competitors that don’t use extra data.
    \item SimMIM concurrently simplifies masked autoencoding in a similar fashion, achieving similar performance on image classification and also including state-of-the-art object detection, action recognition, and semantic segmentation. \citep{xie2022simmim}
    \item Muse reaches state-of-the-art text conditional image generation with a masked transformer approach \citep{chang2023muse}.
    \end{itemize}
    \end{minipage}
    }
\end{tikzpicture}
```
A Theoretical Unification Of Self-Supervised Learning
-----------------------------------------------------

### Theoretical Study of SSL

Numerous works have attempted to unify various SSL methods. In @huang2021towards, Barlow Twins' criterion is shown to be linked to an upper bound of a contrastive loss. This suggests a link exists between contrastive and covariance-based methods. This direction was further pursued in @garrido2022duality, where a covariance-based and contrastive criterion are shown to be equivalent up to normalization by deriving the precise gap between the two approaches. These results were further validated empirically as methods were shown to exhibit similar performance and representation properties at ImageNet's scale (1.2 million samples). The similarities among methods was also studied in @tao2021unigrad where this unification was tackled from a study of the losses' gradients.

#### Relationship between Contrastive Learning and Other Objectives. {#sec:contrastive}

Initially, InfoNCE was suggested as a variational approximation to the mutual information between two views [@aitchison2023infonce; @wang2020understanding; @oord2018representation]. @li2021self explains the role of InfoNCE in contrastive learning through the lens of the Hilbert-Schmidt Independence Criterion (HSIC), which was used to present a variational lower bound on the mutual information (MI) between different transformations. @tschannen2020mutual shows the performance of InfoNCE cannot be explained only in terms of mutual information. Instead other factors such as the feature extractor and formualtion of the mutual information estimator are important and can lead to drastically different performance [@guo2022tight]. Alternative theories suggest that InfoNCE balances alignment of "positive" examples and uniformity of the overall feature representation [@wang2022understanding], or that (under strong assumptions) it can identify the latent structure in a hypothesized data-generating process, akin to nonlinear ICA [@khemakhem2020variational]. In @wang2020understanding, Theorem 1 shows that contrastive learning with an RBF kernel (an expressive map of features into a higher dimensional space) converges to a uniform distribution on the sphere with matched pairs.  [@tian2022understandinga] shows that contrastive learning with deep linear network is equivalent to Principal Component Analysis (PCA) and  [@tian2023understanding] further analyzes the role played by nonlinearity in the architecture if trained with contrastive loss, showing that nonlinearity leads to many local optima that can host diverse patterns in the training data, while linear networks only allow a single dominant pattern to be learned. @hjelm2019learning introduced Deep InfoMax (DIM), which maximizes the mutual information between the input and output of a deep neural network encoder using local features from the input, an idea that was extended to graphs in @veličković2018deep.

```{=latex}
\def\cL{\mathcal{L}}
```
![`\small `{=latex}Problem Setting. **Left**: Data points ($i$-th sample $\vx[i]$ and its augmented version $\vx[i']$, $j$-th sample $\vx[j]$) are sent to networks with weights $\vtheta$, to yield outputs $\vz[i]$, $\vz[i']$ and $\vz[j]$. From the outputs $\vz$, we compute pairwise squared distance $d^2_{ij}$ between $\vz[i]$ and $\vz[j]$ and intra-class squared distance $d^2_i$ between $\vz[i]$ and $\vz[i']$ for contrastive learning with a general family of contrastive loss $\cL_{\phi, \psi}$ (Eqn. `\ref{eq:general-loss}`{=latex}). **Right**: Different existing loss functions corresponds to different monotonous functions $\phi$ and $\psi$. Here $[x]_+ := \max(x, 0)$.](figures/ssl_pca_setting2-crop.png "fig:"){#tab:loss-funcs width=".32\\textwidth"} `\hfill `{=latex} `\footnotesize`{=latex} `\setlength`{=latex}`\tabcolsep{2pt}`{=latex}

  Contrastive Loss                                                  $\phi(x)$                  $\psi(x)$
  ----------------------------------------------------------------- -------------------------- -------------------------
  InfoNCE `\tiny`{=latex}[@oord2018representation]                  $\tau\log(\epsilon + x)$   $e^{x/\tau}$
  MINE `\tiny`{=latex}[@belghazi2018mutual]                         $\log(x)$                  $e^x$
  Triplet `\tiny`{=latex}[@schroff2015facenet]                      $x$                        $[x + \epsilon]_+$
  Soft Triplet `\tiny`{=latex}[@tian2020understanding]              $\tau\log(1 + x)$          $e^{x/\tau + \epsilon}$
  N+1 Tuplet `\tiny`{=latex}[@sohn2016improved]                     $\log(1+x)$                $e^x$
  Lifted Structured `\tiny`{=latex}[@oh2016deep]                    $[\log(x)]^2_+$            $e^{x + \epsilon}$
  Modified Triplet `\tiny `{=latex}Eqn. 10 [@coria2020comparison]   $x$                        $\mathrm{sigmoid}(c x)$
  Triplet Contrastive `\tiny `{=latex}Eqn. 2 [@ji2021power]         linear                     linear

#### Unified contrastive losses.

 @tian2022understandinga unified contrastive losses as minimizing a general family of loss functions $\cL_{\phi,\psi}$, where $\phi$ and $\psi$ are monotonously increasing and differentiable scalar functions $$\min_{\vtheta} \cL_{\phi,\psi}(\vtheta) = \sum_{i=1}^N \phi\left(\sum_{j\neq i} \psi(\|\vz_i-\vz_{i'}\|_2^2 - \|\vz_i-\vz_{j}\|_2^2)\right). \label{eq:general-loss}$$ where $z$ are representations with indices $i$ and $j$ running from $1$ to $N$. With different $\phi$ and $\psi$, Eqn. `\ref{eq:general-loss}`{=latex} covers many loss functions (`\Cref{tab:loss-funcs}`{=latex}). In particular, setting $\phi(x) = \tau\log(\epsilon + x)$ and $\psi(x) = \exp(x/\tau)$ gives a generalized version of InfoNCE loss [@oord2018representation]: $$\!\!\cL_{nce}\!:=\!-\tau \sum_{i=1}^N\log\frac{e^{-\|\vz_i-\vz_{i'}\|_2^2/\tau}}{\epsilon e^{-\|\vz_i-\vz_{i'}\|_2^2/\tau}+\sum_{j\neq i} e^{-\|\vz_i-\vz_{j}\|_2^2/\tau}}$$ where $\epsilon > 0$ is some constant e.g. $\epsilon = 1$ has been used in @He2020MomentumCF [@tian2020contrastive], $\epsilon = 0$ yields a slight variation of SimCLR [@chen2020simple], the DCL loss [@yeh2021decoupled].

#### Hard negative sampling.

Negative mining has been thoroughly studied in (deep) metric learning. Recently, some works have focused on putting more weight on hard samples [@robinson2020contrastive]. Yet, @kalantidis2020hard [@tian2022understandinga] showed that contrastive SSL losses with $\psi=e^{x/\tau}$ already have such mechanisms at the batch level, focusing on *hard-negative pairs* without explicit \"hard-negative sampling\". This means that *contrastive losses need large batch sizes to ensure that hard negative samples are observed* which occurs at an additional memory cost.

`\label{sec:projector_theory}`{=latex}

#### Study of the projector.

The projector network, first introduction by [@chen2020simple], maps the representations into another space where the loss is computed. Despite strong empricial evidence the projector improves performance, few theoretical works attempted to explain its role. [@jing2022understanding] study the role of linear projectors in contrastive learning. More precisely, it is argued that the projector prevents dimensional collapse in the representation space and that it only needs to be diagonal and low-rank to do so. Although the proposed method without a projector outperforms SimCLR with a one layer linear projector, for 2- and 3-MLP projectors, performance remains out of reach. [@cosentino2022toward] study the interplay of the projector and data augmentations when the augmentations are Lie group transformations, and, as @mialon2022variance, provide an explanation on the effect of width and depth of the projector. Further empirical investigations of the role of the projector are presented in section `\ref{sec:projector}`{=latex}.

### Dimensional Collapse of Representations

```{=latex}
\centering
```
![Illustration of dimensional collapse before the projector (Left), and after the projector (Right). Methods suffer from different levels of collapse after the projector; while no such collapse occurs for representations before the projector.](./figures/collapse.png){#fig:collapse width="100%"}

While the goal of joint self-supervised methods is to learn meaningful representations, a significant part of the approaches suffer from what is called *dimensional collapse*. Dimensional collapse occurs when information encoded across different dimensions of the representation is redundant. In other words in the output of the projector, the embeddings are rank-deficient, which can be approximated via the singular value spectrum of the embeddings, as illustrated in `\Cref{fig:collapse}`{=latex}.

This phenomenon was first illustrated by [@hua2021feature] where the use of a whitening batch normalization helped alleviate collapse. Dimensional collapse was also studied from a theoretical point of view by [@jing2022understanding] with a focus on contrastive methods. Several following works linked dimensional collapse to an impact on performance [@he2022exploring; @ghosh2022investigating; @li2022understanding; @garrido2022rankme]. Some works focused on unsupervised evaluation [@ghosh2022investigating; @garrido2022rankme] where dimensional collapse was found to be a good proxy for downstream performance.\
Different measures of dimensional collapse have been introduced such as the entropy of the singular value distribution [@garrido2022rankme], the classical rank estimator [@jing2022understanding], fitting a power law to the singular value distribution [@ghosh2022investigating] or the AUC of the singular value distribution [@li2022understanding]. Nonetheless, all of these measures focus on evaluating the rank of the representations to measure dimensional collapse in the learned representations.

Pretraining Data
----------------

#### Curated (standard)

: The most common practice is to pretrain SSL models on curated datasets such as ImageNet and alternatives such as PASS [@asano2021pass]. These datasets tend to generally be class-balanced and contain object-centric images, where the object is prominentely feature often in the center of the photo.

#### Training with data from the wild

: Even though ImageNet has been the dataset of choice for pretraining, it is definitely not the only option. Its simplicity (object centric, single object, balanced classes) makes it a very good playground but most datasets in the wild are not as clean. If we want to leverage large uncurated datasets for SSL methods need to translate well outside of ImageNet. To this effect some works have explored pretraining on large uncurated datasets [@goyal2021self], or on datasets that different from ImageNet such as COCO [@el2021large], or iNaturalist [@uniform_prior]. While these works have shown promising results, ImageNet (or similarly curated dataset) pretraining has remained the norm.

To provide other insights, we pretrained methods on Places205 [@zhou2014places] and iNaturalist18 [@vanhorni2018naturalist] without changing the augmentations strategy but tuning heavily loss related coefficients. The goal is to see if the setups used on ImageNet transfer well to other datasets. Places205 has the advantage of not being object centric, and iNaturalist of having a power law distribution of classes as well as requiring a lot of fine-grained information. We report our results in `\cref{tab:inat-pre}`{=latex}. As we can see most methods are able to achieve similar performance when pretraining either on ImageNet or on the target dataset. This would suggest that the protocol developed on ImageNet can transfer decently, since we noticed that hyperparameters that were optimal on ImageNet also tended to be on different datasets. There is one visbile exception though, SimCLR and MSN perform poorly on iNaturalist18 when pretraining on it directly. While conclusions are impossible to draw precisely here, it would suggest that certain method exhibit more sensitivity on the pretraining dataset than other.

```{=latex}
\centering
```
```{=latex}
\resizebox{0.75\linewidth}{!}{
  \begin{tabular}{lccccccc}
    \toprule
    Target Dataset  & \multicolumn{4}{c}{iNaturalist18} &  \multicolumn{3}{c}{Places205} \\ 
 \cmidrule(lr){2-5} \cmidrule(lr){6-8}  
 Method & VICReg & SimCLR & DINO & MSN & VICReg & SimCLR & DINO\\ 
  \midrule 
 ImageNet pretraining & 38.8 & 39.2 & 46.3 & 40.5 & 52.6 & 51.8 & 54.4\\
 Target dataset pretraining & 37.0 & 28.6 & 41.9 & 29.1 & 53.4 & 51.6 & 57.2 \\
    \bottomrule
  \end{tabular}
  }
```
`\label{sec:weakly-curated-data}`{=latex}

#### Weakly-curated training data

: A successful approach to leverage large uncurated datasets is to perform retrieval in them based on curated data. This means that the dataset will contain images similar to a curated or smaller source dataset such as ImageNet, while being much larger and more diverse. This strategy was used in DINOv2 [@oquab2023dinov2] where LVD-142M was built using a wide variety of small and domain specific datasets. While this does not lead to big performance boosts in classification on ImageNet, it can lead to significant boosts in performance on other tasks such as image retrieval.

A Cook's Guide to Successful SSL Training and Deployment {#sec:practical_matters}
========================================================

Role of Data-Augmentation {#sec:DA}
-------------------------

Many SSL methods, especially joint embedding methods derived from @chen2020simple, require a way to define positive views from a given image to learn invariances. The proxy used in these SSL methods is to leverage data augmentation to define these invariances. For example, by using different crops of a given images and positive view, the SSL model will be trained to produce a representation that is invariant to these different crops. When using a grayscale operation, or a colorjitter one as positive views, the representation will have to be invariant to the color information. Thus, the deep nature of what is learned by the SSL models is defined by the data augmentation pipeline. It is worth noting that perfect invariance is not achieved thanks to the projector [@bordes2022guillotine], which helps improve performance on tasks which are not entirely invariant. @chen2020simple study how much influence have specific data augmentations on SimCLR with respect to the performances over ImageNet. They show that simpler data augmentation such as noise aren't beneficial on ImageNet classification downstream. Instead, cropping and multiple color jittering operations lead to competitive results with a supervised baseline. This key element of data-augmentation had also been largely used in the following SSL works [@chen2020improved; @bardes2021vicreg; @zbontar2021barlow] without significant changes. The only variant that is sometimes used is adding smaller crops in addition of bigger crops when learning in-variances. We discuss this use of big and smaller crops, called multi-crop, in the coming subsections.

However this specific combination of data augmentation was specifically designed to reach good performances on ImageNet. @demo study the impact of different choice of data augmentation on different downstream tasks and found that even if the addition of ColorJitter seem beneficial for many classification task it might not always be the case on other downstream tasks. Similarly, @ericsson2021selfsupervised show that different augmentations lead to learning different type of invariances for which some of them are better on some downstream tasks than other. The authors suggest to merge representations learned with different augmentations to improve transferability across a wider range of downstream task. There is also an hidden cost when using a complex pipeline of data augmentation: the data preprocessing time which might slow down significantly the training. Thus, when the training budget matter, it might be preferable to just use random crop along a grayscale operation when training a SSL model. We discuss common approaches for speeding up the training pipeline in `\Cref{sec:ffcv}`{=latex}. @ni2021close further show that contrastive learners can benefit from very aggressive data augmentations such as large rotations when explicitly trained not to be invariant to them, as in meta-learning [@ni2021data].

Another line of work attempts to remove the need for these handcrafted data augmentations. One approach is to use a reconstruction-based objectives such as MAE [@he2022masked] which uses a reconstruction loss in pixel space to avoid the need for defining precise invariances. Another approach is based on a joint-embedding where based on random parts of an images the goal is to predict the representations of the missing parts of the image in the representation space. An example of such method is I-JEPA [@assran2023selfsupervised] or Data2Vec2.0 [@baevski2022efficient] which use a context part of an image to predict missing small parts of the image. Another line of work tries to retain style information about the augmentations to improve downstream performance on tasks requiring style information such as color by predicting style information [@xiao2020should; @dangovski2021equivariant; @gidaris2018unsupervised; @scherr2022selfsupervised]. Encoding true equivariance to augmentations (which requires a mapping between embedding) is an active line of work with approaches such as EquiMod[@dangovski2021equivariant], SEN [@park2022learning], or [@marchetti2022equivariant] which also aims at splitting the representations as class and pose. This idea of splitting representations as invariant and equivariant was also explored in SIE [@garrido2023sie] and using Lie group formalism in [@ibrahim2022robust].

### Role of multi-crop

While works such as MoCo [@meng2021coco] are focused on increasing the number or quality of negative pairs, another direction to improve performance is to increase the number of positives for a given image. Multi-crop, which was introduced with SwAV [@caron2020unsupervised], tackles this problem by introducing smaller crops ($96\times 96$) on top of the usual two large ones ($224\times 224$). Instead of only comparing the two large crops together, or all pairs of crops, the two large crops are each compared to all other crops (big or small). As such, if we have 2 large crops and $N$ small crops, the invariance loss is computed $2(N-1)$ times, increasing the positive-pair related signal. The use of smaller crops as well as not comparing all pairs of crops helps reduce the computational cost of these additional crops. While the number of additional crops can vary (10 in Mugs [@zhou2022mugs] compared to 6 in SwAV), it always lead to an icrease in training time and memory usage if used as is. To mitigate this cost, using $160\times 160$ large crops and 4 $96\times 96$ in SwAV helped mitigate the memory cost and only lead to a training time increase of $25\%$ compared to the classical setting using two crops of size $224\times 224$, while leading to a 4 point performance boost. As such, multi-crop is a very useful strategy to help boost performance for a marginal additional compute cost. It has thus become almost ubiquitous in recent works [@caron2021emerging; @zhou2021ibot; @zhou2022mugs; @bardes2022vicregl; @oquab2023dinov2]. It is worth pointing out that some works have only noticed minor increases in performance [@wang2021solving] where it only lead to a 0.3 point performance increase.

Other approaches have emerged to negate the computational burden of feeding additional crops to the encoder by using nearest-neighbours in embedding space. While with NNCLR [@dwibedi2021little] the matched positive crop is replaced by its nearest-neighbour in latent space, in MSF [@koohpayegani2021mean], a $k$-NN graph is built in embedding space to provide a similar effect as multi-crop and increase positive-pair related signal. This strategy was further employed in UniVCL [@tang2022unifying] which used augmentation strategies such as edge of node masking in combination with a $k$-NN graph in latent space. All of these approaches show significant performance boosts for a smaller computational cost compared to multi-crop. In MSF, the use of this $k$-NN graph only increases training time by 6%.

Role of the Projector {#sec:projector}
---------------------

Most SSL with joint embedding methods include a projector (usually 2- or 3-layers MLP with ReLU) after the encoder. The SSL loss is applied to the projector's output, and the projector is usually discarded after training. This crucial component was introduced in SimCLR [@chen2020simple] and, although not responsible for avoiding collapse, allows significant top-1 accuracy gains on ImageNet. For example, in a 100-epochs training, the projector adds around $20 \%$ of top-1 accuracy in SimCLR and VICReg (from around $50 \%$ to $68 \%$ and $48 \%$ to $68 \%$ respectively).\
@bordes2022guillotine show that adding a projector is not only useful for SSL but is also highly beneficial in a supervised training setting when there is a misalignment between the training and downstream tasks (which was also demonstrated by @sariyildiz2022improving). In fact, it's well known from @features_transfert that cutting layers in a trained deep neural network is beneficial when doing transfer learning mostly to avoid the training task's overfitting bias. When looking through the lens of transfer learning, it becomes easy to understand why a projector is needed in SSL since the training task is always different from the downstream task. To bridge the gap between the terms used in the SSL and in the transfer learning literature, @bordes2022guillotine suggested coining the method of probing intermediate representations or cutting layers as: *Guillotine Regularization* (GR). They also highlight how crucial it is to dissociate GR from the addition of a projector in SSL because the optimal layer on which one should probe the representation might not always be the backbone (but could be an intermediate projector layer as demonstrated in @chen2020big). Lastly, @bordes2022guillotine demonstrated that reducing the misalignement between the training and pretext task (by using class label to find the positives pair in contrastive learning) leads to learning a network for which the best linear probe performance on ImageNet are obtained at the last projector layer (instead of the backbone) as shown in Figure `\ref{fig:diff_projector_backbone}`{=latex}.

```{=latex}
\centering
```
![ Figure from [@bordes2022guillotine] that show the accuracy difference between the backbone and projector representation across several downstream tasks. When using the tradition SSL positive pairs (in blue) the backbone accuracy is always much higher than the projector accuracy. However when using the class label information to define the positive pairs (in green), thus by reducing the misalignement between the training and downstream task, the projector representation lead to higher accuracy than the backbone representation on ImageNet.](./figures/diff_projector_acc.png){#fig:diff_projector_backbone width="90%"}

```{=latex}
\centering
```
::: {#tab:projector_oracle}
   Projector   Oracle  Top-1           Top-5
  ----------- -------- --------------- -------
       X         X     50.1            75.8
       X         V     56.4$^{+6.3}$   80.2
       V         X     68.9            88.2
       V         V     69.5$^{+0.6}$   88.8

  : The projector may handle noise that originates from random data augmentations. Training VICReg without a projector can benefit from filtering semantically inconsistent augmented views using an oracle. With a projector, using an oracle provide only minor gains. Top-1 and Top-5 correspond to linear probing performance on IN-1k.
:::

#### Using a projector to handle noisy image augmentations.

The projector may also be necessary to mitigate the noise of data augmentation. As described in `\Cref{sec:DA}`{=latex}, SSL methods typically randomly augment input images to generate two different views of the same image. In some cases, enforcing invariance over two very different views might be a very strong constraint that could harm the performance, like when the content of the two views is different. To demonstrate how using the projector can mitigate that, we pretrain VICReg [@bardes2021vicreg] with and without projector using image augmentations that are semantically similar according to an \`\`oracle", e.g a ResNet50 pretrained on ImageNet with full supervision. We pretrain for $100$ epochs and include the linear probing results of these experiments in `\Cref{tab:projector_oracle}`{=latex}. Without projector and with an oracle, the Top1 performance is $6.3\%$ higher compared to not using an oracle. However, equipped with a projector, using an oracle to remove noisy views only boosts Top1 performance by $0.6\%$. This might imply that the projector has a role in handling inconsistent or noisy augmented views during the SSL training process.

#### Influence of the projector's output dimension.

Similarly to how large batch sizes were seen as a requirement for contrastive methods, a large output dimension of the projector was seen as a requirement for covariance based methods. This is illustrated by figure 4 in [@zbontar2021barlow], and table 12 in [@bardes2021vicreg], where drops of up to $15\%$ in top-1 on ImageNet can be observed. As pointed out in [@garrido2022duality] this was due to the projector's intermediate layers scaling with the output dimension as well as loss weights that needed to be scaled as well. By tuning these parameters, VICReg's top-1 accuracy increases from $55.9\%$ to $65.1\%$ with 256 dimensional embeddings. The peak performance is also achieved at 1024 dimensions and plateaus afterwards. While VICReg stays more sensitive to the output dimension of the projector than SimCLR, it is significantly more robust than originally thought and very large output dimensions are not a requirement. Comparable results should be achievable for Barlow Twins due to the similarities between the two methods.

```{=latex}
\centering
```
![Impact of different projector architectures and output dimension on popular methods.$x-y-z$ denotes a MLP with layers of output dimension $x$,$y$ ad $z$ respectively. From [@garrido2022duality].](./figures/projector_architecture.png){#fig:projector_architecture width="100%"}

#### Influence of the backbone's output dimension.

Recent works also investigated the effect of the backbone dimension. @dubois_improving_2022 observed that larger backbone representations lead to better linear probe performance when using CISSL. @bordes2023surprisingly investigated more deeply the impact of the backbone dimension across common SSL methods like VICReg, SimCLR or BYOL. They show that traditional supervised methods decline in performance when the dimension of the backbone is increased. On the other hand, SSL methods highly benefit from wider backbone representations as shown in `\Cref{fig:sup_vs_ssl_backbone}`{=latex}. In fact, it is much more beneficial in SSL to increase the backbone size when training a ResNet than increasing the width or depth of the ResNet as illustrated in `\Cref{fig:params_vs_acc}`{=latex}. This observation highlights that the current architectures used in SSL, which are often the same as the those used in supervised training, might not be optimal.

```{=latex}
\centering
```
```{=latex}
\centering
```
![](figures/supp_acc_small.png){#fig:sup_vs_ssl_backbone}

```{=latex}
\centering
```
![](figures/params_vs_acc.png){#fig:params_vs_acc}

#### Properties of the representation induced by the projector.

@mialon2022variance argue that the projector enforces pairwise independence of the features in the representation and provide a demonstration for random projectors in the context of VICReg, BarlowTwins and W-MSE [@bardes2021vicreg; @zbontar2021barlow; @ermolov2021whitening]. In particular, higher degrees of independence are reached with wider projectors. Pairwise independence, or a soft notion thereof, can be more appropriate to learn unsupervised representations from "real world" datasets such as ImageNet than mutual independence [@li2019learning]. Alternatively, seeking alternative SSL regularization to VCReg is needed if mutual independence is sought for. The optimization dynamics resulting from applying VCReg (the anti-collapse term in VICReg) at the projector's output is also worth noting: minimizing VCReg with respect to the projector parameters is not necessary, and VCReg is rather optimized with respect to the encoder parameters. Whether this analysis fully extends to other SSL methods is an open question.

#### Training a SSL without a projector.

@jing2022understanding proposes DirectCLR, which shows that regularizing the representation in DirectCLR by applying the InfoNCE SimCLR objective on sub-vectors of the representation without a trainable projector is sufficient to outperform SimCLR with a linear projector in terms of ImagNet top-1 accuracy.

The Uniform Prior in SSL or the Failure of SSL on Unbalanced Data
-----------------------------------------------------------------

Despite their recent successes, there is an important limitation of SSL methods: poor performance on unbalanced datasets. Since real world data is imbalanced, such a limitation is an important factor that made the use of SSL methods on vast amount of uncurated data challenging. @uniform_prior explains such a limitation by the use of an hidden uniform prior that is common to many SSL methods. By distributing the data uniformly in the representation space, SSL methods learn to find the most discriminative features in a given mini batch. When data is uniformly distributed across classes labels, the most discriminative features that the model will learn will be class specific. However, when using imbalanced data, the most discriminative features inside the mini batch might not be the class anymore but more low level information which decrease the performances on downstream classification tasks. To alleviate this issue, @uniform_prior introduce the use of an additional regularization term on the SSL method MSN [@msn] to change the distribution of the SSL clustering.

Teacher-Student Architecture Specific Tricks
--------------------------------------------

### Role of the Moving Average Teacher

While the original BYOL method is based on exponential moving average (EMA) updates of the weights for the target (teacher) network, it was later confirmed that EMA is not necessary (i.e., the online and target networks can be identical). This is also confirmed with SimSiam [@chen2021exploring], as long as the predictor is updated more often or has larger learning rate compared to the backbone. In the case of DQN, the target network with EMA is shown to remove bias @fan2020theoretical and @piche2021beyond showed that the EMA could be removed from the target network by using the correct regularizer. For BYOL, a stop gradient of the online network, meaning the decay rate is 0 for the target network, collapses as shown in Table 5 of [@grill2020bootstrap]. @pham2022pros shows the idea of exponential moving averages provide training stability that can even be used in non student-teacher frameworks such as SimCLR. Specifically, they show applying EMA updates to the projector of SimCLR can boost performance. @wang2022importance shows that training could also benefit from other kinds of asymmetries in the teacher-student setting (e.g., stronger augmentation on the student side).

### Role of the Predictor in Self-Labeling SSL {#sec:predictor}

The predictor network plays a central role in BYOL's success by predicting the representation of the teacher network from the student networks' representation. @shi2020run shows removing the predictor leads to a performance drop from 68% to 21% top-1 accuracy on ImageNet (compared to the original two-layer MLP predictor in BYOL). In Figure 1 of @shi2020run, they demonstrate even a linear predictor leads to good performance and can recover from poor initialization in 10-20 epochs of training. For SimSiam, Table 1 of @chen2021exploring shows removing the predictor in SimSiam also leads to collapse with a top-1 accuracy of \< 1% on ImageNet. @tian2021understanding, whose implementation can be found[^2], proves that in the presence of the predictor, the training dynamics of BYOL and SimSiam contains nontrivial stable fixed points, and thus avoid being trapped into trivial solutions during training, even if these trivial solutions are global optimal. It further proposed a contrastive method, DirectPred, that directly sets the predictor via eigenvalue decomposition during training and leads to comparable performance in ImageNet. Its follow-up work (DirectSet @wang2021towards) further removes the overhead of eigenvalue decomposition.

Role of Standard Hyper-Parameters
---------------------------------

A common issue in SSL research is that each method has different configurations of hyper-parameters. Hence comparisons directly between different SSL methods or models is often challenging. In this section, we present and describe the impact of each hyper-parameters to help SSL practitioners identify which are most important depending of their setup.

```{=latex}
\begin{tikzpicture}
    \Quote{When debugging don’t trust the value of the loss, and first play with loss hyper-parameters (not DA/optimizer)}{Adrien, Quentin}
\end{tikzpicture}
```
### Role of Mini-Batch Size

It was originally thought that contrastive methods such as SimCLR or MoCo require large batch sizes or memory banks to work. This turns out to be misleading as both methods can be made to work at small batch sizes. A square root scaling of the learning rate was discussed in the appendix of [@chen2020simple] which already gave a significant increase in performance of up to 5 points in top-1 accuracy on ImageNet for a 100 epochs training. Similarly, @demo investigated the impact of the learning rate with small batch sizes and found how one can train SimCLR on ImageNet using a single gpu without an important drop in performances. Furthermore, some works such as DCL [@yeh2021decoupled] show that you can reach top performance with a batch size of 256 or more for SimCLR, and a queue size of only 256 or more for MoCo, by simply removing the positive pair from the denominator of the softmax and with more careful hyperparameter tuning. Similarly, it was shown by [@zhang2022dual] that by decomposing the dictionary in MoCo and by using different temperatures for the positive and negative pairs it is possible to increase the robustness to the dictionary dimension.

```{=latex}
\begin{tikzpicture}
    \Quote{Set batch size to maximum that fits on your GPU}{Adrien}
\end{tikzpicture}
```
### Role of Learning Rate (Schedulers) and Optimizers

Here we overview typical standard settings for learning rate schedulers and optimizers across methods. To determine the learning rate, methods often scale a base learning rate based on the batch size according to the heuristic by [@goyal2017accurate]: learning rate = $\frac{\text{batch size}}{256} * \text{base learning rate}$. For ImageNet pretraining, VICREg, Barlow Twins, BYOL, and SimCLR use a base learning rate of $0.2-0.3$ with the LARS optimizer [@you2017large]. Additionally for some methods such as Barlow twins, a much smaller learning rate (0.0048) is used to update the bias terms and batch norm parameters. Other methods such as MAE, DINO, and iBot use the AdamW optimizer [@loshchilov2017decoupled] with a smaller base learning rate of $1e-5-5e-4$. For a discussion of weight decay see `\Cref{subsec:weightdecay}`{=latex}. The most common training schedule involves a warmup period, usually 10 epochs, where the learning rate is linearly increased to its base value. After the warmup period, most methods use cosine decay.

```{=latex}
\begin{tikzpicture}
    \Quote{AdamW/LARS with the standard linear warmup/cosine annealing learning rate schedule is a safe choice}{Adrien, Quentin}
\end{tikzpicture}
```
### Role of Weight-Decay {#subsec:weightdecay}

Weight-decay is an important component of backprogagation for many SSL methods. Table 15 in BYOL [@grill2020bootstrap] indicates that no weight decay may lead to unstable results. A recent blogpost[^3] also mentions using weight decay leads to stable learning in BYOL. In Figure 4 of @tian2020understanding the effect of weight decay is explained in terms of its effect on memory of the initial conditions. The hypothesis is that weight decay allows the online network and predictor to better model invariance to augmentations regardless of the initial condition. For further reading, @zhang2022does provides a good review of SimSIAM collapse understanding and @shi2020run does the same for BYOL.

### Vision Transformers Considerations

Training Vision Transformers (ViT) [@dosovitskiyimage] requires special care. They are more prone to collapse and instability, and are more sensitive to the setting of hyper-parameters [@touvron2021training].

**Batch size.** [@chen2021empirical] found that large batch (e.g., 4096) training for joint-embedding ViT SSL methods can be unstable. This instability does not reflect as a large drop in the final accuracy, but appears as drops in kNN probe accuracy during training when the $L_\infty-norm$ of the gradient spikes. Using a random (versus a learned) patch projection layer to embed pixel patches into input tokens for ViT stabilizes training for MoCo-V3, SimCLR, and BYOL and also improves the final accuracy. A learning rate warm-up period of 10k iterations [@goyal2017accurate; @dosovitskiyimage] also improves training stability. On the other hand, [@caron2021emerging] noted a drop in final k-NN accuracy when training with very small batch sizes (128). So, a batch size of 1024 or 2048 seems to be the sweet spot for SSL pre-training of ViTs.

While the ViT architecture does not have any BatchNorm layers, training a MoCo-V3 model with BN layers in the projector heads improved the linear probing accuracy of the ViT [@chen2021empirical]. Note that for joint embedding methods, batching can be done either together for all samples and crops in one batch, or separately for each batch of crops. SimCLR adopts the former, while BYOL and MoCo-V3 adopt the latter.

**Patch size.** [@caron2021emerging] found that training with smaller patch sizes ($5\times5$, or $8\times8$ instead of $16\times16$) leads to improved linear probing accuracy on DINO ViT pre-training. Note that while increasing patch sizes leads to a reduction in running time, it also increases memory usage (which makes it hard to train on patches smaller than $8\times8$).

**Stochastic depth** [@huang2016deep] originated from NLP and was subsequently used in vision models [@touvron2021going] to train deeper models. It randomly drops blocks of the ViT as a regularization. The per-layer drop-rate may depend linearly on the layer depth or uniformly as suggested in recent works [@touvron2021going]. It has huge importance when training larger models (ViT-L, ViT-H, etc.). For instance [@touvron2022deit] use $0.5$ drop path rate for ViT-H models. Conversely, when training smaller models like ViT-B, such regularization usually hurts the performance [@steiner2021train].

**LayerDecay** [@clark2020electra] decreases the learning rate geometrically along the layers. Put differently, the last layer is not affected, while the first has very small learning rate. In SSL vision models, LayerDecay increases performance when fine-tuning on downstream tasks [@beit; @zhou2021ibot; @he2022masked]. Depending on the model size, the parameter is set between $0.65$ to $0.85$ -- larger models usually need higher values because there are more layers. The underlying principle is that SSL builds strong model backbones, therefore we only need to fine-tune the shallowest layers.

**LayerScale** [@touvron2021going] is a per-channel multiplication of the vector produced by each residual block of the transformer. It increases the stability of the optimization and permits deeper ViT (larger than ViT-B).

**`[cls]` token.** When it is not explicitly needed by the method, using the average of the patch tokens instead of the class token saves memory without much change on the accuracies of the network [@zhai2022scaling].

Techniques for High Performance Masked Image Modeling {#subsec:techniques_masked}
-----------------------------------------------------

While there are several approaches to masked pretraining, the state-of-the-art systems that employ them tend to pair MIM with other techniques. For example, the ConvNextV2 architecture, which was state of the art on ImageNet (for models trained with only public data) when released, employs MAE pretraining [@woo2023convnext]. Interestingly, the authors point out that simply pretraining a ConvNextV2 with the MAE framework is subpar. They propose adding a novel normalization layer, called \`\`global response normalization," that proves vital to reaching state-of-the-art results [@woo2023convnext].

In other works that claim state-of-the-art performance on image classification and semantic segmentation, MIM pretraining is paired with distillation. While some MIM routines involve reconstructing the masked portion of the input in pixel space, another option is to use a teacher network to generate target representations of the unmasked image. @zhou2021ibot propose iBOT, which uses ViTs for both the teacher and the student in distillation-based MIM and outperforms prior methods on ImageNet classification. Subsequently, @liu2022exploring propose dBOT, an updated distillation-based MIM approach which also achieves state-of-the-art results on image classification and semantic segmentation. A major finding in their work is that the choice of the teacher model does not have to be chosen carefully if the distillation is done in stages. This is where the teacher is updated periodically to match the student's weights and the student is reinitialized. @oquab2023dinov2 employ similar distillations to train smaller models from a ViT-g teacher with much better performance than training from scratch. This line of work highlights that pairing distillation with MIM is extremely effective.

For object detectors that utilize MIM to outperform prior work, techniques that allow MIM to work with recent and high performing pyramid ViTs like Swin are critical. Since pyramid ViTs collapse patches, random masking can leave some local windows with no information. @li2022uniform propose an approach to masking that accounts for the hierarchical structure of these models called \`\`uniform masking." They constrain the masking to hide equal amounts of information in each local window ensuring that each has some information intact. This technique helps self-supervised models (on ImageNet1K) outperform supervised models (even on ImageNet22K) on object detection benchmarks @li2022uniform.

Evaluating Your SSL Models
--------------------------

### Evaluation with labels

Self-supervised pre-training is mainly evaluated on image classification, since it has been at the core of computer vision for decades. The three main common protocols are referred to as $k$-nearest neighbors (KNN), linear and full fine-tuning evaluations (ranked by order of complexity). They are offline evaluations, meaning that they are done independently of the self-supervised training procedure, conversely to online evaluation, which are performed during training. While online evaluation can provide a useful signal of downstream performance, because it's optimized alongside the varying self-supervised learning objective, it can be misleading. In addition, to these procedures which require labels for the downstream task, more recently, RankMe [@garrido2022rankme] has appeared as a viable alternative to costly evaluations, and is used as an oracle to final accuracy without having to do any training.

#### KNN

is one of the best known algorithms of machine learning and has been extensively used throughout the fields. With regards to image classification, a KNN classifier determines the label of a data point from the labels of its neighbors.

Formally speaking, the model is first used to extract frozen features $\mathcal{X} = x_1, ..., x_n$ (often $l_2$-normalized), from all the images in the training dataset. To classify a new image, we extract its feature representation $x'$, and retrieve its $k$ nearest-neighbors. They are the $k$ vectors of the training set $\mathcal X$ that have highest cosine-similarity with $x'$. Then, the vanilla approach applies a majority voting scheme: every neighbor counts $+1$ in its corresponding label, and we choose the label with most votes at the end. More sophisticated approaches use a weigthed voting scheme. Instead of counting $+1$ in its corresponding label, every neighbor counts a weight $w = f(x^Tx')$, for instance DINO implemenation employs $w = e^{x^Tx'/T}$  [@caron2021emerging]. This allows to account for imbalanced training set, not i.i.d. features, and usually gives more accurate results, at the cost of introducing an additional hyperparameter $T$.

K-NN classifiers have the great advantage of not relying on many hyperparameters, being fast and light to deploy, without requiring any domain adaptation.

#### Linear

In the context of SSL evaluation, training a linear classifier on top of pre-trained feature representations, a.k.a. linear-probing evaluation, was introduced by [@zhang2016colorful; @zhang2017split]. It is the most popular protocol for several reasons: it achieves high-accuracy, its performance heavily rely on the quality of the representation since their discriminative power is low, it imitates how the features can be used in practice, and last but not least, it is not very computationally expensive.

Most of the time, it is done simply by appending a linear layer at the end of the frozen backbone, and optimizing its parameters for a few epochs (around $100$). Sometimes, as introduced by [@beit], we can benefit from the fact that the linear evaluation is lightweight and evaluate multiple linear heads at the same time, to test many hyper-parameters at the same time (learning rate, averaging features or using a class-token for ViT-like architectures, number of features, etc.). A linear probe can also be trained online, by simply cutting gradient from the representations. Though only an approximation, an online linear probe is extremely cheap as it reuses the computations for the SSL pretraining, and gives a good indication of downstream performance, as shown in `\Cref{fig:linear_vs_mlp}`{=latex}.

#### MLP

Instead of a simple linear probing, a multi-layer perceptron (with two or three layers) could also be used to extract which information is learned in a SSL model. Non linear evaluation is rarely present in work around SSL, however it is needed when the learned features are not linearly separable, and when it is too difficult to extract information present in features with a linear model. In fact, comparing results with a linear and a non linear probe, can give us some ideas about how well structured a representation is. @demo present some results that compare different evaluation regime using a linear or a non linear probe. In `\Cref{fig:linear_vs_mlp}`{=latex}, one can observe that it's possible to get some gain in accuracy when using a multilater layer perceptron instead of a linear probe. However, the main issue with adding capacity into the probe is one related to overfitting: the best MLP head might not be the ones you get after 100 epochs, as showed in `\Cref{fig:linear_vs_mlp}`{=latex}.

```{=latex}
\centering
```
![Figure from @demo. Depiction of the classifier probe trained to predict the Imagenet-1k labels from the output of a Resnet50 backbone during SimCLR training (**online**) and post-training (**offline**) using a linear or MLP classifiers. The cross in red corresponds to the best accuracy. In the offline setting no data-augmentation is employed. We observe clearly that (i) when employing an MLP only a few epochs are needed and regularization or early-stopping should be employed, however, in the popular linear case, we clearly see that there is limited differences between the online and offline performances, and that over-fitting never occurs during either of the training cases. ](figures/linear_versus_mlp_probing.png){#fig:linear_vs_mlp}

```{=latex}
\begin{tikzpicture}
    \Quote{If labels are available, evaluate your model using an online linear probe}{Adrien, Quentin, Florian}
\end{tikzpicture}
```
#### Full Fine-tuning

The Masked Auto-encoders (MAE) paper [@he2022masked] re-introduced fine-tuning as the main evaluation metrics. The main arguments are that linear-probing is uncorrelated with fine-tuning and transfer learning performances, and that small MLP heads do not evaluate the strength of the method to create strong but non-linear features. The majority of works that followed [@beit; @zhou2021ibot; @dong2021peco] focused on this type of evaluation (and sometimes do not report linear/MLP results). It has been shown that contrasting methods show inferior performance than masked image modeling with regards to fine-tuning, because they are less \`\`optimization friendly" [@wei2022contrastive] - which explains the overall interest over MIM. It is by far the most computationally expensive of the evaluation methods, since it needs to re-train the whole network. The most common benchmark on ImageNet runs the optimization over $100$ epochs for ViT smaller than base, and for $50$ epochs for larger models [@he2022masked]. Other works [@beit; @peng2022beit; @wang2022image] first fine-tune on ImageNet-21k for $60$ epochs, and further fine-tune on ImageNet-1k, which represents between $1/5$ to $2$ times the cost of the pre-training phase.

### Evaluation without labels

As we just discussed, most evaluations rely on the use of labels and training an auxiliary model. This can make evaluations expensive and sensitive to hyperparameters or their optimizations. To help alleviate these issues multiple methods have been proposed to evaluate or help tune hyperparameters of methods without relying on labels. Using a pretext-task such as rotation prediction can facilitate performance evaluation without labels, as demonstrated in [@reed2021selfaugment] for data augmentation policy selection. However, a drawback of this approach is the requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations, or the model would be invariant to it. The eigenspectrum of representations is used in conjunction with the loss value to evaluate performance in  [@li2022understanding]. While a correlation with performance is shown, it requires training a performance classifier with the rank and loss value, making it hard to use for unsupervised evaluation. In  [@agrawal2022alphareq] $\alpha$-ReQ is introduced to evaluate methods by looking at the eigenspectrum decay of representations before the projector.

```{=latex}
\iffalse
```
\(i\) In the figure 5 of this paper (<https://arxiv.org/abs/2106.05956>), we show rank of representations and their average accuracy are almost linearly correlated. The context was different than yours, but the result is quite similar. (ii) In Figure 4 of this paper (<https://arxiv.org/abs/2205.11506>), we show rank of representations as approximated by the infonce loss can be used to perform unsupervised hyperparameter optimization. The result we have there is generally usable in any SSL context (and similar to yours about using rank for hparam optimization), but our focus was federated learning for the time being. `\fi`{=latex}

```{=latex}
\centering
```
```{=latex}
\resizebox{0.75\linewidth}{!}{
  \begin{tabular}{llccccccccc}
    \toprule
     \multirow{2}{*}{Dataset} & \multirow{2}{*}{Method}  & \multicolumn{2}{c}{VICReg} &  \multicolumn{1}{c}{SimCLR} &  \multicolumn{2}{c}{DINO} \\ 
 \cmidrule(lr){3-4} \cmidrule(lr){5-5} \cmidrule(lr){6-7}&  & cov. & inv. & temp. & t-temp. & s-temp.\\ 
 \midrule 
\multirow{3}{*}{ImageNet} & \textcolor{gray}{ImageNet Oracle} & \textcolor{gray}{68.2} & \textcolor{gray}{68.2} & \textcolor{gray}{68.5} & \textcolor{gray}{72.3} & \textcolor{gray}{72.4}\\ 
 & $\alpha$-ReQ & \textbf{67.9} & 67.5 & 63.5  & 71.7 & 66.2\\ 
 & RankMe & 67.8 &\textbf{ 67.9} & \textbf{67.1}  & \textbf{72.2} & \textbf{72.4}\\ 
  \midrule 
\multirow{3}{*}{OOD} & \textcolor{gray}{ImageNet Oracle} &  \textcolor{gray}{68.7} & \textcolor{gray}{68.7} & \textcolor{gray}{68.7} & \textcolor{gray}{71.9} & \textcolor{gray}{72.5}\\ 
 & $\alpha$-ReQ  & \textbf{68.1} & 67.8  & 65.1 & \textbf{71.8} & 68.5\\ 
 & RankMe  & 67.7 & \textbf{68.3} & \textbf{67.6} & \textbf{71.8} & \textbf{72.5}\\  
    \bottomrule
  \end{tabular}
  }
```
Another simple way to evaluate SSL methods, called RankMe, was introduced by [@garrido2022rankme]. The idea is to use the effective rank of representations, defined as the entropy of the singular value distribution of the embeddings. It can be computed as: $$\text{RankMe}(\mZ) = \exp\left(-\sum_{k=1}^{\min(N,K)} p_k \log p_k\right),\; p_k = \frac{\sigma_k(\mZ)}{\|\sigma(\mZ)\|_1}+\epsilon$$ It is shown to be a necessary condition for good performance, though you can achieve full rank representations with degenerate results (e.g. a random matrix with entries sampled i.i.d. from a Gaussian distribution). While this cannot be used to evaluate different methods, it works well for hyperparameter selection, as shown in `\Cref{tab:rankme}`{=latex}.

```{=latex}
\begin{tikzpicture}
    \Quote{To debug without labels, looking at the rank of the representations is a good start e.g. with RankMe}{Quentin}
\end{tikzpicture}
```
### Going beyond classification

While classification is a commonly used performance metric for evaluating self-supervised learning models, it is important to consider other types of vision tasks as well. Tasks such as object detection and semantic segmentation have gained popularity as they require models to learn more complex representations of visual information. Recent works [@caron2021emerging; @zhou2021ibot; @bardes2022vicregl] have demonstrated the effectiveness of self-supervised learning for these tasks. However, a limitation is that there is currently no standardized protocol for evaluating self-supervised models on these tasks. Various evaluation methods exist, such as finetuning the encoder on a downstream task or using the encoder as a feature extractor. Further research is needed to establish a standardized evaluation protocol for these tasks in the context of self-supervised learning.

### Visual Evaluation

![`\small `{=latex}Figure from @bordes2022high. RCDM visualization of **what is encoded inside various representations?** First to fourth rows show our samples conditioned on the usual resnet50 backbone representation (size 2048) while fifth to eigth rows show samples conditionned on the projector/head representation of various ssl models. (Note that a separate our generative model was trained specifically for each representation). *Common/stable aspects* among a set of generated images reveal *what is encoded* in the conditioning representation. *Aspects that vary* show *what is not encoded* in the representation. We clearly see that the projector representation only keeps global information and not its context, contrary to the backbone representation. This indicates that invariances in SSL models are mostly achieved in the projector representation, not the backbone. Furthermore, it also confirms the linear classification results of Table a) which show that backbone representation are better for classifications since they contain more information about an input than the ones at the projector level.](figures/AllSSLClass.png){#fig:rcdm_proj_vs_backbone width="0.9\\linewidth"}

```{=latex}
\vspace{-0.4cm}
```
Another way to evaluate what information is contained or not in a representation is to use a decoder over the representation that is able to map back this information to pixel space. Some methods like [@he2022masked] are built with a specific decoder which make such visual analysis easy, however most SSL methods aren't shipped with a decoder. To alleviate this issue and to allow researchers to visualize what can be learned by any type of SSL method, @bordes2022high suggest training a conditional generative diffusion model using a SSL representation as conditioning. By analyzing which information remains constant across different generated samples using a specific conditioning and what information does not remain constant (because of the stochasticity in the generative model), one can get some hints about what information is contained in the representation. If a representation encodes every information about each pixel, the conditional generative model would exploit every bit of this information to perform a perfect reconstruction which will lead to no variance across different samples. If the representation encodes only the class information, the conditional generative model will only be able to use that to reconstruct the image belonging to this class, which means that when generating different samples, the object class will remain constant but the background/context/color would change across samples. In `\Cref{fig:rcdm_proj_vs_backbone}`{=latex}, we show how RCDM was used by @bordes2022high to compare the representations learned at the projector level versus the representations learned at the backbone level. In this Figure, we observe that the representations at the projector level are much more invariant since the color/background information does not remain constant across different samples while this is not the case at the backbone level.

```{=latex}
\begin{tikzpicture}
    \Quote{You often only need to train for a few epochs to test collapse (no more than 5 epochs)}{Adrien}
\end{tikzpicture}
```
Speeding up Training
--------------------

### Distributed Training

Training self-supervised models often requires large batch sizes [@chen2020simple; @He2020MomentumCF], or can be considerably speed up by increasing the batch size, which is ultimately limited by the memory capacity of the device the model is trained on. Distributed training divides batches across several devices that run in parallel, which increases the overall size of the batch. This is mainly done with DDP: Distributed Data Parallel or FSDP: Fully Sharded Data Parallel, available in libraries like FairScale [@FairScale2021] or Apex [@apex]. However some self-supervised methods rely on the statistics of the current batch for the computation of their loss value [@chen2020simple; @zbontar2021barlow; @bardes2021vicreg], which has to be taken into account when distributing the training across multiple devices. In this section, we present the elements that need to be taken into account in order to correctly distribute the training of common self-supervised learning methods. We call effective batch size, the size of the full batch distributed on the devices, and per device batch size, the size of each sub-batch on a single device.

**Synchronized batch normalization.** Batch normalization is one a the most common technique for stabilizing neural network training, as well as improving the performance of the network. It is present in most convolutional backbones used in self-supervised learning, in particular in ResNet. Batch norm uses the statistics from the current batch, which need to be aggregated for distributed training. This can be done easily in PyTorch by wrapping your distributed model the following way: `model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)` This will replace all the BatchNorm modules in the network by a custom BatchNorm class that aggregates the statistics automatically.

**Aggregate batches for exact loss computation.** Batch norm is not the only operation that operate on batches, multiple self-supervised loss functions do as well, such as SimCLR [@chen2020simple] that uses the examples in the current batch as negative example for its contrastive loss, or VICReg [@bardes2021vicreg] that computes the covariance matrix of its embeddings. In these cases the batches from each device need to be aggregated into the full batch manually. This can be done using the all\_gather operation from PyTorch, however this operation does not allow back-propagation through it. We therefore implement a custom gather operation that does, the code is provided below:

```{=latex}
\begin{algorithm}
  \caption{}
  \label{alg:method}
    \definecolor{codeblue}{rgb}{0.25,0.5,0.5}
    \definecolor{codekw}{rgb}{0.85, 0.18, 0.50}
    \newcommand{\algofontsize}{8.5pt}
    \lstset{
      backgroundcolor=\color{white},
      basicstyle=\fontsize{\algofontsize}{\algofontsize}\ttfamily\selectfont,
      columns=fullflexible,
      breaklines=true,
      captionpos=b,
      commentstyle=\fontsize{\algofontsize}{\algofontsize}\color{green!50!black},
      keywordstyle=\fontsize{\algofontsize}{\algofontsize}\bfseries\color{blue!90!black},
    }
\begin{lstlisting}[language=python]
class GatherLayer(torch.autograd.Function):
    """
    Gather tensors from all process and support backward propagation
    for the gradients across processes.
    """

    @staticmethod
    def forward(ctx, x):
        output = [torch.zeros_like(x) for _ in range(dist.get_world_size())]
        dist.all_gather(output, x)
        return tuple(output)

    @staticmethod
    def backward(ctx, *grads):
        all_gradients = torch.stack(grads)
        dist.all_reduce(all_gradients)
        return all_gradients[dist.get_rank()]
\end{lstlisting}
\end{algorithm}
```
We use an `all_reduce` operation on the gradient, which sums them, because DDP will divide them later by the number of devices. One can use the operation by simply calling: `FullGatherLayer.apply(x)` on the input `x`. Practically, for the methods above, this needs to be done on the embeddings just before the computation of the loss.

**Additional tricks.** We advise to always use the effective batch size as argument to the training script, as well as for comparing runs. The `DataLoader` class takes the per device batch size as, argument, which can be obtained by dividing the effective batch size by the number of devices which is `world_size` in PyTorch. We also advise to use an adaptive learning rate scaled with the effective batch\_size, for example using `effective_lr = base_lr * effective_batch_size / 256` where `is the base_lr` is the argument of the training script. This reduce the learning rate search range when changing batch\_size. When using small batch size, it is recommended by @chen2020simple to use `effective_lr = base_lr * \sqrt(\text{effective\_batch\_size}) / 256`.

`\label{sec:ffcv}`{=latex}

### Even Faster Training with FFCV and Other Speedups

Since most join-embedding SSL methods requires different set of handcrafted data augmentation, data processing can become a real bottleneck when training SSL models. Some approaches[^4] have used DALI as an alternative data loader to pytorch vision while some other have relied on FFCV-SSL[^5] which is based on the FFCV library [@leclerc2022ffcv]. FFCV-SSL [@demo] shows that one can train SimCLR on ImageNet in less than 2 days on a single GPU or in just a few hours using 8 GPUs (Figure `\ref{fig:ffcv_vs_torchvision}`{=latex}).

```{=latex}
\centering
```
![Figure from @demo. ImageNet validation accuracy (y-axis) during training of SimCLR with respect to the training time (x-axis). FFCV-SSL is a library that is specifically optimized for Self-Supervised Learning, and that extends the original FFCV library [@leclerc2022ffcv]. FFCV-SSL allows a 3x time speed up with respect to torchvision and enables the training of SSL model in less than 2 days on a single gpu. ](figures/ffcv_vs_torchvision.png){#fig:ffcv_vs_torchvision width="\\linewidth"}

### Speeding Up Training of Vision Transformers

Training ViT can be made more efficient for two reasons. First, it is made easy for ViTs not to process all patches. This is especially helpful when using masked prediction pre-training objectives such as MAE [@he2022masked] or Masked Siamese Networks [@assran2022masked]. For instance, with ViT and such objectives, Data2vec 2.0 [@baevski2022efficient] achieves $84\%$ top-1 accuracy after pre-training for only $3$ hours on $32$ GPUs.

The second reason is linked to the architecture. Since transformers [@vaswani2017attention] are employed in almost all domains of computer science, many works aim to reduce the compute and memory requirements of the attention mechanism. One approach is with low-rank and/or sparse approximation mechanisms [@kitaev2020reformer; @choromanski2020rethinking; @wang2020linformer; @chen2021scatterbrain; @zaheer2020big]. For instance, [@li2022efficient] use sparse self-attention to improve efficiency in the context of SSL vision models. Another approach, is to resort to IO-aware optimizations [@ivanov2021data], the most known one perhaps being FlashAttention [@dao2022flashattention].

These speed-ups are available in open-source libraries: Fairseq [@ott2019fairseq], FairScale [@FairScale2021], XFormers [@xFormers2022], Apex [@apex], etc.

Another simple way to speed up the training of vision transformers is to use Pytorch bfloat16 which allow faster training while keeping the same precision range as float32 (this is useful to avoid the usual numerical instability issues one can encounter when training vision transformers in float16).

Extending Self-Supervised Learning Beyond Images and Classification
===================================================================

Strategies for Other Data Domains
---------------------------------

Pre-training large models with self-supervision objectives is popular not only for vision systems, but also for audio, text, and tabular data as well. The performance of existing SSL methods varies across these domains -- yielding state-of-the-art language models but limited success on tabular data -- which may either reflect better suitability of self-supervision or alternatively the wildly differing amount of attention which has been paid to the various domains in the SSL literature.

Applying SSL techniques to any of these data domains requires care as unique challenges arise in each domain which necessitate special considerations. For example, SSL for vision often revolves around data augmentations that may not naturally apply to speech signals. The \`positive pairs' available for contrastive learning varies from slightly different views of the same image to totally different segments of an audio recording. Nonetheless, both contrastive and generative objectives can be applied to these other data domains. One generically useful technique across data types is masking. Whether predicting missing words in a sentence, pixels in an image, or entries of a row in a table, masking is an effective component of SSL approaches across domains.

This section is not intended as a thorough survey of self-supervision for other data modalities, as each of those fields is vast. Domain-specific surveys can be found in @liu2022audio (audio), @schiappa2022video (video), @min2021recent (text), and @rubachev2022revisiting (tabular data). Rather, this section provides a discussion of the interesting similarities and differences in how SSL is applied to audio, text, and tabular data.

**Audio data.** Audio signals, both raw audio and mel spectrograms, have a lot in common with images. As inputs to a neural network, there are strong similarities. For example convolutions can be useful [@oord2016wavenet; @schneider2019wav2vec; @baevski2021unsupervised]. But as data for SSL, major differences arise. For example, horizontally flipping an image does not usually change the semantic meaning of an image (and is a wildly popular data augmentation), but for speech recordings this would completely distort the data. Similarly, while masking images is often done with random pixels, the two dimensions of a spectrogram represent time and frequency and masking with horizontal and/or vertical bands is more effective [@wang2020unsupervised]. Additionally, the existence of tones other than speech (background noise, room tone) presents a unique challenge when looking for positive pairs for contrastive learning, which is to prevent the learned representations from over fitting to the noise withing a given clip [@oord2018representation; @wang2020unsupervised]. In fact, the high frequency noisy artifacts, which are generally unrelated to the semantic meaning, mean reconstruction in input space is more complicated than other domains (e.g. text). Multi-modal models, on the other hand, can consider a soundbite and its text [@sermanet2018time; @chung2018unsupervised] or some frames of a video and the corresponding sound clip [@zhao2018sound; @alwassel2020self] as different views to be used as positive pairs for contrastive learning.

**Video data.** Most of SSL images methods have a counter-part video SSL method. For instance, [@feichtenhofer2021large] have generalized SimCLR, MoCo, SwAV and BYOL to space-time video data. Indeed, in all these methods it is possible to incorporate the notion of similarity between different temporal clips of the same video. More recently, masked auto-encoding objectives for video have been built around the same idea as images, but by masking patches/ tubes of patches in the temporal axis as well [@feichtenhofer2022masked; @tong2022videomae; @girdhar2022omnimae]. Besides, it is common practice to use SSL vision pre-trained models for video downstream tasks like action recognition. With ViT for instance, the patch embedding convolutional layer can be transferred from 2D to 3D by repeating the weights along the temporal axis [@feichtenhofer2022masked]. Vision models can then transfer to video models by using them as initialization for fine-tuning on video tasks [@fang2022eva]. The frame features can also be used directly, by appending a linear layer on top of the features [@radford2019language], or by using more complex heads [@ni2022expanding; @arnab2021vivit]. In this case, the visual system is frozen and the temporal information is learned after.

**Text data.** In contrast to audio data, text is a relatively clean input signal and representations that are useful for reconstruction do not over fit to a noisy part of the signal. In fact, the most popular large language models are all trained with reconstruction objective as opposed to contrastive objectives popular in other data domains [@radford2018improving; @radford2019language; @brown2020language; @devlin2018bert]. The Word2Vec objective [@mikolov2013distributed] predicts a masked out portion of the training text has served as a foundational objective for self-supervised learning in natural language. While uncommon, language modeling can be done with contrastive learning for word or or character representations [@chen2022clower]. One other difference between text and images is that the masked token prediction for text is done over an entire dictionary. This approach is not the dominant one for images but it has been tried at the pixel level [@chen2020generative]. While there are few augmentations for language data that do not change the semantic meaning, large scale systems generally use enough data and various types of masking to overcome this. Specifically, next token prediction [@radford2018improving; @radford2019language; @brown2020language] is akin to masking the last token in a string, while bidirectional encoders mask tokens anywhere in the string [@devlin2018bert] or fill larger spans of missing text [@raffel2020exploring; @tay_unifying_2022]. This choice of unidirectional next-token prediction versus bidirectional approaches leads to meaningful differences in downstream text applications [@artetxe_role_2022]. For contrastive learning, positive pairs often come from masking and/or cropping input sequences [@meng2021coco; @giorgi2021declutr]. They can also be generated using dropout so that one input has two different latent representations [@gao2021simcse]. Additionally, some methods for both contrastive and reconstructive pretraining corrupt the input with several other augmentations including document rotation, sentence permutation, and token deletion [@lewis2020bart; @raffel2020exploring; @wu2018unsupervised].

**Tabular data.** Unlike text, audio, and images, classical machine learning tools are still popular for processing tabular data. However, while deep learning for tabular data is comparatively a small field, finding sensible data augmentation strategies is already a much studied topic. Several SSL methods for tabular data utilize masking in various ways and some techniques creatively employ other augmentations developed for images, like mixup [@zhang2018mixup]. As with images and audio, some algorithms aim to generate the missing or corrupted values while others employ contrastive learning. In combinatorial optimization such as Mixed Integer Programming (MIP), objective function is used as the guidance to generate positive solution pairs with comparable objectives and negative solution pairs whose objective values drastically differs despite tiny changes of a few variables [@huang2023searching]. Similar approaches are also used in guided language generation [@yang2022doc].

The masked reconstruction approaches account for a variety of masking tactics. Furthermore, it is common with tabular data to predict mask vectors as a pretext task [@yoon2020vime; @iida2021tabbie]. Since the predicting mask itself is part of the pretraining objective, the masked entries in the input must be filled, and typically this is done by sampling from the empirical distribution of that column or feature.

With the same augmentation, i.e. masking and sampling from the empirical marginal distribution, @bahri2021scarf propose pretraining with a contrastive loss. Specifically, they propose using the InfoNCE loss [@gutmann2010noise; @ceylan2018conditional] to compare the representations of the clean and corrupted inputs.

Several other works outline ways to augment the data for a combination of generation and contrastive learning. For example, tabular data can be split into groups of columns so each sample (row) has several views available @ucar2021subtab. Borrowing from vision systems, a combination of CutMix [@yun2019cutmix] in input space and mixup [@zhang2018mixup] in embedding space is also an effective augmentation for tabular data [@somepalli2021saint]. These methods generate augmented views that are used along with the clean input for contrastive learning. However, contrastive pretraining for both the SAINT model [@somepalli2021saint] and SubTab [@ucar2021subtab] seems to work best when this is paired with a reconstructive loss term.

In their work focusing on comparing the SSL methods for tabular data, @rubachev2022revisiting find that pretraining objectives generally do help boost the performance of tabular models. But more specifically, they find that pretraining objectives that use the labels are best, implying that SSL for tabular data has yet to be the state of the art in its domain [@rubachev2022revisiting]. Similarly, @levin2023transfer show that unlike in computer vision, existing SSL pre-training routines yield less transferable features than supervised pre-training.

**Reinforcement learning.** SSL has been used to improve reinforcement learning (RL) on visual inputs. This setting is similar to video, except apart from the sequence of images, we also have access to the sequence of actions. The most common approach to apply SSL here is to use contrastive learning to train a model to match current state representation and the next time step's representation, or to match representations of the same state but with different augmentations applied. One of the earliest examples is CURL [@Srinivas_Laskin_Abbeel_2020]. Recently, SSL has been used to improve sample efficiency on a challenging Atari100k benchmark [@simple]. Recent works have modified BYOL [@grill2020bootstrap] or Barlow Twins [@zbontar2021barlow] by feeding images of consecutive timesteps' observations to the two branches of the siamese network: SGI [@sgi] and Barlow Balance [@Zhang_GX-Chen_Sobal_LeCun_Carion_2022] did this for offline pretraining, while SPR [@spr] uses it as an additional objective in the online setting. The best-performing method doing this is EfficientZero [@Ye_Liu_Kurutach_Abbeel_Gao_2021], which modifies MuZero [@muzero] by, among other modifications, adding the SimSiam [@Chen_He_2020] objective to train the encoder and the forward model, and sets the new state of the art on Atari100k. @Parisi_Rajeswaran_Purushwalkam_Gupta_2022 propose PVR, a method based on MoCo [@he2020momentum] that improves sample efficiency on control tasks. @Eysenbach_Zhang_Salakhutdinov_Levine_2022 show that contrastive learning in RL setting is directly linked to goal-conditioned RL, and demonstrate that a method based on InfoNCE [@oord2018representation] achieves great performance on robotic arm control tasks.

SSL has been shown to yield good representations for behavior cloning. @Pari_Shafiullah_Arunachalam_Pinto_2022 show that imagenet-pretrained model finetuned with BYOL [@grill2020bootstrap] can be very effectively used for visual imitation on robotic open, push and stack tasks, while @Arunachalam_Guzey_Chintala_Pinto_2022 use a similar method and successfully learn from a small manipulation dataset collected using VR. @tdex present a method that uses BYOL to extract information from tactile sensors on robotic arms and improve dexterous manipulation. [@cbet] shows that BYOL representations of visual inputs are also useful when modeling goal-conditioned trajectories with transformer architecture.

There are a few additional challenges when applying SSL to RL. First, if the data is recorded on-line, individual observations are highly correlated with each other and are not IID (independent and identically distributed), so sampling from replay buffer should be done carefully. One failure mode of SSL objectives when applied to RL agents' data is the proclivity to latch on \`slow features' [@slowfeats]. The contrastive objective may learn for example to only look at the cloud patterns in the sky to tell apart frames in a self-driving dataset, so one must be careful to design augmentations in a way to remove useless static features in the image, or to sample data accordingly.

SSL has been used not only to improve sample efficiency, but also to improve exploration. @byolexplore propose BYOL-Explore which uses BYOL [@grill2020bootstrap] to learn the encoder and the forward model, and use the forward model disagreement as the exploration objective. The follow-up work by @Jarrett_Tallec_Altché_Mesnard_Munos_Valko_2022 address the problem of BYOL-Explore latching on a noisy TV. @Yarats_Fergus_Lazaric_Pinto_2021 proposed using a clustering method akin to SwAV [@caron2020unsupervised] to do unsupervised exploration, i.e. exploration with only intrinsic rewards.

A few works have explored using vast natural videos data available to pre-train representations for RL agents. @Xiao_Radosavovic_Darrell_Malik_2022 introduce MVP, which uses masked-autoencoder to pre-train the transformer encoder for robotic control, while @Ma_Sodhani_Jayaraman_Bastani_Kumar_Zhang_2022 propose VIP, a method to learn universal features for RL using ResNet-50 backbone and the objective based the time between frames in the observations as the supervision signal. Another method for training foundation models for RL, R3M [@Nair_Rajeswaran_Kumar_Finn_Gupta_2022], combines time-contrastive and video-language alignment objectives. VIP and R3M are trained on the large Ego4D dataset [@ego4d], while MVP combines Imagenet, Ego4D, and additional hand manipulation data. @Majumdar_Yadav_Arnaud_Ma_Chen_Silwal_Jain_Berges_Abbeel_Malik propose VC-1, a method based on masked auto-encoding. The authors test the proposed method and other foundation models on the new test suite called CortexBench. The benchmark includes control, object manipulation, and navigation tasks, with different methods excelling at different parts of the benchmark.

There are also unsupervised methods for learning representations that are specific to RL and are not commonly used for images: e.g. Laplacian eigenmaps [@Machado_Bellemare_Bowling_2017], forward-backward representations [@Touati_Rapin_Ollivier]. @Zhang_McAllister_Calandra_Gal_Levine_2021 propose to learn representations by making representations the same for states that lead to the same rewards, and different otherwise.

Incorporating Multiple Modalities into SSL Training {#subsec:multimodal}
---------------------------------------------------

Self-supervised learning need not be based on a single modality. Especially multi-modal vision-language have recently demonstrated this to great effect. Contrastive Language--Image Pre-training (CLIP) [@radford_learning_2021], and ALIGN [@jia_scaling_2021] are self-supervised learning approaches that use image-caption pairs to learn a joint embedding space for images and captions. The objective here is contrastive, given an image and its caption are fed through separate encoder models that encode each modality into a fixed-length embedding vector. The embeddings of the training data image-caption pair are aligned, whereas other combinations in a batch are repelled.

This approach is especially interesting in comparison to contrastive SSL based on pure vision, as discussed in `\Cref{sec:contrastive}`{=latex}. The use of a second modality, here text, anchors the entire SSL training. It is no longer necessary to generate multiple augmented views to form a notion of robust representation as the joint approach learns semantically meaningful representations simply by observing similar captions re-occurring with similar images.

As a result, image encoders arising from such a joint pre-training are especially robust to visual changes that leave semantic meaning unchanged, such as sketches of objects as evaluated in ImageNet-Sketch [@wang_learning_2019; @radford_learning_2021], and are strong on out-of-domain generalization tasks. Yet, this is not always a desired representation, as visualizations in @ghiasi_what_2022 show that these models also group features that are visually dissimilar, but semantically, or literally, alike. This can be mitigated, and overall performance, e.g. in linear probing, can even be improved by combining both image-text and image-image SSL as done in @mu_slip_2022, who combine CLIP and SimCLR [@radford_learning_2021; @chen2020simple].

Recent work has pushed these vision-language systems to larger scales [@ding_cogview_2021; @yuan_florence_2021; @singh_flava_2022; @wang_simvlm_2022; @fang_eva_2022], based on freely available image-caption pairs collected from the internet, such as in [@schuhmann_laion-5b_2022]. These modern SSL models are capable of representing both vision and text, and can be used in a number of applications that are multimodal, from visual-question answering to multimodal generation [@alayrac_flamingo_2022; @li_blip_2022; @nichol_glide_2022; @rao_denseclip_2022].

The future of vision-language pre-training, as an alternative to robust visual representations learned on vision alone, remains to be further explored. While its advantages in vision-language downstream applications are evident [@shen_how_2022; @dou_empirical_2022], shared embedding spaces can also be constructed by training solely the vision encoder first, fixing it, and then training a matching language encoder, as described in [@zhai_lit_2022]. Ultimately, vision-language models are only the first step to self-supervised learning from multiple modalities at scale. Prototypes, such as @reed_generalist_2022, train self-supervised on arbitrary input streams, ranging from vision and text to tables and agent actions, and so learn re-usable representations that are helpful for general tasks.

Building Feature Extractors with Localization for Dense Prediction Tasks
------------------------------------------------------------------------

Aside from semantic understanding, popular computer vision tasks from object detection to segmentation to depth estimation require models which extract localized features, in other words ones which contain information indicating the locations of objects within the input image. Self-supervised learning may be particularly valuable for these dense prediction tasks since collecting segmentation masks or bounding box annotations for training images is significantly more expensive than classification labels. However, learning frameworks which are carefully tuned on image classification benchmarks may lack traits which are valuable for such dense prediction tasks. Several works, which we note perform their experiments in different settings and on different architectures and learning algorithms, express seemingly contradictory findings, namely that existing self-supervised learning strategies are or are not effective for downstream dense prediction tasks [@goyal2019scaling; @purushwalkam2020demystifying; @zhao2021distilling; @ericsson2021well; @shwartzpre]. We now delve further into this discussion.

**Limitations of self-supervised learners for localization.** SSL approaches which rely on augmented views or jigsaw transformations, such as MoCo [@He2020MomentumCF] and PIRL [@misra2020self], learn occlusion invariance since they are trained with random crops on ImageNet where foreground objects are often large so that different crops contain different parts of the same object [@purushwalkam2020demystifying]. On the other hand, they lack viewpoint invariance and category-instance invariance. Further, @zhao2021distilling argue that self-supervised learners also lack localization information because the models are able to use all parts of the image, both foreground and background, to make their predictions. The above works conduct experiments principally on convolutional architectures. It is worth noting that @ericsson2021well suggest that the best among the popular SSL algorithms they test on are CNNs, which can still achieve competitive performance with their supervised learning counterparts in some detection and segmentation settings. Interestingly, older pretext tasks such as `jigsaw` or `colorization`, which predate the recent SSL craze sparked by MoCo and SimCLR, can also achieve competitive performance compared to supervised learning backbones when the pretext task is made \`\`hard'' enough [@goyal2019scaling].

**CNNs or ViTs?** Recent works suggest that vision transformers (ViTs) contain superior localization information in their learned representations compared to convolutional architectures [@caron2021emerging]. Whereas CNNs require specially designed segmentation pipelines to extract localization information from their features, this information arises naturally in the patchwise features of ViTs. Existing SSL methods designed specifically for transformers confirm that the trained models are effective for downstream detection and segmentation tasks, especially when fine-tuned [@li2021mst; @he2022masked]. However, it should be noted that these SSL algorithms explicitly demand localization in their objective functions, for example via masked autoencoding where patch features should contain information regarding the contents of the corresponding section of the image [@he2022masked]. More recently, masked autoencoding pre-training strategies have been adapted for convolutional architectures to great effect, where they achieve competitive performance on downstream object detection and instance segmentation [@woo2023convnext]. Moreover, we will see below that a variety of pre-training strategies designed specifically for localization can be effective on transformers and convolutional networks alike.

**So how do we learn localized features without annotations?** In order to tailor representations for downstream dense prediction tasks, numerous works propose modifying SSL routines specifically to enhance the localization in their features. Since these SSL pre-training algorithms do not use segmentation or detection annotations, they instead rely on carefully chosen unsupervised object priors.

One style of object prior enforces relationships between features extracted from locations within a single image, just as self-supervised learning procedures often enforce relationships between distinct images. One such prior uses the fact that adjacent ViT patches often contain the same objects. Unlike popular contrastive objectives which encourage augmented views of an image to produce similar features, SelfPatch encourages adjacent patches within a single image to produce similar features [@yun2022patch]. A related method, DenseCL [@wang2021dense], matches the most similar pixel-wise features extracted from augmented samples to automatically handle the case in which augmentations move objects around in an image, and we only want to match features corresponding to the same object. More recently, VICRegL [@bardes2022vicregl] applies a similar principle by combining geometric and learned matching, with a non-contrastive criterion. Just as clustering-based methods cluster related images, Leopart [@ziegler2022self] fine-tunes a pre-trained model to cluster patch-level features.

In addition to modifying the training loss to improve localization, we can also augment the data with this objective in mind by placing an object in multiple settings so that resulting models extract the same features from an object irrespective of its location. Instance Localization [@yang2021instance] leverages RoIAlign [@he2017mask], an algorithm designed for object detectors which extracts features corresponding to a specific image patch. To this end, Instance Localization pastes a randomly chosen patch cut from the foreground of one image onto two other images and extracts features corresponding to only the pasted foreground patch, using a contrastive loss to ensure that the foreground patch generates similar features regardless of the background present and regardless of its location within an image. A competing approach estimates the location of an object within the training image using saliency maps and then cuts and pastes these objects onto background and optimizes a similar objective [@zhao2021distilling]. Instead of using augmentations to move objects around, @purushwalkam2020demystifying notes that nearby video frames contain the same object but in different positions or from different viewpoints so that contrastive learning on video data can serve much the same purpose.

Recently, UP-DETR [@dai2021up] and DETReg [@bar2022detreg] proposed an end-to-end SSL pretraining of the DETR family detectors. UP-DETR proposes to detect the bounding boxes of randomly selected patch regions in images conditioned on their pixel values while predicting their corresponding SwAV [@caron2020unsupervised] embedding. In DETReg, detection targets are obtained using the Selective Search algorithm, which does not require human annotations. Similarly, the detector predicts an associated SwAV [@caron2018deep] embedding for each target bounding box.

**Vision-language models for dense prediction tasks.** In `\Cref{subsec:multimodal}`{=latex}, we saw that vision-language models extract semantically meaningful features. These features are also leveraged by recent works for open-vocabulary object detection [@kamath2021mdetr; @gu2021open; @zareian2021open; @minderer2022simple]. These works leverage vision and language backbones pre-trained as previously discussed on captioned image databases and fine-tune on object detection data. Crucially, pre-trained language models, paired with image feature extractors, allow open-vocabulary object detectors to detect new objects never seen during their fine-tuning stage simply by querying the language model with an appropriate prompt.

Conclusion
==========

*Self-supervised learning* (SSL) established a new paradigm for advancing machine intelligence. Despite many successes, SSL remains a daunting field with a dizzying array of methods each with intricate implementations. Due to the fast moving research and the breadth of SSL methods, it remains a challenge to navigate the field. This becomes an issue for researchers and practitioners who joined the field only recently, in turn creating a high barrier to entry for SSL research and deployment. We hope our cookbook will help lower these barriers by enabling the curious researcher of any background to navigate the terrain of methods, understand the role of the various knobs, and gain the know-how required to be successful with SSL.

```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{abbrvnat}
```

[^1]: <https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/>

[^2]: <https://github.com/facebookresearch/luckmatters/tree/main/ssl>

[^3]: <https://generallyintelligent.ai/blog/2020-08-24-understanding-self-supervised-contrastive-learning/>

[^4]: <https://github.com/vturrisi/solo-learn>

[^5]: <https://github.com/facebookresearch/FFCV-SSL>
