---
abstract: |
  One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but *with its last few projector layers entirely removed*. This *trick of throwing away the projector* is actually critical for SSL methods to display competitive performances on ImageNet for which more than $30$ percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
author:
- |
  `\name `{=latex}Florian Bordes `\email `{=latex}florian.bordes\@umontreal.ca\
  `\addr `{=latex}Meta AI Research\
  Mila, Université de Montréal `\AND`{=latex} `\name `{=latex}Randall Balestriero\
  `\addr `{=latex}Meta AI Research `\AND`{=latex} `\name `{=latex}Quentin Garrido\
  `\addr `{=latex}Meta AI Research\
  Université Gustave Eiffel,\
  CNRS, LIGM `\AND`{=latex} `\name `{=latex}Adrien Bardes\
  `\addr `{=latex}Meta AI Research\
  Inria `\AND`{=latex} `\name `{=latex}Pascal Vincent\
  `\addr `{=latex}Meta AI Research\
  Mila, Université de Montréal\
  CIFAR
bibliography:
- refs.bib
title: 'Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\gr}{Guillotine Regularization }
```
```{=latex}
\newcommand{\pvcomment}[1]{{\color{red} [P.V.: {#1} ]}}
```
```{=latex}
\newcommand{\pvreplace}[2]{{\color{lightgray}\sout{#1} \color{red}{#2}}}
```
```{=latex}
\newcommand{\change}[2]{{\color{blue}\sout{#1} \color{blue}{#2}}}
```
```{=latex}
\newcommand{\changenew}[2]{{\color{red}\sout{#1} \color{red}{#2}}}
```
```{=latex}
\newcommand{\fbcomment}[1]{{\color{lightgray} [F.B.: {#1} ]}}
```
```{=latex}
\newcommand\red[1]{\textcolor{red}{#1}}
```
```{=latex}
\newcommand{\florian}[1]{{\color{blue}[Florian: #1]}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\def\month{05}
```
```{=latex}
\def\year{2023}
```
```{=latex}
\def\openreview{\url{https://openreview.net/forum?id=ZgXfXSz51n}}
```
```{=latex}
\maketitle
```
Introduction
============

Many recent self-supervised learning (SSL) methods consist in learning invariances to specific chosen relations between samples -- implemented through data-augmentations -- while using a regularization strategy to avoid collapse of the representations [@chen2020simclr; @chen2020mocov2; @grill2020byol; @lee2021cbyol; @caron2020swav; @zbontar2021barlow; @bardes2016vicreg; @tomasev2022relicv2; @caron2021dino; @chen2021mocov3; @li2022esvit; @zhou2022ibot; @zhou2022mugs]. Incidentally SSL learning frameworks also heavily rely on a simple trick to improve downstream task performances: *removing the last few layers of the trained deep network* depicted in Figure `\ref{fig:cartoon}`{=latex}. From a practical viewpoint, this technique emerged naturally [@chen2020simclr] through the search of ever increasing SSL performances. In fact, on ImageNet [@deng2009imagenet], such technique can improve classification performances by around 30 points of percentage (Figure `\ref{fig:intro_plot}`{=latex}).

Although it improves performances in practice, not using the layer on which the SSL training was applied is unfortunate. It means throwing away the representation that was explicitly trained to be invariant to the chosen set of data augmentations, thus breaking the implied promise of using a more structured, controlled, invariant representation. By picking instead a representation that was produced an arbitrary number of layers above, SSL practitioners end up relying on a representation that likely contains much more information about the input [@RCDM] than should be necessary to robustly solve downstream tasks.

Although the use of this technique emerged independently in SSL, using intermediate layers of a neural network--instead of the deepest layer where the initial training criterion was applied-- has long been known to be useful in transfer learning scenarios [@features_transfert]. Features in upstream layers often appear more general and transferable to various downstream tasks than the ones at the deepest layers which are too specialized towards the initial training objective. This strongly suggests a related explanation for its success in SSL: does removing the last layers of a trained SSL model improve performances *because of a misalignment between the SSL training task (source domain) and downstream task (target domain)?*

In this paper, we examine that question thoroughly. We first place the SSL *trick of removing the projector post-training* under the umbrella of a generically applicable method that we call **Guillotine Regularization**. We argue that it is important to distinguish the action of removing layers during evaluation from architecture modifications because the optimal layer to use for a given downstream task is not always the backbone and could be intermediate projector's layers. Then, we explore how changes in the training optimization, training data and downstream task impact the optimal layer in both supervised and self-supervised setting. Lastly, we demonstrate that increasing the \*alignment\* between the pretext and downstream task in SSL decreases the need to use a projector in SSL.

```{=latex}
\centering
```
![a) An illustration of projector head trick used in SSL. During training, a small neural network named the *Head* (also coined as *projector* in the SSL literature [@chen2020simclr]) is added on top of another deep network refereed as the *Trunk*. This *Head* can be viewed as a buffer between the training loss and the Trunk that can absorb any bias related to a ill optimisation. When using such network on downstream tasks, we throw away the Head. b) We measure with linear probes the accuracy at different layers on a Resnet50 (as Trunk) (see Figure `\ref{fig:acc_VIT}`{=latex} for vision transformers) on which we added a small 3-layer MLP (as Head) for various supervised and self-supervised methods. For each method, we show the mean and standard deviation across 3 runs (The std between different runs is low). With traditional supervised learning, there is a significant drop in performances when using the trunk layer instead of the last projector layer. However, when looking at self-supervised methods, the gap in performances between the linear probe trained at the trunk and projector can be as high as 30%.](images/Guillotine_reg_new2.png "fig:"){#fig:cartoon_plot width="\\linewidth"} `\subcaption{}`{=latex} `\label{fig:cartoon}`{=latex}

![a) An illustration of projector head trick used in SSL. During training, a small neural network named the *Head* (also coined as *projector* in the SSL literature [@chen2020simclr]) is added on top of another deep network refereed as the *Trunk*. This *Head* can be viewed as a buffer between the training loss and the Trunk that can absorb any bias related to a ill optimisation. When using such network on downstream tasks, we throw away the Head. b) We measure with linear probes the accuracy at different layers on a Resnet50 (as Trunk) (see Figure `\ref{fig:acc_VIT}`{=latex} for vision transformers) on which we added a small 3-layer MLP (as Head) for various supervised and self-supervised methods. For each method, we show the mean and standard deviation across 3 runs (The std between different runs is low). With traditional supervised learning, there is a significant drop in performances when using the trunk layer instead of the last projector layer. However, when looking at self-supervised methods, the gap in performances between the linear probe trained at the trunk and projector can be as high as 30%.](images/valiod_acc_fig3.png "fig:"){#fig:cartoon_plot width="\\linewidth"} `\subcaption{}`{=latex} `\label{fig:intro_plot}`{=latex}

To summarize, this paper's main contributions are the following:

-   Since the optimal layer to use in Self-Supervised-Learning might not always be the backbone, we suggest coining the action of removing layer as a general method called Guillotine Regularization to distinguish it from the architectural modification which is the addition of a projector.

-   To show through experiments that the optimal layer to cut heavily depend on the training optimization, training data and downstream task for both supervised and self-supervised models. We hope that this result will encourage the research community to run more systematic evaluations through different layers.

-   The need to use Guillotine Regularization in SSL depends heavily on how the positives views are defined. When these views are aligned with the downstream-task, the optimal layer to use become closer to the last layer.

```{=latex}
\vspace{-0.1cm}
```
Related work
============

```{=latex}
\vspace{-0.1cm}
```
#### Self-supervised learning

Many recent works on self-supervised learning [@chen2020simclr; @chen2020mocov2; @grill2020byol; @lee2021cbyol; @caron2020swav; @zbontar2021barlow; @bardes2016vicreg; @tomasev2022relicv2; @caron2021dino; @chen2021mocov3; @li2022esvit; @zhou2022ibot; @zhou2022mugs] rely on the addition of few non linear layers (MLP) -- termed *projection head* -- on top of a well established neural network -- termed *backbone* -- during training. This addition is done regardless of the neural network used as backbone, it could be a ResNet50 [@he2016resnet] or a Vision Transformer [@dosovitskiy2021vit]. After training, the projector is usually threw away to evaluate the model using the backbone representation. Even if @chen2020simclrv2 demonstrated that the optimal layer to use might not always be the backbone when using few labelled data, most recent works introducing new SSL methods have continued to use only the backbone for evaluation. Some works also tried to understand why a projection head is needed for self-supervised learning. @appalaraju2020good argue that the nonlinear projection head acts as filter that can separate the information used for the downstream task from the information useful for the contrastive loss. In order to support this claim, they used deep image prior [@ulyanov2018dip] to perform features inversion to visualize the features at the backbone level and also at the projector level. They observe that features at the backbone level seem more suitable visually for a downstream classification task than the ones at the projector level. Another related work [@RCDM] similarly tries to map back the representations to the input space, this time by using a conditional diffusion generative model. The authors present visual evidence confirming that much of the information about a given input is lost at the projector level while most of it is still present at the backbone level. Another line of work tries to train self-supervised models without the use of a projector. @jing2022directclr shows that by removing the projector and cutting the representation vector in two parts, such that a SSL criteria is applied on the first part of the vector while no criterion is applied on the second part, improves considerably the performances compared to applying the SSL criteria directly on the entire representation vector. This however works mostly thanks to the residual connection of the resnet. In contrast with these approaches, our work focus on identifying which components of traditional SSL training pipelines can explain why the performances when using the final layers of the network are so much worse than the ones at the backbone level. This identification will be key for designing future SSL setups in which the generalisation performance doesn't drop drastically when using the embedding that the SSL criterion actually learns.

#### Transfer learning

The idea of using the intermediate layers of a neural network is very well known in the transfer learning community. Work like Deep Adaptation Network [@dan_network] freeze the first layers of a neural network, fine-tune the last layers while adding a head which is specific for each target domain. The justification behind this strategy is that deep networks learn general features [@caruna_1994; @bengio_transfer; @bengio_ood], especially the ones at the first layers, that may be reused across different domain [@features_transfert]. @oquab2014learning demonstrate that when limited amount of training data are available for the target tasks, using the frozen features extracted from the intermediate layers of a deep network trained on classification can help solve object and action classification tasks on other datasets. Another line of work on training with random or noisy labels also studied how the use of intermediate layers improves significantly downstream performances [@Maennel2020] while @Baldock2021DeepLT introduced a measure of example difficulty that leverages the number of intermediate layers that are aligned towards a given prediction. In this paper, we show that SSL trained models fall under the realm of transfer learning, in consequence we can expect that all the observations made in the transfer learning literature about the use of intermediate layers are also valid for SSL. When viewing the projector SSL trick and cutting layer for transfer as a general machine learning trick to improve generalization, it's not surprising anymore that work as @wang2022 [@sar] have been able to show that adding a projector can also be highly beneficial for supervised training.

#### Out of distribution (OOD) generalization

@lastlayerretraining demonstrates that retraining only the last layer with a specific reweighting helps to \"forget\" the spurious correlations that were learned during the training. Such work emphasizes that most of the spurious correlation due to the training objective is contained in the last layers of the network. Thus, retraining them is essential to remove such spurious correlation and generalize better on downstream tasks. Similarly @domain_adjusted_ERM show that retraining only the last layers is most of the time as good as retraining the entire network over a subset of downstream tasks. Lastly, @head_toe demonstrates the usefulness of using intermediate layers for OOD. Our study also confirms that `\gr `{=latex}show important properties with respect to OOD generalization.

```{=latex}
\vspace{-0.1cm}
```
Guillotine Regularization: A regularization scheme to improve generalization of deep networks
=============================================================================================

```{=latex}
\vspace{-0.1cm}
```
In this section, we provide a definition for Guillotine Regularization. Then, through experiments, we show that the optimal layer to use changes significantly depending on different factors. Finally, we show that the performances at a given layer are not always correlated with the performances one can have at another layer.

(Re)Introducing Guillotine Regularization From First Principles {#sec:IT}
---------------------------------------------------------------

We distinguish between a **source** *training task* with its associated training set, and a **target** *downstream task* with its associated dataset[^1]. It is the performance on the downstream task that is ultimately of interest. In the simplest of cases both tasks could be the same, with their datasets sampled i.i.d. from the same distribution. But more generally they may differ, as in SSL or transfer learning scenarios. In SSL we typically have an *unsupervised* training task, that uses a training set with no labels, while the downstream task can be a supervised classification task. Also note that while the bulk of training the model's parameters happens with the training task, transferring to a different downstream task will require some additional, typically lighter, training, at least of a final layer specific for that task. In our study we will focus on the use of a representation computed by the network trained on the training task and then frozen, which gets fed to a simple linear layer that will be tuned for the downstream task. This \"linear evaluation\" procedure is typical in SSL and aims to evaluate the quality/usefulness of an unsupervised-trained *representation*. Our focus is to ensure good generalization to the downstream task. Note that training and downstream tasks may be misaligned in several different ways.

```{=latex}
\newcommand{\trunk}{\ensuremath{f_\theta^{1:t}}}
```
```{=latex}
\newcommand{\trunkopt}{\ensuremath{f_{\hat{\theta}}^{1:t}}}
```
```{=latex}
\newcommand{\head}{\ensuremath{f_\phi^{t+1:L}}}
```
```{=latex}
\newcommand{\newhead}{\ensuremath{s_{w}}}
```
```{=latex}
\newcommand{\newheadopt}{\ensuremath{s_{\hat{w}}}}
```
```{=latex}
\newcommand{\f}{f_{\theta,\phi}}
```
Informally, `\gr `{=latex}consists in the following: for the *downstream task*, rather than using the last layer (layer $L$) representation from the network trained on the *training task*, instead use the representation from a few layers above (layer $t$, with $t<L$). We thus *remove* a small multilayer \"head\" (layers $t+1$ to $L$) of the initially trained network, hence the name of the technique. We call the remaining part (layers 1 to $t$) the *trunk*[^2].

Formally, we consider a deep network that takes an input $X$ and computes a sequence of intermediate representations $H_1, \ldots, H_L$ through layer functions $f^{(1)}, \ldots f^{(L)}$ such that $H_\ell = f^{(\ell)}(H_{\ell - 1})$, starting from $H_0=X$. The entire computation from input $X$ to last layer representation $H_L$ is thus a composition of layer functions[^3]:

$$% \trunk h^{(1:t)}h^{(1..t)}h^{(1\dots t)}h^{(1 \rightarrow t)}h^{(1\shortrightarrow t)}\\
H_{L} = \f(X)= (
\underbrace{f^{(L)} \circ \dots \circ f^{(t+1)}}_{\text{head } \head} \circ 
\underbrace{f^{(t)} \circ \dots \circ f^{(1)}}_{\text{trunk } \trunk} )(X)$$ The parameters $\theta$ and $\phi$ of trunk $\trunk$ and head $\head$ are then trained on the entire training set of examples $\mathbf{X}^{\rm source}$ of the training task (optionally with associated targets $\mathbf{Y}^{\rm source}$ that we may have in transfer scenarios, but will typically be absent in SSL), to minimize the training task objective $L^{\rm source}$: $$\hat{\theta}, \hat{\phi} = \argmin_{\theta, \phi} L^{\rm source}(\head(\trunk(\mathbf{X}^{\rm source})), \mathbf{Y}^{\rm source})$$ Then the multilayer head $\head$ is discarded, we add to the trunk a (usually shallow) new head $\newhead$ and we train its parameters $w$, using the training set of examples for the downstream task $(\mathbf{X}^{\rm target}, \mathbf{Y}^{\rm target})$, to minimize the downstream task objective $L^{\rm target}$: $$\hat{w}  = \argmin_{w} L^{\rm target}(\newhead(\underbrace{\trunkopt(\mathbf{X^{\rm target}})}_{\text{representation }\mathbf{H^{\rm target}}}), \mathbf{Y}^{\rm target})$$

```{=latex}
\centering
```
![Training a linear regression to predict latent variables from pooled intermediate representations of a network trained with a self-supervised objective (using SimCLR) or a supervised objective (trained to predict 3D rotations of an object). The data used consists of renderings of 3d objects from 3D Warehouse [@3dwarehouse] where we control the floor, lighting and object pose with latent variables, see samples on the right. The dimension of the intermediate representations increases throughout the layers and is kept constant in the head, if there is one. In the supervised setting, when looking at the Validation Mean Squared Error for object rotations prediction, the lowest error is obtained with the linear probe at the last layer of the neural networks. In contrast, the lowest error for other attributes like the Spot $\theta$ prediction are obtained with the linear probes localized 3,4 or 5 layers before the output of the networks. In the self-supervised setting, we also see that the predictor is responsible for a lot of the invariance to augmentation, and that the information is most easily retrievable before it. These results highlight the need to use `\gr `{=latex}i.e removing the last layers of the neural network to generalize better on other tasks.](images/simclr_intermediate.png "fig:"){#fig:object_dataset width="39%"} `\hfill`{=latex} ![Training a linear regression to predict latent variables from pooled intermediate representations of a network trained with a self-supervised objective (using SimCLR) or a supervised objective (trained to predict 3D rotations of an object). The data used consists of renderings of 3d objects from 3D Warehouse [@3dwarehouse] where we control the floor, lighting and object pose with latent variables, see samples on the right. The dimension of the intermediate representations increases throughout the layers and is kept constant in the head, if there is one. In the supervised setting, when looking at the Validation Mean Squared Error for object rotations prediction, the lowest error is obtained with the linear probe at the last layer of the neural networks. In contrast, the lowest error for other attributes like the Spot $\theta$ prediction are obtained with the linear probes localized 3,4 or 5 layers before the output of the networks. In the self-supervised setting, we also see that the predictor is responsible for a lot of the invariance to augmentation, and that the information is most easily retrievable before it. These results highlight the need to use `\gr `{=latex}i.e removing the last layers of the neural network to generalize better on other tasks.](images/resnet_intermediate.png "fig:"){#fig:object_dataset width="39%"} `\hfill`{=latex} ![Training a linear regression to predict latent variables from pooled intermediate representations of a network trained with a self-supervised objective (using SimCLR) or a supervised objective (trained to predict 3D rotations of an object). The data used consists of renderings of 3d objects from 3D Warehouse [@3dwarehouse] where we control the floor, lighting and object pose with latent variables, see samples on the right. The dimension of the intermediate representations increases throughout the layers and is kept constant in the head, if there is one. In the supervised setting, when looking at the Validation Mean Squared Error for object rotations prediction, the lowest error is obtained with the linear probe at the last layer of the neural networks. In contrast, the lowest error for other attributes like the Spot $\theta$ prediction are obtained with the linear probes localized 3,4 or 5 layers before the output of the networks. In the self-supervised setting, we also see that the predictor is responsible for a lot of the invariance to augmentation, and that the information is most easily retrievable before it. These results highlight the need to use `\gr `{=latex}i.e removing the last layers of the neural network to generalize better on other tasks.](images/samples_shapenet.png "fig:"){#fig:object_dataset width="17.5%"} `\vspace{-0.1cm}`{=latex}

An empirical analysis of situations in which cutting layers is beneficial
-------------------------------------------------------------------------

There are several situations that can create a misalignment between a training and a downstream task. Here we name of few:

**Misalignment between the training (source) and downstream (target) task while using the same input data distribution.** The potential effectiveness of GR for transfer is not surprising since this technique has been used for years in the transfer learning research literature [@features_transfert] to improve generalization across different tasks. As a simple illustration, we present Figure `\ref{fig:object_dataset}`{=latex} which show how much performances on a given task can vary depending on which layer has been chosen as features extractor. In this figure, we used an artificially created object dataset in which we are able to play with different factors of variations. The dataset consists of renderings of 3D models from 3D warehouse [@3dwarehouse]. Each scene is built from a 3D object, a floor and a spot placed on top of the object to add lighting. This allows us to control every factor of variation and produce complex transformations in the scene. We vary the rotation of the object defined as a quaternion, the hue of the floor, and the spot hue as well as it position on a sphere using spherical coordinates. We provide more details on the dataset and rendering samples in the appendix. We observe in Figure `\ref{fig:object_dataset}`{=latex} that when training a supervised model on the object rotation prediction task and evaluating the linear probe on the same task across different layers, the best results are obtained on the last layer. However, when using the same frozen neural network to predict other attributes like the Spot $\theta$, the best performances are obtained few layers before the last one. Similarly, when training with a self-supervised objective (SimCLR), we can see that the different factors of variation are most easily retrievable before the projector. This means that representations before the projector will be more versatile as they will contain information that was removed by the pretraining task. For example if our downstream task is to predict the rotation, the representation at block4 will be optimal while if the downstream task is to predict the spot hue, the representation at the block 3 will be optimal. Such results highlight the need to use `\gr `{=latex}when there is a shift in the prediction task. Moreover, the optimality of a layer depends on the downstream task.

```{=latex}
\centering
```
![Training setups](images/fig_training.png){#fig:exp_supervised_optimization width="\\linewidth"}

![Random split](images/supervised_random_splits_new.png){#fig:exp_supervised_transfert_250 width="\\linewidth"}

![Downstream tasks](images/fig_downstream.png){#fig:supervised_downstreams width="\\linewidth"}

**Misalignment due to badly optimized network** It can be expected that the optimal layer to use to train a downstream task readout function might be different depending on how much the pretrained network is overfitting on the pretext task.

```{=latex}
\begin{wrapfigure}{r}{0.4\textwidth}
    \centering
    \includegraphics[width=0.27\textwidth]{images/fig_downstream_simclr.pdf}
    \caption{SimCLR: Linear probe accuracy on several downstream tasks. \textbf{The optimal layer to cut is not the same for different downstream tasks.}}. 
    \label{fig:ssl_downstream}
\end{wrapfigure}
```
To test this hypothesis, we train a headed supervised Resnet50 on ImageNet with two different types of optimization. The first one using only AdamW with a small learning rate of $1e-4$ without any additional regularization. The second one using SGD and the recommended hyper-parameters for a supervised training (with cycling learning rate, weight decay and momentum). In Figure `\ref{fig:exp_supervised_optimization}`{=latex}, we observe that the AdamW trained network that is overfitting on the classification task has readout function performances that are very close across different layers. However, when looking at the well-regularized model with SGD which does not overfit on the task, the readout performances across layers vary significantly. In a second experiment, we study more in-depth the effect of overfitting by training the Resnet50 over only a random subset of 250 classes. Then we use the remaining 750 classes as an OOD validation set that is split randomly in other subset of 250 classes. In Figure `\ref{fig:exp_supervised_transfert_250}`{=latex}, we clearly see that the training readout is overfitting on the training set while the readout performances across layers are similar on the corresponding in distribution validation set (which is similar to the previous experiment over the full ImageNet). Then, we train linear probes over the OOD splits and observe that the performances are radically different from the in distribution validation set. In fact, in this instance the best layer to use for every of these split is the backbone layer whereas the best layer to use for the in-distribution split is the projector layer. **This result highlights that the optimal layer to discard can vary depending on the optimization techniques and downstream data distribution, even when the same training objective is used.**

**Misalignment between the training and downstream tasks while using different data distributions.** When using a pretrained model to predict new classes, there is a bias in the data distribution as well as in the fine-tuning objective (with respect to the training settings). We did a first experiment in Figure `\ref{fig:supervised_downstreams}`{=latex} in which we train a supervised Resnet50 over ImageNet. Then, we freeze the weights of the model and train a linear probe over ImageNet [@deng2009imagenet], CIFAR10 [@cifar10], Place205 [@place205], CLEVR [@johnson2017clevr] and Eurosat [@Eurosat] at different layers. We observe that the readout performances on ImageNet are the best at the last layer but for datasets like CLEVR or Place205 the best performances are obtained at the second projector layer. In Figure `\ref{fig:ssl_downstream}`{=latex}, we performed the same experiment but this time using SimCLR. In this instance, the best performances for ImageNet are obtained at the backbone whereas the best performances for Eurosat, CLEVR and Imagenet 1% are obtained at the first projector layer. **This result challenges the common practice of discarding the entire projector in SSL since the layers to cut depend on the downstream task.**

**Misalignment between the training input data distribution and testing input data distribution while using the same training and downstream tasks.** Another type of bias can arise when using a wrongful data distribution after training of the model.

```{=latex}
\begin{wraptable}{r}{0.4\textwidth} 
    \begin{tabular}{c|c|c|c|c}
         Head 3 & Head 2 & Head 1 & Trunk  \\
         \hline
         59.0 & 58.8 & \textbf{58.0} & 63.3 \\
    \end{tabular}
    \caption{ImageNet-C mCE (unormalized) across layers.}
    \label{tab:imagenet_c}
\end{wraptable}
```
This scenario is often referred to Out Of Distribution (OOD) since the distribution of the data used by the model becomes different from the one seen during training. We took the supervised model trained on ImageNet along with the linear probe trained at different layers and evaluate the performances of these readouts on ImageNet-C [@hendrycks2019robustness] which is a modified version of the validation set of ImageNet on which different data transformations were applied. Our experiment in Table `\ref{tab:imagenet_c}`{=latex} demonstrates that the performances are better after cutting two layers from the head of the network which highlight that it might be a good practice to probe intermediate representations when evaluating on OOD tasks.

The readout performances at the projector and backbone level are not always correlated
--------------------------------------------------------------------------------------

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\centering
```
![SimCLR](images/hyper_params_simclr.png){width="\\linewidth"}

```{=latex}
\centering
```
![Byol](images/hyper_params_byol.png){width="\\linewidth"}

`\label{fig:hyper_parameters_comp}`{=latex} `\centering`{=latex} `\footnotesize`{=latex}

```{=latex}
\centering
```
![Barlow Twins](images/barlowtwins_adamw_lr.png){width="\\linewidth"}

```{=latex}
\centering
```
![Byol](images/byol_robust_lr.png){width="\\linewidth"}

In Figure `\ref{fig:comp_hyper_params_small}`{=latex} we study the effect of `\gr `{=latex}with respect to an hyper-parameter grid search for various SSL methods (SimCLR, Barlow Twins and Byol). When looking at the performances on ImageNet using a linear probe at the backbone level, one can observe an almost stable classification task performance for different hyper-parameters such as SimCLR temperature, Barlow Twins and Byol learning rate while the corresponding performances at the projector level change significantly. This highlights that the performances at the projector level are not always correlated with the performances at the backbone level. In consequence, knowing the performances of a linear probe at the projector level cannot give in advance insights about the performances at the backbone level.

```{=latex}
\vspace{-0.1cm}
```
Reducing the Need for a Projector in Self-Supervised Learning by increasing the alignment with the downstream task
==================================================================================================================

```{=latex}
\vspace{-0.1cm}
```
Self-Supervised Learning is often considered a distinct learning paradigm in between supervised and unsupervised learning. In reality, the distinction is not as sharp, and much of SSL can be understood as solving a pretext-tasks akin to a supervised task[@wu2018discrimination; @sup_contrastive], merely with pseudo-labels obtained in another way than by human annotation. In this section, we show that different data selection process in SSL influences the alignment between the downstream and pretext task, which heavily impact the need of using a projector head in SSL.

```{=latex}
\centering
```
![`\small `{=latex}Difference in accuracy with linear probing between the projector and backbone representation with different alignments with respect to the classification downstream task. In this experiment we used SimCLR and we change how the positive pair are defined to better aligned with a classification downstream task. In blue, our baseline, we trained SimCLR with the traditional SSL data augmentations which defines the positive view as two augmentations of a same image. In orange, we use the embedding of a pretrained model to define the positive pair as two nearest neighbors under a pretrained model (while using the same data augmentation as the baseline). In green, we use a supervised class label selection to define the positive examples. In this scenario, SimCLR should learn to produces similar embedding to all images belonging to a given class. All three models are trained on ImageNet (IN1K), then we evaluate them with a linear probe across a wide range of downstream tasks at the backbone and projector level and show the difference in accuracy between both. **When the difference is positive, the accuracy at the backbone level is higher than the one at the projector level, highlighting the benefits of Guillotine Regularization. In contrast when the difference is negative, the accuracy at the projector level is higher than the one at the backbone level. In this instance, Guillotine Regularization is not needed.** When positives pairs are defined as belonging to a given class, there is no misalignment with the imagenet classification downstream task. Thus on ImageNet-1K, ImageNet1k-10P (10% of the training set to train the linear probe) and ImageNet1k-1P (1% of the training set to train the linear probe), we observe that the performances at the projector level are much higher than the ones at the backbone level. Interestingly, the nearest neighbors heuristic reduces considerably the impact of Guillotine Regularization across several downstream tasks. ](images/diff_projector_acc.png){#fig:sup_NN_SSL width="\\linewidth"}

To confirm the hypotheses that SSL methods need to use a projector because of a misalignement between the pretext and downstream task, we have to verify that reducing this misalignement, results in reducing the performance gap between the Trunk and Head representations. Ideally, we would like to get close to the supervised scenario in Figure `\ref{fig:cartoon_plot}`{=latex} for which the optimal readout function is obtained at the last layer. To do so, we devise two experimental setups in which we replace the traditional data augmentation pipeline used in SSL, which consists of using handcrafted augmentation on each image to create a set of pairwise positive samples.

```{=latex}
\centering
```
![SimCLR trained with SSL augmentations.](images/Simclr_ssl_RCDM.png){width="0.96\\linewidth"}

![SimCLR trained with class labels.](images/Simclr_sup_RCDM.png){width="\\linewidth"}

**In the first setup, while using the exact same SSL criterion (SimCLR), we use as positive examples pairs of images that belong to the same class, and as negative examples images that don't belong to the same class.** Note that the SSL training criteria will push towards a collapse in the representation space of all the images belonging to the same class, while pushing further apart the different class clusters. By doing so the training SSL objective becomes perfectly aligned with the downstream classification task, despite using a SSL training criteria instead of a traditional cross entropy loss.

**In the second setup, we use as positive pairs the closest neighbors found by a pretrained SSL model trained with the traditional SSL handcrafted data augmentation pipeline.** The reasoning is that if instead of considering each image of the dataset as its own specific class, we use clusters of many images to define the positive pairs, we might be able to close the gap with respect to a supervised baseline without the need of labels.

In Figure `\ref{fig:sup_NN_SSL}`{=latex}, we show the differences in accuracy between the backbone and the projector with respect to these two new data augmentation scenarios. The baseline, using the traditional SimCLR positive pairs based on data augmentations is in blue, the nearest neighbors setup in orange and the class based setup in green. We observe for SimCLR that using the nearest neighbors based heuristic is helping in reducing the gap between the pretext and downstream task while having a purely supervised heuristic to define the positive pair is removing the need to perform Guillotine Regularization across several downstream tasks. Hence confirming the hypothesis that the effectiveness of a projector depends of the alignment between the pretext and downstream task in self-supervised learning.

Visualizing the information across layers for different alignments
------------------------------------------------------------------

In this section, we use RCDM [@bordes2022high], a conditional generative model to visualize what information is retain or not in the representation. We train RCDM on ImageNet with blurred faces[@yang2021imagenetfaces], using the representation given by a SimCLR model trained on handcrafted SSL views and another which was trained on class based views. In Figure `\ref{fig:RCDM}`{=latex}, we show that when looking at different decoding corresponding to different layers in the network, the information encoded vary a lot depending on the layer to use. When going deeper, RCDM is not able to reconstruct as much as information about the images than when using the backbone representation (which contain much more low level features). When looking at the generated samples that were conditioned on the representation of the model trained with supervised views, we observe that the breed of the dog stay the same across layers. However when using traditional data augmentations, the information about the specific golden retriever breed is lost in the last projector layers. This is correlated with the fact that this model get lower classification performances when using the projector.

Experimental details
--------------------

We use Pytorch [@pytorch] and FFCV-SSL [@demo; @leclerc2022ffcv] as data loader. All the experiments were performed with a Resnet50 [@he2016resnet] (except if mentioned otherwise) as backbone. For each model, we use a batch of size 2048 and AdamW [@adamw] as optimizer with an adaptive learning rate schedule. We run the training for 100 epochs. For each model, we add as head a small MLP of 3 layers of size 2048 (same dimension as the backbone) with ReLU [@Relu] as activation and batch normalization [@batch_norm]. When training different SSL methods, we always used the same set of data augmentations (with cropping, color-jitter, random grayscale, gaussian blur and solarization).

Conclusion
==========

Through empirical evaluations, we demonstrated that the optimal layer to use for downstream evaluation vary depending on several factors: optimization, data and downstream task. These results highlight the need for SSL practitioners to run systematic evaluations at several layers instead of using always the backbone as reference. We also demonstrated that the use of a projector in SSL depends on the alignment between the downstream and pretext task. Despite, its usefulness, having to rely on a *trick* like `\gr `{=latex}to increase performances reveals an important shortcoming of current self-supervised learning methods: the inability to design experimental setups and training criteria that learn structured and truly invariant representations with respect to an appropriate set of factors of variation. As future work, in order to escape from Guillotine Regularization, we should focus on finding new training criteria and data augmentations that will be more *aligned* with the downstream tasks of interest.

```{=latex}
\bibliographystyle{tmlr}
```
```{=latex}
\appendix
```
Datasets
========

In this work, we use ImageNet [@deng2009imagenet] (Term of license on https://www.image-net.org/download.php) for our experiments. We also used a synthetic 3D dataset that will be described in the next subsection.

3D models dataset
-----------------

```{=latex}
\centering
```
![Rendered views of a skateboard generated by randomly sampling latent variables. The influence of each parameter is easily visible, which is expected to make their prediction easier.](images/samples_shapenet_full.png){#fig:samples_ds width="\\textwidth"}

We will now discuss the dataset used for figure `\ref{fig:object_dataset}`{=latex}. As previously mentioned, this dataset consists of 3D models from 3D Warehouse [@3dwarehouse], freely available under a General Model License, and rendered with Blender's Python API. We alter the scene by uniformly varying the latent variables described in table `\ref{tab:latent}`{=latex}.

```{=latex}
\centering
```
::: {#tab:latent}
  Latent variable    Min. value   Max. value
  ----------------- ------------ ------------
  Object yaw          $-\pi/2$     $\pi/2$
  Object pitch        $-\pi/2$     $\pi/2$
  Object roll         $-\pi/2$     $\pi/2$
  Floor hue             $0$          $1$
  Spot $\theta$         $0$        $\pi/4$
  Spot $\phi$           $0$         $2\pi$
  Spot hue              $0$          $1$

  : Latent variables used to generate views of 3D objects. All variables are sampled from a uniform distribution.
:::

The variety in the scenes that can be generated is illustrated in figure `\ref{fig:samples_ds}`{=latex}. We can see that each latent variables can significantly impact the scene, giving a significant variety in the rendered images.

Reproducibility
===============

Our work does not introduce a novel algorithm nor a significant modification over already existing algorithm. Thus, to reproduce our results, one can simply use the public github repository of the following models: SimCLR, Barlow Twins, VicReg or the PyTorch Imagenet example (for supervised learning) with the following twist: adding a linear probe at each layer of the projector (and backbone) when evaluating the model. However, since many of these models can have different hyper-parameters, or data-augmentations, especially for the SSL models, we recommend to use a single code base with a given optimizer, a given set of data augmentations so that comparisons between models are fair and focus on the effect of Guillotine Regularization. In this paper, except if mentioned otherwise, we use as Head, a MLP with 3 layers of dimensions 2048 each (which match the number of dimensions at the trunk of a Resnet50) along with batch normalizaton and ReLU activations.

Additional experimental results
===============================

In this section, we present additional experimental results. The first one in Figure `\ref{fig:projector_acc_resnet50}`{=latex} is an extended version of Figure `\ref{fig:cartoon_plot}`{=latex} with additional results on the training set. Figure `\ref{fig:acc_VIT}`{=latex} is a similar setup to the one in Figure `\ref{fig:projector_acc_resnet50}`{=latex} where we compared the performances at different layers for SSL methods and a supervised one except that we use a VIT-B instead of a Resnet50. We observe an important gap on the classification performances reached with a linear probe on different layers with the VIT-B when using SSL methods.

In Figure `\ref{fig:epochs_resnet50}`{=latex}, we show how the performances at different layers change during training by using an online linear probing. At the beginning of the training the gap of performances between layers is low, however it increases significantly after 10 epochs.

In Figure `\ref{fig:sup_NN_SSL_projector}`{=latex} we show the accuracy computed with linear probes trained using projector and backbone representations. This figure is similar to Figure `\ref{fig:sup_NN_SSL}`{=latex} except that we present the absolute accuracy value instead of the difference in accuracy with respect to the backbone.

```{=latex}
\centering
```
![We measure with linear probes the accuracy at different layers on a resnet50 (as Trunk) on which we added a small 3 layers MLP (as Head) for various supervised and self-supervised methods on the training and validation set. For each method, we show the mean and standard deviation across 3 runs (The std between different runs is low). When looking at self-supervised methods, the gap in performances between the linear probe trained at different levels can be as high as 30 points of percentage.](images/valiod_train_fig3.png "fig:"){#fig:projector_acc_resnet50 width="\\linewidth"} `\subcaption{Accuracy on the training set}`{=latex}

![We measure with linear probes the accuracy at different layers on a resnet50 (as Trunk) on which we added a small 3 layers MLP (as Head) for various supervised and self-supervised methods on the training and validation set. For each method, we show the mean and standard deviation across 3 runs (The std between different runs is low). When looking at self-supervised methods, the gap in performances between the linear probe trained at different levels can be as high as 30 points of percentage.](images/valiod_acc_fig3.png "fig:"){#fig:projector_acc_resnet50 width="\\linewidth"} `\subcaption{Accuracy on the validation set}`{=latex}

```{=latex}
\centering
```
![Same experiment as in Figure `\ref{fig:projector_acc_resnet50}`{=latex} but this time, we measure with linear probes the accuracy at different layers on a VIT-B (as Trunk) on which we added a small 3 layers MLP (as Head) for various supervised and self-supervised methods. Since the outputs of the VIT-B has a lower number of dimensions than a Resnet, we added at the trunk of the VIT-B a linear layer with ReLU activation to project into a 2048 dimensional vector. In the supervised learning setting, the best performances are obtained when using the last layers of the model. But, when looking at self-supervised methods, the gap in performances between the linear probe trained at different levels can be as high as 20 points of percentage. Interesting, it seems for the VIT-B that we got the best performances at Head 1 for SimCLR whereas for the ResNet, the best performances were obtained at the Trunk. It is likely that for different architectures, the optimal number of layers on which to apply `\gr `{=latex}will vary.](images/vit_projector_train.png "fig:"){#fig:acc_VIT width="\\linewidth"} `\subcaption{Accuracy on the training set}`{=latex} `\label{fig:training_vit}`{=latex}

![Same experiment as in Figure `\ref{fig:projector_acc_resnet50}`{=latex} but this time, we measure with linear probes the accuracy at different layers on a VIT-B (as Trunk) on which we added a small 3 layers MLP (as Head) for various supervised and self-supervised methods. Since the outputs of the VIT-B has a lower number of dimensions than a Resnet, we added at the trunk of the VIT-B a linear layer with ReLU activation to project into a 2048 dimensional vector. In the supervised learning setting, the best performances are obtained when using the last layers of the model. But, when looking at self-supervised methods, the gap in performances between the linear probe trained at different levels can be as high as 20 points of percentage. Interesting, it seems for the VIT-B that we got the best performances at Head 1 for SimCLR whereas for the ResNet, the best performances were obtained at the Trunk. It is likely that for different architectures, the optimal number of layers on which to apply `\gr `{=latex}will vary.](images/vit_projector.png "fig:"){#fig:acc_VIT width="\\linewidth"} `\subcaption{Accuracy on the validation set}`{=latex} `\label{fig:validation_vit}`{=latex}

```{=latex}
\centering
```
```{=latex}
\centering
```
![a) Accuracy of Barlow Twins through epochs computed with online linear probing at different layers. At the beginning of the training the gap in performances between the probes is small however after 10 epochs, the gap becomes larger and larger both on the training and validation set.](images/ecc_epochs_barlow.png "fig:"){#fig:epochs_resnet50 width="\\linewidth"} `\subcaption{Accuracy on the training set}`{=latex}

![a) Accuracy of Barlow Twins through epochs computed with online linear probing at different layers. At the beginning of the training the gap in performances between the probes is small however after 10 epochs, the gap becomes larger and larger both on the training and validation set.](images/ecc_epochs_vicreg.png "fig:"){#fig:epochs_resnet50 width="\\linewidth"} `\subcaption{Accuracy on the validation set}`{=latex}

```{=latex}
\centering
```
![**Backbone and projector accuracy with linear probing with different alignment with respect to the classification downstream task.** In this experiment we used SimCLR and we change how the positive pair are defined to better aligned with a classification downstream task. In blue, our baseline, we trained SimCLR with the traditional SSL data augmentations which defines the positive view as two augmentations of a same image. In orange, we use the embedding of a pretrained model to define the positive pair as two nearest neighbor under this pretrained model (while using the same data augmentation as the baseline). In green, we use a supervised class label selection to define the positive example. In this scenario, SimCLR should learn to produces similar embedding to all images belonging to a given class. All three models are trained on ImageNet (IN1K), then we evaluate them with a linear probe across a wide range of downstream tasks at the projector and backbone level. When positives pairs are defined as belonging to a given class, there is no misalignment with the imagenet classification downstream task. Thus on ImageNet-1K, ImageNet1k-10P (10% of the training set to train the linear probe) and ImageNet1k-1P (1% of the training set to train the linear probe), we observe that the performances at the projector level are much higher than the ones using the traditional SSL augmentations. Interestingly, the nearest neighbors heuristic reduces considerably the impact of Guillotine Regularization across several downstream tasks. ](images/backbone_acc.png "fig:"){#fig:sup_NN_SSL_projector width="\\linewidth"} ![**Backbone and projector accuracy with linear probing with different alignment with respect to the classification downstream task.** In this experiment we used SimCLR and we change how the positive pair are defined to better aligned with a classification downstream task. In blue, our baseline, we trained SimCLR with the traditional SSL data augmentations which defines the positive view as two augmentations of a same image. In orange, we use the embedding of a pretrained model to define the positive pair as two nearest neighbor under this pretrained model (while using the same data augmentation as the baseline). In green, we use a supervised class label selection to define the positive example. In this scenario, SimCLR should learn to produces similar embedding to all images belonging to a given class. All three models are trained on ImageNet (IN1K), then we evaluate them with a linear probe across a wide range of downstream tasks at the projector and backbone level. When positives pairs are defined as belonging to a given class, there is no misalignment with the imagenet classification downstream task. Thus on ImageNet-1K, ImageNet1k-10P (10% of the training set to train the linear probe) and ImageNet1k-1P (1% of the training set to train the linear probe), we observe that the performances at the projector level are much higher than the ones using the traditional SSL augmentations. Interestingly, the nearest neighbors heuristic reduces considerably the impact of Guillotine Regularization across several downstream tasks. ](images/projector_acc.png "fig:"){#fig:sup_NN_SSL_projector width="\\linewidth"}

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\centering
```
![VICReg](images/hyper_params_vicreg.png){#fig:simclr_supervised width="\\linewidth"}

```{=latex}
\centering
```
![SimCLR](images/hyper_params_simclr.png){#fig:simclr_supervised width="\\linewidth"}

```{=latex}
\centering
```
![Barlow Twins](images/hyper_params_barlow.png){#fig:exp_supervised_co3d width="\\linewidth"}

```{=latex}
\centering
```
![Byol](images/hyper_params_byol.png){#fig:exp_supervised_co3d width="\\linewidth"}

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\centering
```
![VICReg](images/vicreg_lars_lr.png){#fig:simclr_supervised width="\\linewidth"}

```{=latex}
\centering
```
![SimCLR](images/simclr_robust_lr.png){#fig:simclr_supervised width="\\linewidth"}

```{=latex}
\centering
```
![Barlow Twins](images/barlowtwins_adamw_lr.png){#fig:exp_supervised_co3d width="\\linewidth"}

```{=latex}
\centering
```
![Byol](images/byol_robust_lr.png){#fig:exp_supervised_co3d width="\\linewidth"}

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\centering
```
![VICReg](images/vicreg_bs.png){#fig:vicreg_robust_bs width="\\linewidth"}

```{=latex}
\centering
```
![SimCLR](images/simclr_bs.png){#fig:simclr_supervised width="\\linewidth"}

```{=latex}
\centering
```
![Barlow Twins](images/barlow_bs.png){#fig:exp_supervised_co3d width="\\linewidth"}

```{=latex}
\centering
```
![Byol](images/byol_robust_bs_gr.png){#fig:exp_supervised_co3d width="\\linewidth"}

Limitations
===========

In this work we focused mostly on analyzing the use of `\gr `{=latex}in the context of Self-Supervised Learning. However, this kind of regularization might be useful for a variety of other types of training methods which we don't investigate in this paper. We also mostly focus on generalization for classification tasks, but other tasks could also been worth exploring.

[^1]: Terminology pretext-training / downstream comes from SSL, while source / target is used in transfer learning

[^2]: head / trunk are also known as projection head / backbone in the SSL literature

[^3]: Precisely, a \"layer function\" $f^{(\ell)}$ can correspond to a standard neural network layer (fully-connected, convolutional) with no residual or shortcut connections between them, or to entire blocks (as in densenet, or transformers) which may have internal shortcut connections, but none between them.