---
bibliography:
- bibliography.bib
---

```{=latex}
\renewcommand{\sfdefault}{lmss}
```
```{=latex}
\renewcommand{\familydefault}{\rmdefault}
```
```{=latex}
\renewcommand{\familydefault}{\rmdefault}
```
```{=latex}
\newcommand{\simpleicml@headingfont}{}
```
```{=latex}
\newcommand{\icmlpdfinfo}[1]{\hypersetup{#1}}
```
```{=latex}
\renewcommand{\headrulewidth}{0.3pt}
```
```{=latex}
\renewcommand{\headrule}{\color{ICMLAccent}\rule{\headwidth}{\headrulewidth}}
```
```{=latex}
\newcommand{\icmlrunningtitle}[1]{\def\@icmlrunningtitle{#1}}
```
```{=latex}
\newcommand{\@icmlrunningtitle}{}
```
```{=latex}
\newcommand{\theicmltitle}{%
  \ifx\@icmlrunningtitle\empty
    \@icmltitle
  \else
    \@icmlrunningtitle
  \fi
}
```
```{=latex}
\newenvironment{compacttable}
  {\begingroup\setlength{\tabcolsep}{4pt}}
  {\endgroup}
```
```{=latex}
\def\@icmltitle{}
```
```{=latex}
\def\@icmlauthors{}
```
```{=latex}
\def\@icmlaffiliations{}
```
```{=latex}
\def\@icmlabstract{}
```
```{=latex}
\newcommand{\icmltitle}[1]{\gdef\@icmltitle{#1}}
```
```{=latex}
\newcommand{\icmlauthors}[1]{\gdef\@icmlauthors{#1}}
```
```{=latex}
\newcommand{\icmlaffiliations}[1]{\gdef\@icmlaffiliations{#1}}
```
```{=latex}
\newcommand{\icmlabstract}[1]{\gdef\@icmlabstract{#1}}
```
```{=latex}
\newcommand{\icmlmaketitle}{%
  % Build title in a global box FIRST
  \global\setbox\@icmltitlebox=\vbox{%
    \begin{center}%
      {\LARGE\bfseries \@icmltitle \par}%
      \vspace{0.6em}%
      {\large \@icmlauthors \par}%
      \vspace{0.3em}%
      {\normalsize
        \begin{minipage}{0.96\textwidth}\centering
          \@icmlaffiliations\par
        \end{minipage}%
      }\par
      \vspace{0.6em}
      \noindent
      {\setlength{\fboxsep}{10pt}%
          \colorbox{ICMLAccent!10}{%
            \begin{minipage}{0.95\textwidth}%
              \@icmlabstract
            \end{minipage}%
          }%
      }\par
    \end{center}%
  }%
  % Now use the box in twocolumn
  \thispagestyle{empty}%
  \twocolumn[\unvbox\@icmltitlebox]
}
```
```{=latex}
\newcommand{\@runningtitle}{}
```
```{=latex}
\renewcommand{\icmlrunningtitle}[1]{\gdef\@runningtitle{#1}}
```
```{=latex}
\newcommand{\sectionheaderline}[1]{\gdef\@sectionline{#1}}
```
```{=latex}
\newcommand{\@sectionline}{}
```
```{=latex}
\newcommand{\seclink}[3]{%
  \ifnum#2=\value{currentsection}
    \textbf{\hyperref[#1]{Sec #2: #3}}%
  \else
    \hyperref[#1]{Sec #2: #3}%
  \fi
}
```
```{=latex}
\providecommand\@ifxundefined[1]{%
 \ifx#1\@undefined\expandafter\@firstoftwo\else\expandafter\@secondoftwo\fi
}
```
```{=latex}
\providecommand\@ifnum[1]{%
 \ifnum#1\expandafter\@firstoftwo\else\expandafter\@secondoftwo\fi
}
```
```{=latex}
\providecommand\@ifx[1]{%
 \ifx#1\expandafter\@firstoftwo\else\expandafter\@secondoftwo\fi
}
```
```{=latex}
\providecommand\appdef[2]{%
 \toks@\expandafter{#1}\@temptokena{#2}%
 \edef#1{\the\toks@\the\@temptokena}%
}
```
```{=latex}
\newcommand\bibstyle@chicago{\bibpunct{(}{)}{;}{a}{,}{,}}
```
```{=latex}
\newcommand\bibstyle@named{\bibpunct{[}{]}{;}{a}{,}{,}}
```
```{=latex}
\newcommand\bibstyle@agu{\bibpunct{[}{]}{;}{a}{,}{,~}}
```
```{=latex}
\newcommand\bibstyle@copernicus{\bibpunct{(}{)}{;}{a}{,}{,}}
```
```{=latex}
\let\bibstyle@egu=\bibstyle@copernicus
```
```{=latex}
\let\bibstyle@egs=\bibstyle@copernicus
```
```{=latex}
\newcommand\bibstyle@agsm{\bibpunct{(}{)}{,}{a}{}{,}\gdef\harvardand{\&}}
```
```{=latex}
\newcommand\bibstyle@kluwer{\bibpunct{(}{)}{,}{a}{}{,}\gdef\harvardand{\&}}
```
```{=latex}
\newcommand\bibstyle@dcu{\bibpunct{(}{)}{;}{a}{;}{,}\gdef\harvardand{and}}
```
```{=latex}
\newcommand\bibstyle@aa{\bibpunct{(}{)}{;}{a}{}{,}}
```
```{=latex}
\newcommand\bibstyle@pass{\bibpunct{(}{)}{;}{a}{,}{,}}
```
```{=latex}
\newcommand\bibstyle@anngeo{\bibpunct{(}{)}{;}{a}{,}{,}}
```
```{=latex}
\newcommand\bibstyle@nlinproc{\bibpunct{(}{)}{;}{a}{,}{,}}
```
```{=latex}
\newcommand\bibstyle@cospar{\bibpunct{/}{/}{,}{n}{}{}%
     \gdef\bibnumfmt##1{##1.}}
```
```{=latex}
\newcommand\bibstyle@esa{\bibpunct{(Ref.~}{)}{,}{n}{}{}%
     \gdef\bibnumfmt##1{##1.\hspace{1em}}}
```
```{=latex}
\newcommand\bibstyle@nature{\bibpunct{}{}{,}{s}{}{\textsuperscript{,}}%
     \gdef\bibnumfmt##1{##1.}}
```
```{=latex}
\newcommand\bibstyle@plain{\bibpunct{[}{]}{,}{n}{}{,}}
```
```{=latex}
\let\bibstyle@alpha=\bibstyle@plain
```
```{=latex}
\let\bibstyle@abbrv=\bibstyle@plain
```
```{=latex}
\let\bibstyle@unsrt=\bibstyle@plain
```
```{=latex}
\newcommand\bibstyle@plainnat{\bibpunct{[}{]}{,}{a}{,}{,}}
```
```{=latex}
\let\bibstyle@abbrvnat=\bibstyle@plainnat
```
```{=latex}
\let\bibstyle@unsrtnat=\bibstyle@plainnat
```
```{=latex}
\let\NAT@merge\z@
```
```{=latex}
\def\NAT@sort{\z@}
```
```{=latex}
\def\NAT@cmprs{\z@}
```
```{=latex}
\def\NAT@nmfmt#1{{\NAT@up#1}}
```
```{=latex}
\renewcommand\bibstyle[1]{\csname bibstyle@#1\endcsname}
```
```{=latex}
\let\@citestyle\bibstyle
```
```{=latex}
\newcommand\citestyle[1]{\@citestyle{#1}\let\bibstyle\@gobble}
```
```{=latex}
\newcommand\bibpunct[7][, ]%
  {\gdef\NAT@open{#2}\gdef\NAT@close{#3}\gdef
   \NAT@sep{#4}\global\NAT@numbersfalse
     \ifx #5n\global\NAT@numberstrue\global\NAT@superfalse
   \else
     \ifx #5s\global\NAT@numberstrue\global\NAT@supertrue
   \fi\fi
   \gdef\NAT@aysep{#6}\gdef\NAT@yrsep{#7}%
   \gdef\NAT@cmt{#1}%
   \NAT@@setcites
  }
```
```{=latex}
\newcommand\setcitestyle[1]{
 \@for\@tempa:=#1\do
 {\def\@tempb{round}\ifx\@tempa\@tempb
    \renewcommand\NAT@open{(}\renewcommand\NAT@close{)}\fi
  \def\@tempb{square}\ifx\@tempa\@tempb
    \renewcommand\NAT@open{[}\renewcommand\NAT@close{]}\fi
  \def\@tempb{angle}\ifx\@tempa\@tempb
    \renewcommand\NAT@open{$<$}\renewcommand\NAT@close{$>$}\fi
  \def\@tempb{curly}\ifx\@tempa\@tempb
    \renewcommand\NAT@open{\{}\renewcommand\NAT@close{\}}\fi
  \def\@tempb{semicolon}\ifx\@tempa\@tempb
    \renewcommand\NAT@sep{;}\fi
  \def\@tempb{colon}\ifx\@tempa\@tempb
    \renewcommand\NAT@sep{;}\fi
  \def\@tempb{comma}\ifx\@tempa\@tempb
    \renewcommand\NAT@sep{,}\fi
  \def\@tempb{authoryear}\ifx\@tempa\@tempb
    \NAT@numbersfalse\fi
  \def\@tempb{numbers}\ifx\@tempa\@tempb
    \NAT@numberstrue\NAT@superfalse\fi
  \def\@tempb{super}\ifx\@tempa\@tempb
    \NAT@numberstrue\NAT@supertrue\fi
  \expandafter\NAT@find@eq\@tempa=\relax\@nil
  \if\@tempc\relax\else
    \expandafter\NAT@rem@eq\@tempc
    \def\@tempb{open}\ifx\@tempa\@tempb
     \xdef\NAT@open{\@tempc}\fi
    \def\@tempb{close}\ifx\@tempa\@tempb
     \xdef\NAT@close{\@tempc}\fi
    \def\@tempb{aysep}\ifx\@tempa\@tempb
     \xdef\NAT@aysep{\@tempc}\fi
    \def\@tempb{yysep}\ifx\@tempa\@tempb
     \xdef\NAT@yrsep{\@tempc}\fi
    \def\@tempb{notesep}\ifx\@tempa\@tempb
     \xdef\NAT@cmt{\@tempc}\fi
    \def\@tempb{citesep}\ifx\@tempa\@tempb
     \xdef\NAT@sep{\@tempc}\fi
  \fi
 }%
 \NAT@@setcites
}
```
```{=latex}
\def\NAT@find@eq#1=#2\@nil
```
```{=latex}
\def\NAT@rem@eq#1={\def\@tempc{#1}}
```
```{=latex}
\def\NAT@@setcites{\global\let\bibstyle\@gobble}
```
```{=latex}
\newcommand\NAT@open{(}
```
```{=latex}
\newcommand\NAT@close{)}
```
```{=latex}
\newcommand\NAT@sep{;}
```
```{=latex}
\newcommand\NAT@aysep{,}
```
```{=latex}
\newcommand\NAT@yrsep{,}
```
```{=latex}
\newcommand\NAT@cmt{, }
```
```{=latex}
\newcommand\NAT@cite%
    [3]{\ifNAT@swa\NAT@@open\if*#2*\else#2\NAT@spacechar\fi
        #1\if*#3*\else\NAT@cmt#3\fi\NAT@@close\else#1\fi\endgroup}
```
```{=latex}
\newcommand\NAT@citenum%
    [3]{\ifNAT@swa\NAT@@open\if*#2*\else#2\NAT@spacechar\fi
        #1\if*#3*\else\NAT@cmt#3\fi\NAT@@close\else#1\fi\endgroup}
```
```{=latex}
\newcommand\NAT@citesuper[3]{\ifNAT@swa
\if*#2*\else#2\NAT@spacechar\fi
\unskip\kern\p@\textsuperscript{\NAT@@open#1\NAT@@close}%
   \if*#3*\else\NAT@spacechar#3\fi\else #1\fi\endgroup}
```
```{=latex}
\providecommand\textsuperscript[1]{\mbox{$^{\mbox{\scriptsize#1}}$}}
```
```{=latex}
\providecommand\@firstofone[1]{#1}
```
```{=latex}
\newcommand\NAT@citexnum{}
```
```{=latex}
\def\NAT@citexnum[#1][#2]#3{%
  \NAT@reset@parser
  \NAT@sort@cites{#3}%
  \NAT@reset@citea
  \@cite{\def\NAT@num{-1}\let\NAT@last@yr\relax\let\NAT@nm\@empty
    \@for\@citeb:=\NAT@cite@list\do
    {\@safe@activestrue
     \edef\@citeb{\expandafter\@firstofone\@citeb\@empty}%
     \@safe@activesfalse
     \@ifundefined{b@\@citeb\@extra@b@citeb}{%
       {\reset@font\bfseries?}
        \NAT@citeundefined\PackageWarning{natbib}%
       {Citation `\@citeb' on page \thepage \space undefined}}%
     {\let\NAT@last@num\NAT@num\let\NAT@last@nm\NAT@nm
      \NAT@parse{\@citeb}%
      \ifNAT@longnames\@ifundefined{bv@\@citeb\@extra@b@citeb}{%
        \let\NAT@name=\NAT@all@names
        \global\@namedef{bv@\@citeb\@extra@b@citeb}{}}{}%
      \fi
      \ifNAT@full\let\NAT@nm\NAT@all@names\else
        \let\NAT@nm\NAT@name\fi
      \ifNAT@swa
       \@ifnum{\NAT@ctype>\@ne}{%
        \@citea
        \NAT@hyper@{\@ifnum{\NAT@ctype=\tw@}{\NAT@test{\NAT@ctype}}{\NAT@alias}}%
       }{%
        \@ifnum{\NAT@cmprs>\z@}{%
         \NAT@ifcat@num\NAT@num
          {\let\NAT@nm=\NAT@num}%
          {\def\NAT@nm{-2}}%
         \NAT@ifcat@num\NAT@last@num
          {\@tempcnta=\NAT@last@num\relax}%
          {\@tempcnta\m@ne}%
         \@ifnum{\NAT@nm=\@tempcnta}{%
          \@ifnum{\NAT@merge>\@ne}{}{\NAT@last@yr@mbox}%
         }{%
           \advance\@tempcnta by\@ne
           \@ifnum{\NAT@nm=\@tempcnta}{%
             \ifx\NAT@last@yr\relax
               \def@NAT@last@yr{\@citea}%
             \else
               \def@NAT@last@yr{--\NAT@penalty}%
             \fi
           }{%
             \NAT@last@yr@mbox
           }%
         }%
        }{%
         \@tempswatrue
         \@ifnum{\NAT@merge>\@ne}{\@ifnum{\NAT@last@num=\NAT@num\relax}{\@tempswafalse}{}}{}%
         \if@tempswa\NAT@citea@mbox\fi
        }%
       }%
       \NAT@def@citea
      \else
        \ifcase\NAT@ctype
          \ifx\NAT@last@nm\NAT@nm \NAT@yrsep\NAT@penalty\NAT@space\else
            \@citea \NAT@test{\@ne}\NAT@spacechar\NAT@mbox{\NAT@super@kern\NAT@@open}%
          \fi
          \if*#1*\else#1\NAT@spacechar\fi
          \NAT@mbox{\NAT@hyper@{{\citenumfont{\NAT@num}}}}%
          \NAT@def@citea@box
        \or
          \NAT@hyper@citea@space{\NAT@test{\NAT@ctype}}%
        \or
          \NAT@hyper@citea@space{\NAT@test{\NAT@ctype}}%
        \or
          \NAT@hyper@citea@space\NAT@alias
        \fi
      \fi
     }%
    }%
      \@ifnum{\NAT@cmprs>\z@}{\NAT@last@yr}{}%
      \ifNAT@swa\else
        \@ifnum{\NAT@ctype=\z@}{%
          \if*#2*\else\NAT@cmt#2\fi
        }{}%
        \NAT@mbox{\NAT@@close}%
      \fi
  }{#1}{#2}%
}
```
```{=latex}
\def\NAT@citea@mbox{%
 \@citea\mbox{\NAT@hyper@{{\citenumfont{\NAT@num}}}}%
}
```
```{=latex}
\def\NAT@hyper@#1{%
 \hyper@natlinkstart{\@citeb\@extra@b@citeb}#1\hyper@natlinkend
}
```
```{=latex}
\def\NAT@hyper@citea#1{%
 \@citea
 \NAT@hyper@{#1}%
 \NAT@def@citea
}
```
```{=latex}
\def\NAT@hyper@citea@space#1{%
 \@citea
 \NAT@hyper@{#1}%
 \NAT@def@citea@space
}
```
```{=latex}
\def\def@NAT@last@yr#1{%
 \protected@edef\NAT@last@yr{%
  #1%
  \noexpand\mbox{%
   \noexpand\hyper@natlinkstart{\@citeb\@extra@b@citeb}%
   {\noexpand\citenumfont{\NAT@num}}%
   \noexpand\hyper@natlinkend
  }%
 }%
}
```
```{=latex}
\def\NAT@last@yr@mbox{%
 \NAT@last@yr\let\NAT@last@yr\relax
 \NAT@citea@mbox
}
```
```{=latex}
\newcommand\NAT@test[1]{%
 \@ifnum{#1=\@ne}{%
  \ifx\NAT@nm\NAT@noname
   \begingroup\reset@font\bfseries(author?)\endgroup
   \PackageWarning{natbib}{%
    Author undefined for citation`\@citeb' \MessageBreak on page \thepage%
   }%
  \else \NAT@nm
  \fi
 }{%
  \if\relax\NAT@date\relax
   \begingroup\reset@font\bfseries(year?)\endgroup
   \PackageWarning{natbib}{%
    Year undefined for citation`\@citeb' \MessageBreak on page \thepage%
   }%
  \else \NAT@date
  \fi
 }%
}
```
```{=latex}
\let\citenumfont=\@empty
```
```{=latex}
\newcommand\NAT@citex{}
```
```{=latex}
\def\NAT@spacechar{\ }
```
```{=latex}
\def\NAT@separator{\NAT@sep\NAT@penalty}
```
```{=latex}
\def\NAT@reset@citea{\c@NAT@ctr\@ne\let\@citea\@empty}
```
```{=latex}
\def\NAT@def@citea{\def\@citea{\NAT@separator\NAT@space}}
```
```{=latex}
\def\NAT@def@citea@space{\def\@citea{\NAT@separator\NAT@spacechar}}
```
```{=latex}
\def\NAT@def@citea@close{\def\@citea{\NAT@@close\NAT@separator\NAT@space}}
```
```{=latex}
\def\NAT@def@citea@box{\def\@citea{\NAT@mbox{\NAT@@close}\NAT@separator\NAT@spacechar}}
```
```{=latex}
\newcommand\NAT@@open{\ifNAT@par\NAT@open\fi}
```
```{=latex}
\newcommand\NAT@@close{\ifNAT@par\NAT@close\fi}
```
```{=latex}
\newcommand\NAT@alias{\@ifundefined{al@\@citeb\@extra@b@citeb}{%
  {\reset@font\bfseries(alias?)}\PackageWarning{natbib}
  {Alias undefined for citation `\@citeb'
  \MessageBreak on page \thepage}}{\@nameuse{al@\@citeb\@extra@b@citeb}}}
```
```{=latex}
\let\NAT@up\relax
```
```{=latex}
\newcommand\NAT@Up[1]{{\let\protect\@unexpandable@protect\let~\relax
  \expandafter\NAT@deftemp#1}\expandafter\NAT@UP\NAT@temp}
```
```{=latex}
\newcommand\NAT@deftemp[1]{\xdef\NAT@temp{#1}}
```
```{=latex}
\newcommand\NAT@UP[1]{\let\@tempa\NAT@UP\ifcat a#1\MakeUppercase{#1}%
  \let\@tempa\relax\else#1\fi\@tempa}
```
```{=latex}
\newcommand\shortcites[1]{%
  \@bsphack\@for\@citeb:=#1\do
  {\@safe@activestrue
   \edef\@citeb{\expandafter\@firstofone\@citeb\@empty}%
   \@safe@activesfalse
   \global\@namedef{bv@\@citeb\@extra@b@citeb}{}}\@esphack}
```
```{=latex}
\newcommand\NAT@biblabel[1]{\hfill}
```
```{=latex}
\newcommand\NAT@biblabelnum[1]{\bibnumfmt{#1}}
```
```{=latex}
\let\bibnumfmt\@empty
```
```{=latex}
\providecommand\@biblabel[1]{[#1]}
```
```{=latex}
\newcommand\NAT@bibsetnum[1]{\settowidth\labelwidth{\@biblabel{#1}}%
   \setlength{\leftmargin}{\labelwidth}\addtolength{\leftmargin}{\labelsep}%
   \setlength{\itemsep}{\bibsep}\setlength{\parsep}{\z@}%
   \ifNAT@openbib
     \addtolength{\leftmargin}{\bibindent}%
     \setlength{\itemindent}{-\bibindent}%
     \setlength{\listparindent}{\itemindent}%
     \setlength{\parsep}{0pt}%
   \fi
}
```
```{=latex}
\newcommand\NAT@bibsetup%
   [1]{\setlength{\leftmargin}{\bibhang}\setlength{\itemindent}{-\leftmargin}%
       \setlength{\itemsep}{\bibsep}\setlength{\parsep}{\z@}}
```
```{=latex}
\newcommand\NAT@set@cites{%
  \ifNAT@numbers
    \ifNAT@super \let\@cite\NAT@citesuper
       \def\NAT@mbox##1{\unskip\nobreak\textsuperscript{##1}}%
       \let\citeyearpar=\citeyear
       \let\NAT@space\relax
       \def\NAT@super@kern{\kern\p@}%
    \else
       \let\NAT@mbox=\mbox
       \let\@cite\NAT@citenum
       \let\NAT@space\NAT@spacechar
       \let\NAT@super@kern\relax
    \fi
    \let\@citex\NAT@citexnum
    \let\@biblabel\NAT@biblabelnum
    \let\@bibsetup\NAT@bibsetnum
    \renewcommand\NAT@idxtxt{\NAT@name\NAT@spacechar\NAT@open\NAT@num\NAT@close}%
    \def\natexlab##1{}%
    \def\NAT@penalty{\penalty\@m}%
  \else
    \let\@cite\NAT@cite
    \let\@citex\NAT@citex
    \let\@biblabel\NAT@biblabel
    \let\@bibsetup\NAT@bibsetup
    \let\NAT@space\NAT@spacechar
    \let\NAT@penalty\@empty
    \renewcommand\NAT@idxtxt{\NAT@name\NAT@spacechar\NAT@open\NAT@date\NAT@close}%
    \def\natexlab##1{##1}%
  \fi}
```
```{=latex}
\DeclareRobustCommand\citet
   {\begingroup\NAT@swafalse\let\NAT@ctype\z@\NAT@partrue
     \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\newcommand\NAT@citetp{\@ifnextchar[{\NAT@@citetp}{\NAT@@citetp[]}}
```
```{=latex}
\newcommand\NAT@@citetp{}
```
```{=latex}
\def\NAT@@citetp[#1]{\@ifnextchar[{\@citex[#1]}{\@citex[][#1]}}
```
```{=latex}
\DeclareRobustCommand\citep
   {\begingroup\NAT@swatrue\let\NAT@ctype\z@\NAT@partrue
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\cite
    {\begingroup\let\NAT@ctype\z@\NAT@partrue\NAT@swatrue
      \@ifstar{\NAT@fulltrue\NAT@cites}{\NAT@fullfalse\NAT@cites}}
```
```{=latex}
\newcommand\NAT@cites{\@ifnextchar [{\NAT@@citetp}{%
     \ifNAT@numbers\else
     \NAT@swafalse
     \fi
    \NAT@@citetp[]}}
```
```{=latex}
\DeclareRobustCommand\citealt
   {\begingroup\NAT@swafalse\let\NAT@ctype\z@\NAT@parfalse
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\citealp
   {\begingroup\NAT@swatrue\let\NAT@ctype\z@\NAT@parfalse
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\citenum
   {\begingroup
     \NAT@swatrue\let\NAT@ctype\z@\NAT@parfalse\let\textsuperscript\NAT@spacechar
     \NAT@citexnum[][]}
```
```{=latex}
\DeclareRobustCommand\citeauthor
   {\begingroup\NAT@swafalse\let\NAT@ctype\@ne\NAT@parfalse
    \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\Citet
   {\begingroup\NAT@swafalse\let\NAT@ctype\z@\NAT@partrue
     \let\NAT@up\NAT@Up
     \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\Citep
   {\begingroup\NAT@swatrue\let\NAT@ctype\z@\NAT@partrue
     \let\NAT@up\NAT@Up
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\Citealt
   {\begingroup\NAT@swafalse\let\NAT@ctype\z@\NAT@parfalse
     \let\NAT@up\NAT@Up
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\Citealp
   {\begingroup\NAT@swatrue\let\NAT@ctype\z@\NAT@parfalse
     \let\NAT@up\NAT@Up
         \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\Citeauthor
   {\begingroup\NAT@swafalse\let\NAT@ctype\@ne\NAT@parfalse
     \let\NAT@up\NAT@Up
    \@ifstar{\NAT@fulltrue\NAT@citetp}{\NAT@fullfalse\NAT@citetp}}
```
```{=latex}
\DeclareRobustCommand\citeyear
   {\begingroup\NAT@swafalse\let\NAT@ctype\tw@\NAT@parfalse\NAT@citetp}
```
```{=latex}
\DeclareRobustCommand\citeyearpar
   {\begingroup\NAT@swatrue\let\NAT@ctype\tw@\NAT@partrue\NAT@citetp}
```
```{=latex}
\newcommand\citetext[1]{\NAT@open#1\NAT@close}
```
```{=latex}
\DeclareRobustCommand\citefullauthor
   {\citeauthor*}
```
```{=latex}
\newcommand\defcitealias[2]{%
   \@ifundefined{al@#1\@extra@b@citeb}{}
   {\PackageWarning{natbib}{Overwriting existing alias for citation #1}}
   \@namedef{al@#1\@extra@b@citeb}{#2}}
```
```{=latex}
\DeclareRobustCommand\citetalias{\begingroup
   \NAT@swafalse\let\NAT@ctype\thr@@\NAT@parfalse\NAT@citetp}
```
```{=latex}
\DeclareRobustCommand\citepalias{\begingroup
   \NAT@swatrue\let\NAT@ctype\thr@@\NAT@partrue\NAT@citetp}
```
```{=latex}
\renewcommand\nocite[1]{\@bsphack
  \@for\@citeb:=#1\do{%
    \@safe@activestrue
    \edef\@citeb{\expandafter\@firstofone\@citeb\@empty}%
    \@safe@activesfalse
    \if@filesw\immediate\write\@auxout{\string\citation{\@citeb}}\fi
    \if*\@citeb\else
    \@ifundefined{b@\@citeb\@extra@b@citeb}{%
       \NAT@citeundefined \PackageWarning{natbib}%
       {Citation `\@citeb' undefined}}{}\fi}%
  \@esphack}
```
```{=latex}
\newcommand\NAT@parse[1]{%
  \begingroup
   \let\protect=\@unexpandable@protect
   \let~\relax
   \let\active@prefix=\@gobble
   \edef\NAT@temp{\csname b@#1\@extra@b@citeb\endcsname}%
   \aftergroup\NAT@split
   \expandafter
  \endgroup
  \NAT@temp{}{}{}{}{}@@%
  \expandafter\NAT@parse@date\NAT@date??????@@%
  \ifciteindex\NAT@index\fi
}
```
```{=latex}
\def\NAT@split#1#2#3#4#5@@{%
  \gdef\NAT@num{#1}\gdef\NAT@name{#3}\gdef\NAT@date{#2}%
  \gdef\NAT@all@names{#4}%
  \ifx\NAT@num\@empty\gdef\NAT@num{0}\fi
  \ifx\NAT@noname\NAT@all@names \gdef\NAT@all@names{#3}\fi
}
```
```{=latex}
\def\NAT@reset@parser{%
  \global\let\NAT@num\@empty
  \global\let\NAT@name\@empty
  \global\let\NAT@date\@empty
  \global\let\NAT@all@names\@empty
}
```
```{=latex}
\newcommand\NAT@parse@date{}
```
```{=latex}
\def\NAT@parse@date#1#2#3#4#5#6@@{%
  \ifnum\the\catcode`#1=11\def\NAT@year{}\def\NAT@exlab{#1}\else
  \ifnum\the\catcode`#2=11\def\NAT@year{#1}\def\NAT@exlab{#2}\else
  \ifnum\the\catcode`#3=11\def\NAT@year{#1#2}\def\NAT@exlab{#3}\else
  \ifnum\the\catcode`#4=11\def\NAT@year{#1#2#3}\def\NAT@exlab{#4}\else
    \def\NAT@year{#1#2#3#4}\def\NAT@exlab{{#5}}\fi\fi\fi\fi}
```
```{=latex}
\newcommand\NAT@index{}
```
```{=latex}
\let\NAT@makeindex=\makeindex
```
```{=latex}
\renewcommand\makeindex{\NAT@makeindex
  \renewcommand\NAT@index{\@bsphack\begingroup
     \def~{\string~}\@wrindex{\NAT@idxtxt}}}
```
```{=latex}
\newcommand\NAT@idxtxt{\NAT@name\NAT@spacechar\NAT@open\NAT@date\NAT@close}
```
```{=latex}
\newcommand\citeindextype{default}
```
```{=latex}
\newcommand\NAT@index@alt{{\let\protect=\noexpand\let~\relax
  \xdef\NAT@temp{\NAT@idxtxt}}\expandafter\NAT@exp\NAT@temp\@nil}
```
```{=latex}
\newcommand\NAT@exp{}
```
```{=latex}
\def\NAT@exp#1\@nil
```
```{=latex}
\newcommand\NAT@ifcmd{\futurelet\NAT@temp\NAT@ifxcmd}
```
```{=latex}
\newcommand\NAT@ifxcmd{\ifx\NAT@temp\relax\else\expandafter\NAT@bare\fi}
```
```{=latex}
\def\NAT@bare#1(#2)#3(@)#4\@nil
```
```{=latex}
\newcommand\NAT@wrout[5]{%
\if@filesw
      {\let\protect\noexpand\let~\relax
       \immediate
       \write\@auxout{\string\bibcite{#5}{{#1}{#2}{{#3}}{{#4}}}}}\fi
\ignorespaces}
```
```{=latex}
\def\NAT@noname{{}}
```
```{=latex}
\renewcommand\bibitem{\@ifnextchar[{\@lbibitem}{\@lbibitem[]}}
```
```{=latex}
\let\NAT@bibitem@first@sw\@secondoftwo
```
```{=latex}
\def\@lbibitem[#1]#2{%
  \if\relax\@extra@b@citeb\relax\else
    \@ifundefined{br@#2\@extra@b@citeb}{}{%
     \@namedef{br@#2}{\@nameuse{br@#2\@extra@b@citeb}}%
    }%
  \fi
  \@ifundefined{b@#2\@extra@b@citeb}{%
   \def\NAT@num{}%
  }{%
   \NAT@parse{#2}%
  }%
  \def\NAT@tmp{#1}%
  \expandafter\let\expandafter\bibitemOpen\csname NAT@b@open@#2\endcsname
  \expandafter\let\expandafter\bibitemShut\csname NAT@b@shut@#2\endcsname
  \@ifnum{\NAT@merge>\@ne}{%
   \NAT@bibitem@first@sw{%
    \@firstoftwo
   }{%
    \@ifundefined{NAT@b*@#2}{%
     \@firstoftwo
    }{%
     \expandafter\def\expandafter\NAT@num\expandafter{\the\c@NAT@ctr}%
     \@secondoftwo
    }%
   }%
  }{%
   \@firstoftwo
  }%
  {%
   \global\advance\c@NAT@ctr\@ne
   \@ifx{\NAT@tmp\@empty}{\@firstoftwo}{%
    \@secondoftwo
   }%
   {%
    \expandafter\def\expandafter\NAT@num\expandafter{\the\c@NAT@ctr}%
    \global\NAT@stdbsttrue
   }{}%
   \bibitem@fin
   \item[\hfil\NAT@anchor{#2}{\NAT@num}]%
   \global\let\NAT@bibitem@first@sw\@secondoftwo
   \NAT@bibitem@init
  }%
  {%
   \NAT@anchor{#2}{}%
   \NAT@bibitem@cont
   \bibitem@fin
  }%
  \@ifx{\NAT@tmp\@empty}{%
    \NAT@wrout{\the\c@NAT@ctr}{}{}{}{#2}%
  }{%
    \expandafter\NAT@ifcmd\NAT@tmp(@)(@)\@nil{#2}%
  }%
}
```
```{=latex}
\def\bibitem@fin{%
 \@ifxundefined\@bibstop{}{\csname bibitem@\@bibstop\endcsname}%
}
```
```{=latex}
\def\NAT@bibitem@init{%
 \let\@bibstop\@undefined
}
```
```{=latex}
\def\NAT@bibitem@cont{%
 \let\bibitem@Stop\bibitemStop
 \let\bibitem@NoStop\bibitemContinue
}
```
```{=latex}
\def\BibitemOpen{%
 \bibitemOpen
}
```
```{=latex}
\def\BibitemShut#1{%
 \bibitemShut
 \def\@bibstop{#1}%
 \let\bibitem@Stop\bibitemStop
 \let\bibitem@NoStop\bibitemNoStop
}
```
```{=latex}
\def\bibitemStop{}
```
```{=latex}
\def\bibitemNoStop{.\spacefactor\@mmm\space}
```
```{=latex}
\def\bibitemContinue{\spacefactor\@mmm\space}
```
```{=latex}
\providecommand{\bibAnnote}[3]{%
  \BibitemShut{#1}%
  \def\@tempa{#3}\@ifx{\@tempa\@empty}{}{%
   \begin{quotation}\noindent
    \textsc{Key:}\ #2\\\textsc{Annotation:}\ \@tempa
   \end{quotation}%
  }%
}
```
```{=latex}
\providecommand{\bibAnnoteFile}[2]{%
  \IfFileExists{#2}{%
    \bibAnnote{#1}{#2}{\input{#2}}%
  }{%
    \bibAnnote{#1}{#2}{}%
  }%
}
```
```{=latex}
\let\bibitemOpen\relax
```
```{=latex}
\let\bibitemShut\relax
```
```{=latex}
\def\bibfield{\@ifnum{\NAT@merge>\tw@}{\@bibfield}{\@secondoftwo}}
```
```{=latex}
\def\@bibfield#1#2{%
 \begingroup
  \let\Doi\@gobble
  \let\bibinfo\relax
  \let\restore@protect\@empty
  \protected@edef\@tempa{#2}%
  \aftergroup\def\aftergroup\@tempa
 \expandafter\endgroup\expandafter{\@tempa}%
 \expandafter\@ifx\expandafter{\csname @bib#1\endcsname\@tempa}{%
  \expandafter\let\expandafter\@tempa\csname @bib@X#1\endcsname
 }{%
  \expandafter\let\csname @bib#1\endcsname\@tempa
  \expandafter\let\expandafter\@tempa\csname @bib@Y#1\endcsname
 }%
 \@ifx{\@tempa\relax}{\let\@tempa\@firstofone}{}%
 \@tempa{#2}%
}
```
```{=latex}
\def\bibinfo#1{%
 \expandafter\let\expandafter\@tempa\csname bibinfo@X@#1\endcsname
 \@ifx{\@tempa\relax}{\@firstofone}{\@tempa}%
}
```
```{=latex}
\def\@bib@Xauthor#1{\let\@bib@Xjournal\@gobble}
```
```{=latex}
\def\@bib@Xjournal#1{\begingroup\let\bibinfo@X@journal\@bib@Z@journal#1\endgroup}
```
```{=latex}
\def\@bibibid@#1{\textit{ibid}.}
```
```{=latex}
\let\SK@lbibitem\@lbibitem
```
```{=latex}
\def\@lbibitem[#1]#2{%
     \SK@lbibitem[#1]{#2}\SK@\SK@@label{#2}\ignorespaces}
```
```{=latex}
\newcommand\NAT@force@numbers{%
  \ifNAT@numbers\else
  \PackageError{natbib}{Bibliography not compatible with author-year
  citations.\MessageBreak
  Press <return> to continue in numerical citation style}
  {Check the bibliography entries for non-compliant syntax,\MessageBreak
   or select author-year BibTeX style, e.g. plainnat}%
  \global\NAT@numberstrue\fi}
```
```{=latex}
\providecommand\bibcite{}
```
```{=latex}
\renewcommand\bibcite[2]{%
 \@ifundefined{b@#1\@extra@binfo}{\relax}{%
   \NAT@citemultiple
   \PackageWarningNoLine{natbib}{Citation `#1' multiply defined}%
 }%
 \global\@namedef{b@#1\@extra@binfo}{#2}%
}
```
```{=latex}
\newcommand\NAT@testdef[2]{%
  \def\NAT@temp{#2}%
  \expandafter \ifx \csname b@#1\@extra@binfo\endcsname\NAT@temp
  \else
    \ifNAT@swa \NAT@swafalse
      \PackageWarningNoLine{natbib}{%
        Citation(s) may have changed.\MessageBreak
        Rerun to get citations correct%
      }%
    \fi
  \fi
}
```
```{=latex}
\newcommand\NAT@apalk{}
```
```{=latex}
\newcommand\citeauthoryear{}
```
```{=latex}
\def\citeauthoryear#1#2#3(@)(@)\@nil
```
```{=latex}
\newcommand\citestarts{\NAT@open}
```
```{=latex}
\newcommand\citeends{\NAT@close}
```
```{=latex}
\newcommand\betweenauthors{and}
```
```{=latex}
\newcommand\astroncite{}
```
```{=latex}
\def\astroncite#1#2(@)(@)\@nil
```
```{=latex}
\newcommand\citename{}
```
```{=latex}
\def\citename#1#2(@)(@)\@nil
```
```{=latex}
\newcommand\harvarditem[4][]{%
 \if\relax#1\relax
   \bibitem[#2(#3)]{#4}%
 \else
   \bibitem[#1(#3)#2]{#4}%
 \fi
}
```
```{=latex}
\newcommand\harvardleft{\NAT@open}
```
```{=latex}
\newcommand\harvardright{\NAT@close}
```
```{=latex}
\newcommand\harvardyearleft{\NAT@open}
```
```{=latex}
\newcommand\harvardyearright{\NAT@close}
```
```{=latex}
\newcommand\harvardurl[1]{\textbf{URL:} \textit{#1}}
```
```{=latex}
\providecommand\bibsection{}
```
```{=latex}
\renewenvironment{thebibliography}[1]{%
 \bibsection
 \parindent\z@
 \bibpreamble
 \bibfont
 \list{\@biblabel{\the\c@NAT@ctr}}{\@bibsetup{#1}\global\c@NAT@ctr\z@}%
 \ifNAT@openbib
   \renewcommand\newblock{\par}%
 \else
   \renewcommand\newblock{\hskip .11em \@plus.33em \@minus.07em}%
 \fi
 \sloppy\clubpenalty4000\widowpenalty4000
 \sfcode`\.\@m
 \let\NAT@bibitem@first@sw\@firstoftwo
    \let\citeN\cite \let\shortcite\cite
    \let\citeasnoun\cite
}{%
 \bibitem@fin
 \bibpostamble
 \def\@noitemerr{%
  \PackageWarning{natbib}{Empty `thebibliography' environment}%
 }%
 \endlist
 \bibcleanup
}
```
```{=latex}
\let\bibfont\@empty
```
```{=latex}
\let\bibpreamble\@empty
```
```{=latex}
\let\bibpostamble\@empty
```
```{=latex}
\def\bibcleanup{\vskip-\lastskip}
```
```{=latex}
\providecommand\reset@font{\relax}
```
```{=latex}
\providecommand\bibname{Bibliography}
```
```{=latex}
\providecommand\refname{References}
```
```{=latex}
\newcommand\NAT@citeundefined{\gdef \NAT@undefined {%
    \PackageWarningNoLine{natbib}{There were undefined citations}}}
```
```{=latex}
\let \NAT@undefined \relax
```
```{=latex}
\newcommand\NAT@citemultiple{\gdef \NAT@multiple {%
    \PackageWarningNoLine{natbib}{There were multiply defined citations}}}
```
```{=latex}
\let \NAT@multiple \relax
```
```{=latex}
\providecommand\@mkboth[2]{}
```
```{=latex}
\providecommand\MakeUppercase{\uppercase}
```
```{=latex}
\providecommand{\@extra@b@citeb}{}
```
```{=latex}
\def\NAT@anchor#1#2{%
 \hyper@natanchorstart{#1\@extra@b@citeb}%
  \def\@tempa{#2}\@ifx{\@tempa\@empty}{}{\@biblabel{#2}}%
 \hyper@natanchorend
}
```
```{=latex}
\providecommand\hyper@natanchorstart[1]{}
```
```{=latex}
\providecommand\hyper@natanchorend{}
```
```{=latex}
\providecommand\hyper@natlinkstart[1]{}
```
```{=latex}
\providecommand\hyper@natlinkend{}
```
```{=latex}
\providecommand\hyper@natlinkbreak[2]{#1}
```
```{=latex}
\providecommand\@safe@activestrue{}
```
```{=latex}
\providecommand\@safe@activesfalse{}
```
```{=latex}
\newcommand\NAT@sort@cites[1]{%
  \let\NAT@cite@list\@empty
  \@for\@citeb:=#1\do{\expandafter\NAT@star@cite\@citeb\@@}%
  \if@filesw
    \expandafter\immediate\expandafter\write\expandafter\@auxout
      \expandafter{\expandafter\string\expandafter\citation\expandafter{\NAT@cite@list}}%
  \fi
  \@ifnum{\NAT@sort>\z@}{%
    \expandafter\NAT@sort@cites@\expandafter{\NAT@cite@list}%
  }{}%
}
```
```{=latex}
\def\NAT@star@cite{%
  \let\NAT@star@sw\@secondoftwo
  \@ifnum{\NAT@merge>\z@}{%
   \@ifnextchar*{%
    \let\NAT@star@sw\@firstoftwo
    \NAT@star@cite@star
   }{%
    \NAT@star@cite@nostar
   }%
  }{%
   \NAT@star@cite@noextension
  }%
}
```
```{=latex}
\def\NAT@star@cite@star*{%
 \NAT@star@cite@nostar
}
```
```{=latex}
\def\NAT@star@cite@nostar{%
 \let\nat@keyopt@open\@empty
 \let\nat@keyopt@shut\@empty
 \@ifnextchar[{\NAT@star@cite@pre}{\NAT@star@cite@pre[]}%
}
```
```{=latex}
\def\NAT@star@cite@pre[#1]{%
 \def\nat@keyopt@open{#1}%
 \@ifnextchar[{\NAT@star@cite@post}{\NAT@star@cite@post[]}%
}
```
```{=latex}
\def\NAT@star@cite@post[#1]#2\@@
```
```{=latex}
\def\NAT@star@cite@noextension#1\@@
```
```{=latex}
\def\NAT@cite@list@append#1{%
  \edef\@citeb{\@firstofone#1\@empty}%
  \if@filesw\@ifxundefined\@cprwrite{}{\expandafter\@cprwrite\@citeb=}\fi
  \if\relax\nat@keyopt@open\relax\else
   \global\expandafter\let\csname NAT@b@open@\@citeb\endcsname\nat@keyopt@open
  \fi
  \if\relax\nat@keyopt@shut\relax\else
   \global\expandafter\let\csname NAT@b@shut@\@citeb\endcsname\nat@keyopt@shut
  \fi
  \toks@\expandafter{\NAT@cite@list}%
  \ifx\NAT@cite@list\@empty
    \@temptokena\expandafter{\@citeb}%
  \else
    \@temptokena\expandafter{\expandafter,\@citeb}%
  \fi
  \edef\NAT@cite@list{\the\toks@\the\@temptokena}%
}
```
```{=latex}
\newcommand\NAT@sort@cites@[1]{%
  \count@\z@
  \@tempcntb\m@ne
  \let\@celt\delimiter
  \def\NAT@num@list{}%
  \let\NAT@cite@list\@empty
  \let\NAT@nonsort@list\@empty
  \@for \@citeb:=#1\do{\NAT@make@cite@list}%
  \ifx\NAT@nonsort@list\@empty\else
   \protected@edef\NAT@cite@list{\NAT@cite@list\NAT@nonsort@list}%
  \fi
  \ifx\NAT@cite@list\@empty\else
   \protected@edef\NAT@cite@list{\expandafter\NAT@xcom\NAT@cite@list @@}%
  \fi
}
```
```{=latex}
\def\NAT@make@cite@list{%
  \advance\count@\@ne
  \@safe@activestrue
  \edef\@citeb{\expandafter\@firstofone\@citeb\@empty}%
  \@safe@activesfalse
  \@ifundefined{b@\@citeb\@extra@b@citeb}%
   {\def\NAT@num{A}}%
   {\NAT@parse{\@citeb}}%
  \NAT@ifcat@num\NAT@num
   {\@tempcnta\NAT@num \relax
    \@ifnum{\@tempcnta<\@tempcntb}{%
      \let\NAT@@cite@list=\NAT@cite@list
      \let\NAT@cite@list\@empty
      \begingroup\let\@celt=\NAT@celt\NAT@num@list\endgroup
      \protected@edef\NAT@num@list{%
       \expandafter\NAT@num@celt \NAT@num@list \@gobble @%
      }%
    }{%
      \protected@edef\NAT@num@list{\NAT@num@list \@celt{\NAT@num}}%
      \protected@edef\NAT@cite@list{\NAT@cite@list\@citeb,}%
      \@tempcntb\@tempcnta
    }%
   }%
   {\protected@edef\NAT@nonsort@list{\NAT@nonsort@list\@citeb,}}%
}
```
```{=latex}
\def\NAT@celt#1{%
  \@ifnum{#1>\@tempcnta}{%
    \xdef\NAT@cite@list{\NAT@cite@list\@citeb,\NAT@@cite@list}%
    \let\@celt\@gobble
  }{%
    \expandafter\def@NAT@cite@lists\NAT@@cite@list\@@
  }%
}
```
```{=latex}
\def\NAT@num@celt#1#2{%
 \ifx#1\@celt
  \@ifnum{#2>\@tempcnta}{%
    \@celt{\number\@tempcnta}%
    \@celt{#2}%
  }{%
    \@celt{#2}%
    \expandafter\NAT@num@celt
  }%
 \fi
}
```
```{=latex}
\def\def@NAT@cite@lists#1,#2\@@
```
```{=latex}
\def\NAT@nextc#1,#2@@{#1,}
```
```{=latex}
\def\NAT@restc#1,#2{#2}
```
```{=latex}
\def\NAT@xcom#1,@@{#1}
```
```{=latex}
\renewcommand\lstlistingname{Algorithm}
```
```{=latex}
\def\lstfloatautorefname{Listing}
```
```{=latex}
\newcommand{\1}{\mathbf{1}}
```
```{=latex}
\newcommand{\norm}[1]{\left\lVert #1\right\rVert}
```
```{=latex}
\newcommand{\grad}{\nabla}
```
```{=latex}
\newcommand{\Hess}{H}
```
```{=latex}
\newcommand{\tr}{\mathrm{tr}}
```
```{=latex}
\newcommand{\Ball}{\mathrm{B}}
```
```{=latex}
\newcommand{\vol}{v_d}
```
```{=latex}
\newcommand{\surf}{s_{d-1}}
```
```{=latex}
\newcommand{\RR}{\mathbb{R}}
```
```{=latex}
\newcommand{\PP}{\mathbb{P}}
```
```{=latex}
\newcommand{\EE}{\mathbb{E}}
```
```{=latex}
\newcommand{\ind}{\mathbbm{1}}
```
```{=latex}
\newcommand{\Law}{\mathcal{L}}
```
```{=latex}
\newcommand{\sphere}{S^{d-1}}
```
```{=latex}
\newcommand{\inner}[2]{\left\langle #1, #2 \right\rangle}
```
```{=latex}
\newcommand{\BL}{\mathrm{BL}}
```
```{=latex}
\newcommand{\dBL}{d_{\mathrm{BL}}}
```
```{=latex}
\newcommand{\RightarrowDist}{\Rightarrow}
```
```{=latex}
\newcommand{\toProb}{\xrightarrow{\,\mathbb{P}\,}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\ReLU}{\text{ReLU}}
```
```{=latex}
\newcommand{\flatten}{\text{vec}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\cst{{\rm cst}}
```
```{=latex}
\newcommand{\rmat}[2]{\mathcal{M}_{#1,#2}(\mathbb{R})}
```
```{=latex}
\newcommand{\romat}[2]{\mathcal{O}_{#1}(\mathbb{R})}
```
```{=latex}
\DeclareMathOperator{\spn}{span}
```
```{=latex}
\DeclareMathOperator{\diag}{diag}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\newcommand{\Trp}[1]{\Tr\left(#1\right)}
```
```{=latex}
\DeclareMathOperator{\eigvec}{eigvec}
```
```{=latex}
\newcommand{\eigvecp}[1]{\eigvec\left(#1\right)}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vsigma{{\bm{\sigma}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\vepsilon{{\bm{\epsilon}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\let\ab\allowbreak
```
Experiment Details fontupper=, colback=orange!5, colframe=orange!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** detail

Definition fontupper=, colback=green!5, colframe=green!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** def

Theorem fontupper=, colback=red!5, colframe=red!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** theorem Lemma fontupper=, colback=blue!3, colframe=blue!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** lemma Proposition breakable, enhanced, fontupper=, colback=red!5, colframe=red!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** proposition Corollary breakable, enhanced, fontupper=, colback=red!5, colframe=red!35!black, fonttitle=**, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm,** corollary

![image](toy_figures/teaser/teaser_manifold_0.png){width="\\linewidth"}

![image](toy_figures/teaser/teaser_manifold_1.png){width="\\linewidth"}

Introduction {#sec:intro}
============

Learning manipulable representations of the world and its dynamics is a long‑standing question in AI, with roots dating back centuries ago [@von1867handbuch; @tolman1948cognitive; @gregory1980perceptions; @sutton1991dyna; @friston2010free]. Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is *how to learn an organized and actionable high‑dimensional embedding space from observations?* Using Deep Networks--parameterized nonlinear operators $f_{\vtheta}$--to map observations to embeddings is a standard first piece of that puzzle [@lecun2015deep; @goodfellow2016deep]. The second, less standardized, piece of that puzzle is *how to train $f_{\vtheta}$*. Joint-Embedding Predictive Architectures (JEPAs) suggest training $f_{\vtheta}$ by maximizing predictive agreement between the embeddings of semantically related *views* [@bromley1993signature; @lecun2022path; @balestriero2023cookbook]. Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc [@tian2020makes]. In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align $f_{\vtheta}$'s embeddings towards the underlying knowledge present in the data.

Alas, JEPA's prediction task admits failure modes, such as representation collapse, where $f_{\vtheta}$ maps all inputs to nearly identical embeddings (*complete collapse*) or to a low-dimensional subspace (*dimensional collapse*) [@jing2021understanding][@jing2021understanding; @cosentino2022toward; @balestriero2022contrastive]. To mitigate such shortcut solutions, state‑of‑the‑art recipes rely on heuristics--stop‑gradient [@chen2020simple], asymmetric view generation [@wang2022importance], teacher--student networks with carefully tuned EMA schedules [@caron2021emerging; @tian2021understanding], explicit normalization and whitening layers [@ermolov2021whitening; @chen2021empirical]--and a delicate balance of hyperparameters. As a result, today's JEPA training is brittle and most research has shifted toward scaling data [@vo2024automatic], models [@fan2025scaling] and even post-training [@rodas2025diet] while leaving the theoretical foundations of JEPAs largely unexplored.

Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking *what are the necessary conditions that JEPAs should abide by?* Those minimal conditions will then act as *axioms* for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings ([3](#sec:gaussian){reference-type="ref" reference="sec:gaussian"}). While (i) follows standard practice [@balestriero2022contrastive], we introduce in [4](#sec:bcs){reference-type="ref" reference="sec:bcs"} a novel distribution matching objective--Sketched Isotropic Gaussian Regularization (SIGReg)--to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its *memory and computational complexity is linear in dimension and sample size*. Crucially, SIGReg's isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model's expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution--coined Latent-Euclidean JEPA (LeJEPA)--is introduced in [5](#sec:lejepa){reference-type="ref" reference="sec:lejepa"}. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see [6](#sec:experiments){reference-type="ref" reference="sec:experiments"}). We summarize our contributions below.

**Contribution 1: We prove the optimal embedding distribution for foundation models.** We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In [3](#sec:gaussian){reference-type="ref" reference="sec:gaussian"}, we derive this result rigorously for both linear ([3.1](#sec:linear_probing){reference-type="ref" reference="sec:linear_probing"}) and nonlinear probes ([3.2](#sec:nonlinear_probing){reference-type="ref" reference="sec:nonlinear_probing"}), providing the first principled answer to what distribution $f_{\vtheta}$'s embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization. **Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale.** We present *Sketched Isotropic Gaussian Regularization* (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching ([\[sec:bcs,fig:bcs\_teaser\]](#sec:bcs,fig:bcs_teaser){reference-type="ref" reference="sec:bcs,fig:bcs_teaser"}). SIGReg provides statistical guarantees ([\[sec:general\_test,sec:tests\]](#sec:general_test,sec:tests){reference-type="ref" reference="sec:general_test,sec:tests"}) while achieving linear complexity and bounded gradients---a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality ([4.3](#sec:dimension){reference-type="ref" reference="sec:dimension"}), making it both theoretically sound and practically efficient for high-dimensional embeddings.

**Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction.** By combining JEPA's predictive objective with SIGReg targeting the isotropic Gaussian, we introduce *LeJEPA*---Latent-Euclidean JEPA ([5](#sec:lejepa){reference-type="ref" reference="sec:lejepa"}). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled theory directly yields practical simplicity.

**Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable.** Our experiments ([6](#sec:experiments){reference-type="ref" reference="sec:experiments"}) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining---previously considered impractical for small datasets.

Background and Notations {#sec:background}
========================

We start by introducing some of the notations we will be using throughout our manuscript ([2.1](#sec:notations){reference-type="ref" reference="sec:notations"}), followed by a review of JEPAs ([2.2](#sec:JEPA){reference-type="ref" reference="sec:JEPA"}), and existing literature studying their design ([2.3](#sec:mi){reference-type="ref" reference="sec:mi"}).

Notations and Definitions {#sec:notations}
-------------------------

JEPA $$\begin{aligned}
    {\rm JEPA}(\vx) \iff& {\rm Enc}\left(\vx_{n,t+1,.}\right) \text { is predictable from }{\rm Enc}\left(\vx_{n,t,.}\right), \forall n,t\text{ and } {\rm Enc}\left(\vx_{.,.,.}\right) \text{ is not degenerate}.\label{def:SSL}\end{aligned}$$

![image](toy_figures/teaser/teaser_JEPA_0.png){width="\\linewidth"}

![image](toy_figures/teaser/teaser_JEPA_1.png){width="\\linewidth"}

![image](toy_figures/teaser/teaser_JEPA_2.png){width="\\linewidth"}

**Data.** We are in possession of a dataset of shape $(N, V, D) \in {\mathbb{N}^*}^3$ where $N$ is the number of samples, $V$ is the number of views, and $D$ is the dimension. One entry of this dataset is accessed via $\vx_{n,v,d}$. Those dimensions are often interpreted as follows: (**N**) is the number of independent samples, e.g., different images or different videos, (**V**) is the number of *views*, e.g., data-augmentations for images, frames for videos, and (**D**) is the dimension of each $\vx_{n,v}$, e.g., number of RGB pixels for images. In many cases the ordering over $V$ is given by *time*--but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one's dataset into a $(N,V,D)$ tensor--*and none of our theory and implementation assumes a particular design decision for that tensor*. However, we will rely on the following two properties, (*independence*) the samples $\vx_{n},\vx_{n'}$ have been obtained independently from each other $\forall n\not = n'$, and (*identically distributed*) the sampling process was identical among $\vx_{n}, \forall n$.

**Deep Networks.** Today's AI solutions rely on *Deep (Neural) Networks* (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN's mapping as $f_\vtheta: \R^D \rightarrow \R^K$ with $K$ the dimension of the embedding space. The internals of $f_\vtheta$ are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of $f_\vtheta$ are irrelevant to our study--as we will see the proposed LeJEPA works out-of-the-box on any $f_\vtheta$. In any case, all the *learnable parameters* are gathered in the vector $\vtheta \in \R^P$, with $P$ counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that $\vtheta$ can be learned from gradient descent to ultimately produce a useful system, or foundation model, $f_{\vtheta}$.

**JEPAs.** A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters $\vtheta$. This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by @lecun2022path as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous well-established techniques such as siamese networks [@bromley1993signature] and predictive coding [@helmholtz1867handbook; @bruner1949perception]. While the exact blueprint of JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view $\vx_{n,v}$ from the embedding of another view $\vx_{n,v'}, v' \not = v$, all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for $\vx$, is provided in [\[def:SSL\]](#def:SSL){reference-type="ref" reference="def:SSL"}. The *predictability* criterion can be done by directly comparing the embeddings of the partial views $Enc(\vx_{n,v,.})$ and $Enc(\vx_{n,v',.})$ with a metric, e.g., $\ell_p$. In some cases, an additional DN coined *Pred*, is employed to compare $Pred(Enc(\vx_{n,v,.}))$ against $Enc(\vx_{n,v',.})$--which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data [@khazatsky2024droid].

The Need for Reliable Pretraining {#sec:JEPA}
---------------------------------

The JEPA's prediction task is designed based on a priori knowledge of the data. Its design is often quite natural since it is relatively intuitive to form $\vx$ so that its views share the relevant information content one hope to capture. On the other hand, the design of the \`\`anti-collapse" criterion is much closer to a game of Whac-A-Mole. Today's designs rely on many different under-specified safeguards which are carefully combined in the hope that degenerate shortcut solutions are avoided during training. Such mechanisms include (i) feature whitening [@ermolov2021whitening; @bardes2021vicreg], (ii) negative samples [@chen2020simple; @he2020momentum], and (iii) asymmetric views and teacher-student networks with stop-gradient [@caron2021emerging; @assran2023self]. Those mechanisms all suffer from at least two of the following limitations: (i) under-specification, i.e., the criteria can be minimized while embeddings are in a degenerate configuration, (ii) quadratic time and memory complexity with mini-batch size and/or embedding dimension, (iii) sensitivity to data distribution, hyperparameters, architecture, and (iv) lack of theoretical understanding and guarantees.

The Need for Actionable Theory {#sec:mi}
------------------------------

For decades, the two major solutions for AI were supervised learning [@lecun2015deep] and learning by reconstruction [@rumelhart1986learning]--sometimes combined together, e.g., for semi-supervised learning [@kingma2014semi]. In supervised learning, the labels both ensure that semantically similar samples are close to each other in embedding space while preventing complete representation collapse. In particular, it is possible to measure the amount of collapse in supervised learning as a function of the number of classes [@papyan2020prevalence]. The reconstruction objective is similarly well suited to prevent representation collapse as the original input must be recovered from the embeddings, i.e., the embeddings must be as informative about the input as possible--up to some optional denoising tasks that users can setup as part of the training [@vincent2010stacked].

Because supervised and reconstruction-based learning have been widely studied for decades, there exists a large body of work to explain and inform practical designs--as well as studying their limitations in producing foundation models [@balestriero2024learning; @van2025joint]. This is not the case for the more recent JEPAs where empirical advances quickly outpace anyone hoping to delve into their inner workings. This dynamic led the community to focus on post-hoc theoretical justification of already found solutions [@liu2021self; @shwartz2024compress; @shwartz2022we; @zhang2023matrix]. In most cases, those studies involve the *Mutual Information (MI)* [@shannon1948mathematical; @cover1999elements] whose different bounds recover established methods [@gutmann2010noise; @ma2018noise; @oord2018representation; @poole2019variational; @hjelm2018learning; @mcallester2020formal]. Because existing studies focus on explaining and interpreting already developed JEPAs, too little principled guidance and innovation has been brought forward. Instead, most of the recent empirical advances take the form of collecting larger dataset, scaling up pre-existing training recipes [@goyal2019scaling; @chen2020big; @oquab2023dinov2; @fan2025scaling], and deriving novel data curation processes [@vo2024automatic; @kerdreux2025efficient].

In contrast, our goal in the following [\[sec:gaussian,sec:bcs,sec:lejepa\]](#sec:gaussian,sec:bcs,sec:lejepa){reference-type="ref" reference="sec:gaussian,sec:bcs,sec:lejepa"} will be to derive a novel JEPA solution from first principles, i.e., whose design relies on proved necessary conditions for optimality, and with a pretraining recipe that can finally reconcile exploratory research, scalability, and state-of-the-art performances.

Latent Euclidean: Embeddings Should be Isotropic Gaussian {#sec:gaussian}
=========================================================

We address a fundamental question: *which distribution should ${\rm Enc}(\vx)$ follow to minimize empirical risk on any downstream task?* We prove that the isotropic Gaussian is the unique optimal distribution for both linear ([3.1](#sec:linear_probing){reference-type="ref" reference="sec:linear_probing"}) and nonlinear probing ([3.2](#sec:nonlinear_probing){reference-type="ref" reference="sec:nonlinear_probing"}), with geometric intuition provided in [3.3](#sec:le){reference-type="ref" reference="sec:le"}. This theoretical result establishes the necessary design principle for our JEPA; [4](#sec:bcs){reference-type="ref" reference="sec:bcs"} then provides the practical implementation to achieve it.

Linear Probing {#sec:linear_probing}
--------------

We begin by identifying the optimal distribution for $f_\vtheta$'s embeddings by analyzing linear probes--one of the most popular methods for frozen encoder evaluation. Specifically, we ask: *which distribution for $f_\vtheta(\vx)$ would be most favorable for solving arbitrary downstream tasks, i.e., for any realization of targets $\vy$?*

Denote as $\mZ \in \mathbb{R}^{N \times K}$ the matrix of $N$ embeddings, each $K$-dimensional, from $f_{\vtheta}(\vx_n)$. The *unknown* corresponding labels are denoted as $\vy \in \mathbb{R}^{N}$. Without loss of generality, we consider univariate targets; the following analysis extends to multivariate targets. The linear probe minimizes the following least square problem [@bishop2006pattern] $$\hat{\beta} = \underset{\beta \in \mathbb{R}^K}{\arg\min} \|\vy - \mZ\beta\|_2^2+\lambda \|\beta\|_2^2,\tag{OLS}\label{eq:OLS}$$ where $\hat{\beta}$ is the optimal probe parameters, and $\lambda \geq 0$ is an hyperparameter controlling the Tikhonov regularizer strength [@bishop1995training; @golub1999tikhonov]. Despite not knowing $\vy$, it is possible to describe the bias and variance of the estimator $\hat{\beta}$ as a function of the distribution of $\mZ$. Consider two embeddings with identical column spans $\mZ_{\rm aniso}, \mZ_{\rm iso}$. $\mZ_{\rm aniso}$'s covariance matrix eigenvalues are given by $\{\lambda_k\}_{k=1}^K$ with at least two distinct values, while $\mZ_{\rm iso}$'s covariance matrix eigenvalues are all equal to $\frac{1}{K}\sum_{k=1}^{K}\lambda_k$. Hence, the two candidate embeddings $\mZ_{\rm aniso}, \mZ_{\rm iso}$ capture the same intrinsic features and have same energy, but different geometries.

Anisotropy amplifies bias Whenever $\lambda_K>\lambda_1$, there always exists a downstream task ($\vy$) for which $\mZ_{\rm aniso}$ produces a higher bias estimator than $\mZ_{\rm iso}$ for $\lambda>0$. (Proof in [9.1](#proof:linear_probe_bias){reference-type="ref" reference="proof:linear_probe_bias"}.)

Anisotropy amplifies variance With $\lambda=0$, the total variance of $\hat{\beta}$ [\[eq:OLS\]](#eq:OLS){reference-type="eqref" reference="eq:OLS"} is minimized for $\mZ_{\rm iso}$ with $\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}_{\text{aniso}})) > \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}_{\text{iso}}))$. (Proof in [9.2](#proof:linear_probe_variance){reference-type="ref" reference="proof:linear_probe_variance"}.)

From the above [\[thm:linear\_probe\_variance,thm:linear\_probe\_bias\]](#thm:linear_probe_variance,thm:linear_probe_bias){reference-type="ref" reference="thm:linear_probe_variance,thm:linear_probe_bias"} we obtain that the distribution of features must be isotropic. We now move to nonlinear probing where the standard Gaussian will emerge as the unique optimum.

Nonlinear Probing {#sec:nonlinear_probing}
-----------------

To allow for more flexible evaluation of the pretrained encoder $f_{\vtheta}$, it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN [@taunk2019brief; @sun2010adaptive; @zhang2017efficient; @abu2019effects] for its simplicity and kernel methods [@nadaraya1964estimating; @watson1964smooth] for their theoretical tractability.

As in [3.1](#sec:linear_probing){reference-type="ref" reference="sec:linear_probing"}, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the $N$ embeddings along with their training labels $\{(\vz_n,\vy_{n})\}_{n=1}^{N}$. The prediction, using radius-based k-NN for a query vector $\vq$ is formed as $$\begin{aligned}
\widehat{\vy}(\vq) := \frac{1}{|\mathcal{N}_{r_0}(\vq)|}\sum_{n \in \mathcal{N}_{r_0}(\vq)}\vy_n,
\tag{kNN}\label{eq:kNN}\end{aligned}$$ where $\mathcal{N}_{r_0}(\vq) = \{n : \|\vz_n - \vq\| \le r_0\}$. The specific choice of radius $r_0$ controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query $\vq\in\mathbb{R}^K$ is given by $$\begin{aligned}
\widehat \vy(\vq)\triangleq \frac{\sum_{n=1}^N K_h(\vq-\vz_n)\vy_n}{\sum_{n=1}^N K_h(\vq-\vz_n)}.\tag{Kernel}\label{eq:NW}\end{aligned}$$

We search over all distributions of Z subject to a fixed total variance constraint, e.g., $\Tr(\Cov(\mZ)) = \kappa_1$ or $\|\Cov(\mZ)\|_F=\kappa_2$. The specific value of $\kappa$ does not affect the optimal distribution shape. Following the same type of derivations as done in the linear regime--with the exception of some additional regularity conditions--we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.

isotropic Gaussian Optimality The integrated square bias (ISB) over query points is given by $$\begin{aligned}
\text{ISB}_{k\text{-NN}} =\frac{r_0^4}{(K+2)^2}\tau_g^2J(p)+O(r_0^4),&&\text{(k-NN)}\\
\text{ISB}_{\text{kernel}} \le \Big(\frac{h^2\mu_2(K)}{2}\Big)^2 \Big(2 B^2 + 8 L^2J(p)\Big)+o(h^4),&&\text{(kernel)}\end{aligned}$$ and among distributions with a scalar-based covariance constraint, the isotropic Gaussian is the unique minimizer of the integrated square bias. (Proof in [\[proof:knn\_optimal,proof:kernel\_optimal\]](#proof:knn_optimal,proof:kernel_optimal){reference-type="ref" reference="proof:knn_optimal,proof:kernel_optimal"}.)

Numerous additional details and discussions on the regularity assumptions we employed are provided in [8](#sec:additional_nonlinear){reference-type="ref" reference="sec:additional_nonlinear"}. Together, these results establish the isotropic Gaussian distribution as the optimal design to minimize the worst-case risk of a foundation model across downstream tasks.

![Illustration of [\[thm:linear\_probe\_variance\]](#thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_variance"} showcasing how anisotropic (**right**) embeddings lead to higher variance estimator compared to isotropic embeddings (**left**). We sample $100$ training points for the $2$-class classification task and fit a logistic regression--repeating the process over numerous training set sample. Each sampling results in a decision boundary (**purple**).](toy_figures/le/decision_boundaries.pdf){#fig:boundaries width="\\linewidth"}

Geometric and Practical Insights {#sec:le}
--------------------------------

We now empirically validate that the isotropic Gaussian is optimal when no information about downstream tasks is available. We focus on linear probing ([3.1](#sec:linear_probing){reference-type="ref" reference="sec:linear_probing"}), where all considered distributions have the same total variance.

When employing a linear probe, an anisotropic distribution increases both bias (with Tikhonov regularization) and variance. Examining bias first ([\[thm:linear\_probe\_bias\]](#thm:linear_probe_bias){reference-type="ref" reference="thm:linear_probe_bias"}), we present in [24](#fig:ols_bias){reference-type="ref" reference="fig:ols_bias"} visualizations for both continuous regression and discrete classification tasks. We observe that the cosine similarity between estimated and ground-truth parameters equals 1 only for isotropic distributions, degrading for anisotropic cases regardless of sample size or regularization strength. Regarding variance ([\[thm:linear\_probe\_variance\]](#thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_variance"}), we show in [1](#fig:boundaries){reference-type="ref" reference="fig:boundaries"} that learned parameters vary significantly more across training sets when the covariance is anisotropic (right) compared to isotropic (left)---even when using logistic regression instead of OLS. [18](#fig:beta_distributions){reference-type="ref" reference="fig:beta_distributions"} further illustrates this effect, showing the distribution of learned $\beta$ parameters across different training samples for both cases. The anisotropic distribution clearly produces higher-variance estimators.

These theoretical and empirical results establish our design principle for LeJEPA: *embeddings $f_{\vtheta}(\vx)$ should follow an isotropic Gaussian distribution to minimize worst-case risk across downstream tasks encountered post-training*. introduces a novel regularizer to achieve this distribution.

SIGReg: Reliable Isotropic Gaussian Regularization in High-Dimension {#sec:bcs}
====================================================================

Having established the isotropic Gaussian as the optimal embedding distribution ([3](#sec:gaussian){reference-type="ref" reference="sec:gaussian"}), we now introduce *Sketched Isotropic Gaussian Regularization* (SIGReg)--a distribution matching objective that is simultaneously (i) *differentiable*, (ii) *scalable*, (iii) *provable*, and (iv) *interpretable*. SIGReg builds on three key innovations. First, we formulate distribution matching as a statistical test under the null hypothesis $P_{\vtheta}=Q$ ([4.1](#sec:general_test){reference-type="ref" reference="sec:general_test"}). Second, we identify a test that guarantees bounded gradients and curvature while maintaining linear complexity and efficient multi-GPU scaling ([4.2](#sec:CF_better){reference-type="ref" reference="sec:CF_better"}). Third, SIGReg bypasses the curse of dimensionality, eliminating collapsed shortcut solutions entirely ([4.3](#sec:dimension){reference-type="ref" reference="sec:dimension"}).

Hypothesis Testing as a Judge {#sec:general_test}
-----------------------------

Asking for $f_{\vtheta}(\vx)$'s distribution $P_{\vtheta}$ to match a target distribution $Q$ is typically done by creating various measures of distance or divergence, and estimating them in high-dimension. We propose a different starting point grounded in statistics. Consider the hypothesis testing framework [@fisher1928statistical; @neyman1933ix] given by $$\begin{aligned}
H_0: P_{\vtheta}=Q \quad \text{vs.} \quad H_1: P_{\vtheta}\neq Q,    \label{eq:null}\end{aligned}$$ with $H_0$ being referred to as the *null hypothesis*. That is, we are asking in [\[eq:null\]](#eq:null){reference-type="ref" reference="eq:null"} if there is enough empirical evidence to reject the null. To answer that question, one (i) employs a *test-statistic*, i.e., a single scalar value summarizing the evidence from the empirical samples, (ii) determines a critical value $\tau_{\alpha}$ for the test-statistic based on the probability $\alpha$ of Type I error, i.e., of mistakenly rejecting a true null hypothesis, (iii) compares the test-statistic to the critical value $\tau_{\alpha}$; if the test-statistic exceeds $\tau_{\alpha}$, reject the null hypothesis. If the null is not rejected, we can only claim that *there is not sufficient empirical evidence against $P_{\vtheta}=Q$*.

As it stands, [\[eq:null\]](#eq:null){reference-type="ref" reference="eq:null"} remains impractical in large dimension as existing tests have at least quadratic complexity with the number of samples considered (more details in [13](#sec:multivariate_tests){reference-type="ref" reference="sec:multivariate_tests"}). We thus propose to derive a sketching strategy by decomposing [\[eq:null\]](#eq:null){reference-type="ref" reference="eq:null"} into simpler univariate tests. Denoting the push-forward distributions $P_\vtheta^{(\va)}\triangleq (\va^\top)_\# P_\vtheta$ and $Q^{(\va)}\triangleq (\va^\top)_\# Q$, we can define the following *directional* univariate test $$\begin{aligned}
    H_0(\va):  P_\vtheta^{(\va)} = Q^{(\va)}\;\;\text{vs.} \;\; H_1(\va): P_\vtheta^{(\va)} \not = Q^{(\va)},   \label{eq:HUV}\end{aligned}$$ for a given directional unit-norm vector $\va \in \mathcal{S}^{K-1}$. The corresponding *directional test-statistic* of [\[eq:HUV\]](#eq:HUV){reference-type="ref" reference="eq:HUV"} is computed as $T(\{\va^\top f_{\vtheta}(\vx_n)\}_{n=1}^N)$. Examples of tests $T$ will be provided in the later [\[sec:tests\]](#sec:tests){reference-type="ref" reference="sec:tests"}. Repeating that process over a set of $M$ directions $\sA\triangleq \{ \va_1,\dots,\va_M\}$ and aggregating the individual values lead to the following *global test-statistic* $$\begin{aligned}
T_{\sA}(\{f_{\vtheta}(\vx_n)\}_{n=1}^N)\triangleq \max_{\va \in \sA} T(\{\va^\top f_{\vtheta}(\vx_n)\}_{n=1}^N).\label{eq:T_max}\end{aligned}$$ We now provide a formal statement asserting the consistency of [\[eq:T\_max\]](#eq:T_max){reference-type="ref" reference="eq:T_max"} to test the original multivariate null hypothesis from [\[eq:null\]](#eq:null){reference-type="ref" reference="eq:null"}. Our result leverages the well-known union-intersection principle [@roy1953heuristic], and a slightly modified Cramér-Wold theorem. We denote by $\stackrel{d}{=}$ equality in distribution.

Hyperspherical Cramér-Wold Let $X,Y$ be $\mathbb{R}^d$-valued random vectors, then $$\langle \vu,X\rangle \stackrel{d}{=} \langle \vu,Y\rangle, \forall \vu \in \mathbb{S}^{d-1}\iff X \stackrel{d}{=} Y.$$ Convergence in distribution also holds. (Proof in [9.8](#proof:spherical_cramer){reference-type="ref" reference="proof:spherical_cramer"}.)

Sufficiency of directional tests is a valid statistical test for [\[eq:HUV\]](#eq:HUV){reference-type="ref" reference="eq:HUV"} as $$\begin{aligned}
P=Q&\implies\limsup_{n\to\infty}\Pr\left(T_{\sA}(\{f_{\vtheta}(\vx_n)\}_{n=1}^N) \ge \tau_\alpha\right)\le \alpha,\text{\bf (level)}
\\
P\neq Q& \implies\limsup_{n\to\infty}\Pr\left(T_{\sA}(\{f_{\vtheta}(\vx_n)\}_{n=1}^N) \ge \tau_\alpha\right)=1,\text{\bf (power)}\end{aligned}$$ (Proof in [9.9](#proof:bcs){reference-type="ref" reference="proof:bcs"}.)

The assumptions required in the proof of [\[thm:bcs\]](#thm:bcs){reference-type="ref" reference="thm:bcs"} hold for classical consistent univariate tests $T$ such as the ones presented in the following [\[sec:tests\]](#sec:tests){reference-type="ref" reference="sec:tests"}.

![Examples of distributions living on the surface of the sphere with varying Sobolev smoothness coefficients $\alpha$. As per [\[thm:spherical\_bounds\]](#thm:spherical_bounds){reference-type="ref" reference="thm:spherical_bounds"}, the greater $\alpha$ is, the more global will be the impact of SIGReg for a given number of directions $M$. Practically, this represents the distribution of the encoder's output. Because the target density (isotropic Gaussian) is smooth, the $\alpha$ coeffcients of the embedding will quickly grow hereby making SIGReg ([\[def:bcs\]](#def:bcs){reference-type="ref" reference="def:bcs"}) immune to the curse of dimensionality.](toy_figures/3d_sphere_0.png){#fig:sobolev width="\\linewidth"}

![Examples of distributions living on the surface of the sphere with varying Sobolev smoothness coefficients $\alpha$. As per [\[thm:spherical\_bounds\]](#thm:spherical_bounds){reference-type="ref" reference="thm:spherical_bounds"}, the greater $\alpha$ is, the more global will be the impact of SIGReg for a given number of directions $M$. Practically, this represents the distribution of the encoder's output. Because the target density (isotropic Gaussian) is smooth, the $\alpha$ coeffcients of the embedding will quickly grow hereby making SIGReg ([\[def:bcs\]](#def:bcs){reference-type="ref" reference="def:bcs"}) immune to the curse of dimensionality.](toy_figures/3d_sphere_1.png){#fig:sobolev width="\\linewidth"}

![image](toy_figures/2d_slicing_example_2.png){width="\\linewidth"}

SIGReg: Sketching the Epps-Pulley Test is Stable and Scalable {#sec:CF_better}
-------------------------------------------------------------

[\[sec:tests\]]{#sec:tests label="sec:tests"}

Our proposed regularizer--coined Sketched Isotropic Gaussian Regularization (SIGReg)--follows directly from [\[thm:bcs\]](#thm:bcs){reference-type="ref" reference="thm:bcs"} using any statistical test $T$ targeted towards the isotropic Gaussian, illustrated in [\[fig:bcs\_teaser,fig:toy\_2d\]](#fig:bcs_teaser,fig:toy_2d){reference-type="ref" reference="fig:bcs_teaser,fig:toy_2d"}, and formalized below.

SIGReg (PyTorch code in [\[lst:epps-pulley-pytorch\]](#lst:epps-pulley-pytorch){reference-type="ref" reference="lst:epps-pulley-pytorch"}) SIGReg sketches a statistical test $T$ towards isotropic Gaussian $$\begin{gathered}
    {\rm SIGReg}_{T}(\sA,\{f_{\vtheta}(\vx_n)\}_{n=1}^{N})\triangleq\frac{1}{|\sA|}\sum_{\va\in\sA}T(\{\va^\top f_{\vtheta}(\vx_n)\}_{n=1}^{N}),\tag{SIGReg}\label{eq:BCS}\end{gathered}$$ where we recommend the Epps-Pulley test ([4.2.3](#sec:cf_tests){reference-type="ref" reference="sec:cf_tests"}) for $T$.

We replace the maximum over $\va\in\sA$ in [\[thm:bcs\]](#thm:bcs){reference-type="ref" reference="thm:bcs"} by an average in [\[eq:BCS\]](#eq:BCS){reference-type="eqref" reference="eq:BCS"} to avoid sparse gradient over the directions in $\sA$. We now delve on the choice of $T$ for which we compare well-known candidate tests in the field of statistics that are categorized into (i) moment based ([4.2.1](#sec:moment_tests){reference-type="ref" reference="sec:moment_tests"}), (ii) CDF based ([4.2.2](#sec:cdf_tests){reference-type="ref" reference="sec:cdf_tests"}), and (iii) CF based ([4.2.3](#sec:cf_tests){reference-type="ref" reference="sec:cf_tests"}) statistics--ultimately justifying our choice of the Epps-Pulley statistic.

### Moments are Unstable and Insufficient {#sec:moment_tests}

The first family of statistics we consider are moment-based. Taking the standard Gaussian as an instanciation for the moments, we can define the Jarque-Bera [@jarque1980efficient] test that compares the third and fourth moments, i.e., skewness and kurtosis, as $$\begin{aligned}
    {\rm JB}(\vu)\triangleq&\frac{N}{6}\left(\widehat{\rm skew}(\vu)^{2}+\left(\frac{\widehat{\rm kurt}(\vu)-3}{2}\right)^{2}\right)\tag{Jarque-Bera}\label{eq:jarque},\end{aligned}$$ where $\widehat{\rm skew}$ is the skewness computed from the data as $\frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{3}}{\hat{\sigma}^{3}}$ and $\widehat{\rm kurt}$ is the kurtosis $\frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{4}}{\hat{\sigma}^4}$. Typically, the [\[eq:jarque\]](#eq:jarque){reference-type="eqref" reference="eq:jarque"} test is used to see if a density follows a Gaussian distribution of any mean and variance--hence it only looks at moments 3 and 4. In our case we aim for a standard Gaussian test and thus add the usual statistics on the first two moments, leading to the extended test $$\begin{aligned}
{\rm EJB}(\vu)\triangleq&\frac{N\hat{\mu}(\vu)^2}{\hat{\sigma}(\vu)^2}+\frac{(N-1)\left(\hat{\sigma}(\vu)^2-1\right)^2}{2}+{\rm JB}\tag{Extended Jarque-Bera}(\vu)\label{eq:ejarque}.\end{aligned}$$ The [\[eq:ejarque\]](#eq:ejarque){reference-type="eqref" reference="eq:ejarque"} acts as a moment matching problem over the first four moments. Such moment matching methods have proven powerful not only for statistical tests but also as mean to learn parametric and nonparametric models of data.

**The Stability and Identifiability Conundrum.** We now explain why moment-based tests--albeit powerful--will not be suited for LeJEPA. The $k^{th}$ of a distribution $P$ is denoted as $m_k(P)$. The first observation is that well-behaved distributions abiding the Carleman's condition $\sum_{k=1}^\infty m_{2k}(Q)^{-1/(2k)} = \infty$ [@carleman1926fonctions], such as the Gaussian, or for distributions with finite interval [@hausdorff1923momentprobleme] are uniquely determined by their moments. However, using a finite number of moments creates the following non-identifiability issue which well-known in statistics and often used as a motivation to use *all* moments [@lehmann2005testing].

Insufficiency of K Moments Minimizing the following objective with $c_k>0, \forall k$ $$\sum_{k=1}^K c_k\left(m_k\left(P_{\vtheta}^{(\va)}\right)-m_k\left(Q^{(\va)}\right)\right)^2,$$ for finite $K$ does not imply $P_{\vtheta}^{(\va)}=Q^{(\va)}$. (Proof in [9.11](#proof:moment_conendrum){reference-type="ref" reference="proof:moment_conendrum"}.)

Hence [\[thm:moment\_conendrum\]](#thm:moment_conendrum){reference-type="ref" reference="thm:moment_conendrum"} prescribes us with the guideline to employ as large $K$ as possible to remove collapsed shortcut solution by making sure our distribution matching is accurate. Yet, doing so leads to unstable gradient-based training due to the gradient norm scaling as $O(k)$, and the variance of Monte Carlo gradient estimates growing as $O(k^2 m_{2(k-1)})$ for the $k$-th moment since $\big\|\nabla_\theta m_k(P_{\vtheta}^{(\va)})\big\|=\|\mathbb{E}\big[k(\va^\top f_{\vtheta}(\vx))^{k-1}\va^\top J_{f_\vtheta}(\vx)\big]\|$, with $J_{f_\vtheta}(\vx)\in\mathbb{R}^{K\times P}$ the Jacobian matrix--hereby creating an impractical situation where training stability and identifiability can not be achieved simultaneously.

![image](toy_figures/nonparametric/2d_slicing_dim_1024_N_100_slices_10.png){width="\\linewidth"}

### Cumulative Density Functions are Impractical {#sec:cdf_tests}

The second family of tests acts upon the CDF. Because those tests require sorting, let's denote the $k^{\rm th}$ order-statistics of $N$ samples by $x_{k:N}$. Two highly standard tests are quadratic Empirical Density Function statistics with different weighting known as Cramér-von Mises [@cramer1928composition; @von1981probability] and Anderson Darling [@anderson1952asymptotic], and given by $$\begin{aligned}
    T_{w}&=N \int_{-\infty}^{\infty}\left(F_{N}(x)-F(x)\right)^{2} w(x) d F(x)\nonumber\\
    w(x)&=1,\tag{Cramér-von Mises}\label{eq:cramer}\\
    w(x)&=[F(x)(1-F(x))]^{-1}, \tag{Anderson-Darling}\label{eq:anderson_darling}\end{aligned}$$ where $w(x)$ is a weighting function. Adding the $U^2$ statistics on top of [\[eq:cramer\]](#eq:cramer){reference-type="ref" reference="eq:cramer"} recovers the Watson test [@watson1961goodness] $$U^{2}=T_{w}-N\left(\bar{F}-\frac{1}{2}\right)^{2}.\tag{Watson} \label{eq:watson}$$ We do not consider the Kolmogorov-Smirnov test [@kolmogorov1933] as it employs the $\ell_{\infty}$-norm instead of the $\ell_2$-norm hereby producing sparse gradients. Another common test is the Shapiro-Wilk test [@shapiro1965analysis] which we found to be unstable in practice--details are provided in [12](#sec:shapiro_wilk){reference-type="ref" reference="sec:shapiro_wilk"}.

**Lack of Scalability and Differentiability.** CDF-based tests require sorting that have been highly optimized, e.g., with the $\mathcal{O}(N \log (N))$ Quicksort algorithm [@quicksort] but that nonetheless breaks the embarrassingly parallel nature of SGD--especially on multi-GPU [@tanasic2013comparison; @maltenberger2022evaluating] due to synchronization requirements. Moreover, these tests involve non-differentiable operations (sorting and order statistics), making them unsuitable for gradient-based optimization without relaxations [@cuturi2019differentiable; @grover2019stochastic; @petersen2022monotonic]. While there exists intricate sketching solutions [@dunning2019computing; @masson2019ddsketch; @dunning2021t], each of those solutions introduce numerous additional hyper-parameters--going against our first motivation for LeJEPA.

### Characteristic Functions are Stable, Scalable and Identifiable {#sec:cf_tests}

The third family of tests is concerned with Empirical Characteristic Functions (ECF) which are the Fourier transform of the density function. The Epps--Pulley test [@epps1983test] is one of the most popular test and simply compares in weighted $\ell_2$-norm the ECF of the data against a target CF $$EP = N \int_{-\infty}^{\infty} \left| \hat{\phi}_X(t) - \phi(t) \right|^2 w(t)  dt.\tag{Epps–Pulley}
    \label{eq:epps_pulley}$$ The first crucial observation is that the ECF being defined as $\hat{\phi}_X(t) = \frac{1}{n} \sum_{j=1}^{n} e^{itX_j}$ is naturally differentiable and easily computed in distributed settings via efficient `all_reduce` operations, as the ECF is a simple average of complex exponentials. The weight function is typically Gaussian, such as $w(t) = e^{- t^2/\sigma^2}$ with $\sigma$ commonly set to $1$.

Other tests, e.g., based on the Entropy [@szekely2005new] are not considered here as they require numerous additional design choices for the univariate Entropy estimation [@silverman2018density; @beirlant1997nonparametric], e.g., using kernels [@joe1989estimation], or M-estimators [@miller2003new].

**Epps-Pulley has bounded loss, gradient and curvature.** We now consider the remaining two families of tests: moment-based and CF-based. First, recall that moments are polynomial in the data and with extreme growth rate for higher moment--assuming they even exist. Even for well-behaved distributions, raising values to a power of $k$ can quickly lead to exploding gradients. This comes in sharp contrast with the ECF which is always bounded and with bounded gradients for any input distribution for the projected samples $z_i = \va^\top f_\theta(\vx_n)$, $n=1,\ldots,N$.

Stability of Epps-Pulley Test [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} satisfies for samples $z_1,\dots,z_N$ $$\left|\frac{\partial EP(\mathbf{a})}{\partial z_i}\right| \le \frac{4\sigma^2}{N}, 
\quad
\left|\frac{\partial^2 EP(\mathbf{a})}{\partial z_i^2}\right| \le \frac{C\sqrt{\pi}\sigma^3}{2N},$$ with constant $C$, and bandwidth $\sigma$. (Proof in [9.12](#proof:ecf_stability){reference-type="ref" reference="proof:ecf_stability"}.)

``` {#lst:epps-pulley-pytorch caption="SIGReg with Epps-Pulley statistic with DDP support and $\\mathcal{O}(N)$ time and memory complexity. x is a (N, K) tensor, num\\_slices is $|\\sA|$ in \\cref{def:bcs}, `global\\_step` is used for sync. sampling across GPUs and can be omited for single-GPU training. An optimized implementation with caching is also provided in our official codebase, computation times provided in \\cref{tab:times}." label="lst:epps-pulley-pytorch"}
def SIGReg(x, global_step, num_slices=256):
    # slice sampling -- synced across devices --
    dev = dict(device=x.device)
    g = torch.Generator(**dev)
    g.manual_seed(global_step)
    proj_shape = (x.size(1), num_slices)
    A = torch.randn(proj_shape,  generator=g, **dev)
    A /= A.norm(p=2, dim=0)
    # -- Epps-Pulley stat. see Sec. 4.3 for alt. --
    # integration points
    t = torch.linspace(-5, 5, 17, **dev)
    # theoretical CF for N(0, 1) and Gauss. window
    exp_f = torch.exp(-0.5 * t**2)
    # empirical CF -- gathered across devices --
    x_t = (x @ A).unsqueeze(2) * t  # (N, M, T)
    ecf = (1j * x_t).exp().mean(0)
    ecf = all_reduce(ecf, op="AVG")
    # weighted L2 distance
    err = (ecf - exp_f).abs().square().mul(exp_f)
    N = x.size(0) * world_size
    T = torch.trapz(err, t, dim=1) * N
    return T
```

By the chain rule, [\[thm:ecf\_stability\]](#thm:ecf_stability){reference-type="ref" reference="thm:ecf_stability"} directly gives $\left\|\nabla_\theta EP(\mathbf{a})\right\| \le \frac{4\sigma^2}{N} \sum_{i=1}^N \|\mathbf{a}^\top \nabla_\theta f_\theta(\mathbf{x}_i)\|$, providing stable gradients. The limitations of moment-based and CDF-based tests coupled with [\[thm:ecf\_stability\]](#thm:ecf_stability){reference-type="ref" reference="thm:ecf_stability"} justifies our choice of the [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"}: (i) DDP-friendly and scalable, (ii) uniformly bounded gradients and curvature regardless of input distribution, and (iii) hyper-parameter free implementation. Lastly, we highlight that *our implementation has a linear memory and computational complexity of $\mathcal{O}(N)$, with $N$ the minibatch size*. The implementation of SIGReg using that statistical test is provided in [\[lst:epps-pulley-pytorch\]](#lst:epps-pulley-pytorch){reference-type="ref" reference="lst:epps-pulley-pytorch"}, along with computation times of the forward-backward pass in [7](#tab:times){reference-type="ref" reference="tab:times"}.

As a last step before introducing LeJEPA, we ought to study the requirements on the number of directions ($|\sA|$) for [\[def:bcs\]](#def:bcs){reference-type="eqref" reference="def:bcs"} to be effective in high-dimension.

How SIGReg Beats the Curse of Dimensionality {#sec:dimension}
--------------------------------------------

This last section seeks to characterize how many slices in $\sA$ one must sample for [\[eq:BCS\]](#eq:BCS){reference-type="eqref" reference="eq:BCS"} to be an effective statistical test. That design is crucial if we hope for LeJEPA to successfully converge towards isotropic Gaussian embeddings.

### Smoothness Beats the Curse of Dimensionality {#smoothness-beats-the-curse-of-dimensionality .unnumbered}

Our first argument arguing for a favorable scaling of $|\sA|$ with the embedding dimension $K$ relies on the smoothness of $P_{\vtheta}$ as measured by its Sobolev regularity $\alpha$ [@adams2003sobolev]. We formalize below a bound on the directional test from [\[eq:HUV\]](#eq:HUV){reference-type="ref" reference="eq:HUV"} over all possible directions $\va$ when the test statistic is minimized over $|\sA|=M$ directions. While we provide bounds on the expected discrepancy over random directions $\va$ when the EP test is satisfied (equals zero) on a finite set of directions, the provided proof includes the case of moment-based and CDF-based tests as well.

![Expected directional statistic at the end of training (**y-axis**) for varying $M$ (number of directions used at each training step, **x-axis**). The $M$ directions are either resampled (**green**) or kept fixed (**blue**) at each training step. While for fixed directions we benefit from [\[thm:spherical\_bounds\]](#thm:spherical_bounds){reference-type="ref" reference="thm:spherical_bounds"} bound where increasing $M$ reduces the overall expected loss, being able to resample at every step provides significant coverage boost for free.](toy_figures/conjecture_scaling.pdf){#fig:resampling width="\\linewidth"}

Unified Error Bounds Let $p_{\vtheta}\in H^\alpha(\mathbb{R}^K)$, $\va \sim \mathcal{U}(\mathcal{S}^{K-1})$, and [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"}$=0$, i.e., $P_\theta^{(\mathbf{a})} = Q^{(\mathbf{a})}, \forall \va \in \sA$, then $$\begin{gathered}
        \mathbb{E}_{\va} \left[ \int_{\mathbb{R}} \left| \varphi_a(t) - \varphi_{\mathcal{N}}(t) \right|^2 dt \right]
    \leq
    C(K, \alpha) |\sA|^{-2\alpha/(K-1)} \\\times\int_0^\infty \left\| \varphi_{\cdot}(r) - \varphi_{\mathcal{N}}(r) \right\|_{H^\alpha(\mathcal{S}^{K-1})}^2 dr,
    \end{gathered}$$ (Proof in [9.10](#proof:spherical_bounds){reference-type="ref" reference="proof:spherical_bounds"}.)

As $|\sA| \to \infty$, the bound decays as $|\sA|^{-2\alpha/(K-1)}$, showing that $|\sA| = O(K)$ directions suffice for $\epsilon$-approximation when $\alpha$ is large. Some examples of embedding densities with varying $\alpha$ are provided in [3](#fig:sobolev){reference-type="ref" reference="fig:sobolev"}. The following statement characterizes how the $M$ directions actually constrain the entire space as a function of $\alpha$. The constant $C(K, \alpha) = \frac{2^{2\alpha} \pi^{(K-1)/2} \Gamma\left(\alpha + \frac{K-1}{2}\right)}{(K-1) \Gamma(\alpha) \Gamma\left(\frac{K-1}{2}\right)}$ is visualized in [\[fig:bound\_example\]](#fig:bound_example){reference-type="ref" reference="fig:bound_example"} (left) depicting how $\alpha$ and $|\sA|$ interact. In words, we obtain that thanks to the natural smoothness of DN--either stemming from the architecture or the implicit and explicit regularizers used during training--applying SIGReg on $|\sA|$ directions can be sufficient to tightly constrain the entire space. We note that considering the worst case over $\va$ or using low-discrepancy sequences for $\va$ does not impact the asymptotic bounds, details provided in [11](#sec:low_discrepancy){reference-type="ref" reference="sec:low_discrepancy"}.

### SGD Beats the Curse of Dimensionality {#sgd-beats-the-curse-of-dimensionality .unnumbered}

Our second argument leverages the iterative nature of DN training. Although we may use only $|\sA|$ to be a few hundreds, the cumulative number of sampled directions grows linearly with training time. This resampling effect (illustrated in [4](#fig:resampling){reference-type="ref" reference="fig:resampling"}, bottom) enables rapid convergence. Even small $|\sA|$ achieves tight distributional matching compared to keeping the set $\sA$ fixed throughout minibatches (recall [\[thm:spherical\_bounds\]](#thm:spherical_bounds){reference-type="ref" reference="thm:spherical_bounds"}). Our experiments show that even with $|\sA|$ as low as $16$ can easily outperform a fixed set with $|\sA|$ of order of thousands thanks to the compounding effect of resampling at each minibatch.

### Empirical Validation on Synthetic Data {#sec:exp_tests .unnumbered}

We conclude this section with a controlled experiment applying [\[eq:BCS\]](#eq:BCS){reference-type="eqref" reference="eq:BCS"} with gradient-based training to produce isotropic embeddings. In this setup, we directly consider embeddings $\mZ$ which we will differentiate and optimized to minimize [\[eq:BCS\]](#eq:BCS){reference-type="eqref" reference="eq:BCS"}. By directly optimizing the embeddings we are able to observe the impact of the loss without any possible constraint and regularization that would come from the architecture. We sample $N$ i.i.d. samples $\vx_{n}$ in a $D$-dimensional space. This sampling is based on an isotropic Gaussian distribution--but the first two dimensions are again set to the adversarial \`\`X" shape. That is, among the $D$ dimensions, only two must be transformed as all the other ones already obey the isotropic Gaussian target distribution. We then make the samples $\vx_{n}$ differentiable and optimize then to minimize the value of the different statistical tests compute on $M$ random $M$ random directions. Those directions are resampled after each gradient step--which follows the procedure we will employ in LeJEPA. We present the results in [\[fig:nonparametric\]](#fig:nonparametric){reference-type="ref" reference="fig:nonparametric"} demonstrating that even in challenging case, i.e., $D=512$ and $M=16$, SIGReg is able to detect the two degenerate dimensions and unfold them back to how they should look like under the target distribution.

LeJEPA: Stable and Scalable Implementation {#sec:lejepa}
==========================================

Having established that isotropic Gaussians are the optimal embedding distribution for foundation models ([3](#sec:gaussian){reference-type="ref" reference="sec:gaussian"}) and introduced SIGReg to achieve this distribution ([\[def:bcs\]](#def:bcs){reference-type="ref" reference="def:bcs"}), we now present the complete LeJEPA framework. We first evaluate candidate statistical tests ([\[sec:moment\_tests,sec:cdf\_tests\]](#sec:moment_tests,sec:cdf_tests){reference-type="ref" reference="sec:moment_tests,sec:cdf_tests"}) and identify characteristic function-based tests as optimal for gradient-based training ([4.2.3](#sec:cf_tests){reference-type="ref" reference="sec:cf_tests"}). The full LeJEPA implementation follows in [5.1](#sec:lejepa_implementation){reference-type="ref" reference="sec:lejepa_implementation"}.

LeJEPA: SIGReg + Prediction Loss {#sec:lejepa_implementation}
--------------------------------

``` {#code:lejepa caption="LeJEPA implementation--works out-of-the-box on any dataset, with DDP, with any backbone, e.g., torchvision or timm. For non-ViT architectures (e.g., ResNet), set global\\_views = all\\_views. We use bs for the minibatch size, SIGReg is from \\cref{lst:epps-pulley-pytorch}." label="code:lejepa"}
def LeJEPA(global_views, all_views, lambd):
    """global_views and all_views are lists of tensors, lambd is a scalar"""

    # embedding of global views
    g_emb = forward(torch.cat(glob_views))
    # embedding of local views
    # if resnet: skip with a_emb=g_emb
    a_emb = forward(torch.cat(all_views))
    
    # LeJEPA loss
    centers = g_emb.view(-1, bs, K).mean(0)
    a_emb = a_emb.view(-1, bs, K)
    sim = (centers - a_emb).square().mean()
    sigreg = mean(SIGReg(emb, global_step) for emb in a_emb)
    return (1-lambd)*sim + lambd*sigreg
```

We now discuss the implementation of LeJEPA starting with SIGReg and followed by the prediction and total losses.

**The SIGReg Loss.** We chose [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} for its provable boundedness ([\[thm:ecf\_stability\]](#thm:ecf_stability){reference-type="ref" reference="thm:ecf_stability"}) and its scalability. Its implementation follows exactly the equation except for the integrate which is estimated using a quadrature approximation. We find that the simple trapezoidal quadrature rule is sufficient even with as few knots as $17$, as ablated in [31](#fig:quadrature){reference-type="ref" reference="fig:quadrature"}. In particular, we leverage the symmetry of the integrand to double the number of knots for free, see the official code. On the other hand, the use of minibatches introduces a bias vanishing at rate $\mathcal{O}(1/N)$, as formalized below.

Vanishing gradient bias The expectation of [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} satisfies $$\mathbb{E}\left[\widehat{L}_n(\theta)\right]
=
L(\theta)+\frac{1}{N}\int_{\mathbb{R}} w_s(t)\big(1-|\varphi_P(t)|^2\big)dt,$$ therefore both the loss and its derivative have a bias of order $O(1/n)$. (Proof in [9.13](#proof:gradient_bias){reference-type="ref" reference="proof:gradient_bias"}.)

Hence, the gradients we obtain from using [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} are biased by an explicit $\mathcal{O}(1/N)$ term. We found this bias to be minimal and not a concern even for minibatches as small as 16. Unbiased alternatives include using U-statistic debiasing of $|\phi_\theta|^2$ or sample splitting, which we do not explore in this study. Our final implementation of the SIGReg term with Epps-Pulley statistic is provided in [\[lst:epps-pulley-pytorch\]](#lst:epps-pulley-pytorch){reference-type="ref" reference="lst:epps-pulley-pytorch"}.

**The Prediction Loss.** To standardize notations, we adopt the DINO [@caron2021emerging] setup of generating $V_{g}$ global views and $V_{l}$ local views, leading to a total of $V=V_{g}+V_{l}$ views. We set the first $1,\dots,V_{g}$ indices of each $\vz_{n,v}$ as the global views. For the cases without local views, simply set $V_{l}=0$. The prediction loss is then given by having all views predict the global views as $$\begin{aligned}
    \mathcal{L}_{\rm pred}(\{\vz_{n,v}\}_{v=1}^{V})=&\frac{1}{V_g}\sum_{v=1}^{V_g}\frac{1}{V}\sum_{v'=1}^{V}\left\| \vz_{n,v} - \vz_{n,v'} \right\|_2^2\label{eq:pred1}\\
    =&\frac{1}{V}\sum_{v'=1}^{V}\left\| \frac{1}{V_{\rm g}}\sum_{v=1}^{V_{\rm g}}\vz_{n,v} - \vz_{n,v'} \right\|_2^2\label{eq:pred2}\\
    \triangleq& \frac{1}{V}\sum_{v'=1}^{V}\left\| \bm{\mu}_{n} - \vz_{n,v'} \right\|_2^2,\label{eq:pred_as}\end{aligned}$$ where we denote $\bm{\mu}_n \triangleq \frac{1}{V_g}\sum_{v=1}^{V_g}\vz_{n,v}$, the [\[eq:pred1\]](#eq:pred1){reference-type="ref" reference="eq:pred1"} to [\[eq:pred2\]](#eq:pred2){reference-type="ref" reference="eq:pred2"} derivations are detailed in [9.6](#proof:pred){reference-type="ref" reference="proof:pred"}.

**LeJEPA Loss.** The final total loss simply combines the above prediction loss along with SIGReg on each views as per $$\begin{gathered}
    \mathcal{L}_{\rm LeJEPA}(\{\vx_{n,v}\}_{n,v=1}^{B,V})=\frac{\lambda}{V}\sum_{v=1}^{V}{\rm SIGReg}(\{\{\vz_{n,v}\}_{n=1}^{B}\})\\
    +\frac{1-\lambda}{B}\sum_{n=1}^{B}\mathcal{L}^{(V_{\rm g})}_{\rm pred}(\{\vz_{n,v}\}_{v=1}^{V}).\tag{LeJEPA}\label{eq:lejepa}\end{gathered}$$

We present [\[eq:lejepa\]](#eq:lejepa){reference-type="eqref" reference="eq:lejepa"}'s implementation in [\[code:lejepa\]](#code:lejepa){reference-type="ref" reference="code:lejepa"}. Altogether, the entire implementation--besides the usual model definitions, optimizers, and data loaders--only takes a few dozens lines in PyTorch ([\[lst:epps-pulley-pytorch,code:lejepa\]](#lst:epps-pulley-pytorch,code:lejepa){reference-type="ref" reference="lst:epps-pulley-pytorch,code:lejepa"}). The absence of prototypes, stop-gradients, and teacher-student networks makes [\[eq:lejepa\]](#eq:lejepa){reference-type="eqref" reference="eq:lejepa"} appealing as it only contains one hyperparameter, $\lambda$, balancing the trade-off between the prediction and isotropic Gaussian terms.

Relation to Prior Work
----------------------

Prior to presenting our experiments ([6](#sec:experiments){reference-type="ref" reference="sec:experiments"}), we conclude by discussing how our proposed LeJEPA and SIGReg objective relate to existing frameworks in the literature.

While there is no existing solution employing such slicing and distribution matching for JEPAs, there exists similar pipelines for generative models and optimal transport. Notably, the Sliced Score Matching [@song2020sliced] proposes to leverage univariate slicing of the space to ease the estimation of a density for generative models. In a similar vein, the sliced Wasserstein distance [@bonneel2015sliced; @nguyen2023energy] uses such strategy to speed up and improve optimal transport. Furthermore, when the integral of the [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} test is computed exactly, as opposed to our quadrature, each slice loss value recovers the kernel MMD [@sriperumbudur2010hilbert; @gretton2012kernel; @chwialkowski2016kernel] measuring the distance between two distributions--albeit with a quadratic complexity. Lastly, it is possible to recover some existing SSL frameworks in the limit by employing LeJEPA with a particular test--instead of the preferred [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"}. For example, Setting $T(\{x_n\}_{n=1}^{B})={\rm mean}(\{x_n\}_{n=1}^{B})^2+({\rm std}(\{x_n\}_{n=1}^{B})-1)^2$ and using that $T$ with SIGReg in LeJEPA recovers the VICReg SSL method in the limit of large number of slices. In fact, SIGReg will enforce in expectation that $\mathbb{E}[\mathbf{Z}] = \mathbf{0}$ and $\mathrm{Cov}(\mathbf{Z}) = \mathbf{I}_d$, where $\mathbf{I}_d$ denotes the $d \times d$ identity matrix--derivations provided in [9.14](#proof:vcreg){reference-type="ref" reference="proof:vcreg"}. And since our invariance term is simply the $\ell_2$ distance between the views' embeddings, LeJEPA recovers VICReg for this degenerate statistical test. Based on [\[thm:moment\_conendrum\]](#thm:moment_conendrum){reference-type="ref" reference="thm:moment_conendrum"}, we however strongly advocate against such a setting as it would lead to shortcut solutions--a phenomenon already observed in VICReg.

LeJEPA: Empirical Validation {#sec:experiments}
============================

We now use the LeJEPA implementation described in [5.1](#sec:lejepa_implementation){reference-type="ref" reference="sec:lejepa_implementation"} to demonstrate its effectiveness through comprehensive experiments. We show that LeJEPA: (i) trains reliably across diverse architectures and datasets ([6.1](#sec:hparams){reference-type="ref" reference="sec:hparams"}), (ii) provides an informative training loss for model selection ([6.2](#sec:loss_corr){reference-type="ref" reference="sec:loss_corr"}), (iii) outperforms frontier vision models on small-scale in-domain pretraining ([6.3](#sec:galaxy){reference-type="ref" reference="sec:galaxy"}), (iv) scales successfully to nearly 1 billion parameters on ImageNet-1k ([6.4](#sec:scale){reference-type="ref" reference="sec:scale"}), and (v) learns rich semantic segmentation features without explicit supervision.

LeJEPA's Stability Across Hyper-Parameters and Architectures {#sec:hparams}
------------------------------------------------------------

We now demonstrate LeJEPA's stability across hyperparameters, architectures, and experimental setups. Additional cross-domain stability results are presented in [6.3](#sec:galaxy){reference-type="ref" reference="sec:galaxy"}.

![Inet100 with $400$ pretraining epochs and resnet50 backbone. We depict linear probe performances as a function of $\lambda$ and the number of views $V$ (recall [\[eq:lejepa\]](#eq:lejepa){reference-type="eqref" reference="eq:lejepa"}). We observe that performances are stable over $\lambda$--with **peak performance obtain by slightly adjust $\lambda$ proportionally to the number of views**. The corresponding performance values are provided in [8](#tab:lambda_perf){reference-type="ref" reference="tab:lambda_perf"}.](toy_figures/performance_vs_lambda.pdf){#fig:lambda_views width="\\linewidth"}

**(a) [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"} parameters**

::: {#tab:ablations}
  ------------- ------------- ------- ------- -------
  integration   num\_slices                   
  (lr)3-5                     5       17      41
  $[-1,1]$      512           71.82   72.13   72.04
                2048          72.88   72.30   72.69
  $[-3,3]$      512           73.95   74.16   74.04
                2048          75.02   74.68   74.77
  $[-5,5]$      512           73.71   74.21   74.15
                2048          74.50   74.80   74.77
  ------------- ------------- ------- ------- -------

  : ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).**LeJEPA's performance is stable across all its hyperparameters** and while some may slightly improve performance, e.g., the number of slices $|\sA|$ and the projector sizes, none of the choices lead to a catastrophic collapse.
:::

\
**(b) Number of local/global views**

::: {#tab:ablations}
  ------------------------------------ ------- ------- -------
  \# global\_views ($V_{\rm g}$)       1       2       4
  \# views ($V=V_{\rm g}+V_{\rm l}$)                   
  4                                    53.06   72.26   --
  6                                    58.65   73.07   73.68
  8                                    64.46   74.24   73.94
  10                                   68.97   74.06   75.08
  ------------------------------------ ------- ------- -------

  : ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).**LeJEPA's performance is stable across all its hyperparameters** and while some may slightly improve performance, e.g., the number of slices $|\sA|$ and the projector sizes, none of the choices lead to a catastrophic collapse.
:::

\
**(c) Mini-batch size**

::: {#tab:ablations}
  ------------- ------- ------- ------- -------
  batch\_size   128     256     512     1024
                                        
                72.20   74.15   74.72   74.07
  ------------- ------- ------- ------- -------

  : ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).**LeJEPA's performance is stable across all its hyperparameters** and while some may slightly improve performance, e.g., the number of slices $|\sA|$ and the projector sizes, none of the choices lead to a catastrophic collapse.
:::

\
**(d) Embedding/Projector dim.**

::: {#tab:ablations}
  ------------- ------- ------- ------- -------
  num\_slices                           
  emb. dim.     512     2048    512     2048
  proj. dim.                            
  64            75.29   75.32   75.50   75.65
  128           74.77   75.09   75.26   75.47
  256           74.56   74.66   75.08   75.02
  512           73.94   74.11   74.81   74.65
  1024          73.65   73.94   74.71   74.79
  ------------- ------- ------- ------- -------

  : ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).**LeJEPA's performance is stable across all its hyperparameters** and while some may slightly improve performance, e.g., the number of slices $|\sA|$ and the projector sizes, none of the choices lead to a catastrophic collapse.
:::

\
**(e) Register tokens**

::: {#tab:ablations}
  ------------- ------- ------- ------- ------- -------
  reg\_tokens   0       1       2       4       8
  num\_slices                                   
  1024          75.14   75.18   75.08   75.34   75.23
  4096          75.61   75.58   75.67   75.63   75.84
  ------------- ------- ------- ------- ------- -------

  : ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).**LeJEPA's performance is stable across all its hyperparameters** and while some may slightly improve performance, e.g., the number of slices $|\sA|$ and the projector sizes, none of the choices lead to a catastrophic collapse.
:::

![image](toy_figures/exps/archs_inet10.png){width="\\linewidth"}

**Stability across standard hyperparameters.** We begin by evaluating LeJEPA on ImageNet-100 and ImageNet-1K. On ImageNet-100, we train a ResNet-50 and vary the number of views and the loss weighting $\lambda$ ([5](#fig:lambda_views){reference-type="ref" reference="fig:lambda_views"}). Performance remains stable across both dimensions, leading us to recommend $\lambda=0.05$ as a robust default. On ImageNet-1K, we train a ViT-Large/14 and explore batch size, as well as the number of global ($V_{\rm g}$) and local ($V_{\rm l}$) views ([5](#tab:ablations){reference-type="ref" reference="tab:ablations"}b). We find that the configuration commonly used in prior work ($V_{\rm g}=2, V_{\rm l}=8$) transfers well to LeJEPA. Notably, LeJEPA achieves competitive performance with batch sizes as small as 128 on ImageNet-1K ([5](#tab:ablations){reference-type="ref" reference="tab:ablations"}c), suggesting reduced memory requirements compared to existing methods. *We thus recommend to use $\lambda=0.05$, $V_{\rm g}=2$, $V_{\rm l}=8$, and batch size $\geq 128$ as starting points*.

**Stability across Epps-Pulley hyperparameters.** We next examine hyperparameters specific to LeJEPA: the number of slices $|\mathcal{A}|$ in SIGReg, the integration domain for the Epps-Pulley test [\[eq:epps\_pulley\]](#eq:epps_pulley){reference-type="eqref" reference="eq:epps_pulley"}, and the number of quadrature points for numerical integration. [5](#tab:ablations){reference-type="ref" reference="tab:ablations"}a shows ablations on ImageNet-1K with ViT-Large/14. Both the integration domain and number of quadrature points have negligible impact on performance. This is expected: since the characteristic function is accurate at zero, the moments of the distribution are well-characterized even with a modest integration range. The number of slices $|\mathcal{A}|$ has a modest effect---while more slices slightly improve performance, even 512 slices yield competitive results. *We thus recommend to use 17 integration points, an integration domain of $[-5,5]$, and 1024 slices as starting points*.

**Stability across architectures.** A key advantage of LeJEPA over recent methods (e.g., IJEPA, DINOv2) is its architecture-agnostic design. While most modern self-supervised methods are tailored to Vision Transformers, LeJEPA works across diverse architecture families without modification. To validate this claim, we pretrain approximately 50 architectures from 8 different families on ImageNet-10, selecting all models in the timm library with fewer than 20M parameters. All models are able to learn high-quality representations reaching between 91.5% to 95% top 1 accuracy with frozen backbone linear probing. It seems that models performing well in supervised learning setups are also the ones to favor for LeJEPA, such as resnets and ViTs. *We thus recommend to use standard architectures such as ResNets and ViTs over specialized models like EfficientNet as stating point.*

**Removal of popular heuristics.** In addition to providing reliable performance across models and datasets, LeJEPA's provable construction enables us to *remove* many heuristics traditionally used to prevent collapse. First, prior work has shown both empirically and theoretically that predictors in image JEPA (without asymmetric information) and teacher-student architectures serve primarily to prevent collapse [@grill2020bootstrap; @jing2021understanding; @tian2021understanding; @caron2021emerging; @chen2021empirical]. Removing these components produces *collapsed* encoders, i.e., with performances at chance-level. Thanks to LeJEPA's SIGReg loss, we can remove both the predictor and teacher-student architecture without suffering from collapse, as shown in [\[tab:proj\_pred\]](#tab:proj_pred){reference-type="ref" reference="tab:proj_pred"}. While a teacher-student configuration does provide a small performance boost for ViT models---consistent with observations in supervised learning via Stochastic Weight Averaging [@izmailov2019averagingweightsleadswider]---it is not necessary to prevent collapse. In our setup, we apply SWA on the encoder producing $\mu$ in [\[eq:pred2\]](#eq:pred2){reference-type="ref" reference="eq:pred2"}. Second, recent work demonstrated that register tokens are needed to prevent training instabilities in vision models [@oquab2023dinov2; @simeoni2025dinov3; @darcet2023vision]. We show in [5](#tab:ablations){reference-type="ref" reference="tab:ablations"} that such instabilities likely stem from poorly conditioned training objectives. In contrast, LeJEPA *does not* require register tokens and achieves stable performance with or without them. *We thus recommend training without a predictor or register tokens, and optionally applying SWA with ViTs for a possible performance gain.*

We strive for **simplicity** and thus adopt a unified pretraining pipeline. The following parameters apply to *all* experiments and figures unless stated otherwise in the corresponding caption and come from [6.1](#sec:hparams){reference-type="ref" reference="sec:hparams"}:

-   LeJEPA's implementation is given in [\[code:lejepa\]](#code:lejepa){reference-type="ref" reference="code:lejepa"} with hyperparameter $\lambda$

-   All backbones are from ${\rm timm}$ and all optimizers/schedulers are from ${\rm PyTorch}$ without modifications

-   We employ eight views ($V=8$) containing two global views ($V_{\rm g}=2$) with resolution 224x224 and 96x96 for the local views

-   AdamW optimizer with ${\rm lr} \in \{5e-3,5e-4\}$ and ${\rm wd} \in \{1e-1,1e-2,1e-5\}$--no scheduler on weight-decay, standard linear warm-up cosine-annealing for ${\rm lr}$

LeJEPA's Training Loss is Informative of Downstream Performance {#sec:loss_corr}
---------------------------------------------------------------

![image](toy_figures/exps/loss_corr/selection_resnet50_galaxy10.png){width="0.33\\linewidth"} ![image](toy_figures/exps/loss_corr/selection_resnet50_inet10.png){width="0.33\\linewidth"} ![image](toy_figures/exps/loss_corr/selection_ViT-base-14_inet1k.png){width="0.33\\linewidth"}

![Spearman correlation (**y-axis)** between LeJEPA's training loss and downstream accuracy on the dataset's classification task with a frozen backbone and linear evaluation. The **x-axis** varies $\alpha$ in [\[eq:corr\]](#eq:corr){reference-type="ref" reference="eq:corr"} following our scaling law of the loss w.r.t. $\lambda$. Using $\alpha=0$ recovers the plain training loss. We clearly observe a very high correlation already for $\alpha=0$, which further increases up to $99\%$ for $\alpha=0.4$. The entire set of points is obtained across numerous hyper-parameters such as learning rate, weight decay, number of epochs, $\lambda$--demonstrating how **LeJEPA's training loss is strongly predictive of downstream performance** which can be used for label-free cross-validation.](toy_figures/exps/loss_corr/loss_ablations.pdf){#fig:alignment_loss width="\\linewidth"}

A major challenge in SSL pretraining is the lack of reliable signals conveying the quality of the learned representation. As a result, it is common to monitor a supervised downstream task performance, sometimes supplemented with unsupervised embedding statistics [@agrawal2022alpha; @garrido2023rankme; @thilak2023lidar]. This process is highly limiting since it requires labeled data that is costly and overly specialized. This is further exacerbated in the latest JEPA models where training losses exhibit low correlation with downstream performance--and may not even decrease monotonically during training.

In contrast, we find that LeJEPA's training loss behaves much more favorably--providing us with a meaningful signal on model quality. First, we provide in [\[fig:heatmap\_loss\]](#fig:heatmap_loss){reference-type="ref" reference="fig:heatmap_loss"}, the 2D plane spanned by the SIGReg and prediction losses where a clear trend with downstream task accuracy can be observed. More strikingly, the combined training loss [\[eq:lejepa\]](#eq:lejepa){reference-type="eqref" reference="eq:lejepa"} with mixing coefficient $\lambda$ exhibits very high Spearman correlation [@spearman1961proof], denoted as $\rho_s$, of about $85\%$ with downstream accuracy--which is considered a strong signal. This strong relationship holds across datasets and architectures. As a result, a lower LeJEPA training loss reliably indicates a better downstream performance.

We can further improve this correlation through a simple scaling law based upon the trade-off weighting hyperparameter $\lambda$ $$\begin{aligned}
C^{(\alpha)} = \rho_s\left(\frac{\text{train\_loss}}{\lambda^{\alpha}}, 
\text{test\_accuracy}\right). \label{eq:corr}\end{aligned}$$ By setting $\alpha \approx 0.4$, LeJEPA's training loss is able to achieve nearly 99% correlation with downstream performance across multiple datasets and models. We depict the changes in $C^{(\alpha)}$ as a function of $\alpha$ on multiple datasets and models in [6](#fig:alignment_loss){reference-type="ref" reference="fig:alignment_loss"}, as well as the training LeJEPA loss against downstream performance in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}. **The strong alignment between LeJEPA's training loss and model quality enables label-free SSL model selection and cross-validation**.

In-Domain LeJEPA Outperforms Frontier Model Transfer Learning {#sec:galaxy}
-------------------------------------------------------------

![image](toy_figures/exps/galaxy10.png){width="\\linewidth"}

  ---------------- ---------------------- -------- ---------- -------- ----------- ----------- ----------- ----------- ----------- ------------ ----------- ----------- ---------
                                                                                                                                                                        
  (lr)6-14 shots   model                  params   pretrain   epochs   DTD              aircr.        cars     cifar10    cifar100   flowers102        food        pets      avg.
                   LeJEPA ViT-L           304M     IN-1K      100      **33.21**          9.37        3.40       51.65       27.01        48.53       17.14       46.11     29.55
                   LeJEPA ConvNeXtV2-H    660M     IN-1K      100      32.15              8.07        4.28       50.95       31.48    **48.74**   **17.95**       58.98     31.58
                   I-JEPA ViT-H           632M     IN-1K      300      27.71              9.86        4.33   **56.52**       30.58        44.69       14.53       53.38     30.20
                   I-JEPA ViT-H + STOP    632M     IN-1K      300      26.60         **11.18**    **4.75**       56.27   **35.20**        47.17       15.75   **59.47**     32.05
                   *I-JEPA ViT-H (22K)*   *632M*   *IN-22K*   *900*    *27.98*         *13.00*      *3.45*     *61.84*     *34.70*      *89.72*     *19.62*     *30.86*   *35.15*
                   LeJEPA ViT-L           304M     IN-1K      100      **64.72**         35.25       22.25       85.15       59.77    **92.53**   **50.90**       77.00     60.95
                   LeJEPA ConvNeXtV2-H    660M     IN-1K      100      61.84             30.67       24.46       85.74       63.29        91.78       49.32       78.53     60.70
                   I-JEPA ViT-H           632M     IN-1K      300      57.68             33.82       21.96       88.77       66.42        88.24       43.97       83.23     60.51
                   I-JEPA ViT-H + STOP    632M     IN-1K      300      57.00         **39.77**   **25.21**   **90.09**   **70.32**        90.16       45.68   **85.13**     62.92
                   *I-JEPA ViT-H (22K)*   *632M*   *IN-22K*   *900*    *58.74*         *43.52*     *18.27*     *94.83*     *75.23*      *98.94*     *49.06*     *67.66*     63.28
                   LeJEPA ViT-L           304M     IN-1K      100      **78.30**         57.01       57.28       96.50       83.71    **91.21**   **82.05**       89.74     79.48
                   LeJEPA ConvNeXtV2-H    660M     IN-1K      100      76.60             52.99       54.88       96.15       81.34        91.11       77.64       89.76     77.56
                   I-JEPA ViT-H           632M     IN-1K      300      73.32             56.61       54.47       97.54       86.42        86.47       81.02       92.11     78.50
                   I-JEPA ViT-H + STOP    632M     IN-1K      300      73.87         **61.95**   **61.27**   **98.02**   **87.78**        88.08       81.72   **92.88**     80.70
                   *I-JEPA ViT-H (22K)*   *632M*   *IN-22K*   *900*    *75.67*         *65.39*     *49.79*     *98.46*     *89.95*      *98.54*     *81.58*     *87.19*   *80.82*
  ---------------- ---------------------- -------- ---------- -------- ----------- ----------- ----------- ----------- ----------- ------------ ----------- ----------- ---------

![image](toy_figures/videos/dog_video.png){width="0.33\\linewidth"} ![image](toy_figures/videos/snowbike_video.png){width="0.33\\linewidth"} ![image](toy_figures/videos/bird_video.png){width="0.33\\linewidth"}

![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n01818515_5551_original.png "fig:"){#fig:attention_vis width="0.49\\linewidth"} ![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n01818515_5551_pca.png "fig:"){#fig:attention_vis width="0.49\\linewidth"}\
![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n01818515_5770_original.png "fig:"){#fig:attention_vis width="0.49\\linewidth"} ![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n01818515_5770_pca.png "fig:"){#fig:attention_vis width="0.49\\linewidth"}\
![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n02099601_3788_original.png "fig:"){#fig:attention_vis width="0.49\\linewidth"} ![**LeJEPA learns rich semantic representations through self-supervised learning.** PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.](toy_figures/pca/n02099601_3788_pca.png "fig:"){#fig:attention_vis width="0.49\\linewidth"}

A key promise of self-supervised learning is to learn universal representations that generalize across tasks and domains. However, current frontier foundation models (e.g., DINOv2/v3, IJEPA) are pretrained on natural images forcing practitioners in specialized domains to collect large amount of labels for supervised finetuning. In fact, most frontier models can not be trained directly on those domains as the number of samples may be small and searching again for the hyper-parameters would be cumbersome yet necessary [@assran2022hidden].

To demonstrate LeJEPA's versatility and ability to resolve that current pain-point, we propose to pretrain directly on a new domain without any change in the loss or the pretraining pipeline. We select the Galaxy10 dataset, a galaxy morphology classification task that differs significantly from natural images in both visual structure and statistical properties [@balestriero2025gaussian]. The dataset contains 11,000 training samples across 10 galaxy types. For LeJEPA, we use the default hyper-parameters and pretrain for 400 epochs a variety of backbones. We compare against the latest DINOv2, DINOv3 and IJEPA. We report in [\[fig:galaxy10\]](#fig:galaxy10){reference-type="ref" reference="fig:galaxy10"} the top1 accuracy for linear probing both with frozen backbone and full-finetuning. We observe that **in-domain pretraining with LeJEPA substantially outperforms state-of-the-art frontier models (DINOv2, DINOv3) on both linear probing and full finetuning**. Additional datasets and backbones are provided in [\[tab:in\_domain\]](#tab:in_domain){reference-type="ref" reference="tab:in_domain"} depicting LeJEPA's ability to train in-domain, even with a dataset with $1000$ samples (flowers102). Coupling this result with the stability of LeJEPA across architectures and hyper-parameters should offer a promising alternatives in domains not yet accounted for by the latest frontier models.

LeJEPA Scales Across Data and Models {#sec:scale}
------------------------------------

We now propose to apply LeJEPA over a larger pretraining dataset, i.e., Imagenet-1k, and over larger backbones such as ViT/Large (0.3B), ConvNextV2-Huge (0.6B). For those two models, we reach an online linear probe accuracy on inet1k of 77.1% and 78.5% respectively. Beyond in-distribution performances, we also explore transfer learning. For those experiments, our baselines are IJEPA with a ViT-Huge (0.6B) which is the closest to our setup, and we also include a recent improved version of IJEPA including additional stochastic prediction tasks [@bar2023stochastic] that is coined IJEPA + STOP. For LeJEPA, we employ the same recipe as described in [6.1](#sec:hparams){reference-type="ref" reference="sec:hparams"} and report transfer learning performances with frozen backbone in [\[tab:transfer\]](#tab:transfer){reference-type="ref" reference="tab:transfer"}. We observe that we consistently outperform IJEPA while employed a smaller model and shorted training schedule. Beyond top1 accuracy, we also echo our findings from [6.2](#sec:loss_corr){reference-type="ref" reference="sec:loss_corr"} about LeJEPA's training loss quality. In our setup, we observe a very stable and smooth training curve indicating a stable optimization landscape removing the need for careful hyperparameter selection (recall [\[thm:ecf\_stability\]](#thm:ecf_stability){reference-type="ref" reference="thm:ecf_stability"}). We provide an example on a ViT-gigantic (1.8B parameters) in [\[fig:teaser\]](#fig:teaser){reference-type="ref" reference="fig:teaser"}.

Emergent Semantic Structure in LeJEPA Representations
-----------------------------------------------------

A hallmark of successful self-supervised learning is the emergence of semantically meaningful attention patterns without explicit supervision [@caron2021emerging]. To assess whether LeJEPA learns such structure, we visualize the attention maps of the learned representations. Following DINO [@caron2021emerging], we apply PCA to the embeddings and visualize the first principal components, which reveal clear correspondence to object boundaries and salient regions ([12](#fig:attention_vis){reference-type="ref" reference="fig:attention_vis"}). Furthermore, we explore whether these attention patterns can enable unsupervised video segmentation---a challenging task requiring temporal consistency and object understanding. By thresholding the self-attention maps of the \[CLS\] token, we obtain binary masks that track objects across frames without any segmentation labels during training. As shown in [\[fig:video\_seg\]](#fig:video_seg){reference-type="ref" reference="fig:video_seg"}, **LeJEPA's attention naturally segments foreground objects from background with remarkable temporal coherence**, suggesting that the learned representations capture both spatial semantics and temporal structure. This emergent capability demonstrates that LeJEPA's stability-focused objective does not sacrifice the semantic richness of learned features.

Conclusion
==========

We have established a principled theoretical framework for JEPA-based self-supervised learning that fundamentally resolves its core pathologies. Our contributions span theory and practice: we proved that isotropic Gaussian embeddings uniquely minimize worst-case downstream risk, introduced SIGReg as a tractable and provably correct method to enforce this distribution, and demonstrated that this approach eliminates representational collapse by design--and not through ad-hoc combinations of teacher-student networks, stop-gradients, or asymmetric architectures.

We validate LeJEPA across domains and over $60$ architectures including gigantic versions with 1.8B parameters. In spite of its simplicify , LeJEPA matches state-of-the-art performance while requiring fewer than 50 lines of core implementation. Critically, our approach provides what SSL has long needed: a mathematically rigorous foundation that directly informs practical algorithm design.

Acknowledgments {#acknowledgments .unnumbered}
===============

We would like to thank Mike Rabbat and Lucas Maes for providing valuable feedbacks on the manuscript.

```{=latex}
\newcommand{\FancyAppendixTitle}[1]{%
    \thispagestyle{empty}
    \begin{center}
        % Thick decorative line
        \noindent
        \begin{tikzpicture}
            \draw[line width=2pt] (0,0) -- (\textwidth,0);
        \end{tikzpicture}
        \\[0.5em]
        % Thin decorative line
        \begin{tikzpicture}
            \draw[line width=0.5pt] (0,0) -- (\textwidth,0);
        \end{tikzpicture}
        \\[2.5em]
        % Main title
        {\fontsize{22pt}{26pt}\selectfont\bfseries #1}\\[1.2em]
        % Appendix label
        {\fontsize{16pt}{20pt}\selectfont\bfseries Appendix}\\[2.5em]
        % Bottom lines
        \begin{tikzpicture}
            \draw[line width=0.5pt] (0,0) -- (\textwidth,0);
        \end{tikzpicture}
        \\[0.5em]
        \begin{tikzpicture}
            \draw[line width=2pt] (0,0) -- (\textwidth,0);
        \end{tikzpicture}
    \end{center}
    \vspace{2cm}
}
```
Additional Details on Nonlinear Probing {#sec:additional_nonlinear}
=======================================

kNN Probing
-----------

To allow for more flexible evaluation of the pretrained encoder $f_{\vtheta}$, it is standard to work with a $k$-NN prober [@taunk2019brief], both for regression and classification. We rely on the radial $k$-NN variation that leverages a sample-dependent $k$--improving performance for non uniform distributions of samples [@sun2010adaptive; @zhang2017efficient; @abu2019effects].

We denote the underlying embedding density as $p_{z}\in C^3$ with derivatives of order up to $3$ bounded, and finite Fisher information and covariance. This regularity condition is fulfilled by current encoders. The *unknown* labels come from the target function $\eta:\R^K\to\R$, assumed $C^2$. We handle classification tasks by setting $\eta(\vz)=\mathbb{P}(Y=1\mid \vz)$. The training consists of the $N$ embeddings along with their training labels $\{(\vz_n,\eta(\vz_n))\}_{n=1}^{N}$, where we will denote $\vy_{n}\triangleq \eta(\vz_n)$. The prediction for a query vector $\vq$ is formed as $$\begin{aligned}
\widehat{\vy}(\vq)
:=
\frac{1}{\vy(\vq)}\sum_{n:\norm{\vz_{n}-\vq}\le r_0}\vy_n,\tag{kNN}\label{eq:kNN}\end{aligned}$$ with $\vy(\vq)\triangleq\#\{n:\norm{\vz_{n}-\vq}\le r_0\}$ counting the number of samples within a $r$-radius ball around $\vq$. The radius $r$ controls how many neighbors predictions are averaged to form the query's prediction. As per the linear probing's [\[thm:linear\_probe\_bias\]](#thm:linear_probe_bias){reference-type="ref" reference="thm:linear_probe_bias"}, we can characterize the bias of the estimator [\[eq:kNN\]](#eq:kNN){reference-type="ref" reference="eq:kNN"} at a particular query point, as formalized below.

k-NN Pointwise Bias The [\[eq:kNN\]](#eq:kNN){reference-type="eqref" reference="eq:kNN"} estimator has bias at query $\vq$ given by $$\begin{gathered}
    \mathrm{Bias}(\vq)=
\frac{r_0^2}{d+2}\Big(\grad\eta(\vq)^\top\grad\log p_{z}(\vq)+\tfrac{1}{2}\Delta\eta(\vz)\Big)\\+o(r_0^2),\end{gathered}$$ where the remainder $o(r_0^2)$ is uniform in $\vq$. (Proof in [9.3](#proof:knn_bias){reference-type="ref" reference="proof:knn_bias"}.)

To obtain the integrated bias, i.e., over the distribution of query points, we consider the following two properties. First, the distribution of query points follow the training distribution, i.e., $\vq \sim p_{z}$, second, target function $\eta$ has gradient which is mean-zero and isotropic with $\E\big[\grad\eta(\vz)\grad\eta(\vz)^\top\big]=\tau_g^2I_d$ with $\tau_g^2\in(0,\infty)$ uniformly in $\vz$. We also have any finite scalar-constraint on the covariance of the embeddings such as $\Tr(\Sigma)=c$ or $\|\Sigma\|_F=c$ for a finite constant $c$.

k-NN isotropic Gaussian Optimality The integrated squared bias of [\[eq:kNN\]](#eq:kNN){reference-type="eqref" reference="eq:kNN"} satisfies $$\E_{\vz}\big[\mathrm{Bias}(\vz)^2\big]
=
\frac{r_0^4}{(K+2)^2}\tau_g^2J(p)\\+O(r_0^4),$$ and the isotropic Gaussian is the unique minimizer of the integrated square bias. (Proof in [9.4](#proof:knn_optimal){reference-type="ref" reference="proof:knn_optimal"}.)

As a result, we now have a unique minimizer for the optimal embedding density for both the linear and k-NN probes.

Kernel Probing
--------------

As an alternative to [\[eq:kNN\]](#eq:kNN){reference-type="eqref" reference="eq:kNN"}, it is also common to leverage kernel methods, which we consider in this section.

Consider a kernel $K:\mathbb{R}^K\to\mathbb{R}$ with the following standard properties $$\begin{aligned}
    &\int_{\mathbb{R}^d} K(u)du=1,\tag{normalized}\\
    &\int_{\mathbb{R}^d} uK(u)du=0,\tag{symmetric}\\
    &\int_{\mathbb{R}^d} u u^\top K(u)du=\mu_2(K)I_d,\tag{isotropic}\\
    &R(K)\triangleq\int_{\mathbb{R}^d} K(u)^2du<\infty,\tag{finite roughness}\end{aligned}$$ for some $\mu_2(K)\in(0,\infty)$, some bandwidth $h>0$ and denoting $K_h(t)\triangleq h^{-d}K(t/h)$, we remind the reader that the Nadaraya-Watson estimator, introduced in @nadaraya1964estimating [@watson1964smooth], at a query $\vq\in\mathbb{R}^d$ is $$\begin{aligned}
\widehat \vy(\vq)\triangleq \frac{\sum_{n=1}^N K_h(\vq-\vx_n)\vy_n}{\sum_{n=1}^N K_h(\vq-\vx_n)}.\tag{NW}\label{eq:NW}\end{aligned}$$ Similarly to [\[eq:kNN\]](#eq:kNN){reference-type="eqref" reference="eq:kNN"}, we will see that the performance of [\[eq:NW\]](#eq:NW){reference-type="eqref" reference="eq:NW"} depends crucially on the distribution of the training points. We have access to our dataset of inputs from $p_z$ and for each sample $\vz_n$ the corresponding target is given from $\eta(\vz_n)=\mathbb{E}[Y_n\mid \vz_n]$. We also denote the corresponding conditional variance of the target function at that point as $v(x)=\mathrm{Var}(Y_i\mid X_i=x)$. We follow the regularity conditions of the k-NN probing derivations and additionally assume that $p$ has sufficiently light tails so that for each coordinate $j$, $\lim_{\|x\|\to\infty} p(x)=0$ and $\lim_{\|x\|\to\infty} x_jp(x)=0$. We first derive the pointwise bias and variance for $\widehat \vy(\vq)$.

Kernel Bias and Variance For any fixed $\vq\in\mathbb{R}^d$ with $p(\vq)>0$, as $h\to 0$ and $n h^d\to\infty$, $$\begin{aligned}
\mathrm{Bias}\big[\widehat \vy(\vq)\big]
&=\frac{h^2\mu_2(K)}{2}\Big(\Delta \vy(\vq)+2\nabla \vy(\vq)^\top \nabla\log p(\vq)\Big)+o(h^2),\\
\mathrm{Var}\big[\widehat \vy(\vq)\big]
&=\frac{R(K)}{n h^d}\frac{v(\vq)}{p(\vq)}+o\big((n h^d)^{-1}\big).\end{aligned}$$ The $o(\cdot)$ terms are uniform over compact sets where $p$ is bounded away from zero. (Proof in [9.5](#proof:kernel_bias){reference-type="ref" reference="proof:kernel_bias"}.)

We now show that, under a fixed mean and total-covariance constraint on $p_z$, the isotropic Gaussian distribution uniquely minimizes the bias and variance of the kernel regression estimator at any test point. We restrict the smoothness class of the target function using $$\begin{gathered}
    \mathcal{M}(L,B)\triangleq\Big\{m\in C^2(\mathbb{R}^d):\|\nabla \vy(\vq)\|\le L,\\|\Delta \vy(\vq)|\le B, \forall  \vq\in\mathbb{R}^d\Big\},\end{gathered}$$ allowing us to formalize below the worst case integrated bias and the optimal density for $z$.

Proofs
======

Proof of [\[thm:linear\_probe\_bias\]](#thm:linear_probe_bias){reference-type="ref" reference="thm:linear_probe_bias"} {#proof:linear_probe_bias}
----------------------------------------------------------------------------------------------------------------------

Our proof follows standard derivations when it comes to studying the bias of an estimator. Let's consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X} + \lambda_{\rm wd} \mathbf{I})^{-1} \mathbf{X}^T \mathbf{Y}.$$ The labels are formed from the ground truth parameter $\beta_{\rm true}$ with centered error, as per $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta}_{\text{true}} + \boldsymbol{\varepsilon}$ where $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$. We can now look at the bias of our estimator given by $$\begin{aligned}
\text{Bias}(\hat{\boldsymbol{\beta}}) &= \mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}_{\text{true}} \\
&=(\mathbf{X}^T \mathbf{X} + \lambda_{\rm wd} \mathbf{I})^{-1} \mathbf{X}^T \mathbf{X}\boldsymbol{\beta}_{\text{true}}-\boldsymbol{\beta}_{\text{true}}\\
&= -\lambda_{\rm wd}(\mathbf{X}^T \mathbf{X} + \lambda_{\rm wd} \mathbf{I})^{-1} \boldsymbol{\beta}_{\text{true}}\\
&= -\lambda_{\rm wd} \mathbf{Q}(\boldsymbol{\Lambda} + \lambda \mathbf{I})^{-1}\mathbf{Q}^T \boldsymbol{\beta}_{\text{true}}\end{aligned}$$ We will now compare that bias when $\mX$ has isotropic and anisotropic covariance with same total variance: $$\frac{\lambda_1 + \lambda_2 + \cdots + \lambda_p}{p} = \bar{\lambda}.$$ For any anisotropic covariance matrix of $\mX$, denote by $\vq_1$ the eigenvector with smallest eigenvalue, and let's denote by $\kappa>0$ a positive constant. We now define $$\boldsymbol{\beta}_{\text{true}} = \kappa \cdot \mathbf{q}_p,$$ leading to $$\begin{aligned}
\|\text{Bias}(\hat{\boldsymbol{\beta}})\|_{\text{isotropic}} = \frac{\lambda_{\rm wd}}{\bar{\lambda} + \lambda_{\rm wd}} \|\boldsymbol{\beta}_{\text{true}}\|,\\
\|\text{Bias}(\hat{\boldsymbol{\beta}})\|_{\text{non-isotropic}} = \frac{\lambda_{\rm wd}}{\lambda_p + \lambda_{\rm wd}} \|\boldsymbol{\beta}_{\text{true}}\|\end{aligned}$$ Since $\lambda_p < \bar{\lambda}$ (strict inequality when not isotropic): $$\frac{\lambda_{\rm wd}}{\lambda_p + \lambda_{\rm wd}} > \frac{\lambda_{\rm wd}}{\bar{\lambda} + \lambda_{\rm wd}}$$ we obtain that $$\|\text{Bias}(\hat{\boldsymbol{\beta}})\|_{\text{non-isotropic}} > \|\text{Bias}(\hat{\boldsymbol{\beta}})\|_{\text{isotropic}}$$

As a result, whenever the covariance matrix of $\mX$ is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data's covariance structure.

Proof of [\[thm:linear\_probe\_variance\]](#thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_variance"} {#proof:linear_probe_variance}
----------------------------------------------------------------------------------------------------------------------------------

We use the same formula as in [9.1](#proof:linear_probe_bias){reference-type="ref" reference="proof:linear_probe_bias"} with $\lambda_{\rm wd}=0$. We first see that the estimator is unbiased. We will now leverage that result to compute the covariance matrix of the estimator $$\begin{aligned}
\text{Var}(\hat{\boldsymbol{\beta}}|\mathbf{X}) &= \mathbb{E}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T|\mathbf{X}]\\
&= \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}|\mathbf{X}]\\
&= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^T|\mathbf{X}]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\
&= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\sigma^2\mathbf{I}_n)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\
&= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\end{aligned}$$ leading to the total variance $$\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}})) = \sigma^2\text{tr}(\mathbf{G}^{-1})=\sigma^2 \sum_{j=1}^p \frac{1}{\lambda_j}$$ where we used the eigendecomposition: $$\mathbf{G} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$$

The function $f(x) = \frac{1}{x}$ is strictly convex on $(0, \infty)$ allowing us to leverage Jensen's Inequality: $$\begin{aligned}
    \frac{1}{K}\sum_{k=1}^K \frac{1}{\lambda_k} > \frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\\
    \iff \frac{1}{K}\sum_{k=1}^K \frac{1}{\lambda_k} > \frac{1}{K}\sum_{k=1}^{K}\frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\\
    \iff \sum_{k=1}^K \frac{1}{\lambda_k} > \sum_{k=1}^{K}\frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\\
    \iff \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}))_{\text{aniso}} > \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}))_{\text{iso}}\end{aligned}$$ The inequality is strict whenever the eigenvalues $\{\lambda_j\}_{j=1}^p$ are not all equal.

Proof of [\[thm:knn\_bias\]](#thm:knn_bias){reference-type="ref" reference="thm:knn_bias"} {#proof:knn_bias}
------------------------------------------------------------------------------------------

Under PPP, conditional expectations of $\widehat{\eta}(x)$ coincide with the normalized ball average $$\E\big[\widehat{\eta}(x)\big]
\;=\;
\frac{\int_{\Ball(0,r_0)} \eta(x+z)p(x+z)dz}{\int_{\Ball(0,r_0)} p(x+z)dz}
\quad\text{to second order in }r_0,$$ which is the key surrogate used below. **Ball integrals.** For computations we use (by symmetry) for any $r>0$: $$\int_{\Ball(0,r)} zdz=0,\qquad
\int_{\Ball(0,r)} zz^\top dz=\frac{{\rm Vol}^{d+2}}{d+2}I_d,\qquad
\int_{\Ball(0,r)} \norm{z}^2dz=\frac{d{\rm Vol}^{d+2}}{d+2}.$$

Fix $x\in\R^d$ and write $z\in\Ball(0,r_0)$ for local displacements. Assume $p\in C^3$, $\eta\in C^2$ with bounded derivatives on the region of interest, and expand a second-order Taylor expansion: $$\begin{aligned}
p(x+z)&=p(x)+\nabla p(x)^\top z+\tfrac12 z^\top H p(x)z + O(\|z\|^3),\\
\eta(x+z)&=\eta(x)+\nabla\eta(x)^\top z+\tfrac12 z^\top H\eta(x)z + O(\|z\|^3),\end{aligned}$$ with remainders satisfying $|R_\eta(x;z)|\le C_\eta\norm{z}^3$ and $|R_p(x;z)|\le C_p\norm{z}^3$ uniformly for $\norm{z}\le r_0$. Using the ball identities $\int_{B(0,r)} zdz=0$ and $\int_{B(0,r)} zz^\top dz=\frac{v_d r^{d+2}}{d+2}I_d$ and collecting terms up to order $r_0^{d+2}$, we simplify the denominator as $$\begin{aligned}
    \mathcal{D}(x)&\triangleq\int_{\Ball(0,r_0)} p(x+z)dz\\
    &= \int_{\Ball(0,r_0)} \Big[p(x) + \grad p(x)^\top z + \tfrac{1}{2}z^\top \Hess p(x)z + R_p(x;z)\Big]dz\\
    &= {\rm Vol}_0^dp(x)\;+\;\frac{{\rm Vol}_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big)\;+\;O(r_0^{d+3}),\end{aligned}$$ since $\int zdz=0$ and $\int z^\top \Hess pzdz=\tr(\Hess p)\frac{\vol r_0^{d+2}}{d+2}$ and the denominator as $$\begin{aligned}
\mathcal{N}(x)&\triangleq \int_{\Ball(0,r_0)} \eta(x+z)p(x+z)dz\\
    &= \int \Big[\eta(x)+\grad\eta(x)^\top z+\tfrac{1}{2}z^\top \Hess\eta(x)z\Big]
\Big[p(x)+\grad p(x)^\top z+\tfrac{1}{2}z^\top \Hess p(x)z\Big]dz+O(r_0^{d+3})\\
&= \eta(x)p(x)\vol r_0^d+\eta(x)\frac{\vol r_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big)+\frac{\vol r_0^{d+2}}{d+2}\grad\eta(x)\cdot\grad p(x)+ \frac{\vol r_0^{d+2}}{2(d+2)}p(x)\tr\big(\Hess\eta(x)\big)
+O(r_0^{d+3}).\end{aligned}$$

Cubic terms vanish by symmetry, and quartic terms are $O(r_0^{d+4})$. Subtract $\eta(x)\mathcal{D}(x)$ to obtain the bias numerator: $$\mathcal{N}(x)-\eta(x)\mathcal{D}(x)
= \frac{v_dr_0^{d+2}}{d+2}\Big(\nabla\eta(x)\cdot\nabla p(x) + \tfrac{1}{2}p(x)\Delta\eta(x)\Big) + O(r_0^{d+3}).$$ Write $\mathcal{D}(x)=v_d r_0^d p(x)\big(1+\alpha(x)r_0^2+O(r_0^3)\big)$ where $\alpha(x):=\frac{1}{2(d+2)p(x)}\mathrm{tr}(H p(x))$. Then $$\begin{aligned}
\frac{\mathcal{N}(x)}{\mathcal{D}(x)}-\eta(x)
&= \frac{ \frac{v_d r_0^{d+2}}{d+2}\left(\nabla\eta\cdot\nabla p + \frac{1}{2}p\Delta\eta\right) + O(r_0^{d+3})}{ v_d r_0^d p \left(1+\alpha r_0^2+O(r_0^3)\right)}\\[0.5ex]
&= \frac{r_0^2}{d+2}\left(\frac{\nabla\eta\cdot\nabla p}{p} + \frac{1}{2}\Delta\eta\right)\Big(1-\alpha r_0^2+O(r_0^3)\Big)\ +\ O(r_0^3)\\[0.5ex]
&= \frac{r_0^2}{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + \tfrac{1}{2}\Delta\eta(x)\Big)\ +\ o(r_0^2),\end{aligned}$$ uniformly on $\mathcal{K}$. This gives the bias formula $$\mathbb{E}\big[\widehat{\eta}(x)\big]-\eta(x)
=
\frac{r_0^2}{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + \tfrac{1}{2}\Delta\eta(x)\Big)\ +\ o(r_0^2),$$

completing the proof.

Proof of [\[thm:knn\_optimal\]](#thm:knn_optimal){reference-type="ref" reference="thm:knn_optimal"} {#proof:knn_optimal}
---------------------------------------------------------------------------------------------------

Recall from [9.3](#proof:knn_bias){reference-type="ref" reference="proof:knn_bias"} that the bias term as sample $\vx$ is given by $$\begin{aligned}
\mathrm{Bias}(\vx)
=&\frac{r_0^2}{d+2}\Big(\grad\eta(x)\cdot\grad\log p(x)\Big)\;+\;\frac{r_0^2}{2(d+2)}\Delta\eta(x)\;+\;o(r_0^2)\\
=&\frac{r_0^2}{d+2}\big(A(x)+C(x)\big)+o(r_0^2),\end{aligned}$$ where we defined $A(x)\triangleq \nabla\eta(x)\cdot\nabla\log p(x)$ and $C(x)\triangleq\frac{1}{2}\Delta\eta(x)$. We now square and take expectation of $X\sim p$ and the isotropic gradient prior $$\begin{aligned}
\mathbb{E}\big[\mathrm{Bias}(X)^2\big]
&=\mathbb{E}\big[\left(\frac{r_0^2}{d+2}\right)^2\big(A(x)^2 + 2A(x)C(x) + C(x)^2\big)
+ o(r_0^4)\big]\\
&=\left(\frac{r_0^2}{d+2}\right)^2
\Big\{\underbrace{\mathbb{E}\big[A(X)^2\big]}_{\text{score-gradient term}}
+ \underbrace{2\mathbb{E}\big[A(X)C(X)\big]}_{\text{cross term}}
+ \underbrace{\mathbb{E}\big[C(X)^2\big]}_{\text{curvature term}}\Big\}
+ o(r_0^4).
\label{eq:three-terms}\end{aligned}$$ We will derive each term separately, recalling that we assume an isotropic gradient prior for $\eta$, i.e., $\mathbb{E}\big[\nabla\eta(x)\big]=0$ and $\mathbb{E}\big[\nabla\eta(x)\nabla\eta(x)^\top\big]=\tau_g^2I_d$, for some $\tau_g^2\in(0,\infty)$.

#### 1) The score-gradient term $\mathbb{E}[A(X)^2]$.

Using $v(x):=\nabla\log p(x)$ for brevity: $$\begin{aligned}
\mathbb{E}\big[A(X)^2\big]
=&\mathbb{E}_X\big[\mathbb{E}_\eta[A(X)^2]\big]\\
=&\mathbb{E}_X\big[\mathbb{E}_\eta[\big(\nabla\eta(x)^\top v(x)\big)^2]\big]\\
=&\mathbb{E}_X\big[\mathbb{E}_\eta[\nabla\eta(x)^\top\Big(v(x)v(x)^\top\Big)\nabla\eta(x)]\big]\\
=&\mathbb{E}_X\big[\mathbb{E}_\eta[\mathrm{tr}\Big(v(x)v(x)^\top\nabla\eta(x)\nabla\eta(x)^\top\Big)]\big]\\
=&\mathbb{E}_X\big[\mathrm{tr}\Big(v(x)v(x)^\top\mathbb{E}_\eta[\nabla\eta(x)\nabla\eta(x)^\top]\Big)\big]\\=&\mathbb{E}_X\big[\tau_g^2\|v(x)\|^2\big]\\
=&\tau_g^2\mathbb{E}_X\big[\|v(X)\|^2\big]\\
=&\tau_g^2\int_{\mathbb{R}^d} \|\nabla\log p(x)\|^2p(x)dx\end{aligned}$$ recovering the Fisher-information functional $J(p)$, scaled by $\tau_g^2$

#### 2) The cross term $2\mathbb{E}[A(X)C(X)]$.

We have $$A(x)C(x)=\frac{1}{2}\big(\nabla\eta(x)^\top v(x)\big)\Delta\eta(x).$$ Under the prior, $\nabla\eta$ is mean-zero and isotropic; if, additionally, $\Delta\eta$ is uncorrelated with $\nabla\eta$ and has zero mean (or is bounded and mean-zero after centering), then $\mathbb{E}_\eta[A(x)C(x)]=0$. If one does *not* assume the orthogonality/vanishing covariance above, then $\mathbb{E}[A(X)C(X)]$ is a finite constant (depending on the joint law of derivatives of $\eta$), and the cross term contributes $$\left(\frac{r_0^2}{d+2}\right)^2\cdot 2\mathbb{E}[A(X)C(X)]
=\ O(r_0^4),$$ not $o(r_0^4)$. In that general case, the leading $p$-dependent term of $\mathbb{E}[\mathrm{Bias}(X)^2]$ is still the *score-gradient* $\tau_g^2J(p)$.

#### 3) The curvature term $\mathbb{E}[C(X)^2]$.

$$\begin{aligned}
\mathbb{E}\big[C(X)^2\big]
=&\mathbb{E}_X\big[\mathbb{E}_\eta[C(X)^2]\big]\\
=&\frac{1}{4}\mathbb{E}_X\big[\mathbb{E}_\eta[(\Delta\eta(X))^2\big]\end{aligned}$$ which is independent of $p$, hence $\mathbb{E}\big[C(X)^2\big]=O(1)$

#### Putting it together.

Substituting into [\[eq:three-terms\]](#eq:three-terms){reference-type="eqref" reference="eq:three-terms"}: $$\begin{aligned}
\mathbb{E}\big[\mathrm{Bias}(X)^2\big]
&=\left(\frac{r_0^2}{d+2}\right)^2\Big\{\tau_g^2J(p) + O(1)\Big\} + o(r_0^4)\\
&=\frac{r_0^4}{(d+2)^2}\tau_g^2J(p)\;+\;O(r_0^4),\end{aligned}$$

We show that, among all mean-zero distributions $p$ on $\mathbb{R}^d$ with a given *scalar* constraint on the covariance (trace, determinant, Frobenius norm, or spectral radius), the density that minimizes the Fisher-information functional $$J(p)\;:=\;\int_{\mathbb{R}^d}\|\nabla\log p(x)\|^2p(x)dx$$ is the Gaussian with *isotropic* covariance satisfying the same scalar constraint. We proceed in two steps: (i) for fixed covariance matrix $\Sigma\succ 0$, $J(p)$ is minimized by the Gaussian $\mathcal{N}(0,\Sigma)$ and attains the value $\mathrm{tr}(\Sigma^{-1})$; (ii) for each scalar constraint, $\mathrm{tr}(\Sigma^{-1})$ is minimized by $\Sigma=sI_d$ for the appropriate scalar $s>0$.

Special case: Recovery of VCReg Let $p$ be a mean-zero probability density on $\mathbb{R}^d$ with covariance $\Sigma=\mathbb{E}[X X^\top]\succ 0$. Then $$J(p)\;\ge\;\mathrm{tr}(\Sigma^{-1}),$$ with equality if and only if $p=\mathcal{N}(0,\Sigma)$.

Consider the location family $p_\theta(x):=p(x-\theta)$, $\theta\in\mathbb{R}^d$. Its Fisher-information matrix at $\theta$ is $$\mathcal{I}(\theta)
\;=\;
\mathbb{E}\big[\nabla_\theta\log p_\theta(X)\nabla_\theta\log p_\theta(X)^\top\big]
\;=\;
\mathbb{E}\big[\nabla\log p(X)\nabla\log p(X)^\top\big],$$ so that $J(p)=\mathrm{tr}\mathcal{I}(\theta)$. The estimator $T(X)\equiv X$ is unbiased for $\theta$ under $p_\theta$, with $\mathrm{Cov}(T)=\Sigma$. The matrix Cramér--Rao bound gives $\mathrm{Cov}(T)\succeq \mathcal{I}(\theta)^{-1}$, i.e., $\mathcal{I}(\theta)\succeq \Sigma^{-1}$. Taking traces yields $J(p)\ge \mathrm{tr}(\Sigma^{-1})$. Equality in the matrix Cramér--Rao bound holds if and only if the score is an *affine* function of $X-\theta$, i.e., $\nabla\log p_\theta(X)=A(X-\theta)$ a.s. for some matrix $A$; integrating this identity shows $p_\theta$ is Gaussian with precision matrix $-A$, hence $p=\mathcal{N}(0,\Sigma)$.

Step 2: Optimizing over covariance shapes under scalar constraints {#step-2-optimizing-over-covariance-shapes-under-scalar-constraints .unnumbered}
------------------------------------------------------------------

Write the eigenvalues of $\Sigma$ as $\lambda_1,\ldots,\lambda_d>0$. Then $$\mathrm{tr}(\Sigma^{-1})=\sum_{i=1}^d \frac{1}{\lambda_i}.$$ We now solve $\min \sum_i 1/\lambda_i$ under each scalar constraint; in every case the minimum is attained when all $\lambda_i$ are equal, i.e., $\Sigma=sI_d$.

#### (a) Trace constraint.

Given $\mathrm{tr}(\Sigma)=\sum_i \lambda_i=t>0$, by Cauchy--Schwarz, $$\left(\sum_{i=1}^d \frac{1}{\lambda_i}\right)\left(\sum_{i=1}^d \lambda_i\right)\;\ge\;\left(\sum_{i=1}^d 1\right)^2=d^2,$$ with equality if and only if $\lambda_1=\cdots=\lambda_d$. Hence $$\min_{\Sigma\succ 0:\ \mathrm{tr}(\Sigma)=t}\ \mathrm{tr}(\Sigma^{-1})
\;=\;\frac{d^2}{t},
\quad\text{attained at}\quad
\Sigma=\frac{t}{d}I_d.$$

#### (b) Determinant constraint.

Given $\det(\Sigma)=\prod_i \lambda_i=\delta>0$, set $\mu_i:=1/\lambda_i$ so that $\prod_i \mu_i=\delta^{-1}$. By the AM--GM inequality, $$\frac{1}{d}\sum_{i=1}^d \mu_i\;\ge\;\left(\prod_{i=1}^d \mu_i\right)^{1/d}=\delta^{-1/d},$$ with equality iff $\mu_1=\cdots=\mu_d$, i.e., $\lambda_1=\cdots=\lambda_d$. Thus $$\min_{\Sigma\succ 0:\ \det(\Sigma)=\delta}\ \mathrm{tr}(\Sigma^{-1})
\;=\;d\delta^{-1/d},
\quad\text{attained at}\quad
\Sigma=\delta^{1/d}I_d.$$

#### (c) Frobenius-norm constraint.

Given $\|\Sigma\|_F^2=\sum_i \lambda_i^2=c^2>0$, minimize $f(\lambda):=\sum_i 1/\lambda_i$ over $\lambda_i>0$ subject to $g(\lambda):=\sum_i \lambda_i^2=c^2$. The Lagrangian $$\mathcal{L}(\lambda,\nu)=\sum_{i=1}^d \frac{1}{\lambda_i} + \nu\left(\sum_{i=1}^d \lambda_i^2 - c^2\right)$$ has first-order conditions $-\lambda_i^{-2}+2\nu\lambda_i=0$ for all $i$, i.e., $\lambda_i^3=\frac{1}{2\nu}$, so all $\lambda_i$ are equal. Imposing $\sum \lambda_i^2=c^2$ yields $\lambda_i=c/\sqrt{d}$, hence $$\min_{\Sigma\succ 0:\ \|\Sigma\|_F=c}\ \mathrm{tr}(\Sigma^{-1})
\;=\;\sum_{i=1}^d \frac{1}{\lambda_i}
\;=\;\frac{d^{3/2}}{c},
\quad\text{attained at}\quad
\Sigma=\frac{c}{\sqrt{d}}I_d.$$

#### (d) Spectral-radius constraint.

Let the spectral radius be constrained by $\rho(\Sigma)=\max_i \lambda_i\le r$ for some $r>0$. Since $x\mapsto 1/x$ is strictly decreasing on $(0,\infty)$, $$\sum_{i=1}^d \frac{1}{\lambda_i}
\;\ge\;\sum_{i=1}^d \frac{1}{r}
\;=\;\frac{d}{r},$$ with equality if and only if $\lambda_i=r$ for all $i$. Therefore $$\min_{\Sigma\succ 0:\ \rho(\Sigma)\le r}\ \mathrm{tr}(\Sigma^{-1})
\;=\;\frac{d}{r},
\quad\text{attained at}\quad
\Sigma=rI_d.$$ (The same conclusion holds if the constraint is $\rho(\Sigma)=r$, since one may take all eigenvalues equal to $r$.)

Conclusion: Isotropic Gaussian is optimal {#conclusion-isotropic-gaussian-is-optimal .unnumbered}
-----------------------------------------

Combining Lemma [\[lem:fixed-Sigma\]](#lem:fixed-Sigma){reference-type="ref" reference="lem:fixed-Sigma"} with the solutions (a)--(d), we obtain:

Special case: Recovery of VCReg Fix one of the following scalar covariance constraints for a mean-zero distribution $p$ on $\mathbb{R}^d$:

-   trace: $\mathrm{tr}(\mathrm{Cov}(X))=t$,

-   determinant: $\det(\mathrm{Cov}(X))=\delta$,

-   Frobenius norm: $\|\mathrm{Cov}(X)\|_F=c$,

-   spectral radius upper bound: $\rho(\mathrm{Cov}(X))\le r$.

Then the Fisher-information functional $J(p)$ is minimized over all such $p$ by the isotropic Gaussian $p_G=\mathcal{N}(0,sI_d)$ with $s$ chosen to satisfy the constraint. The minimal values are: $$\begin{array}{ll}
\text{trace } t: & J_{\min}=\dfrac{d^2}{t},\quad s=\dfrac{t}{d},\\[1.2ex]
\text{determinant } \delta: & J_{\min}=d\delta^{-1/d},\quad s=\delta^{1/d},\\[1.2ex]
\text{Frobenius } c: & J_{\min}=\dfrac{d^{3/2}}{c},\quad s=\dfrac{c}{\sqrt{d}},\\[1.2ex]
\text{spectral radius } r: & J_{\min}=\dfrac{d}{r},\quad s=r.
\end{array}$$ In each case, $p_G$ is the unique minimizer (up to null sets).

For any admissible $p$ with covariance $\Sigma$, Lemma [\[lem:fixed-Sigma\]](#lem:fixed-Sigma){reference-type="ref" reference="lem:fixed-Sigma"} gives $J(p)\ge \mathrm{tr}(\Sigma^{-1})$. Minimizing the right-hand side under the stated scalar constraint yields $\Sigma=sI_d$ by the calculations in (a)--(d). Equality in Lemma [\[lem:fixed-Sigma\]](#lem:fixed-Sigma){reference-type="ref" reference="lem:fixed-Sigma"} holds if and only if $p$ is Gaussian with that covariance, hence $p_G$ uniquely attains the bound.

Proof of [\[thm:kernel\_bias\]](#thm:kernel_bias){reference-type="ref" reference="thm:kernel_bias"} {#proof:kernel_bias}
---------------------------------------------------------------------------------------------------

Write the numerator and denominator of $\widehat m(x)$ as $$B_n(x):=\sum_{i=1}^n K_h(x-X_i)Y_i,\qquad A_n(x):=\sum_{i=1}^n K_h(x-X_i),$$ so that $\widehat m(x)=\frac{B_n(x)}{A_n(x)}$. *Bias.* Compute expectations using independence and change of variables. For the denominator, $$\begin{aligned}
\mathbb{E}[A_n(x)]
&=n\mathbb{E}\big[K_h(x-X)\big]\\
&=n\int_{\mathbb{R}^d} h^{-d}K\Big(\frac{x-u}{h}\Big)p(u)du\\
&=n\int_{\mathbb{R}^d} K(t)p(x-h t)dt\qquad (t:=(x-u)/h)\\
&=n\int_{\mathbb{R}^d} K(t)\Big(p(x)-ht^\top \nabla p(x)+\frac{h^2}{2}t^\top \nabla^2 p(x)t+o(h^2)\Big)dt\\
&=n\Big(p(x)+\frac{h^2}{2}\underbrace{\int t^\top \nabla^2 p(x)tK(t)dt}_{=\mu_2(K)\Delta p(x)}+o(h^2)\Big),\end{aligned}$$ where we used symmetry $\int t K(t)dt=0$ and isotropy $\int t t^\top K(t)dt=\mu_2(K) I_d$, which implies $\int t^\top \nabla^2 p(x)tK(t)dt=\mu_2(K)\mathrm{tr}(\nabla^2 p(x))=\mu_2(K)\Delta p(x)$. Similarly, for the numerator, $$\begin{aligned}
\mathbb{E}[B_n(x)]
&=n\mathbb{E}\big[K_h(x-X)Y\big]
=n\int K(t)(m p)(x-h t)dt\\
&=n\int K(t)\Big((mp)(x)-ht^\top \nabla(mp)(x)+\frac{h^2}{2}t^\top \nabla^2(mp)(x)t+o(h^2)\Big)dt\\
&=n\Big(m(x)p(x)+\frac{h^2}{2}\mu_2(K)\mathrm{tr}\big(\nabla^2(mp)(x)\big)+o(h^2)\Big)\\
&=n\Big(m(x)p(x)+\frac{h^2\mu_2(K)}{2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)(x)+o(h^2)\Big),\end{aligned}$$ where the last step uses the fact that $\mathrm{tr}\big(\nabla^2(mp)\big)=p\Delta m + m\Delta p + 2\nabla m^\top \nabla p$ by the product rule and symmetry of mixed derivatives.

Now expand the ratio $\frac{\mathbb{E}[B_n(x)]}{\mathbb{E}[A_n(x)]}$ using the identity $$\frac{a_0+h^2 a_2+o(h^2)}{b_0+h^2 b_2+o(h^2)}
=\frac{a_0}{b_0}
+h^2\frac{a_2 b_0-a_0 b_2}{b_0^2}
+o(h^2),$$ with $a_0=m(x)p(x)$, $a_2=\frac{\mu_2(K)}{2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)(x)$, $b_0=p(x)$, and $b_2=\frac{\mu_2(K)}{2}\Delta p(x)$. This yields $$\begin{aligned}
\frac{\mathbb{E}[B_n(x)]}{\mathbb{E}[A_n(x)]}
&=m(x)
+\frac{h^2\mu_2(K)}{2}\frac{\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)p - m p\Delta p}{p^2}\Big|_{x}+o(h^2)\\
&=m(x)
+\frac{h^2\mu_2(K)}{2}\Big(\Delta m(x)+2\nabla m(x)^\top \frac{\nabla p(x)}{p(x)}\Big)+o(h^2),\end{aligned}$$ which recovers our statement. *Variance.* Linearize $\widehat m(x)=B_n(x)/A_n(x)$ around $(\mathbb{E}[B_n(x)],\mathbb{E}[A_n(x)])$ and use independence. To leading order, $$\mathrm{Var}[\widehat m(x)]\approx \frac{\mathrm{Var}[B_n(x)]}{(\mathbb{E}[A_n(x)])^2}.$$ Compute $$\begin{aligned}
\mathrm{Var}[B_n(x)]
&= \sum_{i=1}^n \mathrm{Var}\big(K_h(x-X_i)Y_i\big)\quad\text{(independence)}\\
&= n\mathbb{E}\big[K_h(x-X)^2\mathrm{Var}(Y\mid X)\big]
= n\mathbb{E}\big[K_h(x-X)^2v(X)\big]\\
&= n\int h^{-2d}K\Big(\frac{x-u}{h}\Big)^2 v(u)p(u)du\\
&= n h^{-d}\int K(t)^2v(x-h t)p(x-h t)dt
= n h^{-d}\Big(R(K)v(x)p(x)+o(1)\Big),\end{aligned}$$ while $$\mathbb{E}[A_n(x)]=n\big(p(x)+o(1)\big).$$ Therefore, $$\mathrm{Var}[\widehat m(x)]
\approx \frac{n h^{-d}R(K)v(x)p(x)}{n^2p(x)^2}
=\frac{R(K)}{n h^d}\frac{v(x)}{p(x)}+o\big((n h^d)^{-1}\big),$$ completing the proof.

Proof of [\[eq:pred1\]](#eq:pred1){reference-type="ref" reference="eq:pred1"} to [\[eq:pred2\]](#eq:pred2){reference-type="ref" reference="eq:pred2"} {#proof:pred}
-----------------------------------------------------------------------------------------------------------------------------------------------------

Let $\bar{\mathbf{z}} = \frac{1}{V_g}\sum_{v=1}^{V_g}\mathbf{z}_{n,v}$ denote the mean of the first $V_g$ vectors.

We prove that: $$\frac{1}{V_g}\sum_{v=1}^{V_g}\frac{1}{V}\sum_{v'=1}^{V}\| \mathbf{z}_{n,v} - \mathbf{z}_{n,v'} \|_2^2 = \frac{1}{V}\sum_{v'=1}^{V}\left\| \bar{\mathbf{z}} - \mathbf{z}_{n,v'} \right\|_2^2$$

Expanding the left-hand side: $$\begin{aligned}
\text{LHS} &= \frac{1}{V_g V}\sum_{v=1}^{V_g}\sum_{v'=1}^{V}\| \mathbf{z}_{n,v} - \mathbf{z}_{n,v'} \|_2^2 \\
&= \frac{1}{V_g V}\sum_{v=1}^{V_g}\sum_{v'=1}^{V}\left(\|\mathbf{z}_{n,v}\|_2^2 - 2\mathbf{z}_{n,v}^T\mathbf{z}_{n,v'} + \|\mathbf{z}_{n,v'}\|_2^2\right) \\
&= \frac{1}{V_g}\sum_{v=1}^{V_g}\|\mathbf{z}_{n,v}\|_2^2 - \frac{2}{V_g V}\sum_{v=1}^{V_g}\sum_{v'=1}^{V}\mathbf{z}_{n,v}^T\mathbf{z}_{n,v'} + \frac{1}{V}\sum_{v'=1}^{V}\|\mathbf{z}_{n,v'}\|_2^2 \\
&= \frac{1}{V_g}\sum_{v=1}^{V_g}\|\mathbf{z}_{n,v}\|_2^2 - \frac{2}{V}\bar{\mathbf{z}}^T\sum_{v'=1}^{V}\mathbf{z}_{n,v'} + \frac{1}{V}\sum_{v'=1}^{V}\|\mathbf{z}_{n,v'}\|_2^2\end{aligned}$$

Expanding the right-hand side: $$\begin{aligned}
\text{RHS} &= \frac{1}{V}\sum_{v'=1}^{V}\left(\|\bar{\mathbf{z}}\|_2^2 - 2\bar{\mathbf{z}}^T\mathbf{z}_{n,v'} + \|\mathbf{z}_{n,v'}\|_2^2\right) \\
&= \|\bar{\mathbf{z}}\|_2^2 - \frac{2}{V}\bar{\mathbf{z}}^T\sum_{v'=1}^{V}\mathbf{z}_{n,v'} + \frac{1}{V}\sum_{v'=1}^{V}\|\mathbf{z}_{n,v'}\|_2^2\end{aligned}$$

To complete the proof, we verify that: $$\frac{1}{V_g}\sum_{v=1}^{V_g}\|\mathbf{z}_{n,v}\|_2^2 = \|\bar{\mathbf{z}}\|_2^2$$

Expanding the right-hand side: $$\begin{aligned}
\|\bar{\mathbf{z}}\|_2^2 &= \left\|\frac{1}{V_g}\sum_{v=1}^{V_g}\mathbf{z}_{n,v}\right\|_2^2 \\
&= \frac{1}{V_g^2}\sum_{v=1}^{V_g}\sum_{v''=1}^{V_g}\mathbf{z}_{n,v}^T\mathbf{z}_{n,v''} \\
&= \frac{1}{V_g}\sum_{v=1}^{V_g}\|\mathbf{z}_{n,v}\|_2^2\end{aligned}$$

Therefore, LHS = RHS, completing the proof.

Proof of [\[thm:kernel\_optimal\]](#thm:kernel_optimal){reference-type="ref" reference="thm:kernel_optimal"} {#proof:kernel_optimal}
------------------------------------------------------------------------------------------------------------

For each $x$, $$\mathrm{Bias}[\widehat m(x)]=\frac{h^2\mu_2(K)}{2}\Big(\Delta m(x)+2\nabla m(x)^\top \nabla\log p(x)\Big)+o(h^2).$$ Square and integrate against $p(x)$: $$\begin{aligned}
\mathcal{B}^2(h;p,m)
&=\Big(\frac{h^2\mu_2(K)}{2}\Big)^2\int \Big(\Delta m(x)+2\nabla m(x)^\top \nabla\log p(x)\Big)^2p(x)dx+o(h^4)\\
&\le \Big(\frac{h^2\mu_2(K)}{2}\Big)^2
\int \Big(2(\Delta m(x))^2+2(2\nabla m(x)^\top \nabla\log p(x))^2\Big)p(x)dx+o(h^4)\\
&=\Big(\frac{h^2\mu_2(K)}{2}\Big)^2\Big(2\int (\Delta m(x))^2p(x)dx+8\int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx\Big)+o(h^4),\end{aligned}$$ where we used $(a+b)^2\le 2 a^2+2 b^2$ pointwise. Since $|\Delta m(x)|\le B$ for all $x$, we have $$\int (\Delta m)^2p \le \int B^2p = B^2.$$ For the second term, first use Cauchy--Schwarz and then integrate against $p(x)$ to obtain $$\begin{aligned}
(\nabla m(x)^\top \nabla\log p(x))^2 \le \|\nabla m(x)\|^2\|\nabla\log p(x)\|^2 \le L^2\|\nabla\log p(x)\|^2\\
\implies \int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx
\le L^2\int \|\nabla\log p(x)\|^2p(x)dx
= L^2J(p).\end{aligned}$$ which can be combined with the bounds above to obtain the desired result. We similarly have for the integrated variance $$\begin{aligned}
\mathcal{V}(h;p)
&=\int \Big(\frac{R(K)}{n h^d}\frac{v(x)}{p(x)}+o\big((n h^d)^{-1}\big)\Big)p(x)dx=\frac{R(K)}{n h^d}\int v(x)dx+o\big((n h^d)^{-1}\big),\end{aligned}$$ which is independent of $p$.

Proof of [\[thm:spherical\_cramer\]](#thm:spherical_cramer){reference-type="ref" reference="thm:spherical_cramer"} {#proof:spherical_cramer}
------------------------------------------------------------------------------------------------------------------

We first start by reminding the reader about the original Cramér-Wold theorem that is a function of all possible directions (not unit-norm ones).

Cramér-Wold [@cramer1936some] Let $X$ and $Y$ be random vectors in $\mathbb{R}^D$: $$\begin{aligned}
    X \overset{d}{=} Y \iff \langle X, a \rangle \overset{d}{=} \langle Y, a \rangle, \forall \va \in \mathbb{R}^D.\end{aligned}$$

Our proof will follow the same proof as for [\[def:cramer\]](#def:cramer){reference-type="ref" reference="def:cramer"}. Necessity is immediate: if $X \stackrel{d}{=} Y$, then every measurable function of $X$ has the same distribution as the corresponding function of $Y$, from which the linear mapping $x \mapsto \langle u,x\rangle$ for $u \in \mathbb{S}^{d-1}$ is a special case. For sufficiency, assume $\langle u,X\rangle \stackrel{d}{=} \langle u,Y\rangle$ for all $u \in \mathbb{S}^{d-1}$. Let $\varphi_X(t) := \mathbb{E}\big[e^{i\langle t,X\rangle}\big]$ and $\varphi_Y(t) := \mathbb{E}\big[e^{i\langle t,Y\rangle}\big]$ denote the characteristic functions of $X$ and $Y$. Fix an arbitrary $t \in \mathbb{R}^d$; if $t=0$, then $\varphi_X(0)=\varphi_Y(0)=1$. If $t \neq 0$, write $t = s u$ with $s := \|t\| > 0$ and $u := t/\|t\| \in \mathbb{S}^{d-1}$. By the assumption, $\langle u,X\rangle \stackrel{d}{=} \langle u,Y\rangle$, hence for this $u$ and $s$ we have $$\varphi_X(t) \;=\; \mathbb{E}\big[e^{i\langle t,X\rangle}\big]
\;=\; \mathbb{E}\big[e^{i s \langle u,X\rangle}\big]
\;=\; \mathbb{E}\big[e^{i s \langle u,Y\rangle}\big]
\;=\; \mathbb{E}\big[e^{i\langle t,Y\rangle}\big]
\;=\; \varphi_Y(t).$$ Thus $\varphi_X(t) = \varphi_Y(t)$ for all $t \in \mathbb{R}^d$, i.e., $\varphi_X \equiv \varphi_Y$ on $\mathbb{R}^d$. By the uniqueness theorem for characteristic functions, this implies $X \stackrel{d}{=} Y$. (ii) Define $\psi_{n,t} := \mathbb{E}\big[e^{i\langle t,X_n\rangle}\big]$ and $\psi_{t} := \mathbb{E}\big[e^{i\langle t,X\rangle}\big]$. Fix $t \in \mathbb{R}^d$ and decompose $t = s u$ with $s := \|t\| \ge 0$ and $u \in \mathbb{S}^{d-1}$ (take, e.g., $u = t/\|t\|$ if $t \neq 0$, and any $u$ if $t=0$). The map $g_s:\mathbb{R}\to\mathbb{R}$, $g_s(x)=s x$, is continuous. By the continuous mapping theorem applied to the real-valued random variables $\langle u,X_n\rangle \xrightarrow{d} \langle u,X\rangle$, we obtain $$\langle t,X_n\rangle \;=\; s \langle u,X_n\rangle \xrightarrow{d} s \langle u,X\rangle \;=\; \langle t,X\rangle.$$ Hence, for every fixed $t \in \mathbb{R}^d$, the one-dimensional projections satisfy $\langle t,X_n\rangle \xrightarrow{d} \langle t,X\rangle$, which in turn yields pointwise convergence of characteristic functions: $$\psi_{n,t} \;=\; \mathbb{E}\big[e^{i\langle t,X_n\rangle}\big] \;\longrightarrow\; \mathbb{E}\big[e^{i\langle t,X\rangle}\big] \;=\; \psi_{t}, \qquad \text{for all } t \in \mathbb{R}^d.$$ Therefore, by Lévy's continuity theorem, $X_n \xrightarrow{d} X$. This completes the proof.

Proof of [\[thm:bcs\]](#thm:bcs){reference-type="ref" reference="thm:bcs"} {#proof:bcs}
--------------------------------------------------------------------------

We first formulate the following assumptions required for the proof--all of this are satisfied by typical univariate statistical tests.

$P=Q$ if and only if $P_a=Q_a$ for all $a\in S^{d-1}$ (population-level equivalence of laws).

$A_n$ are finite sets with mesh $\Delta(A_n):=\sup_{u\in S^{d-1}} \min_{a\in A_n}\|u-a\| \to 0$ as $n\to\infty$.

If $P\neq Q$, there exists a separating direction $a^\star\in S^{d-1}$ and a neighborhood $U$ of $a^\star$ such that $$\inf_{a\in U}\lim_{n\to\infty}\Pr\big(T_{a,n} \ge u_n(\alpha)\big)=1.$$ (Intuitively: near a truly separating direction, the 1D statistic eventually exceeds the global null threshold with probability $\to 1$.)

\(i\) Under $H_0:P=Q$, our assumption implies no separating direction exists at the population level, and the calibration of $u_n(\alpha)$ ensures $\Pr(M_n \ge u_n(\alpha)) \le \alpha$ for all $n$, hence $\limsup_{n\to\infty}\Pr(\Psi_n=1)\le \alpha$. (ii) Suppose $P\neq Q$. Our assumption guarantees that there exists at least one separating direction $a^\star$ with $P_{a^\star}\neq Q_{a^\star}$. Our assumption guarantees a neighborhood $U$ of $a^\star$ in which the projection statistics exceed the global null threshold with probability tending to 1: $$\inf_{a\in U}\lim_{n\to\infty}\Pr\big(T_{a,n} \ge u_n(\alpha)\big) \;=\; 1.$$ By assumption, for all large $n$ the set $A_n$ contains at least one direction $a_n\in U$ (dense coverage). Therefore, $$\Pr(\Psi_n=1) \;=\; \Pr\big(M_n \ge u_n(\alpha)\big)
\;\ge\; \Pr\big( T_{a_n,n} \ge u_n(\alpha) \big)
\;\longrightarrow\; 1,$$ which proves consistency.

Proof of [\[thm:spherical\_bounds\]](#thm:spherical_bounds){reference-type="ref" reference="thm:spherical_bounds"} {#proof:spherical_bounds}
------------------------------------------------------------------------------------------------------------------

For each case, consider the function $g(a)$ on $\mathbb{S}^{D-1}$ defined by the quantity of interest (CF, CDF, or moment) at a fixed $t$ or $k$. Since $f \in H^\alpha(\mathbb{R}^D)$, the mapping $a \mapsto g(a)$ is in $H^\alpha(\mathbb{S}^{D-1})$ for each fixed $t$ or $k$.

Given $M$ samples $\{a_i\}_{i=1}^M$ on the sphere, the best possible reconstruction of $g$ from its values at these points is given by spherical interpolation. By classical results on Sobolev spaces and spherical harmonics (see, e.g., [@narcowich2006localized]), the $L^2$ interpolation error for functions in $H^\alpha(\mathbb{S}^{D-1})$ using $M$ points is bounded by $$\mathbb{E}_b \left[ |g(b) - g^*(b)|^2 \right] \leq C(D, \alpha) M^{-2\alpha/(D-1)} \| g \|_{H^\alpha(\mathbb{S}^{D-1})}^2,$$ where $g^*$ is the interpolant matching $g$ at the $M$ sampled points. The interpolation error bound on the sphere follows from the theory of spherical harmonics and Marcinkiewicz--Zygmund (MZ) inequalities . Any $f \in H^\alpha(\mathbb{S}^d)$ admits a spherical harmonics expansion, and the best $L^2$ approximation by harmonics of degree at most $L$ satisfies $$\|f - P_L f\|_{L^2(\mathbb{S}^d)} \leq (1 + L^2)^{-\alpha/2} \|f\|_{H^\alpha(\mathbb{S}^d)},$$ where $P_L f$ is the projection onto harmonics of degree $\leq L$ [@narcowich2006localized Lemma 2.1]. If $M$ points are distributed quasi-uniformly on $\mathbb{S}^d$, then for $L \sim c M^{1/d}$, the set forms a Marcinkiewicz--Zygmund (MZ) set for degree $L$ [@mhaskar2001spherical Theorem 1.1]. This allows reconstruction of any function in the space of harmonics of degree at most $L$ from its values at these points, and the $L^2$ interpolation error for $f$ is bounded by $$\|f - I_M f\|_{L^2(\mathbb{S}^d)} \leq C (1 + L^2)^{-\alpha/2} \|f\|_{H^\alpha(\mathbb{S}^d)},$$ where $I_M f$ is any interpolant matching $f$ at the $M$ points [@narcowich2006localized Theorem 3.1]. Substituting $L \sim c M^{1/d}$ yields the rate $M^{-\alpha/d}$, and thus $$\mathbb{E}_{\omega} |f(\omega) - I_M f(\omega)|^2 \leq C(d, \alpha) M^{-2\alpha/d} \|f\|_{H^\alpha(\mathbb{S}^d)}^2,$$ with explicit $C(d, \alpha)$ as in the main theorem. Integrating (or summing) over $t$ (for CF and CDF) or $k$ (for moments, with weights $w_k$) yields the stated bounds. The explicit constant $C(D, \alpha)$ arises from the theory of spherical Sobolev spaces and is given above.

For the moment case, the sum over $k$ is weighted to ensure convergence, as higher moments may grow rapidly. The weights $w_k$ can be chosen, for example, as $w_k = 1/k!$.

This completes the proof.

Proof of [\[thm:moment\_conendrum\]](#thm:moment_conendrum){reference-type="ref" reference="thm:moment_conendrum"} {#proof:moment_conendrum}
------------------------------------------------------------------------------------------------------------------

Pick distinct $x_0,\dots,x_{K+1}\in\mathbb{R}$ and consider the linear map $A:\mathbb{R}^{K+2}\to\mathbb{R}^{K+1}$, $(Ap)_r=\sum_{j=0}^{K+1} p_j x_j^r$ for $r=0,\dots,K$. Then $\mathrm{rank}(A)\le K+1$, so $\ker(A)\neq\{0\}$. Let $v\in\ker(A)\setminus\{0\}$; from $(Ap)_0=\sum_j p_j$, we get $\sum_j v_j=0$, hence $v$ has positive and negative entries. Choose a strictly positive probability vector $p$ and $\varepsilon>0$ small such that $p^\pm:=p\pm\varepsilon v$ remain probability vectors. Then $Ap^+=Ap^-$, so the distributions supported on $\{x_j\}$ with masses $p^\pm$ are distinct yet match moments up to order $K$.

Proof of [\[thm:ecf\_stability\]](#thm:ecf_stability){reference-type="ref" reference="thm:ecf_stability"} {#proof:ecf_stability}
---------------------------------------------------------------------------------------------------------

Fix the Gaussian weight $$w_s(t)=e^{-s^2 t^2},\qquad s>0,$$ and define the population CF distance $$D(P,G)=\int_{\mathbb{R}} w_s(t)\big|\varphi_P(t)-\varphi_G(t)\big|^2dt.$$ Let the empirical CF be $$\widehat{\varphi}_N(t)=\frac{1}{N}\sum_{i=1}^N e^{itX_i},$$ and consider the V-statistic estimator $$\widehat{D}_V=\int_{\mathbb{R}} w_s(t)\big|\widehat{\varphi}_N(t)-\varphi_G(t)\big|^2dt.$$ We use only that $|e^{itX}|=1$, $|\varphi_P(t)|\le 1$, $|\varphi_G(t)|\le 1$, and integrability of $w_s$. For each $i$ differentiate under the integral (dominated convergence applies because the integrand and its derivative are bounded) $$\begin{aligned}
\frac{\partial \widehat{D}_V}{\partial X_i}
=& \int_{\mathbb{R}} w_s(t)2\Re\!\Big(\big(\widehat{\varphi}_N(t)-\varphi_G(t)\big)\overline{\frac{\partial \widehat{\varphi}_N(t)}{\partial X_i}}\Big)dt,\\
\frac{\partial \widehat{\varphi}_N(t)}{\partial X_i}
=& \frac{1}{N}i te^{itX_i},\end{aligned}$$ since $|\widehat{\varphi}_N(t)|\le 1$ and $|\varphi_G(t)|\le 1$, $$\begin{aligned}
\left|\frac{\partial \widehat{D}_V}{\partial X_i}\right|
&\le \frac{2}{N}\int w_s(t)|t|\big(|\widehat{\varphi}_N(t)|+|\varphi_G(t)|\big)dt\\
&\le \frac{4}{N}\int w_s(t)|t|dt\\
&= \frac{4}{Ns^2},\end{aligned}$$ using $\int_{\mathbb{R}} e^{-s^2 t^2}|t|dt=1/s^2$. $$\Bigg|\frac{\partial \widehat{D}_V}{\partial X_i}\Bigg|
\;\le\;\frac{4}{N}\int_{\mathbb{R}} w_s(t)|t|dt
\;=\; \frac{4}{Ns^2}.$$ Moreover, differentiating once more in $X_i$ and using $|\widehat{\varphi}_N(t)|\le 1$, $|\varphi_G(t)|\le 1$ gives a global Lipschitz bound $$\Bigg|\frac{\partial^2 \widehat{D}_V}{\partial X_i^2}\Bigg|
\;\le\; \frac{C}{N}\int_{\mathbb{R}} w_s(t)t^2dt
\;=\; \frac{C}{N}\cdot \frac{\sqrt{\pi}}{2s^3},$$ for some absolute constant $C$ arising from bounded factors and product rule. Hence ECF gradients are uniformly bounded and Lipschitz, with scale controlled only by $(N,s)$.

(Moment sample-gradients are polynomial in $X_i$ and unbounded for $k\ge 2$.) Let $\widehat{D}_V$ be as above. Define the moment objective $$\widehat{D}_k
\;=\;
(\bar{\phi}-\mu)^\top W(\bar{\phi}-\mu),\qquad
\bar{\phi}:=\frac{1}{N}\sum_{i=1}^N \phi(X_i),\quad
\phi(x)=(x,x^2,\dots,x^k)^\top,$$ for a symmetric positive semidefinite $W\in\mathbb{R}^{k\times k}$ and Gaussian target moments $\mu=\mathbb{E}_G[\phi(Y)]$. For each $i$, $$\begin{aligned}
\frac{\partial \widehat{D}_k}{\partial X_i}
=& \frac{2}{N}(\bar{\phi}-\mu)^\top W\frac{\partial \phi(X_i)}{\partial X_i},\\
\frac{\partial \phi(X)}{\partial X}
=& \big(1,2X,3X^2,\dots,k X^{k-1}\big)^\top.\end{aligned}$$ The gradient formula follows by the chain rule and linearity of $\bar{\phi}$. Let $c:=W(\bar{\phi}-\mu)$ and write $c_r$ for its $r$-th coordinate. Then $$\frac{\partial \widehat{D}_k}{\partial X_i}
= \frac{2}{N}\sum_{r=1}^k c_rrX_i^{r-1},$$ which is a polynomial in $X_i$ of degree $\deg=\max\{r-1: c_r\neq 0\}\le k-1$. In particular, if $c_k\neq 0$ (the generic case when the top-weighted deviation is nonzero), then $$\left|\frac{\partial \widehat{D}_k}{\partial X_i}\right|
\;\xrightarrow[|X_i|\to\infty]{}\; \infty
\quad\text{as}\quad |X_i|^{k-1}.$$ The expression is a nonconstant polynomial in $X_i$ of degree $\deg\le k-1$ whenever some $c_r\neq 0$ with $r\ge 2$. Thus the gradient cannot be uniformly bounded on $\mathbb{R}$. If $c_k\neq 0$, the leading term dominates and the magnitude grows like $|X_i|^{k-1}$, proving unboundedness for $k\ge 2$.

Proof of [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"} {#proof:gradient_bias}
---------------------------------------------------------------------------------------------------------

A direct calculation shows Fix $t \in \mathbb{R}^d$ and abbreviate $Z_j \coloneqq e^{\mathrm{i} t^\top X_j}$, so that $\phi_n(t) = \frac{1}{n}\sum_{j=1}^n Z_j$. Note that $|Z_j|=1$ almost surely (since $t^\top X_j\in\mathbb{R}$), and $\mathbb{E}[Z_j]=\phi_\theta(t)$ for all $j$. We start from the algebraic identity $$\big|\phi_n(t) - \psi(t)\big|^2
=
\phi_n(t)\overline{\phi_n(t)}
-
\psi(t)\overline{\phi_n(t)}
-
\overline{\psi(t)}\phi_n(t)
+
\big|\psi(t)\big|^2.$$ Taking expectations term by term gives $$\begin{aligned}
\mathbb{E}\left[ \big|\phi_n - \psi\big|^2 \right]
=&
\mathbb{E}\left[ |\phi_n|^2 \right]
-
\psi\mathbb{E}\left[ \overline{\phi_n} \right]
-
\overline{\psi}\mathbb{E}\left[ \phi_n \right]
+
|\psi|^2,\\
=&
\mathbb{E}\left[ |\phi_n|^2 \right]
- \psi\overline{\mathbb{E}[\phi_n]}-\overline{\psi}\frac{1}{n}\sum_{j=1}^n \mathbb{E}[Z_j] +
|\psi|^2,\\
=&
\mathbb{E}\left[ |\phi_n|^2 \right]
-\psi\overline{\phi_\theta} - \overline{\psi}\phi_\theta+
|\psi|^2,\\
=&\mathbb{E}\left[ |\phi_n|^2 \right]
-2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+
|\psi|^2,\\
=&\mathbb{E}\left[ \left|\frac{1}{n}\sum_{j=1}^n Z_j\right|^2 \right]-2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+
|\psi|^2,\\
=&\frac{1}{n^2}\sum_{j=1}^n \sum_{l=1}^n \mathbb{E}\left[ Z_j\overline{Z_l} \right]-2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+
|\psi|^2,\\\end{aligned}$$ Since the $Z_j$ are i.i.d., $$\mathbb{E}\left[ Z_j\overline{Z_l} \right]
=
\begin{cases}
\mathbb{E}\left[ |Z_1|^2 \right] = 1, & \text{if } j=l,\\[4pt]
\mathbb{E}[Z_j]\overline{\mathbb{E}[Z_l]} = \phi_\theta\overline{\phi_\theta} = |\phi_\theta|^2, & \text{if } j\neq l,
\end{cases}$$ hence $$\begin{aligned}
    \mathbb{E}\left[ |\phi_n|^2 \right]
=&
\frac{1}{n^2}\Big( n + n(n-1) |\phi_\theta|^2 \Big)\\
=&
\frac{1}{n} + \left(1-\frac{1}{n}\right)|\phi_\theta|^2\\
=&
|\phi_\theta|^2 + \frac{1-|\phi_\theta|^2}{n}\end{aligned}$$ Plugging these, we obtain $$\begin{aligned}
\mathbb{E}\left[ \big|\phi_n - \psi\big|^2 \right]
&=
\left( |\phi_\theta|^2 + \frac{1-|\phi_\theta|^2}{n} \right)
-
2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)
+
|\psi|^2
\\[4pt]
&=
\big( |\phi_\theta|^2 - 2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big) + |\psi|^2 \big)
+
\frac{1-|\phi_\theta|^2}{n}
\\[4pt]
&=
\big|\phi_\theta - \psi\big|^2 + \frac{1-|\phi_\theta|^2}{n}.\end{aligned}$$ Under Dominated convergence, $\E[\nabla_\theta D_n(t)] = \nabla_\theta \E[D_n(t)]$, hence $$\E\left[\nabla_\theta D_n(t)\right]
= \nabla_\theta \big|\phi_\theta(t)-\psi(t)\big|^2
+ \nabla_\theta \frac{1-|\phi_\theta(t)|^2}{n},$$ concluding the proof.

In practice one replaces $\int_{\mathbb{R}} w(t)(\cdot)dt$ by a deterministic quadrature on a uniform grid $t_k\in[-T,T]$ with weights $\omega_k$ (e.g. trapezoidal rule) and a Gaussian window $w(t)=e^{-\alpha t^2}$. All statements above remain valid with the integral replaced by $\sum_k \omega_k (\cdot)$: $$L(\theta) \approx \sum_{k} \omega_k\big|\phi_\theta(t_k)-\psi(t_k)\big|^2,
\quad
\widehat{L}_n(\theta) \approx \sum_{k} \omega_k\big|\phi_n(t_k)-\psi(t_k)\big|^2,$$ and the bias term becomes $$\text{Bias}(\theta) = -\frac{1}{n}\sum_k \omega_k\nabla_\theta \big|\phi_\theta(t_k)\big|^2.$$ Since the grid and weights are deterministic, they do not affect unbiasedness with respect to sampling; they only introduce a deterministic approximation error to the target functional $L(\theta)$.

Proof of VICReg's Recovery {#proof:vcreg}
--------------------------

We prove this result in two parts.

#### Part I: $\mathbb{E}[\mathbf{X}] = \mathbf{0}$

Given that $\mathbb{E}[\langle \mathbf{X}, \mathbf{a} \rangle] = 0$ for all unit vectors $\mathbf{a}$, and noting that $\langle \mathbf{X}, \mathbf{a} \rangle = \mathbf{a}^T \mathbf{X}$, we have: $$\label{eq:mean_condition}
\mathbb{E}[\mathbf{a}^T \mathbf{X}] = 0 \quad \text{for all } \mathbf{a} \in \mathbb{R}^d \text{ with } \|\mathbf{a}\| = 1$$ By linearity of expectation: $$\mathbf{a}^T \mathbb{E}[\mathbf{X}] = 0 \quad \text{for all unit vectors } \mathbf{a}$$ Let $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$. We claim that $\boldsymbol{\mu} = \mathbf{0}$. Suppose, for the sake of contradiction, that $\boldsymbol{\mu} \neq \mathbf{0}$. Then $\|\boldsymbol{\mu}\|_2 > 0$. Define the unit vector: $$\mathbf{a}^* = \frac{\boldsymbol{\mu}}{\|\boldsymbol{\mu}\|_2}$$ Since $\mathbf{a}^*$ is a unit vector, equation [\[eq:mean\_condition\]](#eq:mean_condition){reference-type="eqref" reference="eq:mean_condition"} implies: $$(\mathbf{a}^*)^T \boldsymbol{\mu} = 0$$ However, substituting the definition of $\mathbf{a}^*$: $$(\mathbf{a}^*)^T \boldsymbol{\mu} = \left(\frac{\boldsymbol{\mu}}{\|\boldsymbol{\mu}\|_2}\right)^T \boldsymbol{\mu} = \frac{\boldsymbol{\mu}^T \boldsymbol{\mu}}{\|\boldsymbol{\mu}\|_2} = \frac{\|\boldsymbol{\mu}\|_2^2}{\|\boldsymbol{\mu}\|_2} = \|\boldsymbol{\mu}\|_2 > 0$$ This contradiction establishes that $\boldsymbol{\mu} = \mathbf{0}$.

#### Part II: $\mathrm{Cov}(\mathbf{X}) = \mathbf{I}_d$

Since $\mathbb{E}[\mathbf{X}] = \mathbf{0}$, we have: $$\mathrm{Var}(\langle \mathbf{X}, \mathbf{a} \rangle) = \mathbb{E}[(\langle \mathbf{X}, \mathbf{a} \rangle)^2] = \mathbb{E}[(\mathbf{a}^T \mathbf{X})^2]$$ Expanding the quadratic form: $$\mathbb{E}[(\mathbf{a}^T \mathbf{X})^2] = \mathbb{E}[\mathbf{a}^T \mathbf{X} \mathbf{X}^T \mathbf{a}] = \mathbf{a}^T \mathbb{E}[\mathbf{X} \mathbf{X}^T] \mathbf{a}$$ Since $\mathbb{E}[\mathbf{X}] = \mathbf{0}$, the covariance matrix is $\mathrm{Cov}(\mathbf{X}) = \mathbb{E}[\mathbf{X} \mathbf{X}^T]$. Let $\boldsymbol{\Sigma} = \mathrm{Cov}(\mathbf{X})$. The variance condition gives us: $$\label{eq:quadratic_form}
\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = 1 \quad \text{for all unit vectors } \mathbf{a}$$ We now show that $\boldsymbol{\Sigma} = \mathbf{I}_d$. *Step 1: Diagonal entries.* For $i \in \{1, 2, \ldots, d\}$, let $\mathbf{e}_i$ denote the $i$-th standard basis vector. Setting $\mathbf{a} = \mathbf{e}_i$ in equation [\[eq:quadratic\_form\]](#eq:quadratic_form){reference-type="eqref" reference="eq:quadratic_form"}: $$\mathbf{e}_i^T \boldsymbol{\Sigma} \mathbf{e}_i = \Sigma_{ii} = 1$$ Therefore, all diagonal entries of $\boldsymbol{\Sigma}$ equal 1. *Step 2: Off-diagonal entries.* For distinct indices $i, j \in \{1, 2, \ldots, d\}$, consider the unit vector: $$\mathbf{a} = \frac{\mathbf{e}_i + \mathbf{e}_j}{\|\mathbf{e}_i + \mathbf{e}_j\|_2} = \frac{\mathbf{e}_i + \mathbf{e}_j}{\sqrt{2}}$$ Applying equation [\[eq:quadratic\_form\]](#eq:quadratic_form){reference-type="eqref" reference="eq:quadratic_form"}: $$\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = \frac{1}{2}(\mathbf{e}_i + \mathbf{e}_j)^T \boldsymbol{\Sigma} (\mathbf{e}_i + \mathbf{e}_j) = 1$$ Expanding the quadratic form and using the symmetry of $\boldsymbol{\Sigma}$: $$\begin{aligned}
\frac{1}{2}(\mathbf{e}_i^T \boldsymbol{\Sigma} \mathbf{e}_i + 2\mathbf{e}_i^T \boldsymbol{\Sigma} \mathbf{e}_j + \mathbf{e}_j^T \boldsymbol{\Sigma} \mathbf{e}_j) &= 1\\
\frac{1}{2}(\Sigma_{ii} + 2\Sigma_{ij} + \Sigma_{jj}) &= 1\\
\frac{1}{2}(1 + 2\Sigma_{ij} + 1) &= 1\\
1 + \Sigma_{ij} &= 1\\
\Sigma_{ij} &= 0\end{aligned}$$ Therefore, all off-diagonal entries of $\boldsymbol{\Sigma}$ equal zero, establishing that $\boldsymbol{\Sigma} = \mathbf{I}_d$.

Background
==========

**Foundation: The Linear Regression Model** We start with the standard linear regression model: $$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$ where:

-   $\mathbf{y} = [y_1, y_2, \ldots, y_n]^T \in \mathbb{R}^n$ is the response vector

-   $\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix with $\mathbf{X}_{ij} = x_{ij}$

-   $\boldsymbol{\beta} = [\beta_1, \beta_2, \ldots, \beta_p]^T \in \mathbb{R}^p$ is the parameter vector

-   $\boldsymbol{\varepsilon} = [\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n]^T \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$ is the error vector

The error assumption means: $$\mathbb{E}[\varepsilon_i] = 0, \quad \text{Var}(\varepsilon_i) = \sigma^2, \quad \text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \text{ for } i \neq j$$ **Step 1: Deriving the OLS Estimator** To find the OLS estimator, we minimize the sum of squared residuals: $$\text{SSR}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$ Expanding this quadratic form: $$\begin{aligned}
\text{SSR}(\boldsymbol{\beta}) &= \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\end{aligned}$$ Taking the derivative with respect to $\boldsymbol{\beta}$: $$\frac{\partial \text{SSR}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$ Setting equal to zero and solving: $$-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}$$ $$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$ Assuming $\mathbf{X}^T\mathbf{X}$ is invertible: $$\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}$$

![image](toy_figures/bound_constant.png){width="0.49\\linewidth"}

dimension=128, slices=10\
![Reprise of [\[fig:nonparametric\]](#fig:nonparametric){reference-type="ref" reference="fig:nonparametric"} for additional dimensions and number of 1d projections.](toy_figures/nonparametric/2d_slicing_dim_128_N_100_slices_10.pdf "fig:"){#fig:extra_nonparametric width="0.9\\linewidth"} dimension=128, slices=100\
![Reprise of [\[fig:nonparametric\]](#fig:nonparametric){reference-type="ref" reference="fig:nonparametric"} for additional dimensions and number of 1d projections.](toy_figures/nonparametric/2d_slicing_dim_128_N_100_slices_100.pdf "fig:"){#fig:extra_nonparametric width="0.9\\linewidth"} dimension=1024, slices=100\
![Reprise of [\[fig:nonparametric\]](#fig:nonparametric){reference-type="ref" reference="fig:nonparametric"} for additional dimensions and number of 1d projections.](toy_figures/nonparametric/2d_slicing_dim_1024_N_100_slices_100.pdf "fig:"){#fig:extra_nonparametric width="0.9\\linewidth"}

![Depiction of the distribution of optimized $\beta$ values from OLS when comparing $\mZ_{\rm iso}$ and $\mZ_{\rm aniso}$ from [\[thm:linear\_probe\_bias,thm:linear\_probe\_variance\]](#thm:linear_probe_bias,thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_bias,thm:linear_probe_variance"}. We clearly observe that the anisotropic version (**blue**) provides much lower variance compared to the isotropic case (**red**). We consider a binary classification (linear separable class) (**top row**), a linear regression task (**middle row**), and a nonlinear regression task with smooth targets (**bottom row**). For each case, we resample the training samples numerous times and produce an estimate for $\beta$ each time. Because the data is $2$-dimensional, we can visualize the $\beta$ distribution directly.](toy_figures/le/beta_distributions_binary.pdf "fig:"){#fig:beta_distributions width="0.8\\linewidth"} ![Depiction of the distribution of optimized $\beta$ values from OLS when comparing $\mZ_{\rm iso}$ and $\mZ_{\rm aniso}$ from [\[thm:linear\_probe\_bias,thm:linear\_probe\_variance\]](#thm:linear_probe_bias,thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_bias,thm:linear_probe_variance"}. We clearly observe that the anisotropic version (**blue**) provides much lower variance compared to the isotropic case (**red**). We consider a binary classification (linear separable class) (**top row**), a linear regression task (**middle row**), and a nonlinear regression task with smooth targets (**bottom row**). For each case, we resample the training samples numerous times and produce an estimate for $\beta$ each time. Because the data is $2$-dimensional, we can visualize the $\beta$ distribution directly.](toy_figures/le/beta_distributions_linear.pdf "fig:"){#fig:beta_distributions width="0.8\\linewidth"} ![Depiction of the distribution of optimized $\beta$ values from OLS when comparing $\mZ_{\rm iso}$ and $\mZ_{\rm aniso}$ from [\[thm:linear\_probe\_bias,thm:linear\_probe\_variance\]](#thm:linear_probe_bias,thm:linear_probe_variance){reference-type="ref" reference="thm:linear_probe_bias,thm:linear_probe_variance"}. We clearly observe that the anisotropic version (**blue**) provides much lower variance compared to the isotropic case (**red**). We consider a binary classification (linear separable class) (**top row**), a linear regression task (**middle row**), and a nonlinear regression task with smooth targets (**bottom row**). For each case, we resample the training samples numerous times and produce an estimate for $\beta$ each time. Because the data is $2$-dimensional, we can visualize the $\beta$ distribution directly.](toy_figures/le/beta_distributions_smooth.pdf "fig:"){#fig:beta_distributions width="0.8\\linewidth"}

![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/acc_vs_lambda_N50_linear_classification.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"} ![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/acc_vs_lambda_N200_linear_classification.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"} ![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/acc_vs_lambda_N1000_linear_classification.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"}\
![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/cosine_vs_lambda_N50_linear_regression.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"} ![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/cosine_vs_lambda_N200_linear_regression.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"} ![Depiction of accuracy (**top**) and cosine similarity between estimated and true estimator (**bottom**) for the OLS setting with varying strength of Tikhonov regularization (**x-axis)** comparing isotropic and anisotropic embeddings. As per [\[thm:gradient\_bias\]](#thm:gradient_bias){reference-type="ref" reference="thm:gradient_bias"}, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.](toy_figures/le/cosine_vs_lambda_N1000_linear_regression.pdf "fig:"){#fig:ols_bias width="0.32\\linewidth"}

![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_resnet50_galaxy10.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"} ![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_resnet50_inet10.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"} ![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_ViT-base-8_inet1k.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"} ![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_ViT-s-8_galaxy10.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"} ![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_ViT-s-8_inet10.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"} ![Additional figures provides in [30](#fig:corr_loss_extra){reference-type="ref" reference="fig:corr_loss_extra"}](toy_figures/exps/loss_corr/loss_corr_resnet18_flowers102.pdf "fig:"){#fig:corr_loss_extra width="0.33\\linewidth"}

::: {#tab:galaxy10}
  --------------------- ------------------ ----------- ----------- ----------- ----------- ----------- ----------- -----------
  **Freeze Backbone**   **Model Name**                                                                             
  (lr)3-9                                      **All**       **1**       **2**       **5**      **10**     **100**    **1000**
                                                                                                                   
                        ConvNeXt-V2 Nano         82.72   **29.42**   **36.65**   **50.94**   **59.85**   **75.34**       81.97
                        LeViT-128                79.41       18.45       24.08       33.11       41.76       64.59       77.59
                        ResNet-18                82.15       23.34       31.56       43.82       54.64       73.53       81.41
                        ResNet-34            **83.28**       24.27       31.51       44.23       53.95       74.93   **82.32**
  (lr)2-9                                                                                                          
                        DINOv2 Small             78.34       21.05       21.71       30.33       36.23       60.81       75.55
                        DINOv3 ViT-S/16          81.60       24.71       29.43       37.71       44.71       69.87       80.54
                                                                                                                   
                        ConvNeXt-V2 Nano         76.52       28.74       36.65       50.60       59.50       72.62       77.24
                        LeViT-128                69.00       25.85       33.30       45.52       52.43       64.37       69.39
                        ResNet-18                75.95       30.48       38.22       50.85       58.86       72.70       76.39
                        ResNet-34            **78.17**   **31.08**   **38.33**   **52.26**   **60.63**   **74.77**   **78.62**
  (lr)2-9                                                                                                          
                        DINOv2 Small             67.62       27.68       32.22       40.72       47.72       62.49       67.89
                        DINOv3 ViT-S/16          71.38       30.17       36.65       45.74       51.51       65.90       71.35
  --------------------- ------------------ ----------- ----------- ----------- ----------- ----------- ----------- -----------

  : Performance metrics across different sample sizes from [\[fig:galaxy10\]](#fig:galaxy10){reference-type="ref" reference="fig:galaxy10"}
:::

![Proposed trapezoid quadrature for the Epps-Pulley statistic as implemented in [\[lst:epps-pulley-pytorch\]](#lst:epps-pulley-pytorch){reference-type="ref" reference="lst:epps-pulley-pytorch"}. We depict the approximation error of the integral for various distributions, demonstrate rapid convergence (faster than quadratic show in **grey line**) across possible embedding distributions.](toy_figures/quadrature_convergence.pdf){#fig:quadrature width="\\linewidth"}

![image](toy_figures/exps/loss_corr/selection_ViT-s-8_galaxy10.png){width="0.33\\linewidth"} ![image](toy_figures/exps/loss_corr/selection_ViT-s-8_inet10.png){width="0.33\\linewidth"} ![image](toy_figures/exps/loss_corr/selection_resnet18_flowers102.png){width="0.33\\linewidth"}

  -------------- ----------- --------- --------- --------- --------- --------- --------- --------- --------- ---------
                 backbone                                                                                    
                 Projector   1-layer   2-layer   3-layer   1-layer   2-layer   3-layer   1-layer   2-layer   3-layer
  w/ predictor   w/ SWA                                                                                      
                 False       79.71     82.44     83.93     76.59     80.77     81.07     71.79     76.87     80.37
                 True        79.79     82.69     83.50     79.96     83.63     84.12     75.86     82.36     80.50
                 False       79.41     82.44     83.57     77.58     79.41     81.91     67.74     77.64     80.73
                 True        78.87     82.04     82.82     77.11     81.77     82.58     69.53     78.27     79.77
  -------------- ----------- --------- --------- --------- --------- --------- --------- --------- --------- ---------

  ------------------------------- ------------------- ------------ ---------- --------- -------- --------- ----------
                                  Pretraining         flowers102   cifar100   food101   inet10   cifar10   galaxy10
                                  \# train. samples   1020         50000      75750     13000    50000     11008
  LeJEPA (convnextv2\_nano) 14M   in-domain           64.34        69.26      69.59     90.81    92.22     76.05
  LeJEPA (resnet18) 11M           in-domain           74.57        69.94      73.57     92.36    92.51     75.32
  LeJEPA (resnet34) 21M           in-domain           71.85        70.44      74.95     92.80    93.16     77.29
  LeJEPA (resnext26ts) 8M         in-domain           82.19        69.10      76.77     92.82    91.59     73.78
  LeJEPA (swin\_tiny) 27M         in-domain           63.94        65.08      78.40     92.87    92.67     74.89
  IJEPA-inet22k (ViT-H/14) 630M   inet1k              85.76        86.93      81.06     98.65    97.77     62.93
  ------------------------------- ------------------- ------------ ---------- --------- -------- --------- ----------

::: {#tab:times}
  -------- ----------- ---------- ----------- ----------
         N           M                        
    points   mean (ms)   std (ms)             
       512         512         16    0.465236   0.011642
       512         512         64    0.461317   0.003894
       512         512        256    0.627644   0.003337
      2048         512         16    1.406441   0.002415
      8192         512         16    6.188304   0.007226
      8192        8192         16    8.685009   0.038829
     32768         512         16   26.373118   0.012732
       512        2048         16    0.465614   0.005274
       512        8192         16    0.670379   0.006854
  -------- ----------- ---------- ----------- ----------

  : Time (in millisecond) to compute the proposed SIGReg loss from [\[lst:epps-pulley-pytorch\]](#lst:epps-pulley-pytorch){reference-type="ref" reference="lst:epps-pulley-pytorch"} on a Tesla V100-SXM2-16GB for varying mini-batch size ($N$), number of slices ($M$), integration points. Results are computed over $10$ runs.
:::

::: {#tab:lambda_perf}
  ----------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
                                                                                                      
  $\lambda$   0.001   0.005   0.010   0.020   0.025   0.050   0.100   0.150   0.200   0.300   0.400   0.500
  \#views                                                                                             
  2           81.41   82.73   83.49   82.99   82.23   \-      \-      \-      \-      \-      \-      \-
  4           79.88   83.04   84.36   84.68   84.33   83.00   82.91   81.05   78.58   \-      \-      \-
  8           76.67   81.58   83.59   83.49   83.76   84.32   83.66   83.07   82.16   81.00   79.25   77.72
  ----------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------

  : Number of [5](#fig:lambda_views){reference-type="ref" reference="fig:lambda_views"}.
:::

Details on Low-Discrepancy Sequences {#sec:low_discrepancy}
====================================

Quasi-Monte Carlo (QMC) methods, such as the Sobol sequence, are widely used to generate low-discrepancy samples in the unit hypercube, providing improved uniformity over purely random sampling. To obtain samples uniformly distributed on the hypersphere, each QMC point is mapped to a standard normal vector via the inverse cumulative distribution function (CDF), and then projected onto the sphere by normalization. This approach leverages the rotational invariance of the multivariate normal distribution, ensuring that the resulting directions are uniformly distributed on the sphere's surface. While the low-discrepancy property is not strictly preserved under this nonlinear mapping, the resulting samples are empirically more uniform than random samples and are standard in high-dimensional applications [@marsaglia1972choosing; @dick2010digital; @caflisch1998monte].

Number of points $N$, dimension $d$ Points $\{\mathbf{y}_i\}_{i=1}^N$ quasi-uniformly distributed on $\mathbb{S}^{d-1}$ Generate $\mathbf{x}_i \in [0,1]^d$ as the $i$-th point of a Sobol sequence Transform each component: $z_{i,j} = \Phi^{-1}(x_{i,j})$ for $j = 1, \ldots, d$ Normalize: $\mathbf{y}_i = \mathbf{z}_i / \|\mathbf{z}_i\|_2$

Shapiro-Wilk Test {#sec:shapiro_wilk}
=================

Let X1 \< X2 \< . . . \< Xn denote an ordered random sample of size n from a standard normal distribution. Also, let mÂ 5 (m1,m2,\...,mn) be the vector of expected values of standard normal order statistics, and let V 5 (vij ) be the corresponding n 3 n covariance matrix, so that $$E\left(X_{i}\right)=m_{i} \quad \text { and } \quad \operatorname{cov}\left(X_{i}, X_{j}\right)=v_{i j}, \quad i, j=1,2, \ldots, n$$ The W test statistic [@shapiro1965analysis] for normality is then denoted by $$\begin{array}{l}
W=\frac{\left(\sum_{i=1}^{n} a_{i} Y_{i}\right)}{\sum_{i=1}^{n}\left(Y_{i} -\bar{Y}\right)^{2}}=\frac{(\mathbf{a} \mathbf{Y})}{S^{2}}\\
\mathbf{a}^{\prime}=\left(a_{1}, a_{2}, \ldots, a_{n}\right)=\mathbf{m} \mathbf{V}^{-1}\left(\mathbf{m} \mathbf{V}^{-1} \mathbf{V}^{-1} \mathbf{m}\right)^{-1 / 2}\\
\mathrm{S}^{2}=\sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}
\end{array}$$

[@shapiro1972approximate] suggested replacing the covariance matrix V by the identity matrix I, because for large samples, the observations Yi may be treated as if they are independent (see [@gupta1952estimation]). Another asymptotic extension was suggested by [@weisburg1975approximate] $$E\left(X_{i}\right)=m_{i} \approx\Phi^{-1}\left(\frac{i-\frac{3}{8}}{n+\frac{1}{4}}\right) \quad i=1,2, \ldots, n$$ building atop [@elfving1947asymptotical]'s approximation but using $3/8$ instead of $\pi/8$.

[@rahman1997modification] proposed another variation using the approximation for the expected values of order statistics given by [@blom1958statistical] and the approximations for the elements of the variance± covariance matrix given by [@blom1958statistical; @mosteller2006some]. These approximations are $$E\left(X_{i}\right)=m_{i} \approx \Phi^{-1}\left(\frac{i}{N+1}\right), \quad i=1,2, \ldots, n$$ $$\operatorname{cov}\left(X_{i}, X_{j}\right)=v_{i j} \approx \frac{p_{i} p_{j}}{(n+2) f\left(m_{i}\right) f\left(m_{j}\right)}, \quad i, j=1,2, \ldots, n$$ $$p_{i}=\frac{i}{n+1}$$

We know (see [@hammersley1954estimation; @plackett1958linear]) $$\begin{array}{l}
\mathbf{V}^{-1}=(n+1)(n+2) \\
\times\left(\begin{array}{cccccc}
2 \phi^{2}\left(m_{1}\right) & -\phi\left(m_{1}\right) \phi\left(m_{2}\right) & 0 & 0 & \ldots & 0 \\
-\phi\left(m_{1}\right) \phi\left(m_{2}\right) & 2 \phi^{2}\left(m_{2}\right) & -\phi\left(m_{2}\right) \phi\left(m_{3}\right) & 0 & \ldots & 0 \\
0 & -\phi\left(m_{2}\right) \phi\left(m_{3}\right) & 2 \phi^{2}\left(m_{3}\right) & -\phi\left(m_{3}\right) \phi\left(m_{4}\right) & \ldots & 0 \\
\vdots & & & & & \\
0 & 0 & 0 & 0 & \ldots & 2 \phi^{2}\left(m_{n}\right)
\end{array}\right)
\end{array}$$

Multivariate Statistics {#sec:multivariate_tests}
=======================

We ideally would like to compare the distributions. One slight variation is to compare the Characteristic function of the distributions. Given samples $\vx_1,\dots,\vx_N$, the Empirical Characteristic Function (ECF) is defined as $$\begin{aligned}
    \hat{\psi}_{N}(\vt) = \frac{1}{N}\sum_{n=1}^{N} e^{-i \vt^\top \vy_n}.\end{aligned}$$ We can now compare our ECF to the one of the target distribution and build the statistic $$\begin{aligned}
    N\int |\hat{\psi}_{N}(\vt) - \psi_{0}(\vt) |^2\omega(\vt) dt=N\int |\hat{\psi}_{N}(\vt) - e^{-\|\vt\|_2/2} |^2\omega(\vt) dt,\end{aligned}$$ if the weighting function is given by $\omega(\vt) = (2\pi \beta ^2 )^{-d/2}e^{-\frac{\|\vt\|_2^2}{2}}$ then the following simplification can be made $$\begin{aligned}
\mathrm{BHEP}_{n, \beta}= & \frac{1}{n} \sum_{j, k=1}^{n} \exp \left(-\frac{\beta^{2}\left\|Y_{n, j}-Y_{n, k}\right\|^{2}}{2}\right) \\
& -\frac{2}{\left(1+\beta^{2}\right)^{d / 2}} \sum_{j=1}^{n} \exp \left(-\frac{\beta^{2}\left\|Y_{n, j}\right\|^{2}}{2\left(1+\beta^{2}\right)}\right)+\frac{n}{\left(1+2 \beta^{2}\right)^{d / 2}} .\end{aligned}$$ with $\beta>0$, Baringhaus-Henze-Epps-Pulley. From [^1] leading to the HZ test [^2] uses $$\beta_{n}=2^{-1 / 2}((2 d+1) n / 4)^{1 /(d+4)}$$

the same can be done with the moment generating function [^3] $$\begin{aligned}
T_{n, \beta}=\pi^{d / 2}\left(\frac{1}{n} \sum_{i, j=1}^{n} \frac{1}{\beta^{d / 2}} \exp \left(\frac{\left\|Y_{n, i}+Y_{n, j}\right\|^{2}}{4 \beta}\right)+\frac{n}{(\beta-1)^{d / 2}}\right. \\
\left.-2 \sum_{j=1}^{n} \frac{1}{(\beta-1 / 2)^{d / 2}} \exp \left(\frac{\left\|Y_{n, j}\right\|^{2}}{4 \beta-2}\right)\right),\end{aligned}$$ here with $\beta>2$

There is also one combining both[^4]! $$\begin{array}{l}
T_{n, \gamma}:=\int_{\mathbb{R}^{d}} U_{n}^{2}(t) w_{\gamma}(t) \mathrm{d} t\\
U_{n}(t):=\sqrt{n}\left(R_{n}(t) M_{n}(t)-1\right)
\end{array}$$ $$\begin{aligned}
T_{n, \gamma}= & \left(\frac{\pi}{\gamma}\right)^{d / 2}\left\{\frac { 1 } { 2 n ^ { 3 } } \sum _ { j , k , l , m = 1 } ^ { n } \left[\exp \left(\frac{\left\|Y_{j k}^{+}\right\|^{2}-\left\|Y_{\ell m}^{-}\right\|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{j k}^{+\top} Y_{\ell m}^{-}}{2 \gamma}\right)\right.\right. \\
+ & \left.\exp \left(\frac{\left\|Y_{j k}^{+}\right\|^{2}-\left\|Y_{\ell m}^{+}\right\|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{j k}^{+\top} Y_{\ell m}^{+}}{2 \gamma}\right)\right] \\
& \left.-\frac{2}{n} \sum_{j, k=1}^{n} \exp \left(\frac{\left\|Y_{n, j}\right\|^{2}-\left\|Y_{n, k}\right\|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{n, j}^{\top} Y_{n, k}}{2 \gamma}\right)+n\right\},
\end{aligned}$$ and its simplified version $$\widetilde{T}_{n, \gamma}:=\int_{\mathbb{R}^{d}} U_{n}(t) w_{\gamma}(t) \mathrm{d} t .$$ $$\widetilde{T}_{n, \gamma}=\left(\frac{\pi}{\gamma}\right)^{d / 2} \sqrt{n}\left(\frac{1}{n^{2}} \sum_{j, k=1}^{n} \exp \left(\frac{\left\|Y_{n, j}\right\|^{2}-\left\|Y_{n, k}\right\|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{n, j}^{\top} Y_{n, k}}{2 \gamma}\right)-1\right)$$

Also one testing the derivative [^5]

$$\mathrm{HV}_{n, \gamma}:=n \int\left\|\nabla M_{n}(t)-t M_{n}(t)\right\|^{2} \widetilde{w}_{\gamma}(t) \mathrm{d} t$$ $$\mathrm{HV}_{n, \gamma}=\frac{1}{n}\left(\frac{\pi}{\gamma}\right)^{d / 2} \sum_{j, k=1}^{n} \exp \left(\frac{\left\|Y_{n, j, k}^{+}\right\|^{2}}{4 \gamma}\right)
\left(Y_{n, j}^{\top} Y_{n, k}-\frac{\left\|Y_{n, j, k}^{+}\right\|^{2}}{2 \gamma}+\frac{d}{2 \gamma}+\frac{\left\|Y_{n, j, k}^{+}\right\|^{2}}{4 \gamma^{2}}\right) .$$

skewness [^6]: $$b_{1, d}=\frac{1}{n^{2}} \sum_{j, k=1}^{n}\left(Y_{n, j}^{\top} Y_{n, k}\right)^{3}$$ skewness [^7]: $$\widetilde{b}_{1, d}=\frac{1}{n^{2}} \sum_{j, k=1}^{n} Y_{n, j}^{\top} Y_{n, k}\left\|Y_{n, j}\right\|^{2}\left\|Y_{n, k}\right\|^{2}$$ which should be 0 for Gaussian and Kurtosis which should be d(d+2) $$b_{2, d}=\frac{1}{n} \sum_{j=1}^{n}\left\|Y_{n, j}\right\|^{4}$$

[^1]: <https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203?srsltid=AfmBOoodlL-CtlqL0JVC-LcP6mOWw6VTt51_YstdZOW4W3iuicu1VFyg>

[^2]: <https://www.tandfonline.com/doi/abs/10.1080/03610929008830400>

[^3]: <https://arxiv.org/pdf/1711.07199>

[^4]: <https://arxiv.org/pdf/1706.03029>

[^5]: <https://arxiv.org/pdf/1901.03986>

[^6]: <https://www.jstor.org/stable/2334770>

[^7]: <https://link.springer.com/article/10.1007/s13171-020-00211-6>
