---
bibliography:
- SciPost\_Example\_BiBTeX\_File.bib
---

`\Large `{=latex} **Introduction to Latent Variable Energy-Based Models:\
A Path Towards Autonomous Machine Intelligence\
**

Anna Dawid^1,2^ and Yann LeCun^3,4$\star$^

**1** ICFO - Institut de Ciències Fotòniques, The Barcelona Institute of Science and Technology, Av. Carl Friedrich Gauss 3, 08860 Castelldefels (Barcelona), Spain\
**2** Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland\
**3** Courant Institute of Mathematical Sciences, New York University\
**4** Meta - Fundamental AI Research\
${}^\star$ `\small `{=latex}`\sf `{=latex}yann\@cs.nyu.edu

```{=latex}
\today
```
Abstract {#abstract .unnumbered}
========

**Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun. In particular, we introduce energy-based and latent variable models and combine their advantages in the building block of LeCun's proposal, that is, in the hierarchical joint embedding predictive architecture (H-JEPA).**

```{=latex}
\vspace{10pt}
```
`\noindent`{=latex}

------------------------------------------------------------------------

```{=latex}
\setcounter{tocdepth}{2}
```
```{=latex}
\tableofcontents
```
```{=latex}
\thispagestyle{fancy}
```
`\noindent`{=latex}

------------------------------------------------------------------------

```{=latex}
\vspace{10pt}
```
Introduction {#sec:intro}
============

Over the past decade, [ML]{acronym-label="ML" acronym-form="singular+short"} methods have exploded in popularity and played a crucial role in many substantial technological advancements. We have witnessed [ML]{acronym-label="ML" acronym-form="singular+short"} models that achieve expert performance in strategic games like Go, Chess, and Shogi [@silver2017chess; @Silver2016AlphaGo], that can solve challenging physical simulation problems like protein folding with groundbreaking accuracy [@Senior2020AlphaFold], and that translate text between over 200 languages [@costa-jussa2022NLLB]. These new technologies all have one technique in common: [DL]{acronym-label="DL" acronym-form="singular+short"} or the use of deep [NNs]{acronym-label="NN" acronym-form="plural+short"} that may have hundreds or thousands of layers. Perhaps surprisingly, training deep [NNs]{acronym-label="NN" acronym-form="plural+short"} on extensive datasets has seemingly become a straightforward recipe to achieve human-like performance on generic computational tasks, even demonstrating what may seem to be intelligent and creative problem-solving!

In exchange for their groundbreaking performance, creating [DL]{acronym-label="DL" acronym-form="singular+short"} models can require training on massive datasets, which is an extreme computational expense. Human learning, in contrast, is often highly efficient. With only a few examples, we can quickly find an intuitive way to complete a task and easily generalize our learning to other tasks. For instance, babies quickly obtain an intuitive understanding of physics, allowing them to predict their motions' outcomes and change them accordingly. Conversely, robots trained through [RL]{acronym-label="RL" acronym-form="singular+short"} struggle with this task, requiring years of simulated interactions with the world to achieve intuitive motion [@Silver2016AlphaGo; @silver2017chess].

In these lecture notes, we explore the concept of `\stress{autonomous intelligence}`{=latex}, which can learn efficiently and automatically to predict the state of the world, much like human learners. Eventually, we hope to achieve a `\stress{fully autonomous AI}`{=latex}, which can perform well on generic tasks by transferring its knowledge and automatically adapting to new situations without trying out many solutions first. The content of this paper follows a series of lectures given by Yann LeCun in July 2022 as a part of the Summer School on Statistical Physics and Machine Learning [@SummerSchool] in École de Physique des Houches, organized by Florent Krzakala and Lenka Zdeborová. We aim here to explain the limitations of the current [ML]{acronym-label="ML" acronym-form="singular+short"} approaches and introduce central concepts needed for understanding a possibly autonomous [AI]{acronym-label="AI" acronym-form="singular+short"} of the future proposed by Yann LeCun in his paper \`\`A Path Towards Autonomous Machine Intelligence" in 2022 [@LeCunnPathTowardsAI] as well as the main idea behind the design.

```{=latex}
\input{lecture1}
```
```{=latex}
\input{lecture2}
```
```{=latex}
\input{lecture3}
```
Conclusion {#sec:conclusions}
==========

`\Acf{DL}`{=latex} and, more broadly, [AI]{acronym-label="AI" acronym-form="singular+full"} undoubtedly revolutionized the industry in the past years and started reshaping science as well. However, before [AI]{acronym-label="AI" acronym-form="singular+short"} gets a chance to take our civilization to the next level with Level 5 self-driving cars, virtual assistants, and domestic robots that we know from science fiction, it needs to be freed from its current limitations. For example, [SL]{acronym-label="SL" acronym-form="singular+full"} and [RL]{acronym-label="RL" acronym-form="singular+full"}, which dominate modern real-world applications, are highly inefficient compared to human learning. They require an enormous number of either labeled samples or trials. More importantly, current automatic systems still miss crucial requirements for the future [AI]{acronym-label="AI" acronym-form="singular+short"} systems, such as a basic understanding of the world and humans that may be called \`\`common sense" which we understand here as the ability to use models of the world to fill in information about the world that is unavailable from perception or memory (e.g., predict the future).

Here we summarized the main ideas of LeCun from Ref. [@LeCunnPathTowardsAI] that address those limitations. In section `\ref{sec:EBMs}`{=latex}, we explained that as real-world data (such as video or text) is usually high-dimensional, the [EBMs]{acronym-label="EBM" acronym-form="plural+full"} may be a more promising approach to the future human-like intelligence than probabilistic models that stop being tractable in continuous high-dimensional domains. In section `\ref{sec:train-EBMs}`{=latex}, we introduced contrastive and regularized methods to train [EBMs]{acronym-label="EBM" acronym-form="plural+short"} and explained that due to the large cost of generating contrastive samples in high dimensions, regularized methods seem more promising for training [EBMs]{acronym-label="EBM" acronym-form="plural+short"} of the future. We gave examples of [EBMs]{acronym-label="EBM" acronym-form="plural+short"} of historical and practical relevance in section `\ref{sec:examples-EBMs}`{=latex}.

Finally, section `\ref{sec:JEPAandHJEPA}`{=latex} focused on the fact that a human-like decision process is based on data of various formats and representations, with a structure that often needs to be decoded to make a prediction, also containing information that can be redundant depending on the task. Such multimodality can be addressed with a new architecture proposed by LeCun in Ref. [@LeCunnPathTowardsAI], called [JEPAs]{acronym-label="JEPA" acronym-form="plural+full"}, on three levels. Firstly, [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"} are trained to capture dependencies of two input objects and allow those objects to have a different format (e.g., video and audio). Secondly, [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"} make predictions in the representation space, allowing encoders to remove irrelevant data features for the task at hand. Thirdly, latent variables of [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"} can encode additional features not readily present in the input data, allowing for handling uncertainties in the perceived data.

The final challenge is to allow the autonomous [AI]{acronym-label="AI" acronym-form="singular+short"} of the future to handle predictions of the state of the world on various time scales and levels of abstraction. Such multi-level predictions may be achieved with a [H-JEPA]{acronym-label="H-JEPA" acronym-form="singular+full"}. Its architecture is simply a series of stacked [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"}. Lower-level [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"} encode data and feed its representations to higher-level [JEPAs]{acronym-label="JEPA" acronym-form="plural+short"}, creating multi-level representations. This architecture, trained with regularized methods, may be a starting point for designing predictive world models able of hierarchical planning under uncertainty that would constitute a breakthrough in developing an autonomous [AI]{acronym-label="AI" acronym-form="singular+short"} of the future.

Acknowledgements {#acknowledgements .unnumbered}
================

The content of these lecture notes follows a series of lectures given by Yann LeCun in July 2022 as a part of the Summer School on Statistical Physics and Machine Learning in École de Physique des Houches, organized by Florent Krzakala and Lenka Zdeborová. We thank Alfredo Canziani, Lucas Clarte, and Max Daniels for the helpful discussions.

#### Funding information

A.D. acknowledges the financial support from the Foundation for the Polish Science and the National Science Centre, Poland, within the Etiuda grant No. 2020/36/T /ST2/00588. ICFO group acknowledges support from: ERC AdG NOQIA; Ministerio de Ciencia y Innovation Agencia Estatal de Investigaciones (PGC2018-097027-B-I00/10.13039/5011-00011033, CEX2019-000910-/10.13039/501100011033, Plan National FIDEUA PID2019-`\-`{=latex}106901GB-I00, FPI, QUANTERA MAQS PCI2019-111828-2, QUANTERA DYNAMITE PCI2022-132919, Proyectos de I+D+I \`\`Retos Colaboración" QUSPIN RTC2019-007196-7); MICIIN with funding from European Union NextGenerationEU(PRTR-C17.I1) and by Generalitat de Ca`\-`{=latex}ta`\-`{=latex}lun`\-`{=latex}ya; Fundació Cellex; Fundació Mir-Puig; Generalitat de Ca`\-`{=latex}ta`\-`{=latex}lun`\-`{=latex}ya (European Social Fund FEDER and CERCA program, AGAUR Grant No. 2021 SGR 01452, QuantumCAT / U16-011424, co-funded by ERDF Operational Program of Catalonia 2014-2020); Barcelona Supercomputing Center Mare`\-`{=latex}Nos`\-`{=latex}trum (FI-2022-1-0042); EU (PASQuanS2.1, 101113690); EU Horizon 2020 FET-OPEN OPTOlogic (Grant No 899794); EU Horizon Europe Program (Grant Agreement 101080086 --- NeQST), National Science Centre, Poland (Symfonia Grant No. 2016/20/W/`\-`{=latex}ST4/00314); ICFO Internal \`\`QuantumGaudi" project; European Union's Horizon 2020 research and innovation program under the Marie-Skłodowska-Curie grant agreement No 101029393 (STREDCH) and No 847648 (\`\`La Caixa" Junior Leaders fellowships ID100010434: LCF/BQ/PI19/11690013, LCF/BQ/PI20/11760031, LCF/BQ/PR20/11770012, LCF/BQ/PR21/11840013). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union, European Commission, European Climate, Infrastructure and Environment Executive Agency (CINEA), nor any other granting authority. Neither the European Union nor any granting authority can be held responsible for them.

List of acronyms {#sec:acronyms .unnumbered}
================

```{=latex}
\addcontentsline{toc}{section}{List of acronyms}
```
```{=latex}
\sectionmark{LIST OF ACRONYMS}
```
```{=latex}
\input{acronyms.tex}
```
```{=latex}
\begin{appendix}
\renewcommand\thefigure{\thesection\arabic{figure}}    
\setcounter{figure}{0}
\renewcommand\theequation{\thesection\arabic{equation}}    
\setcounter{equation}{0}
\input{appendixA}

\end{appendix}
```
```{=latex}
\nolinenumbers
```
