---
abstract: |
  `\label{sec:abstract}`{=latex}

  Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question *\`\`Is it possible to **generalize** these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?*" Interestingly, we find the answer is **yes**, by learning to explicitly **verify** the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train **Energy-Based Transformers (EBTs)**---a new class of Energy-Based Models (EBMs)---to assign an **energy** (unnormalized probability) value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. This formulation enables System 2 Thinking to emerge from unsupervised learning, making it modality and problem agnostic. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking (i.e., extra computation) by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that System 2 Thinking with EBTs yields larger performance improvements on data that is farther out-of-distribution, and that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the **learning** and **thinking** capabilities of models.
author:
- |
  **Alexi Gladstone**^1,2^, **Ganesh Nanduru**^1^, **Md Mofijul Islam**^1,3^, **Peixuan Han**^2^,\
  **Hyeonjeong Ha**^2^, **Aman Chadha**^3,4^, **Yilun Du**^5^, **Heng Ji**^3^, **Jundong Li**^1^, **Tariq Iqbal**^1^\
  ^1^UVA `\quad`{=latex} ^2^UIUC `\quad`{=latex} ^3^Amazon GenAI^$\dagger$^ `\quad`{=latex} ^4^Stanford University `\quad`{=latex} ^5^Harvard University\
  **[`\faGlobe`{=latex}`\enspace{energy-based-transformers.github.io}`{=latex}](https://energy-based-transformers.github.io) `\quad`{=latex} [`\faGithub`{=latex}`\enspace{github.com/alexiglad/EBT}`{=latex}](https://github.com/alexiglad/ebt)**
bibliography:
- references.bib
title: 'Energy-Based Transformers are Scalable Learners and Thinkers'
---

```{=latex}
\pdfoutput=1
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\newcommand{\pa}{EBT}
```
```{=latex}
\newcommand{\parch}{EBT}
```
```{=latex}
\renewcommand{\thefacet}{\arabic{facet}}
```
```{=latex}
\newcommand{\facet}[2][]{%
  \refstepcounter{facet}%               % increment and make labelable
  \textbf{Facet \thefacet: #2.}%     % print unnumbered paragraph heading
  \ifx&#1&\else\label{#1}\fi            % optional label
}
```
```{=latex}
\newcommand{\cmark}{\textcolor{green}{\ding{51}}}
```
```{=latex}
\newcommand{\xmark}{\textcolor{red}{\ding{55}}}
```
```{=latex}
\newcommand\ac[1]{\todo[author=AC,color=blue!40]{#1}}
```
```{=latex}
\newcommand\acil[1]{\todo[author=AC,color=blue!40,inline]{#1}}
```
```{=latex}
\newcommand\acb[1]{\textcolor{blue}{#1}}
```
```{=latex}
\newcommand{\blfootnote}[1]{%
  \begingroup
    \renewcommand\thefootnote{}% empty marker
    \footnote{#1}%
    \addtocounter{footnote}{-1}% keep the counter unchanged
  \endgroup}
```
```{=latex}
\maketitle
```
```{=latex}
\blfootnote{Correspondence to Alexi Gladstone:  \href{mailto:alexigladstone@gmail.com}{\faEnvelope\enspace{alexigladstone@gmail.com}}. \textsuperscript{$\dagger$}Work does not relate to position at Amazon.}
```
Introduction {#sec:intro}
============

In psychology, human thinking is often classified into two different types: System 1 (thinking fast) and System 2 (thinking slow) [@kahneman2011thinking; @evans2011dual; @kahneman2002representativeness; @frankish2010dual]. System 1 thinking is characterized by quick, intuitive and automatic responses, relying on previous experience to solve simple or familiar problems. Alternatively, System 2 Thinking is slow, deliberate and analytical, requiring conscious effort and logical reasoning to process more complex information. System 2 Thinking is essential for complex problems that go beyond automatic pattern recognition, such as in mathematics, programming, multistep reasoning, or novel out-of-distribution situations [@neys2006dual; @goel2000dissociation], where precision and depth of understanding are important. Although current models perform well on tasks suitable for System 1 thinking [@li2025system], they continue to struggle with tasks that demand System 2 capabilities [@mirzadeh2024gsm; @yan2025phd; @EscapeBench2025].

Consequently, the recent pursuance of System 2 Thinking capabilities has become a growing interest among AI researchers, leading to the rise of several foundation models such as O1 [@jaech2024openai], R1 [@guo2025deepseek], Grok3 [@xai2025grok3], and Claude 3.7 Sonnet [@anthropic2025claude37sonnet]. These \`\`reasoning models" excel on math and coding benchmarks, particularly by increasing the time models spend thinking. However, publicly available information on training methods, particularly from the open-source R1 model [@guo2025deepseek], suggests that the Reinforcement Learning (RL) based training approach for these models only works in domains where rule-based rewards can easily verify answers, such as math and coding. This limitation reduces applicability to a small range of problem types, and as a consequence, often deteriorates performance in other tasks such as writing [@openai2024learning; @su2025expanding; @illusionofthinking]. Further, recent evidence suggests this RL-based approach may not induce new reasoning patterns, but rather just increase the probability of reasoning patterns already known to the base model [@yue2025does], which limits performance on problems requiring exploration.

Along similar lines, there has been strong interest in achieving System 2 Thinking in both diffusion models and Recurrent Neural Networks (RNNs). Diffusion models support iterative inference through denoising steps, where increasing denoising steps can improve performance. However, they typically fail to benefit from denoising steps beyond what they were trained on [@ma2025inference], and aside from increasing denoising steps, diffusion models require an external verifier to improve System 2 Thinking capabilities [@ma2025inference; @liu2025video; @singhal2025general]. RNNs also offer iterative computation via recurrent state updates [@gu2023mamba; @peng2023rwkv; @hochreiter1997long], however, most modern RNNs only update their internal state with new information, meaning they cannot be used for thinking longer. Additionally, RNNs that do support recurrent depth still lack mechanisms for explicit verification [@geiping2025scaling], resulting in limited adoption for System 2 Thinking.

![AR Transformer](images/transformer_arch.png){#fig:transformer_arch height="3.5cm"}

`\label{subfig:ar_transformer}`{=latex}

```{=latex}
\hfill
```
![RNN](images/rnn_arch.png){#subfig:rnn_arch height="3.5cm"}

```{=latex}
\hfill
```
![Diffusion Transformer](images/diffusion_arch.png){#fig:diffusion_arch height="3.5cm"}

`\label{subfig:diffusion_arch}`{=latex}

```{=latex}
\hfill
```
![`\pa`{=latex}](images/ebwm_arch.png){#fig:ebwm_arch height="3.5cm"}

```{=latex}
\begin{flushleft}
      
      
        
        \rule{\textwidth}{0.4pt}\\[-3pt]
        Existing Autoregressive Approaches
      
    \end{flushleft}
```
As one of the primary goals of AI is to figure out how we can create systems that **learn to think on their own** on any type of problem, these approaches ultimately bring about the following core research question: \`\`*Can we **rely entirely** on **unsupervised learning** to develop **System 2 Thinking?***" Such a capability would enable *generalization* of current System 2 Thinking approaches to *any problem*, *any modality*, and *avoid the reliance on external human, reward, or model supervision*.

In this work, we argue and demonstrate empirically that the answer to this core research question is **yes**, but that there are several limitations in existing model architectures that prevent this type of unsupervised System 2 Thinking from emerging. Particularly, when comparing the qualities of human System 2 Thinking with the current modeling paradigms (Figure `\ref{fig:model_comparison}`{=latex}, Table `\ref{tab:cognitive_facets}`{=latex}), we observe several key differences, outlined below as three key Facets of System 2 Thinking[^1]:\

```{=latex}
\begin{table*}[t]

\caption{\textbf{Architectures and Cognitive Facets.} For each prediction, Feed Forward (FF) Transformers and RNNs generally\protect\footnotemark~have a finite amount of computation. While diffusion models have potentially more computation during inference by increasing the number of denoising steps, they do not learn to explicitly verify or estimate uncertainty for predictions. EBMs can use a dynamic amount of computation during inference by iterating for any number of steps, and give an energy scalar that can be used to evaluate uncertainty and verify the strength of predictions.}
\small
    \begin{tabular}{lccc}
            \toprule
            \textbf{Architecture} & \makecell{\textbf{Dynamic Compute} \\ \textbf{Allocation (Facet~\ref{facet:dynamic-computation})}} & \makecell{\textbf{Modeling} \\ \textbf{Uncertainty (Facet~\ref{facet:prediction-uncertainty})}} & \makecell{\textbf{Prediction} \\ \textbf{Verification (Facet~\ref{facet:prediction-verification})}} \\
            & & & \\
            \toprule
            FF Transformers & \xmark & \xmark & \xmark \\
            RNNs & \xmark & \xmark & \xmark \\
            Diffusion Transformers & \cmark & \xmark & \xmark \\
            {\pa}s & \cmark & \cmark & \cmark \\
            \bottomrule
        \end{tabular}
 \label{tab:cognitive_facets}
 
\end{table*}
```
```{=latex}
\footnotetext{Recent works attempt to enable dynamic computation per prediction~\cite{hao2024training, geiping2025scaling, goyal2023think}, but these approaches are generally not modality agnostic and have not been widely adopted. While RNNs can theoretically support dynamic computation, they are typically updated only with new state information, except in specialized cases such as in~\cite{geiping2025scaling}.}
```
```{=latex}
\facet[facet:dynamic-computation]{Dynamic Allocation of Computation}
```
Humans naturally allocate varying amounts of effort to different tasks depending on difficulty, which is widely supported by psychology and neuroscience [@kahneman2011thinking; @ditterich2006evidence; @rougier2005prefrontal]. As the difficulty of tasks humans face varies widely, the ability to adjust the magnitude of computational resources allocated towards a task is fundamental to success.[^2] For example, a decision regarding whether to change careers generally takes people much more time than deciding what to eat for lunch.

```{=latex}
\facet[facet:prediction-uncertainty]{Modeling Uncertainty in Continuous State Spaces}
```
While thinking longer is important for improving performance, humans also weigh how uncertain they are before committing to a decision. In language, LLMs can simulate this through token-level probabilities [@tomani2024uncertainty]. In the context of continuous state spaces, such as in vision, without the usage of discretization schemes such as Vector Quantization [@van2017neural] or pseudo losses/objectives (such as ELBO [@kingma2013auto]), standard implementations of the most successfully used approaches with Transformers, RNNs, or Diffusion models generally do not provide strong or reliable uncertainty estimates [@sankararaman2022bayesformer; @heng2024out; @nalisnick2018deep; @serra2019input].[^3] EBMs can naturally model uncertainty without having to model exact likelihoods by modeling the relative unnormalized likelihoods of predictions [@dawid2024introduction], as demonstrated in Figure `\ref{fig:energy_landscape_minimization}`{=latex}. As the real world often contains many inherently unpredictable elements, for instance, when a pedestrian might emerge from behind a parked vehicle, the ability to express uncertainty in predictions is essential to being cautious, and is a natural capability of humans [@Peters2017Uncertainty; @Vilares2012Differential; @Sarinopoulos2010Uncertainty].

```{=latex}
\facet[facet:prediction-verification]{Verification of Predictions}
```
In addition to allocating computation and modeling uncertainty, effective thinking also benefits from the ability to verify predictions, which can guide decisions about when to stop thinking or to select the most accurate predictions. There is strong evidence that such verification is a core component of human thinking [@loesche2018paving; @alkouri2016using]. In fact, it can be shown that verifying solutions is exponentially easier than generating solutions [@du2022learning], which means that verifiers can often generalize better than explicit generators [@du2022learning]. Training an explicit verifier means that at each step of the thinking process an estimate of the current prediction's quality can be extracted. This supports more dynamic inference time behavior such as early stopping when a prediction is known to be correct or allocating more compute when a problem is difficult, which can intuitively be seen as adjusting resources according to the difficulty of a problem [@kahneman2011thinking]. Training an explicit verifier also allows for capabilities such as Monte Carlo Tree Search [@silver2017mastering] or sampling many times and choosing the best prediction, which traditionally have involved training new models on more data [@lightman2023let; @ma2025inference].

For more information on additional Facets, please refer to Section `\ref{sec:cognitive_facet_details}`{=latex}.

To achieve all facets described, we propose viewing thinking as an optimization procedure with respect to a learned verifier, which evaluates the compatibility (unnormalized probability) between an input and candidate prediction (Figure `\ref{fig:model_architecture}`{=latex}). More concretely, we train Energy-Based Models (EBMs) to learn an energy (unnormalized probability) landscape over all possible input-prediction pairs, where lower energy indicates higher compatibility. Then, thinking corresponds to starting from an initial random prediction, and progressively refining it through energy minimization until convergence (visualized in Figure `\ref{fig:energy_landscape_minimization}`{=latex}). Optimizing over a learned energy landscape naturally allows for dynamic compute allocation (Facet `\ref{facet:dynamic-computation}`{=latex}), allowing models to think for longer on harder problems by iteratively performing more steps. Additionally, at each step of this thinking process, EBMs act as verifiers in the forward pass, giving an energy scalar which represents the *compatibility* of the context with the prediction. This energy scalar directly addresses Facets `\ref{facet:prediction-uncertainty}`{=latex} and `\ref{facet:prediction-verification}`{=latex} by serving as an unnormalized likelihood as well as the score that verifies whether a prediction is correct (Figure `\ref{fig:energy_landscape_minimization}`{=latex}).

![**`\pa`{=latex} for Autoregressive Modeling.** Each blue box corresponds to a different prediction based on the current step of the thinking process, where the initial prediction starts as random. At each step, a new prediction is fed into the model, which gives an energy scalar for the prediction's current *compatibility* (unnormalized likelihood) with the context (Facets `\ref{facet:prediction-uncertainty}`{=latex} and `\ref{facet:prediction-verification}`{=latex}). Then, the gradient of this energy with respect to the prediction is calculated and used to update the prediction. This gradient descent update is done iteratively to refine the prediction until convergence of the predicted energy, which allows for dynamic use of computation (Facet `\ref{facet:dynamic-computation}`{=latex}).](images/proposed_model.png){#fig:model_architecture width="\\textwidth"}

While viewing thinking through the lens of inference-time energy minimization is a promising perspective, EBMs have traditionally struggled with scalability [@du2019implicit], with zero publicly known Foundation EBMs. This can be attributed to issues with EBM training stability [@du2019implicit; @du2020improved; @li2023learning; @arbel2020generalized] and long training times [@li2023learning; @arbel2020generalized]. To address these challenges and establish a scalable EBM training paradigm, we introduce Energy-Based Transformers (EBTs), Transformer implementations specifically for EBMs. We release two variants: a decoder-only EBT inspired by the GPT architecture [@radford2018improving], which parallelizes all predictions simultaneously; and a bidirectional EBT, enabling bidirectional attention across entire sequences, similar to BERT [@devlin2019bert] and Diffusion Transformers (DiT) [@peebles2023scalable].

Additionally, we identify practical enhancements that improve the training efficiency of EBTs and provide theoretical insights into why our optimization-based training approach achieves strong scalability. Finally, we introduce *energy landscape regularization* techniques, which are methods that enhance the convexity and smoothness of learned energy landscapes, thereby fostering the emergence of strong System 2 Thinking capabilities during pretraining.

To investigate the learning and thinking scalability of EBTs, we compare EBTs to the Transformer++ for autoregressive modeling and the Diffusion Transformer (DiT) for bidirectional modeling, across both discrete and continuous modalities. During pretraining, we find that EBTs achieve an up to $35\%$ higher scaling rate than the Transformer++ across several axes, including data, batch size, parameters, FLOPs, and depth. Notably, EBTs are the first approach to out-scale the standard feed-forward Transformer++ recipe across several modalities and axes, including improved data efficiency. At inference, EBTs outperform existing paradigms on System 2 Thinking, or solving more challenging problems with additional computation. For example, EBTs can improve language modeling performance $29\%$ more than the Transformer++, and for image denoising EBTs exhibit a higher performance-compute scaling rate and improved performance over DiTs with $99\%$ fewer forward passes. In investigations of the effect of System 2 Thinking on generalization, we consistently observe two phenomena. First, with the same or worse pretraining performance, `\pa`{=latex}s still outperform existing models at inference by performing System 2 Thinking, demonstrating its importance and how EBTs often generalize better than existing models. Second, System 2 Thinking with EBTs yields more substantial performance gains on data that is further out-of-distribution, aligning with observations in psychology where humans use System 2 Thinking on challenging, unseen tasks.

We believe the EBT implementations, along with novel techniques for EBMs to maximize the learning and thinking scalability, will advance the EBM paradigm by addressing key challenges in stable, parallelizable, and efficient training.

Energy-Based Transformers (`\pa`{=latex}) Intuition {#sec:intuition}
===================================================

Energy-Based Models (EBMs) learn an energy function that assigns a scalar value to each input configuration, with lower energy indicating higher compatibility or likelihood between input variables [@du2019implicit], and high energy indicating lower compatibility or likelihood. Accordingly, the energy function acts as a **verifier** of input data coherence. Leveraging this principle, Energy-Based Transformers (`\pa`{=latex}s) are trained to **verify** the compatibility of a given context-prediction pair (give an energy value), and then make predictions by optimizing with respect to this verifier (minimizing the energy), which is shown in Figure `\ref{fig:model_architecture}`{=latex}. Starting from an initial prediction, such as random noise, `\pa`{=latex}s iteratively refine their output by progressively minimizing the energy, ultimately converging to predictions that are consistent with the given context. Performing energy minimization through this process simulates the thinking process during pretraining, unlike with traditional models, enabling each prediction (e.g., a token for LLMs) to have its own thinking process.

![**Thinking Process Visualization.** A learned energy landscape and its optimization through gradient descent, interpreted as a thinking process. In this example, the model predicts a distribution over text tokens, progressively shifting from an initial random distribution toward the target distribution. At each step, the EBM assigns an energy scalar indicating how **compatible** the current prediction is with the context, visualized as the landscape's height (Facet `\ref{facet:prediction-verification}`{=latex}). This scalar's convergence allows the model to determine whether the prediction is adequate or if further thinking is necessary. Uncertainty (Facet `\ref{facet:prediction-uncertainty}`{=latex}) can be represented by landscapes that are harder to optimize or by landscapes with many local minima, allowing the model to know when it requires more steps to think (Facet `\ref{facet:dynamic-computation}`{=latex}). Adapted from [@li2018visualizing].](images/energy_landscape_minimization.png){#fig:energy_landscape_minimization width=".75\\textwidth"}

Learning to Verify
------------------

Verifying solutions is often substantially more tractable than generating them, a distinction well-established in complexity theory and foundational to developments in knowledge proofs and learning algorithms [@cook2023complexity; @goldwasser2019knowledge; @godel1956letter]. For example, in solving a maze, verifying the correctness of a given path is significantly easier than discovering such a path. This asymmetry has been recognized and utilized for several decades, notably in the field of cryptography [@goldwasser2019knowledge; @lavin2024survey; @rivest1978method]. EBMs are built on this principle that *verification is easier than generation*: rather than learning to generate directly, as in most existing paradigms, EBMs learn to generate by optimizing predictions with respect to this learned verification (energy function); this is depicted in Figure `\ref{fig:energy_landscape_minimization}`{=latex}.

Recent works have attempted to leverage this characteristic of verifiers [@Team_2023; @yao2023tree; @ma2025inference; @ouyang2022training], but these approaches decouple the verifier and generator, resulting in adversarial dynamics [@ma2025inference] and challenges in scalability [@yao2023tree]. For example, researchers combining tree search and LLMs required thousands or even millions of samples to achieve optimal performance [@Team_2023]. In contrast, EBMs **combine** the verifier and generator into a **single model**, where the generator is defined implicitly by the gradient of the verifier [@du2019implicit]. We show that this coupling resolves scalability and adversarial issues (Figures `\ref{subfig:thinking_scaling}`{=latex} and `\ref{subfig:self_verification_non_adversarial}`{=latex}).

An additional advantage of verifiers is generalization. Because verification is usually easier than generation [@swamy2025all], *prediction verification on Out-Of-Distribution (OOD) data is often easier than explicit prediction generation for OOD data* [@du2022learning]. This characteristic often results in better generalization of verifiers than explicit generators [@du2022learning]. As an example, models trained to explicitly solve mazes on a small grid may not generalize to larger grids, whereas verifiers for maze solutions learn whether a path is correct, which more easily transfers to larger mazes. This characteristic may be why EBMs often generalize better than existing models [@du2022learning; @du2024learning], which we further support in our larger-scale experiments (Figure `\ref{subfig:ood_thinking_comparison}`{=latex} and Table `\ref{tab:img_denoising_performance}`{=latex}).

Learning to Understand
----------------------

This verifier-centric perspective also relates to a deeper limitation in contemporary generative models, referred to as \`\`The Generative AI Paradox [@west2023generative]." Although current generative models achieve strong generative capabilities, they frequently lack basic discrimination skills, such as the ability to assess the plausibility or coherence of their own predictions [@west2023generative; @stojnic2023commonsense]. These limitations impede their ability to engage in reasoning, planning, and decision-making [@yan2025phd; @kambhampati2024position]. In contrast, EBMs offer a potential solution to this challenge: as EBMs generate by learning a verifier (which is conceptually similar to a discriminator), they develop strong discrimination skills [@wang2023energy]. Experimental results further support this observation (Table `\ref{tab:img_denoising_performance}`{=latex}).

Energy-Based Transformers (`\pa`{=latex}) Approach
==================================================

Energy-Based Models (EBM) Background {#sec:learning_approach}
------------------------------------

Energy-Based Models (EBMs) assign a scalar energy value to each configuration of input variables, enabling them to model the **compatibility** and **interactions** between variables, such as between a context and candidate-prediction. In the case of probabilistic EBMs, this defines a probability distribution using a Boltzmann distribution $p_\theta(x) = \frac{e^{-E_{\theta}(x)}}{Z(\theta)}$ where $Z(\theta) = \int e^{-E_{\theta}(x)} dx$ is the intractable partition function involving an integral over all possible values of $x$. Due to the negative exponential, lower energy corresponds to higher probability, and higher energy corresponds to lower probability. To avoid the intractability of the partition function, it is common to work with **unnormalized EBMs**, which dispense of the partition function in favor of representing relative unnormalized probabilities. This formulation shifts the focus from addressing the partition function, to simply assigning low energy to the true data manifold and high energy elsewhere [@du2019implicit; @dawid2024introduction], offering benefits such as scalability to spaces where the true data manifold is thin and therefore a probabilistic EBMs would have an infinite score [@dawid2024introduction]. In supervised or predictive self-supervised learning (e.g., classification, autoregressive modeling, masked modeling), unnormalized EBMs can be formulated as: $p_\theta(x, \hat{y}) \propto e^{-E_{\theta}(x, \hat{y})}$, where the goal of the EBM is to learn to predict $\hat{y}$ given $x$.[^4] For an accessible and beginner-friendly introduction to EBMs, including intuitive explanations and pseudocode, please refer to Section `\ref{sec:ebm_intro}`{=latex}.

```{=latex}
\begin{algorithm}[H]\small
      \caption{Training}\label{alg:ebt_training}
      \Inputs{Context $x$, Target $y$, EBM $E_\theta(x,\hat y)$}
      \Hparams{Steps $N$, Step Size $\alpha$, Loss $J(\cdot)$}
      Sample $\hat y_0 \sim \mathcal{N}(0,I)$\;
      \For{$i = 0, \dots, N-1$}{
        $\hat y_{i+1} \gets \hat y_i - \alpha\nabla_{\hat y_i}E_\theta(x,\hat y_i)$\;
      }
      $\mathcal{L} \gets J(\hat y_N,y)$\;
      \Return{$\mathcal{L}$, update $E_\theta$}\;
    \end{algorithm}
```
```{=latex}
\hfill
```
```{=latex}
\begin{algorithm}[H]\small
      \caption{Inference with Verification}\label{alg:ebt_inference}
      \Inputs{Context $x$, EBM $E_\theta(x,\hat y)$}
      \Hparams{Steps $N$, Step Size $\alpha$, Samples $M$}
      \For{$j = 1, \dots, M$}{
        Sample $\hat y_{0, j} \sim \mathcal{N}(0,I)$\;
        \For{$i = 0, \dots, N-1$}{
          $\hat y_{i+1, j} \gets \hat y_{i, j} - \alpha \nabla_{\hat y_{i, j}}E_\theta(x,\hat y_{i, j})$\;
        }
      }
      \Return {$\hat y^* = \operatorname*{argmin}_j E_\theta\!\bigl(x,\hat y_{N, j}\bigr)$}\;
    \end{algorithm}
```
Scalable EBM Learning
---------------------

While EBMs offer a flexible modeling framework, training them scalably remains an open research problem. Two primary training approaches exist---contrastive and regularized methods [@lecun2022path]. Contrastive methods increase the energy of negative samples while decreasing the energy of positive samples. Due to the curse of dimensionality, where the volume of spaces grows exponentially with their dimension, contrastive methods struggle to scale because they must increase the energy of an exponentially higher number of negative samples [@dawid2024introduction].

An alternative is to frame EBM learning as an optimization problem [@du2022learning; @wang2023energy], which avoids the curse of dimensionality by implicitly regularizing the energy landscape, enabling scalable learning. In this approach, EBMs are trained to optimize an initial prediction to the ground truth solution through gradient descent, as shown in Figure `\ref{fig:energy_landscape_minimization}`{=latex}. This pushes the energy landscape to be convex surrounding the ground truth solution, thereby regularizing the energy landscape to only have low energy on the true data manifold. Intuitively, this training approach is similar to GANs [@goodfellow2014generative]. During the forward pass, EBMs can be seen as a GAN discriminator by giving an energy \`\`verification"; on the backward pass they can be seen a GAN generator by optimizing predictions through energy minimization to try and fool the discriminator.

Training EBMs to perform optimization can be formalized as follows. We begin with an EBM $E_{\theta}$, a prior (initial prediction) $\hat{y}_0$, an input (context) for the model $x$, and seek to predict $y$. We aim to find the minimum energy (most compatible/likely) $\hat{y}$ given an $x$, which we search for using gradient descent: $$\begin{aligned}
\hat{y}_{i+1} = \hat{y}_i - \alpha \nabla_{\hat{y}_i}E_{\theta}(x, \hat{y}_i),\end{aligned}$$ where $\alpha$ is the step size (this is formalized in Algorithm `\ref{alg:ebt_training}`{=latex}). Then, the loss can be computed using any standard objective function. For this work, we use the same loss functions as existing papers to simplify experiments, so we use categorical cross-entropy for language modeling [@bengio2003neural] and mean squared error for image denoising [@ho2020denoising]. Importantly, this loss is backpropagated through the entire optimization process, requiring second-order derivatives (i.e., gradients of gradients). These are computed efficiently via Hessian-vector products, which scale linearly with model size [@dagréou2024howtocompute], similar to standard first-order backpropagation in feed-forward models. More details and pseudocode can be found in Section `\ref{sec:how_to_learn_ebt}`{=latex}.

Scalable EBM Thinking {#sec:thinking_approach}
---------------------

While this training approach is scalable, achieving smooth and convex energy landscapes, such as the landscape visualized in Figure `\ref{fig:energy_landscape_minimization}`{=latex}, remains challenging on real-world problems. Because $y$ is high-dimensional, the energy landscape spans a high-dimensional space, and must remain well-shaped throughout. To address this, we found three key *energy landscape regularization* techniques to be essential in ensuring the smoothness and convexity of learned energy landscapes, enabling strong thinking capabilities to be learned during training.

First, we found a replay buffer (following existing EBM works [@du2022learning; @du2019implicit]) helps simulate longer optimization trajectories, enabling energy landscapes to be well defined near their minimum. Second, a variant of Langevin Dynamics [@du2019implicit] (random noise), was found to be helpful for encouraging **exploration** of the energy landscape: $$\begin{aligned}
\hat y_{i+1}
&= \hat y_i - \alpha\nabla_{\hat y_i}E_\theta\bigl(x,\hat y_i\bigr) + \eta_i,\quad
   \eta_i \sim \mathcal{N}(0, \sigma),\end{aligned}$$ where $\sigma$ is the magnitude of the noise $\eta$. Without this random noise term, exploration is often limited to paths leading directly to the energy minimum, leaving other regions poorly defined. Third, varying the paths taken towards predicting solutions, by randomizing the gradient descent step size $\alpha$ and number of optimization steps, significantly improved generalization. Together, these techniques substantially improved the System 2 Thinking capabilities of models, as confirmed by ablation experiments in Table `\ref{tab:system_2_ablations}`{=latex}.

With these energy landscape regularization techniques established, to understand the System 2 Thinking capabilities of `\pa`{=latex}s, we explored two main thinking approaches, corresponding to two of the cognitive facets described. First, corresponding to dynamic computation allocation (Facet `\ref{facet:dynamic-computation}`{=latex}), we conduct experiments that involve changing the number of steps taken for optimization of a single prediction. This is conceptually similar to increasing the denoising steps performed with a diffusion model. Second, corresponding to the ability to verify predictions (Facet `\ref{facet:prediction-verification}`{=latex}), we generate N predictions from an EBM and then choose the minimum energy prediction. This is conceptually similar to Best of N (BoN) sampling using language models [@stiennon2020learning]. However, EBMs generalize this approach to both discrete and continuous modalities, don't require additional supervision (e.g., an external reward/verification model), and perform it on *every single prediction*, not just to entire sequences. We demonstrate performance improvements gained from both of these techniques in several Thinking experiments (Figures `\ref{fig:best_result}`{=latex}, `\ref{fig:ood_generalization_thinking}`{=latex}, and `\ref{fig:thinking_fwd_passes_img}`{=latex}), which further confirm the importance of the described cognitive facets. This thinking process is formalized in Algorithm `\ref{alg:ebt_inference}`{=latex} and we more formally define and justify usage of the term thinking in Section `\ref{sec:thinking_formalization}`{=latex}.

Energy-Based Transformers (`\pa`{=latex}s) Architecture {#sec:ebt_intro}
-------------------------------------------------------

Transformers have demonstrated exceptional performance across numerous domains [@openai2023gpt4; @guo2025deepseek; @oquab2023dinov2; @he2021masked; @borsos2023audiolm]. Three primary advantages of Transformers include their parallelizability [@radford2019language; @vaswani2017attention], their stability [@grattafiori2024llama], and their scalability [@kaplan2020scaling]. Because Energy-Based Models (EBMs) have encountered difficulties with all three of these characteristics [@du2019implicit; @du2020improved; @li2023learning; @arbel2020generalized], Transformers represent a particularly suitable architecture for scaling EBMs. Consequently, to advance the EBM paradigm, we introduced Energy-Based Transformers (EBTs), which are Transformer implementations designed for EBMs. We developed two variants: a decoder-only EBT, inspired by the GPT architecture [@radford2018improving] for autoregressive modeling, as well as a bidirectional EBT with bidirectional attention across sequences, enabling capabilities such as infilling and masked modeling [@peebles2023scalable; @devlin2019bert]. Although the bidirectional EBT implementation is relatively straightforward, the autoregressive EBT presents greater implementation challenges, primarily due to the potential for information leakage in naïve implementations. Comprehensive details of this implementation are discussed in Section `\ref{sec:ebt_full}`{=latex}.

Experimentation and Results {#sec:experimentation}
===========================

We experiment with `\pa`{=latex} across both Autoregressive (AR) [@radford2019language] as well as Bidirectional models [@devlin2019bert] in discrete and continuous spaces.[^5] In discrete spaces, we focus primarily on the language modeling objective. In continuous spaces, we focus on vision tasks of next frame prediction and image denoising. We perform the most comprehensive experiments using Autoregressive Language Models trained to predict the next token, as this is the most extensively studied pretraining approach [@li2025mis]. All models are pretrained from scratch under a tightly constrained compute budget, as the architecture of `\pa`{=latex}s is incompatible with existing foundation models, making them incompatible for adaptation via fine-tuning. We focus on two primary types of results. First, we examine **learning scalability**, investigating how quickly models can fit the pretraining data, which is standard in the pretraining literature [@gu2023mamba; @touvron2023llama; @li2025mis; @alabdulmohsin2022revisiting; @kaplan2020scaling; @henighan2020scaling; @hoffmann2022training].

Second, we study **thinking scalability**, or how the performance of models changes as we scale the System 2 Thinking of models (Definition `\ref{def:system_2_thinking}`{=latex}). To measure the thinking scalability, we use the Number of Function Evaluations (NFEs) [@chen2018neural; @ma2025inference], which we deem to be one for every forward pass completed (for EBMs this is one function evaluation per optimization step). By scaling the number of forward passes during inference, we can determine whether model performance improves with increased thinking.

Autoregressive Language Modeling Experiments
--------------------------------------------

In this section, we detail and discuss the results for all Natural Language Processing (NLP) experiments using Autoregressive (AR) Language Models trained to predict the next discrete token in a text sequence [@radford2019language]. All language models are pretrained on the RedPajamaV2 text corpus [@weber2024redpajama; @together2023redpajama] $100B$ sample from HuggingFace using the GPT-NeoX tokenizer [@black2022gpt] (following [@gu2023mamba]) to predict the next token. Following existing pretraining work, we compare AR `\pa`{=latex} with the standard Transformer++ recipe [@gu2023mamba; @touvron2023llama; @sun2024learning]. We manually created a training and validation split of $66$ million and $33$ thousand samples, respectively.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_data.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_bs.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_depth.svg}
```
For downstream evaluation, we used four key datasets in addition to the pretraining dataset, spanning reasoning, question answering, and syntax understanding. Ordered roughly by increasing perplexity difficulty, these include GSM8K [@cobbe2021training], SQuAD [@rajpurkar2016squad], BigBench Elementary Math QA  [@srivastava2022beyond], and BigBench Dyck Languages [@srivastava2022beyond]. We intentionally design the evaluation towards reasoning benchmarks due to their alignment with System 2 Thinking. Additionally, we focus on reporting perplexity as our relatively small models trained from scratch, when compared to current foundation models, do not achieve high accuracies on many of these benchmarks. Furthermore, perplexity often functions as a more linear metric than accuracy [@schaeffer2023emergent; @gu2023mamba], enabling a more comparable analysis of downstream performance as we scale compute during inference. Further details on the exact hyperparameters and setup used for all experiments are in Section `\ref{sec:experimental_details}`{=latex}.

### Autoregressive Language Model Learning Scalability

The scaling trends related to the learning speed of models, commonly referred to as \`\`scaling laws," are challenging to measure. For example, a recent survey [@li2025mis] found that the \`\`scaling laws" observed often depend on several implementation details and axes to measure, which can result in several different conclusions.[^6] Therefore, to be as comprehensive as possible in determining how `\pa`{=latex}s scales compared to the Transformer++, we conduct scaling experiments for six different axes--including data, batch size, depth, parameters, FLOPs,[^7] and embedding dimension. The results for the data, batch size, and depth scaling are shown in Figure `\ref{fig:nlp_ar_learn_scale_1}`{=latex}; and the results for parameters, FLOPs, and embedding dimension are visualized in Figure `\ref{fig:nlp_ar_learn_scale_2}`{=latex}. Across all axes `\pa`{=latex} consistently out-scales (has a higher scaling rate) the Transformer++ recipe, becoming the first model to achieve such a feat without using a different tokenizer ([@pagnoni2024byte] was the first and only work to our knowledge to out-scale the Transformer++ recipe, but they use a different tokenizer, and do not out-scale the Transformer++ across multiple axes unlike `\pa`{=latex}s). These results suggest that `\pa`{=latex}s are more data efficient, batch size efficient, parameter efficient, depth efficient, and compute efficient than the Transformer++ recipe. Thus, at the scale of modern foundation models trained on $1,000\times$ more data with models $1,000\times$ larger (following [@grattafiori2024llama]), we expect the pretraining performance of `\pa`{=latex}s to be significantly better than that of the Transformer++ recipe.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_params.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_flops.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_width.svg}
```
```{=latex}
\small
```
  **Model**                                   **Thinking Longer $\uparrow$**   **Thinking Longer and Self-Verification $\uparrow$**
  ------------------------------------------ -------------------------------- ------------------------------------------------------
  No Random Step Size                                     -1.47                                        0.19
  No Random Num. Steps                                     0.00                                        9.65
  No Langevin Dynamics                                   **17.2**                                      17.0
  No Replay Buffer                                         14.8                                        17.8
  **Full System 2 Configuration**                          7.19                                      **18.7**
  `\label{tab:system_2_ablations}`{=latex}                                    

  : **System 2 Thinking Ablations.** All energy landscape regularization techniques described in Section `\ref{sec:thinking_approach}`{=latex} and their impact on System 2 Thinking performance, measured by percent perplexity improvement. Thinking Longer denotes more optimization steps and Self-Verification denotes optimizing many predictions and choosing the best. Bolded highlights the default System 2 Hyperparameters, leveraging all energy landscape regularization techniques described. This configuration results in the best performance when thinking longer and doing self-verification. Removing regularization, such as Langevin Dynamics, results in less energy landscape exploration, which improves single path performance (thinking longer) at the expense of self-verification performance.

### Autoregressive Language Model Thinking Scalability

Building on the learning results, we investigate `\pa`{=latex}s for thinking at inference time. We found that the thinking capabilities of `\pa`{=latex} emerge with a sufficiently large data scale, and therefore, due to limited resources, we focus on conducting thinking experiments with smaller models trained on substantial amounts of data. We test thinking capabilities along two axes: thinking longer, which denotes more optimization steps, and self-verification, which denotes generating many candidate predictions and selecting the minimum energy prediction. In Table `\ref{tab:system_2_ablations}`{=latex}, we conduct ablation studies to confirm the benefits of our *energy landscape regularization* techniques for System 2 Thinking on Out-of-Distribution Data from the BigBench Dyck Languages benchmark [@srivastava2022beyond]. We find that using all techniques yields the best System 2 Thinking performance when combining extended thinking and self-verification. Additionally, the results show that randomizing the step size is critical---removing it nearly eliminates Thinking gains, while disabling Langevin Dynamics degrades combined performance but improves results without verification, offering a performance-compute tradeoff.

Having established the importance of these landscape regularization techniques, in Figure `\ref{fig:best_result}`{=latex}, we analyze the scalability of thinking with `\pa`{=latex}s, where the results yield two main insights. First, as shown in Figure `\ref{subfig:ood_thinking_comparison}`{=latex}, EBTs are able to improve performance by as much as $29\%$ by increasing the amount of forward passes (thinking time), whereas the Transformer++ cannot improve performance at all.[^8] This aligns with our claims that because traditional feed-forward Transformers cannot dynamically allocate additional computation for each prediction being made, they are unable to improve performance for each token by thinking for longer.

Second, as demonstrated in Figure `\ref{subfig:thinking_scaling}`{=latex}, the thinking capabilities of `\pa`{=latex}s scale, showing that as `\pa`{=latex}s are trained for longer, their ability to achieve improvements from verification improves, increasing up to $10\%-14\%$ from $4\%-8\%$. This suggests that `\pa`{=latex}s trained at the same scale as modern foundation models, such as the $15$T tokens Llama3 [@grattafiori2024llama] was trained on ($\approx1000 \times$ the current data scale), would have significantly more substantial results from self-verifying. Lastly, we visualize results from `\pa`{=latex} at representing uncertainty while predicting tokens in Figure `\ref{fig:qual_nlp_uncertainty}`{=latex}. The results demonstrate that for easier to predict tokens, such as \`\`the" or \`\`but", `\pa`{=latex}s optimize to lower energies faster, whereas for harder to predict tokens, such as \`\`fox" or \`\`problem" `\pa`{=latex}s have higher energy that does not converge across steps. This suggests that during pretraining `\pa`{=latex}s learn to capture uncertainty regarding which tokens are harder or easier to predict, achieving Facet `\ref{facet:prediction-uncertainty}`{=latex}.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_thinking_nlp_ar_fwd_advanced.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_thinking_nlp_ar_data_ood_linear.svg}
```
### Autoregressive Language Model Generalization

```{=latex}
\begin{wrapfigure}[25]{r}{0.5\textwidth}
  
  \includesvg[width=\linewidth]{images/scaling_thinking_nlp_ar_ood.svg}
  \caption{\textbf{OOD Thinking Performance.} As the data becomes more OOD, thinking leads to greater performance improvements, with a roughly linear trend. These findings highlight that {\pa}s thinking is especially critical for robust generalization to OOD data. Performance is measured on $5$ different datasets varying in Out-of-Distribution (OOD) magnitude shift, which is measured as the ratio of downstream dataset perplexity to pretraining perplexity. Max Thinking denotes combining thinking longer and self-verification. 
  }
  \label{fig:ood_generalization_thinking}
\end{wrapfigure}
```
As System 2 Thinking in humans is associated with generalization to novel scenarios, we conduct experiments directly aimed at measuring the effects of System 2 Thinking on generalization. In Figure `\ref{fig:ood_generalization_thinking}`{=latex}, we visualize the performance of `\pa`{=latex}s on the datasets described, which have varying levels of Out-of-Distribution (OOD) shift (measured as the ratio of downstream task perplexity to pretraining perplexity). We observe a strong linear trend: as the data becomes more OOD, thinking leads to greater performance improvements. Therefore, these findings suggest that the benefits of EBTs' thinking are not uniform across all data but scale positively with the magnitude of distributional shifts, highlighting thinking as a critical mechanism for robust generalization beyond training distributions. These findings align with observations in psychology where humans rely on deliberate System 2 Thinking to tackle challenging OOD tasks.

Next, we investigate the relation between OOD generalization and pretraining performance. Pretraining performance is strongly correlated with downstream task performance in language models [@gadre2024language; @gururangan2020don; @thrush2024improving], and can even be a predictor of downstream performance [@chen2024scaling; @isik2024scaling; @grattafiori2024llama]. Because we know that EBTs scale at a faster rate than the Transformer++, as demonstrated in Figures `\ref{fig:nlp_ar_learn_scale_1}`{=latex} and `\ref{fig:nlp_ar_learn_scale_2}`{=latex}, it is reasonable to hypothesize that they may also perform better on downstream tasks at scale. To investigate this, we compare models with identical training setups, where EBTs have slightly worse pretraining perplexity than Transformer++ models. As shown in Table `\ref{tab:generalization_exps}`{=latex}, despite achieving a higher pretraining perplexity, EBTs achieve lower (better) perplexity on most downstream tasks, suggesting stronger generalization, particularly to Out-of-Distribution (OOD) data. Together, with the better learning scalability results, and knowing that improved pretraining performance usually leads to improved downstream task performance [@chen2024scaling; @isik2024scaling; @grattafiori2024llama], these results suggest that at scale `\pa`{=latex}s would outperform the Transformer++.

```{=latex}
\small
```
  **Model**                                    **Pretrain**   **GSM8K ↓**   **SQuAD ↓**   **BB Math QA ↓**   **BB Dyck ↓**
  ------------------------------------------- -------------- ------------- ------------- ------------------ ---------------
  Transformer++                                 **31.36**        49.6        **52.3**           79.8             131.5
  EBT                                             33.43        **43.3**        53.1           **72.6**         **125.3**
  `\label{tab:generalization_exps}`{=latex}                                                                 

  : **Language Model Task Generalization Comparison.** Despite exhibiting a slightly higher pretraining perplexity, `\pa`{=latex}s usually achieve lower perplexity on downstream tasks than the Transformer++. This suggests that `\pa`{=latex}s generalize better than the Transformer++. Additionally, because `\pa`{=latex}s scale better than the Transformer++ during pretraining (Figure `\ref{fig:nlp_ar_learn_scale_1}`{=latex}), these findings suggest that `\pa`{=latex}s would outperform Transformer++ at foundation model scale. BB stands for BigBench.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/token_energy_seq5.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/token_energy_seq3.svg}
```
Autoregressive Video Experiments
--------------------------------

To assess how `\pa`{=latex}s scale in continuous domains, we train models to predict the next image in a video conditioned on all previous frames---a common pretraining objective for video models [@deng2024autoregressive; @weissenborn2019scaling; @rakhimov2020latent; @ye2023video; @gu2025long]. Unlike in the NLP experiments, where models see each sample only once due to the dataset size, current popular video datasets are relatively small, requiring models to train repeatedly on the same data. As a result, this setting probes a different question: *\`\`how well can models fit a fixed dataset?"* rather than how efficiently models scale under non-data-bound regimes. This distinction is especially important given the recent scarcity of high-quality datasets [@villalobos2022will; @Data-centricML-benchmarking] and the perspective that data will increasingly become a bottleneck.

For experiments, we encode all $224 \times 224$ images into $3136$ dimensional features with the frozen SD-XL VAE [@rombach2022high; @stabilityai/sd-vae-ft-mse]. Then, all models are trained using a Smooth L1 loss with $\beta = 1.0$ on the Something Something V2 dataset [@goyal2017something], where we report the minimum validation loss achieved. In Figure `\ref{fig:cv_scaling_ssv2}`{=latex} we report scaling results for the embedding dimension and non-embedding parameter count, as we found these axes behaved the most linearly. The results demonstrate that, despite achieving a higher initial loss, `\pa`{=latex}s scale at a more than $33\%$ faster rate than the Transformer++. This suggests that at foundation model scale `\pa`{=latex}s would achieve significantly better performance than the Transformer++.

We believe this large scaling rate gap can be linked to the fact that `\pa`{=latex}s more seamlessly model continuous distributions than standard feed forward transformers due to being able to express uncertainty (Facet `\ref{facet:prediction-uncertainty}`{=latex}) through their energy scalar. To confirm this, we visualize results for different energies when predicting video frames in Figure `\ref{fig:qual_vid_uncertainty}`{=latex}. The results demonstrate that `\pa`{=latex}s successfully learn to capture uncertainty---where frames earlier on in the video have higher energy (higher uncertainty) due to no large objects being within the frame, and then as the major object in the scene becomes revealed more `\pa`{=latex} predicts lower energy (lower uncertainty). `\pa`{=latex}s learn to exhibit this behavior without any supervision using a Smooth L1 loss, whereas the standard feed-forward Transformer++ would require discretization schemes such as Vector Quantization [@islam2023eqa] with a categorical loss, or other tricks to achieve the same effect.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_vid_ar_embed.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_vid_ar_params.svg}
```
```{=latex}
\captionsetup[table]{position=top}
```
```{=latex}
\small
```
  ------------------------------------------------------------------------------------- -------------------------------------- -------------------------- ---------------------------- ------------------------ ----------------------- -----------------------
                                                                                         In Distribution Noise $\sigma = 0.1$   OOD Noise $\sigma = 0.2$   ImageNet-1k Classification                                                   
  `\cmidrule`{=latex}(r)2-3 `\cmidrule`{=latex}(r)4-5 `\cmidrule`{=latex}(r)6-7 Model              PSNR $\uparrow$               MSE Pixel $\downarrow$         PSNR $\uparrow$         MSE Pixel $\downarrow$   Top 1 Acc. $\uparrow$   Top 5 Acc. $\uparrow$
  DiT                                                                                                   26.58                            142.98                      19.56                      718.7                    0.31%                   1.36%
  EBT                                                                                                 **27.25**                        **122.55**                  **23.29**                  **305.2**                **5.32%**               **13.2%**
  `\label{tab:img_denoising_performance}`{=latex}                                                                                                                                                                                       
  ------------------------------------------------------------------------------------- -------------------------------------- -------------------------- ---------------------------- ------------------------ ----------------------- -----------------------

  : **Image Denoising and Classification Comparison**. For image denoising, EBTs significantly outperform DiTs [@peebles2023scalable] in Peak Signal to Noise Ratio (PSNR), as well as MSE, on both in-distribution and Out-Of-Distribution (OOD) data, while using $99\%$ less forward passes. This suggests that EBTs generalize better than DiTs while using less computation. On image classification, EBTs also perform better than DiTs, yielding around $10\times$ higher accuracy, suggesting that EBTs learn better image representations and therefore understand images better than DiTs.

Bidirectional Image Experiments
-------------------------------

In addition to investigating autoregressive `\pa`{=latex}s, we also explore the performance of `\pa`{=latex}s trained bidirectionally. These experiments allow for a fairer comparison with diffusion models, which are not commonly trained autoregressively. Following [@du2022learning; @chen2020learning], models are trained to denoise images that have been noised. To make these denoising experiments compatible with diffusion models, we deviate from the original noising schemes performed in these works and use a noising scheme based on the noising schedule from diffusion models. Specifically, we follow [@peebles2023scalable], and use a linear variance schedule ranging from $1\times10^{-4}$ to $2\times10^{-2}$. To control the noise level, we use a hyperparameter denoted $\sigma$ representing the percentage of the diffusion schedule to noise samples; $\sigma$ was set to $0.1$ during training, and $0.2$ during testing to test generalization. We use the COCO 2014 dataset [@lin2014microsoft; @AbdoTW/COCO_2014DatasetsHF] with $128$ by $128$ images, a patch size of $16$, and the Diffusion Transformer implementation from [@peebles2023scalable].

![**Qualitative OOD Image Denoising.** `\pa`{=latex}s achieve better denoising quality during inference while using one step for every $100$ denoising steps of a DiT. The overall image quality of `\pa`{=latex} denoised images is less blurry than images denoised by DiT.](images/qualitative_image_denoising.png){#fig:qualitative_image_denoising width="95%"}

Following [@du2022learning; @chen2020learning], during inference we investigate denoising on the same level of noise performed during training ($\sigma = 0.1$), as well as at a higher noise level ($\sigma = 0.2$), representing Out-of-Distribution (OOD) noisier images. The results are in Table `\ref{tab:img_denoising_performance}`{=latex}, where we observe that `\pa`{=latex}s perform better than DiTs at both in and out of distribution image denoising across various metrics, by as much as $3.5$ in Peak Signal to Noise Ratio (PSNR). Following [@chen2018neural], we also plot the performance based on the number of forward passes (Number of Function Evaluations NFEs) in Figure `\ref{fig:thinking_fwd_passes_img}`{=latex}. These results demonstrate that `\pa`{=latex}s perform better at denoising while using $99\%$ less denoising steps than DiTs, and that the System 2 Thinking scaling rate for `\pa`{=latex}s is higher than for DiTs. Lastly, qualitative results for denoised out-of-distribution images for `\pa`{=latex} compared to the DiT baseline are shown in Figure `\ref{fig:qualitative_image_denoising}`{=latex}. These results further demonstrate that the visual quality of EBT denoised images is much better than the visual quality of DiT denoised images, while using less compute.

```{=latex}
\begin{wrapfigure}[32]{r}{0.5\textwidth}
    
    \includesvg[width=\linewidth]{images/video_energy_seq0003.svg}
    \caption{\textbf{Learning Uncertainty on Video Results.} In line with cognitive Facet~\ref{facet:prediction-uncertainty}, EBTs learn to express uncertainty across continuous video frames without supervision. At the start of the video, uncertainty is high (high energy) because the frame is mostly empty and the scene is highly unpredictable. As a blue garment is placed into the frame, uncertainty decreases (low energy), reflecting the greater predictability of the scene. When the blue garment is removed from the scene, uncertainty increases again, indicating a return to higher unpredictability. Such a capability is significantly more difficult to achieve in continuous spaces with traditional feed-forward transformers without discretization schemes~\cite{pei2022transformer}.}
    \label{fig:qual_vid_uncertainty}
\end{wrapfigure}
```
In an effort to understand whether the representations learned from denoising captured useful visual features, we perform a linear probe evaluation on ImageNet-1k [@russakovsky2015imagenet], following common practice in visual representation learning [@oquab2023dinov2; @he2021masked]. For both models, we take the average of all the final patch tokens, and for DiTs we feed in $T = 0$. The results are shown in Table `\ref{tab:img_denoising_performance}`{=latex}, where the performance of EBTs is much higher than DiTs, achieving a Top-1 and Top-5 accuracy around $10 \times$ higher than that of DiTs. This suggests that EBTs learn better image representations than DiTs, and therefore that EBTs offer promise in generative models with an **improved understanding** of what they generate.

Discussion
==========

Across both discrete (text) and continuous (video) autoregressive models, the results show that `\pa`{=latex}s scale at a faster rate than the standard Transformer++ approach during pretraining across all measured axes, including data, batch size, depth, parameters, FLOPs, and width. This is especially apparent with data and batch scaling for text, as well as width and parameter scaling for video, where the scaling rate was over $30\%$ higher. These results are particularly important for two reasons: first, they suggest that at the scale of modern foundation models, `\pa`{=latex}s would outperform the current Transformer++ approach *even without their System 2 Thinking capabilities*. Second, `\pa`{=latex}s appear to be the first approach that has *better data efficiency* than the Transformer++. As data has become one of the major limiting factors in further scaling [@openaiYouTubegpt4_5], this makes `\pa`{=latex}s especially appealing.

With System 2 Thinking, the EBT-Transformer++ performance gap would be expected to increase, as we found that System 2 Thinking could improve performance by as much as $29\%$ for text (Figure `\ref{subfig:ood_thinking_comparison}`{=latex}). Further, we found that the thinking capabilities of EBTs scale well, as they improve during pretraining (Figure `\ref{subfig:thinking_scaling}`{=latex}) and perform better for data that is more OOD (Figure `\ref{fig:ood_generalization_thinking}`{=latex}). These findings imply that OOD generalization benefits from System 2 Thinking will further increase at larger scales, paving the way for principled generalization to OOD data. In addition, EBTs become increasingly robust to self-generated errors during verification (Figure `\ref{subfig:self_verification_non_adversarial}`{=latex}), indicating that their self-verification process scales reliably and sidesteps the adversarial instabilities reported in prior works [@ma2025inference; @huang2025best].

```{=latex}
\begin{wrapfigure}[24]{r}{0.50\textwidth}
  
  \includesvg[width=\linewidth]{images/scaling_thinking_img_bi_fwd.svg}
  \caption{\textbf{Image Denoising Thinking Scalability.} A comparison between EBT and DiT on image denoising given a different number of forward passes. EBTs require only $1\%$ of the forward passes used by DiT to achieve comparable or better PSNR. Further, the scaling rate of PSNR improvement given more forward passes is much higher for EBTs than it is for DiTs. These results suggest EBTs have superior thinking capabilities than DiTs on OOD data.}
  \label{fig:thinking_fwd_passes_img}
  
\end{wrapfigure}
```
We hypothesize that the superior scaling of `\pa`{=latex}s compared to the Transformer++ can be attributed to `\pa`{=latex}s learning to verify (Facet `\ref{facet:prediction-verification}`{=latex}) rather than solely learning to predict. As discussed in Section `\ref{sec:intuition}`{=latex}, verification often generalizes better than amortized generation, which we believe leads to improved learning efficiency. The results in our generalization experiments further support this idea of verification leading to improved generalization, as we found that *given a slightly worse pretraining performance*, `\pa`{=latex}s still outperform the Transformer++ recipe on downstream tasks. Additionally, corresponding to Facet `\ref{facet:prediction-uncertainty}`{=latex} regarding prediction uncertainty, we find that the energy values learned by `\pa`{=latex}s correlate strongly with more challenging to predict data, as shown in Figures `\ref{fig:qual_nlp_uncertainty}`{=latex} and `\ref{fig:qual_vid_uncertainty}`{=latex}. This is a promising characteristic of `\pa`{=latex}s, as it enables uncertainty estimation within continuous state spaces, which would allow for principled inference-time behavior adaptation (Facet `\ref{facet:dynamic-computation}`{=latex}) when models determine a problem is more challenging.

Lastly, the results from our bidirectional experiments indicate that `\pa`{=latex}s hold strong promise in bidirectional modeling. Particularly, we find that bidirectional EBTs outperform the DiT baseline in image denoising on both In and Out-of-Distribution performance, while using significantly fewer forward passes. This suggests that EBTs offer promise in tasks such as image generation or bidirectional text generation. Further, when evaluating the representations from DiTs and EBTs, we find that EBTs perform significantly better, achieving up to a $10 \times$ improvement in accuracy, suggesting EBTs offer promise in developing a *better understanding* of what is being generated when performing generative modeling.

Related Work
============

Traditional Transformers
------------------------

The Transformer architecture [@vaswani2017attention] has become ubiquitous across various domains [@radford2019language; @touvron2023llama; @latif2023transformers; @oquab2023dinov2]. The most commonly used transformer variant of today makes predictions directly in the output space with a single forward pass, demonstrated in Figure `\ref{subfig:ar_transformer}`{=latex}. Because these models have a finite depth and width, and make predictions in a single forward pass, they are unable to dynamically allocate more computation to *each* prediction being made. Furthermore, they cannot model uncertainty in continuous state spaces in the same way they can in discrete state spaces, because the normalization process for continuous state spaces is not as well-defined as it is for discrete spaces using softmax [@dawid2024introduction]. Rather, training these models to express uncertainty relies on tricks such as Vector Quantization [@van2017neural] or pseudo losses/objectives (e.g., ELBO [@kingma2013auto]). Finally, because these models are not trained to explicitly verify samples, improving inference-time performance at a per-prediction level often requires external models [@lightman2023let].

RNNs
----

Recently, several RNN variants (Figure `\ref{subfig:rnn_arch}`{=latex}) have emerged to alleviate memory bottlenecks and achieve faster inference [@gu2023mamba; @peng2023rwkv]. These approaches have scaled similarly to Transformers in autoregressive sequence modeling and achieve better memory efficiency and reduced latency. However, traditional RNNs that update their internal state based solely on new information/data [@gu2023mamba; @peng2023rwkv] are not capable of allocating additional computation during inference, and thus suffer from the same flaws as traditional transformers in achieving human-like System 2 Thinking.

To resolve these issues, people have equipped RNNs with the ability to allocate computation dynamically, with architectures such as the Universal Transformer [@dehghani2018universal]. Recently, this type of RNN has also been applied to LLMs [@geiping2025scaling; @saunshi2025reasoning], allowing LLMs to reason using additional computation in a continuous latent space through the depth of an unrolled RNN. However, like Diffusion models, these models learn to amortize gradient prediction of the energy function [@geiping2025scaling], meaning they cannot model uncertainty or explicitly verify predictions. Consequently, EBMs generalize these RNN-based architectures by offering explicit prediction verification capabilities [@ma2025inference]. Further discussion on this relationship is provided in Section `\ref{sec:additional_related_work}`{=latex}.

Dynamic Computation with LLMs
-----------------------------

The ability to leverage a dynamic amount of computation in LLMs has been emulated using chain-of-thought prompting [@wei2022chain] and continuous latent space reasoning [@hao2024training]. While these approaches can improve performance, they don't seamlessly transfer to continuous modalities, and LLM chain-of-thought has been shown to be unreliable for reasoning [@lin2025implicit; @turpin2023language; @agarwal2024faithfulness]. More recently, models have been explicitly trained to perform reasoning using Reinforcement Learning [@jaech2024openai; @guo2025deepseek; @xai2025grok3; @anthropic2025claude37sonnet]. These approaches allow LLMs to simulate additional computational depth based on the number of tokens decoded before making a prediction, and as a result, significantly improve performance [@jaech2024openai; @guo2025deepseek]. The main limitations of these approaches are that they currently apply only to discrete domains (i.e., LLMs), are effective on a narrow set of problems that are easily verifiable (e.g., math and coding), and require additional supervision, typically in the form of reward signals, making them incompatible with purely unsupervised pretraining [@guo2025deepseek].

Dynamic Computation with Diffusion
----------------------------------

The most common instance of a model architecture specifically created to leverage dynamic computation is diffusion models (Figure `\ref{subfig:diffusion_arch}`{=latex}), where using multiple forward passes to generate a prediction is a core aspect of both training and inference [@höppe2022diffusion; @rombach2022high]. Although diffusion models implicitly define a likelihood through the reverse process [@ho2020denoising], which could theoretically be used to verify predictions, in practice an external verifier is necessary to improve performance at inference time beyond increasing denoising steps [@ma2025inference; @liu2025video; @singhal2025general]. This requirement limits the generalizability and scalability of diffusion models as an approach for System 2 Thinking, as they do not have two of the cognitive facets discussed in Table `\ref{tab:cognitive_facets}`{=latex}: the ability to model uncertainty in continuous state spaces and the ability explicitly verify predictions without additional models [@ma2025inference; @liu2025video; @singhal2025general]. Furthermore, diffusion models rely on a fixed denoising schedule, which restricts their ability to adaptively halt or extend computation---unlike EBMs. Additionally, diffusion models can be seen as predicting the gradient of the data density/energy function [@du2023reduce], and therefore EBMs are a generalization of diffusion models that learn to explicitly verify predictions. More on this connection is in Section `\ref{sec:additional_related_work}`{=latex}, and a side-by-side comparison of diffusion models and EBMs is in Figure `\ref{fig:diffusion_vs_ebm}`{=latex}.

Energy-Based Models (EBMs)
--------------------------

The perspective of energy minimization as thinking/reasoning has been known for some time [@lecun2006tutorial]. Therefore, the most similar approaches to `\pa`{=latex}s also train EBMs to do reasoning/thinking [@du2022learning; @du2024learning]. While these works achieved impressive generalization results, they only focus on small-scale problems, and did not scale EBMs to high-dimensional real-world problems such as language or video. Additionally, these works did not perform an in-depth analysis on the types of System 2 Thinking that emerge with EBMs, more complex inference time procedures beyond just increasing the number of gradient descent steps, approaches towards improving EBM scalability, and required techniques for enhancing System 2 Thinking in EBMs.

Limitations and Conclusion
==========================

In this work, we proposed Energy-Based Transformers (`\pa`{=latex}s), a new paradigm that frames System 2 Thinking as an optimization procedure with respect to a learned verifier (an Energy-Based Model), enabling System 2 Thinking to emerge across any problem or modality entirely from unsupervised learning.

#### Limitations.

Despite demonstrating strong performance, `\pa`{=latex}s have several current limitations. First, because `\pa`{=latex}s generate predictions through an optimization process, they introduce additional hyperparameters, such as the optimization step size and the number of optimization steps. Tuning these hyperparameters is crucial for training stability, as we found poorly chosen values often lead to unstable training. Second, training and inference are more computationally expensive than standard feed-forward models, requiring additional gradient computations. Third, while `\pa`{=latex}s scale well up to 800M parameters, we have not explored larger models due to resource constraints. However, experimental scaling trends demonstrate that `\pa`{=latex}s scale faster than existing paradigms during pretraining, suggesting that `\pa`{=latex}s would perform better at foundation-model scale. Finally, `\pa`{=latex}s currently struggle with data distributions that have many modes, such as class conditional image generation, likely due to the convex energy landscape assumption made during training.

#### Conclusion.

`\pa`{=latex}s are the first instance of an approach that scales at a *faster rate* than the Transformer++ during pretraining across both continuous and discrete modalities. Additionally, our results suggest that `\pa`{=latex}s scale better than existing approaches during inference at System 2 Thinking by dynamically allocating computational resources and self-verifying their own predictions; these System 2 Thinking capabilities enable improved generalization to out-of-distribution data. Ultimately, the superior scaling in both training and inference, coupled with improved generalization, positions EBTs as a promising new paradigm shift for advancing the capabilities of future foundation models.

Acknowledgements
================

We extend special thanks to Jeonghwan Kim and Cheng Qian for their helpful discussions. This research used the Delta and DeltaAI advanced computing and data resources, which are supported by the National Science Foundation (award OAC 2320345 and award OAC 2005572) and the State of Illinois. Delta and DeltaAI are joint efforts of the University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications.

```{=latex}
\small
```
```{=latex}
\bibliographystyle{unsrtnat}
```
```{=latex}
\appendix
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\renewcommand{\thefigure}{\Alph{section}\arabic{figure}}
```
```{=latex}
\renewcommand{\thetable}{\Alph{section}\arabic{table}}
```
```{=latex}
\setcounter{section}{0}
```
`\counterwithin{figure}{section}`{=latex} `\counterwithin{table}{section}`{=latex}

```{=latex}
\clearpage
```
In this appendix, we provide additional insight and details on EBTs. First, we provide more insight on the broader impact/future work of EBTs in Section `\ref{sec:future_broader}`{=latex}. Next, in Section `\ref{sec:additional_exp}`{=latex}, we include additional experiments. Then, we include additional approach details in Section `\ref{sec:training_details}`{=latex}, as well as additional experimental details in Section `\ref{sec:experimental_details}`{=latex}. After that, we include a comprehensive related work in Section `\ref{sec:additional_related_work}`{=latex}, additional facets of cognition in Section `\ref{sec:cognitive_facet_details}`{=latex}, and a discussion of counterarguments in Section `\ref{sec:counterarguments}`{=latex}. Finally, in the hopes of making EBTs more accessible to general audiences, in Section `\ref{sec:ebm_intro}`{=latex} we contain a general easier-to-understand intro to EBMs, and in Section `\ref{sec:tutorial}`{=latex} we describe a tutorial for getting started with EBTs.

Future Works and Broader Impact  {#sec:future_broader}
===============================

`\pa`{=latex}s, being qualitatively different from existing approaches, open several future research directions.

Reversal Curse
--------------

Recently, a phenomenon known as \`\`The Reversal Curse" has been observed where LLMs fail to learn a symmetric mapping [@berglund2023reversal] \`\`B is A" despite learning \`\`A is B". For example, LLMs trained on an example such as \`\`Q: Who is Tom Cruise's mother? A: Mary Lee Pfeiffer" often fail to generalize to know the answer to the reverse question of \`\`Who is Mary Lee Pfeiffer's son?" Remarkably, the Reversal Curse has manifested itself in LLMs regardless of the size or scale [@berglund2023reversal]---probing researchers to investigate whether there are fundamental limitations to traditional feed-forward LLMs. One predicted cause of The Reversal Curse is the nature of gradient updates, where backpropagation only updates tokens within context. That is, while learning the mapping \`\`A is B", none of B's tokens are within context, meaning they do not receive gradient updates when updating A's tokens. We hypothesize that LLMs trained with `\pa`{=latex}s rather than standard feed-forward Transformers could help reduce this phenomenon, as with `\pa`{=latex}s the tokens of A and B are within context during gradient updates due to predictions being made in the input space. Therefore, an exciting research direction would be investigating whether this hypothesis is correct in LLMs trained with `\pa`{=latex}s, allowing improved generalization.

Improved Stability
------------------

In this work, we primarily trained `\pa`{=latex}s with either two or three optimization steps. While these parameters worked well, we suspect that increasing the number of steps would improve the System 2 Thinking capabilities and the scaling during pretraining of `\pa`{=latex}s, as more steps enable a longer \`\`thinking process" before needing to converge. However, because of challenges in stability when training with more steps, due to a larger gradient computation graph, we were unable to successfully increase past two or three steps. Future work could focus more on extending the number of steps by studying ways to improve the training stability of EBMs.

World Models
------------

In this work, we focus on autoregressive and bidirectional models over just state information (no actions). `\pa`{=latex}s offer high promise in modeling states and actions due to the nature of EBMs learning a distribution over all possible inputs. Particularly, given a model trained to estimate the unnormalized joint distribution of the current context, future, as well as future actions, such world models could implicitly be used as policies to generate actions to achieve a specific state, similar to [@janner2022planning; @chi2023diffusion; @zhou2024dino]. This would involve holding the current context (past states) constant, and minimizing the energy by propagating the gradient back to the action inputs and future state predictions. Thus, world models trained in this manner become capable of more than just predicting the future, but also in decision making to achieve a specific goal state.

`\pa`{=latex}s as Complementary Models {#sec:complementary}
--------------------------------------

As demonstrated in [@bakhtin2021residual; @bhattacharyya2020energy], EBMs can be used to improve the quality of generated text from language models. It's possible `\pa`{=latex}s could be used in a similar manner for a broad variety of tasks, serving as the verifier of predictions initialized by standard feed-forward models. Therefore, although we do a side-by-side comparison to existing models in this work, `\pa`{=latex}s could be **complementary** to existing model paradigms---being used as the System 2 Backbone for helping lighter models that perform System 1 thinking.

There exist several current real-world use cases, such as low-latency LLM serving, where doing a single forward pass is sufficient, and where the added inference overhead of gradients with `\pa`{=latex}s would not be worth the extra computation. However, we also envision a world in which people use `\pa`{=latex}s for long-term System 2 Thinking to solve challenging problems. How much computation would it be worth dedicating to prove a long-standing mathematical conjecture, or figuring out a cure to cancer?

Recurrent Energy-Based Models
-----------------------------

While `\pa`{=latex}s scale well, for latency-driven use cases, Transformers require significantly more memory than Recurrent Neural Networks. Additionally, there is strong evidence for recurrence in the human brain [@douglas2007recurrent]. Therefore, we anticipate that recurrent Energy-Based Models, possibly leveraging the Mamba architecture [@gu2023mamba], will eventually become common.

Improved Thinking Algorithms
----------------------------

The EBM thinking algorithms described have strong connections to or are derived from Markov Chain Monte Carlo (MCMC) sampling. Therefore, we broadly expect known MCMC samplers with more advanced techniques for traversing the energy landscape to be successful, such as Hamiltonian Monte Carlo [@betancourt2017conceptual] or annealed Langevin dynamics [@du2019implicit]. Additionally, we did not explore more advanced search algorithms such as Monte Carlo Tree Search, which we suspect could offer performance improvements and leave for future work.

Multimodal Energy-Based Models
------------------------------

We did not experiment with multimodal EBMs, however, EBMs offer several advantages for learning over multiple modalities. For example, multimodal EBMs would enable a single energy scalar to represent the alignment between modalities, and would simplify joint training across modalities by providing a unified objective that naturally captures inter‑modal dependencies.

Thinking Scalability {#sec:search}
--------------------

Due to a lack of computational resources, we were unable to train models with more than $10^{21}$ FLOPs ($\approx 1300$ A100 GPU Hours). Therefore, training and thinking with `\pa`{=latex}s remains untested at larger foundation model scale. We leave it to future work to scale with more GPUs and investigate the qualitative differences in training and thinking with `\pa`{=latex}s.

Learning Multimodal Distributions
---------------------------------

We found that `\pa`{=latex}s, with the current training approach, struggle to capture distributions with many modes (e.g., unconditional image generation). Therefore, future work could explore approaches to improve the learning of distributions with many modes. More info is in Section `\ref{sec:failure_cases}`{=latex}.

Understanding Predictions
-------------------------

We believe that there exists a fundamental distinction between the internal representations associated with model inputs and outputs. Specifically, models generate internal representations **of** inputs, as these serve as the foundation that dictate the model's behavior at any given point in time. Conversely, as outputs do not affect a model's behavior at a given point in time, we contend that models construct representations **for predicting** (and not necessarily understanding) outputs. This distinction leads us to an interesting insight: *models may not achieve a genuine understanding of outputs in the same way they understand inputs.* This implies that while models may develop an intricate understanding of input data, such understanding does not naturally extend to predictions that are made in the output space. Therefore, existing feed-forward models primarily making predictions in the output space may not **understand** their predictions in the same way they understand their inputs. This intuition further supports the principles behind `\pa`{=latex}, where predictions are made in the input space, enabling representations **of predictions** to be developed. We leave the investigation of this hypothesis to future work.

Additional Experimentation {#sec:additional_exp}
==========================

Additional Natural Language Processing Experiments
--------------------------------------------------

We conduct experiments to confirm hypotheses on thinking results obtained in the main paper. First, we confirm that `\pa`{=latex}s become less adversarial with scale by comparing the performance of using BoN with $2$ versus $10$ samples. The results in Figure `\ref{subfig:self_verification_non_adversarial}`{=latex} demonstrate that when models are trained on less tokens, there is little performance improvement by verifying $10$ samples instead of just $2$. In fact, verifying $10$ samples occasionally leads to *worse* performance than verifying $2$ samples, likely because the EBT found an adversarial sample (a sample with low energy that is in fact not a good prediction). However, as data scale increases we observe that performance improvements from BoN-10 versus BoN-2 increase, and that these adversarial dynamics decrease. Together, these results suggest that with scale `\pa`{=latex}s become less adversarial due to an improved energy landscape. In an effort to understand the impacts of thinking at the scale of modern foundation models, we project results from Figure `\ref{subfig:thinking_scaling}`{=latex} to the scale of modern foundation models [@grattafiori2024llama] to extrapolate a projected performance gain from self-verification based thinking. The results are visualized in Figure `\ref{subfig:thinking_scaling_extrapolated}`{=latex}, where they demonstrate that, because of the $1000 \times$ data scale of modern foundation models, the performance improvement from self-verification increases drastically.

Additionally, in an effort to understand whether `\pa`{=latex}s can capture epistemic uncertainty (uncertainty related to knowledge), in addition to aleatoric uncertainty (uncertainty related to inherent data randomness), we visualize the energies of different token sequences that are in versus Out-Of-Distribution (OOD) in Figure `\ref{fig:qualitative_epistemic_uncertainty}`{=latex}. The results demonstrate that, for a more in-distribution sequence, `\pa`{=latex}s have lower energy (less uncertainty), than for an OOD sequence. This suggests that `\pa`{=latex}s learn to *know what they don't know*, as they learn to have higher energy for OOD sequences signifying harder predictability.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_thinking_nlp_ar_data_linear.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_thinking_nlp_ar_data_ood_linear_projected.svg}
```
```{=latex}
\includesvg[width=0.95\textwidth]{images/token_energy_pair3.svg}
```
To confirm some of the scaling trends in the paper, as well as experiment with additional datasets, we conduct larger-scale experiments on the FineWeb dataset [@penedo2024fineweb]. Particularly, for Figure `\ref{subfig:data_scaling}`{=latex}, we confirm that the generally better data scaling of EBTs compared to the Transformer++ holds at increased scale. We confirm this by training small-sized models with a batch size of $256$, a context length of $1024$, and for $500,000$ training steps. These results are visualized in Figure `\ref{fig:data_scaling_larger_scale}`{=latex}, where we observe that EBTs still continue to outscale the Transformer++ by as much as $35\%$ in scaling rate. These results further reinforce that EBTs are more data-efficient than the Transformer++.

```{=latex}
\scriptsize
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_data_smallm1.svg}
```
```{=latex}
\hfill
```
```{=latex}
\includesvg[width=\linewidth]{images/scaling_learning_nlp_ar_data_smallm2.svg}
```
`\pa`{=latex} Failure Cases {#sec:failure_cases}
---------------------------

During experimentation, along with image denoising experiments, we also conducted small-scale experiments with text-to-image generation. We found that, in datasets such as COCO with many different modes for a single condition (e.g., hundreds of images with a caption similar to \`\`giraffe with a long neck"), that `\pa`{=latex}s did not learn to generate high-quality novel images. Instead, `\pa`{=latex}s often generated blurred images similar to the training distribution. We believe this is caused by the training approach pushing the energy landscape to be convex surrounding the training examples (modes). Therefore, when there are many different modes within the same region (same condition), this convex energy landscape \`\`merges" to one landscape averaged around the different modes, resulting in blurriness. We believe that this is not a fundamental limitation of `\pa`{=latex}s, and that future work could address this issue, such as by adding more conditioning.

Additional `\pa`{=latex} Details {#sec:training_details}
================================

Formalizing Thinking {#sec:thinking_formalization}
--------------------

Due to the recent surge of interest in scaling the performance of models during inference/test time, there are several common terms used to refer to these ideas. These terms include scaling the thinking capabilities of models [@jaech2024openai], inference-time scaling [@ma2025inference], inference-time compute [@manvi2024adaptive], and test-time compute [@jaech2024openai; @snell2024scaling]. Therefore, to reduce confusion stemming from a wide variety of terminology and unite the community, in this work we broadly define these concepts as **System Two Thinking** or more concisely **Thinking**. We formalize improvements made by Thinking as the following:\

```{=latex}
\begin{definition}[System 2 Thinking]\label{def:system_2_thinking}
\textit{Given a problem with data $x$, a model $\theta$, and additional computational resources in the form of function evaluations $F$ greater than the minimum number of function evaluations to get a valid prediction from the model $F_0$, \textbf{System Two Thinking} $\text{STT}(\cdot)$ quantifies the expected percentage improvement in performance as $F$ increases. Let $P(x, \theta, F)$ be the performance on input $x$ when the model $\theta$ uses $F$ function evaluations:}
\[
    \text{STT}(x, \theta, F) \;=\; \mathbb{E}_x \left[
        \frac{P(x, \theta, F)}{P(x, \theta, F_0)} \;-\; 1
    \right],
\]
\end{definition}
```
This formalization is compatible with any type of metric (e.g., Accuracy, Perplexity, FID, etc), and uses more psychology-aligned terminology [@kahneman2011thinking]. Further, avoiding terms such as \`\`inference" or \`\`test-time" makes the idea of Thinking more compatible with domains where the line between inference and training is blurry, such as real-world continual learning, domain adaptation, or actual human learning/thinking processes [@parr2022active]. Just as **learning** has become a flexible term across machine learning representing several different ideas, we intend for **thinking** to similarly unify many diverse ideas under a common framework. For a greater justification on this perspective, please see Section `\ref{sec:counterarguments}`{=latex}

Energy-Based Transformer (`\pa`{=latex}) Thinking Types {#sec:ebt_types}
-------------------------------------------------------

All experiments in the paper are conducted with two main variants of `\pa`{=latex}s, which we call System 1 (S1) and System 2 (S2) `\pa`{=latex}s. S1-`\pa`{=latex}s have hyperparameters specifically optimized for stability and learning convergence, whereas S2-`\pa`{=latex}s have hyperparameters optimized for System 2 Thinking capabilities. Many of the pretraining scaling experiments conducted in Section `\ref{sec:experimentation}`{=latex} are with S1 models as to reduce the computational resources required for experimentation. To confirm that these results hold for S2 models, we plot the scaling trends of S1 and S2 models side by side; in Figure `\ref{fig:s1_vs_s2}`{=latex}, we find that S2 models scale *at the same or a higher rate* than S1 models during training, but have a higher Y-intercept. This Y-intercept offset does not affect asymptotic scaling behavior (as asymptotically the scaling rate dominates), and hence, scaling trends that hold for S1 models should generally hold for S2 models. In fact, S2 models may even perform better than S1 models during pretraining asymptotically because of the higher scaling rate. Intuitively, switching from S1 to S2 models allows for a compute trade-off between the model's pretraining performance and the model's System 2 Thinking capabilities.

S1 models have the gradient of predictions detached between optimization steps to increase training stability. Conversely, following [@du2024learning], the S2 models truncate backpropagation and avoid detaching prediction tensors between optimization steps. Additionally, the S2 models have all the energy landscape regularization techniques described in Section `\ref{sec:thinking_approach}`{=latex}, whereas the S1 models have none. We also find that S2 models require a different value for the optimization step size, that the optimization step size not be learned, and to perform a minimum number of optimization steps greater than one.

```{=latex}
\scriptsize
```
`\includesvg[width=0.48\columnwidth]{images/scaling_learning_nlp_ar_s1_s2.svg}`{=latex}\

Autoregressive Causal Energy-Based Transformers Efficient Implementation {#sec:ebt_full}
------------------------------------------------------------------------

GPT-style decoder-only autoregressive Transformers are parallelizable due to making all next-state predictions simultaneously [@radford2019language]. In this section, we detail the implementation of decoder-only autoregressive EBTs, which also parallelize all next-state predictions simultaneously, but involve more complexity due to EBMs making predictions in the input space rather than the output space.[^9] To demonstrate why this poses a challenge, for the decoder-only autoregressive Transformer, consider the case of the $N \times N$ attention scores matrix after the causal mask has been applied: $$\text{scores} = 
\begin{bmatrix}
\alpha_{z_1,z_1} & 0 & \ldots & 0 \\
\alpha_{z_2,z_1} & \alpha_{z_2,z_2} & \ldots & 0 \\
\ldots & \ldots & \ldots & \ldots \\
\alpha_{z_n,z_1} & \alpha_{z_n,z_2} & \ldots & \alpha_{z_n,z_n} \\
\end{bmatrix},$$ where $\alpha_{z_i,z_j}$ represents the attention score (probability mass) from observed state $z_i$ to observed state $z_j$. Now, in the case of an EBM, where predictions are made in the input space, the desired $N \times (N + 1)$ attention scores matrix would look like the following: $$\text{scores} = 
\begin{bmatrix}
\alpha_{z_1,z_1} & \alpha_{z_1,\hat{z}_2} & 0 & \ldots & 0 \\
\alpha_{z_2,z_1} & \alpha_{z_2,z_2} & \alpha_{z_2,\hat{z}_3} & \ldots & 0 \\
\ldots & \ldots & \ldots & \ldots & \ldots \\
\alpha_{z_n,z_1} & \alpha_{z_n,z_2} & \alpha_{z_n,z_3} & \ldots & \alpha_{z_n,\hat{z}_{n+1}} \\
\end{bmatrix},      \label{math:scores_ebm}$$ where $\hat{z}_j$ denotes a predicted state (as opposed to an observed state). This is challenging to compute because each $\hat{z}_j$ along the superdiagonal is unique for its row, as we chose for each prediction to be made independently from each other prediction.[^10] Consequently, unlike standard attention in feed-forward Transformers, this attention matrix cannot be computed with a matrix multiplication ($\text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)$), as every element of the superdiagonal is a prediction and not an observed state.

Additionally, in traditional transformers, if the context length is $N$, the size of manipulated tensors will generally be $B \times N \times D$ where $B$ is the batch size and $D$ is the embedding dimension. However, because EBMs make predictions in the input space, the input tensor needs to be different to allow for inputting predictions. Therefore, to distinguish these tensors, for a context length of $N$ tokens (meaning we are given $N$ tokens), we define a sequence of the first $N$ elements as $z_o$, or the observed sequence representations, and the final $N$ elements as $z_p$, or the predicted representations (for each next element in the sequence). This means that the manipulated tensors within the EBT are of size $B \times 2 N \times D$ for a context length of size $N$.

The values for $z_o$, or the observed states, are computed identically as in the original transformer [@vaswani2017attention], as the attention scores of the observed states do not depend on the predicted states. This can be formalized as the following: $$\text{Attention}(Q_o, K_o, V_o)_{z_o} = \text{softmax}\left(\frac{Q_oK_o^T}{\sqrt{d_k}}\right)V_o,$$ where $Q_o$, $K_o$, and $V_o$ are the Query, Key, and Value matrices of the observed states $z_o$. In every block of the transformer, the representations of all observed states are updated in this manner, independent of the representations of the predictions.\
For the computation of prediction representations, three matrices are computed, which we notate as $Q_p$, $K_p$, and $V_p$.[^11] First, we compute the self-attention scores of all predicted representations to all observed representations: $$\tilde{S}_{z_o\leftarrow z_p} = \frac{Q_pK_o^T}{\sqrt{d_k}},$$ where $\tilde{S}_{z_o\leftarrow z_p}$ denotes the raw unnormalized attention scores of the predicted states to the observed states. Note, however, that the self-attention scores of each prediction with itself have not yet been calculated---due to the key matrix being from the observed states. Therefore, the superdiagonal of the $\tilde{S}_{z_o\leftarrow z_p}$ matrix needs to be replaced with the self-attention scores of each prediction with itself to achieve the attention score matrix shown in Equation `\ref{math:scores_ebm}`{=latex}.

To achieve this matrix, we first need to append a column to the right side of the $\tilde{S}_{z_o\leftarrow z_p}$ matrix, as the size of the matrix is currently $N \times N$, but the matrix needs to be $N \times (N+1)$ ($N$ for the observed states and then one for the predicted states along the second dimension). After doing this, we first mask out the superdiagonal with zeros to ensure that the probabilities in the superdiagonal of the score matrix only correspond to the values of predicted states with themselves. To make this operation differentiable, this masking operation is done through elementwise multiplication of a matrix with ones everywhere except the superdiagonal, which has zeros. Then, we compute the self-attention scores of each predicted state with itself, using the following equation:\
$$\tilde{S}_{z_p \leftarrow z_p} = \frac{\text{sum}(Q_p * K_p)}{\sqrt{d_k}},$$ where the $*$ indicates the Hadamard product and the sum is across the feature dimension. Using a superdiagonal mask again, we set the diagonal of the $\tilde{S}_{z_o\leftarrow z_p}$ to these values. We denote the updated matrix as $\tilde{S}_{z_p}$, representing that this matrix is still unnormalized scores but takes into account interaction between $z_p$ and $z_o$ as well as from $z_p$ to themselves. Now, after applying the softmax:\
$$S_{z_p} = \text{softmax}\left( \tilde{S}_{z_p} \right),$$ we have the intended scores matrix shown in Equation `\ref{math:scores_ebm}`{=latex}. Note that for this softmax operation we adjust the standard causal attention mask to ensure all future information is masked out and the superdiagonal we added is not masked out.

One more barrier towards finally extracting all updated $z_p$ representations is the fact that we cannot simply multiply this resulting scores matrix by the values matrix, as each element of the superdiagonal corresponds to a different predicted state. Thus, using similar techniques to before, we first clone and then extract the superdiagonal from this scores matrix using a diagonal mask.

After extracting out the superdiagonal, zeroing it out after extraction, and removing the earlier appended right-most column, we can multiply the resulting scores matrix by the $V_o$ matrix to get all of the representations summed together of each predicted state with all observed states. This is computed using the following matrix multiplication:\
$$z_p = S_{z_p} \cdot V_o.$$ As we also need to add the representation of each predicted state weighted with its own attention score (what was extracted on the superdiagonal), we perform another Hadamard product of the $V_p$ matrix with the cloned superdiagonal to get these values, and then add these element wise to the $z_p$ representations. Now, we have computed the intended representations involving the scores matrix shown in Equation `\ref{math:scores_ebm}`{=latex}. Now, $z_o$ and $z_p$ are updated by multiplying these tensors by the shared output weight matrix $W_o$.

Autoregressive Energy-Based Transformers Simplified Implementation
------------------------------------------------------------------

A more simplified implementation involves the entire attention matrices and a generalized causal mask, as described in [@deng2024causal]. However, because this implementation involves a matrix multiplication with $2$ times the sequence length, this results in $4$ times the number of FLOPs as normal attention, which is around double the number of FLOPs of our more efficient implementation.

Experimentation Details {#sec:experimental_details}
=======================

Tables `\ref{tab:hparams_baseline}`{=latex} and `\ref{tab:hparams_ebt}`{=latex} specify general model information and hyperparameters. We utilized the Llama 2 transformer implementation [@touvron2023llama] for the Transformer++ and used this implementation as the backbone upon which we built causal decoder-only `\parch`{=latex}s. We seed all libraries using PyTorch Lightning [@falcon2019pytorch] for all experiments with a seed of $33$. For the Diffusion Transformer, we use the implementation from [@peebles2023scalable]---for the bidirectional EBT we build upon this implementation.

Unless otherwise stated, for all scaling experiments we copy the popular Mamba paper pretraining settings for the model configuration (we use different batch sizes/data due to computational constraints), shown in Table `\ref{tab:scaling_model_sizes}`{=latex}. Because we are compute-constrained, we also include two extra model sizes in this table, extra extra small (xxs) and extra small (xs), which allow us to collect more data points for the extrapolation of scaling trends. Because the hyperparameters used in these experiments were tuned for feed-forward Transformers in [@kaplan2020scaling], we broadly expect that hyperparameters tuned for EBTs will further increase the performance gap between EBTs and the Transformer++ in autoregressive modeling. Future research should investigate the optimal width/depth/batch sizes for EBTs.

```{=latex}
\captionsetup[table]{position=top}
```
```{=latex}
\small
```
  Size                                         Non-Embedding Params   \# layers   embed. dim   \# heads
  ------------------------------------------- ---------------------- ----------- ------------ ----------
  xxs                                                 6.18M               6          384          6
  xs                                                  12.4M              12          384          6
  small                                               48.8M              12          768          12
  medium                                               176M              24          1024         16
  large                                                396M              24          1536         16
  xl                                                   708M              24          2048         32
  `\label{tab:scaling_model_sizes}`{=latex}                                                   

  : **Model sizes and hyperparameters for scaling experiments.** For applicable model sizes we follow Mamba [@gu2023mamba].

```{=latex}
\begin{table*}[t]

\caption{\textbf{Hyperparameters for Transformer++.}}

\small
\begin{tabular}{lcc}
\toprule
\textbf{Hyperparameter} & \textbf{CV} & \textbf{NLP} \\
\toprule
% Batch Size per GPU & $64$ & $128$ \\
% Effective Batch Size & $192$ & $384$ \\
% Epochs & $400$ \\
Optimizer & AdamW \\
Optimizer Momentum & $\beta_1,\beta_2=0.9,0.999$ \\
% Freeze Encoder Epochs & $2000$ \\
% Learning Rate (LR) & $1e-3$ & $6e-4$ \\
LR Schedule & Linear warm up cosine decay \\
Warmup steps & $1e4$ \\
% Warmup Base LR Divider & $20$ \\
Minimum LR Scale & $10$ \\
Gradient Clip Value & $1$ \\
% Transformer Blocks & $12$ \\
% Multi-headed Attention Heads & $12$ \\
Weight Decay & $0.01$ \\
Context Length & $16$ & $256$ \\
Encoder & SD-XL VAE~\cite{rombach2022high, stabilityai/sd-vae-ft-mse} & - \\
% ViT Backbone Size & Base & - \\
Image Dimension & $224$x$224$ & - \\
Tokenizer & - & EleutherAI/gpt-neox-20b~\cite{black2022gptneox20b} \\
Vocab Size & - & $50277$ \\
\bottomrule
\end{tabular}
\label{tab:hparams_baseline}
\end{table*}
```
Autoregressive Language Modeling Experimental Details
-----------------------------------------------------

### Autoregressive Language Modeling Learning Scalability Details {#sec:scaling_law_extra_info}

Conducting thorough scaling experiments is extremely challenging---a recent survey on \`\`scaling laws" [@li2025mis] showed just how fragile many of these \`\`scaling laws" are to hyperparameters, data, and other parameters, and how changing these variables slightly can lead to vastly different conclusions. Being bottlenecked by a limited set of computing resources further exacerbates the issue of comparing two models. Therefore, we sought out to conduct controlled experiments that revealed the most information possible regarding the scaling of `\pa`{=latex}s compared to different models.

Most existing works study scaling by varying several factors simultaneously, including depth (number of transformer blocks) [@kaplan2020scaling], width (embedding dimension) [@kaplan2020scaling], possibly batch size [@chen2024scaling; @hu2023predicting], and the amount of data [@kaplan2020scaling]. Therefore, to be more comprehensive in determining when `\pa`{=latex}s scale differently than the Transformer++, we decided to conduct normal scaling experiments over all of these factors at the exact same time (as is standard), *as well as ablating over changing just one of these factors at a time*. Notably, conducting experiments in this manner allows for controlling a single independent variable at a time (i.e., just changing the number of Transformer Blocks), which allows for stronger conclusions regarding what aspects of model scaling different models perform better over (i.e., `\pa`{=latex}s scale better then the Transformer++ when increasing the number of Transformer Blocks). Scaling all factors at once does not allow for such insight, as there are many independent variables changing at once. We believe experiments that involve changing a single independent variable are in broader alignment with traditional scientific principles.

For the parameter and FLOP scalability experiments, all models are pretrained for $105k$ steps. For each model size, following [@chen2024scaling; @hu2023predicting], we scale the batch size. For NLP experiments we use the following batch sizes for the xxs, xs, small, medium, and large models respectively: $32$, $46$, $90$, $170$, and $256$. These batch sizes were determined by scaling the batch size with the square root of the number of parameters and then rounding to an even number, similar to batch size trends in [@hu2023predicting]. All model sizes use the same learning rate as in the Mamba paper [@gu2023mamba] ($0.0006$, $0.0003$, $0.00025$, and $0.0002$ for small, medium, large, and xl models respectively), where for the xxs and xs models we use a learning rate of $0.0012$ and $0.0009$ respectively. For the data scaling and batch size scaling experiments conducted in Figure `\ref{fig:nlp_ar_learn_scale_1}`{=latex}, we use xxs models (we also report small models in Section `\ref{sec:additional_exp}`{=latex}). The data scaling experiment used a batch size of $128$ and the batch size experiments ran for $105k$ steps. All models were trained with a context length of $256$ and no FFN multiplier (FFN dimension being equal to the embedding dimension) due to limited resources.

### Autoregressive Language Modeling Thinking Scalability Details

We train xxs S2-EBT and Transformer++ models with the same setup as above, with the exception that models are trained with a batch size of $128$ for 1M training steps (Table `\ref{tab:hparams_ebt_thinking}`{=latex} contains more hyperparameter details). Increasing the data scale enables us to better understand how thinking scales during pretraining. It's worth noting that since we are training small language models, they could not benefit from modern techniques such as Chain of Thought (CoT) in improving performance. Figures `\ref{fig:best_result}`{=latex}, `\ref{fig:nlp_ar_think_scale}`{=latex}, `\ref{fig:ood_generalization_thinking}`{=latex} and Table `\ref{tab:generalization_exps}`{=latex} use the best checkpoints of these two models. Figures `\ref{subfig:ood_thinking_comparison}`{=latex}, `\ref{fig:ood_generalization_thinking}`{=latex} both use all four downstream datasets shown in Table `\ref{tab:generalization_exps}`{=latex}. Figure `\ref{fig:ood_generalization_thinking}`{=latex} also uses the pretraining dataset in addition to these downstream datasets, where every dataset is represented as a separate OOD point.

Figure `\ref{subfig:thinking_scaling}`{=latex} was solely from the Dyck Languages benchmark. We did not observe this trend in other benchmarks, possibly due to perplexity not being a completely linear metric. Figure `\ref{subfig:self_verification_non_adversarial}`{=latex} was on the RedPajamaV2 validation dataset.

Autoregressive Video Experimental Details
-----------------------------------------

We use the same model parameter scaling, shown in Table `\ref{tab:scaling_model_sizes}`{=latex}, as the NLP experiments. We also used a batch size of $256$ for all models, as we found that it did not affect the scaling performance due to models training for many epochs. We also processed videos with $0.25$ seconds between frames. For the Transformer++ baseline, we use the same learning rates as the NLP experiments. For EBTs, we found that it was necessary to use a lower learning rate by a factor of $3$ for proper training stability. We use the standard SSV2 train and validation split for experiments. Other hyperparameters are shown in Figure `\ref{tab:hparams_baseline}`{=latex} and Figure `\ref{tab:hparams_ebt}`{=latex}.

```{=latex}
\begin{table*}[t]\caption{\textbf{Hyperparameters for \textbf{\pa} scaling experiments.}}

\small
\begin{tabular}{lcc}
\toprule
\textbf{Hyperparameter} & \textbf{CV} & \textbf{NLP} \\
\toprule
% Batch Size per GPU & $64$ & $24$ \\
% Effective Batch Size & $192$ & $72$ \\
% Epochs & $400$ \\
Optimizer & AdamW \\
Optimizer Momentum & $\beta_1,\beta_2=0.9,0.999$ \\
% Learning Rate (LR) & $2e-4$ \\
LR Schedule & Linear warm up cosine decay \\
Warmup Steps & $1e4$ \\
% Warmup Base LR Divider & $10$ \\
Minimum LR Scale & $10$ \\
Gradient Clip Value & $1$ \\
Weight Decay & $0.01$ \\
Context Length & $16$ & $256$ \\
% Transformer Blocks & $12$ \\
% Multi-headed Attention Heads & $12$ \\
% End MLP Layers & $1$ \\

Encoder & SD-XL VAE~\cite{rombach2022high, stabilityai/sd-vae-ft-mse} & - \\
Image Dimension & $224$x$224$ & - \\
Tokenizer & - & EleutherAI/gpt-neox-20b~\cite{black2022gptneox20b} \\
Vocab Size & - & $50277$ \\
Normalize Input Distribution & - & \cmark \\

% optimization Params
Optimization Steps & $2$ & $2$ \\
Optimization Step Size & $30,000$ & $500$ \\
Optimization Step Size LR Multiplier & $90,000$ & $1,500$ \\
Learnable Optimization Step Size & \cmark \\

\bottomrule
\end{tabular}
\label{tab:hparams_ebt}
\end{table*}
```
Bidirectional Image Denoising Experimental Details
--------------------------------------------------

We use the COCO 2014 dataset [@lin2014microsoft; @AbdoTW/COCO_2014DatasetsHF] with $128$ by $128$ images, its train/validation split, a patch size of $16$, and the Diffusion Transformer implementation from [@peebles2023scalable]. All models were trained using the large model size described in Table `\ref{tab:scaling_model_sizes}`{=latex}, with a learning rate of $1e-4$ for $100,000$ steps. For the DiT baseline, we used the same hyperparameters from [@peebles2023scalable], changing only the batch size to $128$ from $256$. We based our bidirectional EBT implementation on the code from this repository. We experimented with several different diffusion inference strategies to ensure a fair comparison, including DDPM, DDIM, increasing the number of diffusion steps at inference, and recursively applying the diffusion model on its own denoised output.[^12] Ultimately, we found that the combination of DDIM recursively applied on its own output performed best, hence we used this as the baseline in all experiments. We used the default denoising schedule from the DiT codebase [@peebles2023scalable].

For both DiTs and EBTs, we found that models performed best on OOD noise levels when denoising their own outputs twice, that is applying the model to denoise the image three times recursively. This is how we are able to achieve the results in Figure `\ref{fig:thinking_fwd_passes_img}`{=latex} demonstrating the performance for $300$ forward passes from DiTs and $3$ forward passes from EBTs. We found that for image denoising, it was not necessary to train EBTs with the S2 hyperparameters for System 2 Capabilities to emerge, although its possible these would further improve performance. Additionally, for image classification, for both models, we take the average of all the final patch tokens, and for DiTs we feed in $T = 0$.

```{=latex}
\begin{table*}[t]\caption{\textbf{Hyperparameters for {\pa} System 2 Thinking experiments.}}

\small
\begin{tabular}{l c}
\toprule
\textbf{Hyperparameter} & \textbf{NLP} \\
\midrule
Optimizer & AdamW \\
Optimizer Momentum ($\beta_1,\beta_2$) & 0.9, 0.999 \\
Model Size & Small \\
Learning Rate (LR) & 0.0012 \\
Learning Rate (LR) Schedule & Linear warm-up + cosine decay \\
Warmup Steps & $1\times10^4$ \\
Minimum LR Scale & 10 \\
Gradient Clip Value & 1 \\
Weight Decay & 0.01 \\
Context Length & 256 \\
Tokenizer & EleutherAI/gpt-neox-20b~\cite{black2022gptneox20b} \\
Vocab Size & 50,277 \\
Normalize Input Distribution & \cmark \\
Optimization Steps & 2-3 (randomized) \\
Optimization Step Size & 5 \\
Optimization Step Size Multiplicative Random Factor & 2 \\
Langevin Dynamics Noise & 3 \\
Learnable Optimization Step Size & \xmark \\
No Detach Between Optimization Steps & \cmark \\
Truncate Optimization & \cmark \\
Replay Buffer & \cmark \\
\bottomrule
\end{tabular}
\label{tab:hparams_ebt_thinking}
\end{table*}
```
Computational Resources
-----------------------

All experiments were conducted on either Nvidia A100s, H100s or GH200s, with the largest scale experiment requiring approximately $\approx 1300$ A100 GPU Hours. The runtime for each experiment was dependent on the model sizes used as well as the amount of data trained on.

FLOP Calculations {#sec:flop_calc}
-----------------

We adopt the standard estimate of $6N$ FLOPs per token for the AR Transformer++ [@casson2023transformerflops], where $N$ denotes the number of non‐embedding parameters. For AR EBTs, however, the per‐token cost varies with the number of training optimization steps and chosen hyperparameters.

To derive the FLOPs for EBT training, we follow [@dagréou2024howtocompute] for the Hessian‐vector product (HVP), which EBTs require to backpropagate through a first‐order derivative. Since an HVP has the same theoretical complexity as a gradient computation, we express the per‐step FLOPs as $$\begin{aligned}
\text{FLOPs} = F + B + B.\end{aligned}$$ Based on [@casson2023transformerflops], the forward and backward passes require approximately $2N$ and $4N$ FLOPs per token, respectively, where $N$ denotes the count of non‑embedding parameters in the Transformer. In the autoregressive `\pa`{=latex} implementation, the effective sequence length becomes twice that of the original Transformer (formally $2S - 2$ for an original sequence length $S$). Owing to the efficient scheme of Section `\ref{sec:ebt_full}`{=latex}, this doubling of sequence length translates roughly into a two‑fold increase in FLOPs, rather than a four-fold increase in FLOPs. Hence, each second‑order optimization step demands roughly $$\begin{aligned}
(F + B + B) \times 2 = (2N + 4N + 4N)\times 2 = 10N \times 2,\end{aligned}$$ making it $\approx3.33\times$ more expensive than a standard feed‑forward Transformer step.

The overall FLOP count also depends on one's choice of hyperparameters. For S1 models, where the loss is evaluated at every iteration without gradient truncation, the total FLOPs simply multiply by the number of steps. Therefore, for our pretraining experiments using two optimization steps, we get that `\pa`{=latex}s used $6.66\times$ the FLOPs of a comparable Transformer++ during training. In contrast, for S2 models, a random number of optimization steps is used, the gradient is truncated, the loss is only calculated at the last step following [@du2022learning], and a Replay Buffer is used. Therefore, the FLOP count varies and can both decrease (as truncating uses less FLOPs for earlier steps) as well as increase (as using more steps and a replay buffer both use more FLOPs). These numbers also vary during inference, where the full EBT implementation parallelizing all predictions at once is not necessary.

Given the scarcity of published methods for computing higher‑order derivative FLOPs and our inability to leverage existing libraries for Hessian‑vector products, these estimates remain approximate. We welcome corrections or additional insights from readers familiar with FLOP calculations for second‑order methods.

Additional Related Works {#sec:additional_related_work}
========================

Energy-Based Models (EBMs) as a Generalization of Diffusion Models and Recurrent Depth Models
---------------------------------------------------------------------------------------------

Both diffusion models [@du2023reduce] and RNNs [@saunshi2025reasoning] can be seen as predicting the score, or the gradient of the energy function/data density, $\nabla_x E_{\theta}(x)$, where diffusion models do this with an additional time condition [@liu2022compositional]. Thus, the largest benefit of explicit EBMs over these approaches, which can be seen as implicit EBMs (due to only implicitly defining an energy function), is that using an explicit EBM allows for explicit verification/likelihood modeling. We show that this enables the use of self-verification to improve predictions, whereas with diffusion models and RNNs an additional verifier model is necessary to achieve this capability [@ma2025inference].

```{=latex}
\scriptsize
```
![Diffusion Model](images/diffusion_training.png){width="\\linewidth"}

```{=latex}
\hfill
```
![Energy-Based Model (EBM)](images/ebm_training.png){width="\\linewidth"}

It's also worth noting that RNN, diffusion models, and EBMs need not be incompatible with one another. For example, [@du2024learning] combines EBMs and diffusion to reason over challenging problems. This can increase stability of the learned energy landscape by adding explicit score supervision.

In Depth EBM and Diffusion Model Comparison {#sec:diffusion_vs_ebm}
-------------------------------------------

Because of the similarity of diffusion models and EBMs, we present a side-by-side comparison of the training and inference approach for both in Figure `\ref{fig:diffusion_vs_ebm}`{=latex}, where the primary difference lie in the supervision they receive during training and the update rule for predictions. Additionally, we provide more information to compare these approaches.

Under the assumption that the energy landscape is well formed and that optimization is well behaved, EBMs offer several distinct advantages over diffusion models. In EBMs, the energy function is trained to represent a meaningful landscape where the energy value of a sample directly corresponds to its relative unnormalized likelihood. Consequently, two samples can be directly compared to determine which is more likely, in a single forward pass. On the other hand, diffusion models require running samples through the entire reverse diffusion process to get likelihoods, which often requires hundreds to thousands of steps, and rely on likelihood approximations such as ELBOs or numerical solvers for SDEs/ODEs. In practice, these result in incomparable likelihoods, as ELBOs only give likelihood lower bounds and numerical solvers result in high approximation error [@ma2025inference].

Furthermore, the learning of an energy landscape over predictions means that any approximation errors at each individual step of the Markov Chain (optimization process) do not result in cumulative error, as the minimum of the energy landscape can still be reached. This differs from diffusion models, where any approximation error at each step will result in increasing accumulated error across the entire Markov Chain [@du2024learning] (demonstrated in Figure `\ref{fig:diffusion_vs_ebm}`{=latex}).

Lastly, EBMs, giving an unnormalized likelihood estimate at each step, are in practice much more flexible for generation than diffusion models. While diffusion models require running the entire reverse diffusion process with a specific denoising schedule to generate a sample, EBMs can be trained to directly predict the next sample in a single step, and give an unnormalized likelihood at each step indicating how likely they think this sample is. This better approximates human-like System 2 thinking, where humans naturally evaluate the strength of current predictions, and on the basis of knowing how good their predictions are, decide to dynamically allocate more or less computational resources.

Additional Energy-Based Models Related Works
--------------------------------------------

One contribution of this work was the design of a custom architecture for EBMs called the Energy-Based Transformer (`\parch`{=latex}). Roughly similar in naming is the work of the Energy Transformer [@hoover2024energy]. However, despite similarity in the names of these architectures, they are very different---with the primary similarity in architectures being the usage of attention mechanisms as well as a global energy function. The existing work integrated ideas from Modern Hopfield Networks, including RNNs, whereas in our work the architecture is non-recurrent and does not use associative memories. Additionally, in this work with EBTs we focused on System 2 Thinking and scalability to high-dimensional problems, which the previous work did not experiment with.

Other loosely related approaches to EBTs involve autoregressive Energy-Based Models, including E-ARM [@wang2022your], EBR [@bhattacharyya2020energy], and Residual EBMs [@bakhtin2021residual]. E-ARM involves adding an objective to the learning process to turn a traditional autoregressive model into an EBM, and as such does not achieve two of the cognitive facets discussed in Section `\ref{sec:intro}`{=latex}. EBR and Residual EBMs involve the training of an EBM on top of an already pretrained autoregressive language model. Both works, however, leverage a contrastive objective, which suffers from the curse of dimensionality.

The optimization procedure used to train `\pa`{=latex}s can be seen as a form of denoising score matching [@vincent2011connection; @wang2023energy]. Particularly, EBTs having predictions initialized at some Gaussian, and then optimizing predictions using the gradient of the energy function, can be seen as training EBMs to denoise by learning the score of the data starting from a Gaussian prior. However, we find the optimization perspective is more intuitive, and this denoising score matching perspective is more similar to the diffusion model training procedure than it is EBMs, involving multiple levels of noise rather than just one level.

Additional Cognitive Facets {#sec:cognitive_facet_details}
===========================

In this section, we describe in more detail the different facets of cognition, in addition to introducing additional facets.

Facet `\ref{facet:dynamic-computation}`{=latex}, Dynamic Allocation of Computation, can be formalized as Turing completeness (assuming an infinite length external memory). Because standard feed-forward Transformers have finite length programs (i.e., they have finite depth), they are unable to be Turing complete at the granularity of each prediction being made. This also holds for common RNN variants that are only updated with new sequence information. Diffusion models and EBMs, being able to iteratively compute functions, have potentially infinite length programs, and therefore when augmented with an infinite external memory can be Turing complete.

```{=latex}
\facet[facet:compositionality]{Compositional Reasoning and Systematicity}
```
Humans routinely solve novel tasks by recombining familiar primitives (e.g., verbs with new arguments or visual parts into unseen objects), a hallmark of compositional generalization well-documented in neuroscience [@friederici2007mapping]. In contrast, state-of-the-art Transformers and diffusion models often fall short when evaluated on compositional generalization [@kobayashi2024can; @huang2023t2i]. Energy-Based Models (EBMs) seamlessly address these limitations: energies for individual factors are composable in several different manners [@du2023reduce], enabling zero-shot generation of novel combinations without retraining, where gradient-based sampling provides an intrinsic mechanism to verify and iteratively correct compositions [@du2023reduce]. Thus, EBMs offer a promising path toward human-like systematicity that remains elusive for existing paradigms and architectures.

Counterarguments {#sec:counterarguments}
================

System 2 Thinking
-----------------

In this paper, strong claims were made regarding the capabilities of current models and their ability to perform System 2 Thinking. However, there are common counterarguments to our claims, which we address here.

### System 2 Thinking and Inference-Time Compute Term Usage

Whether the computational effort spent at inference time fully captures what psychologists term System 2 Thinking is still actively debated. In Section `\ref{sec:thinking_formalization}`{=latex} we outline three reasons for preferring the broader terminology of System 2 Thinking: (i) it naturally extends to settings such as continual learning where terms such as \`\`inference-time compute" becomes ambiguous, (ii) it connects our discussion to a substantial body of cognitive-science work, and (iii) it offers a conceptually straightforward entry point for readers beyond the machine-learning community.

We believe that Human System 2 Thinking encompasses a far greater depth and complexity than the specific approaches explored in this paper, such as \`\`Thinking Longer" and \`\`Self-Verification." We wish to emphasize that our work does not claim current models replicate the full spectrum of human System 2 Thinking. Rather, we view the methods presented here as foundational steps toward that more ambitious long-term goal.

We propose that \`\`System 2 Thinking" offers a useful umbrella term that can effectively encompass and generalize other existing terminologies, including \`\`inference-time compute," \`\`test-time compute," or \`\`reasoning.\" A parallel can be drawn with the term \`\`learning" in our field. \`\`Learning" itself has evolved to describe a wide array of processes, some of which, such as k-Nearest Neighbors (KNNs) or the specific mechanisms of weight matrix updates in Artificial Neural Networks (ANNs), represent distinct facets rather than the entirety of human-like learning, meaning these approaches may not encompass the complexity of true human-like learning. However, despite this breadth and these simplicities, \`\`learning" has become a cornerstone term, upon which our entire field of machine learning is named upon.

In a similar vein, while the \`\`thinking" exhibited by the models discussed in this paper may not yet capture the full nuance and intricacy of human cognition, we believe the term \`\`System 2 Thinking" can still serve a valuable role. It offers a generalizing framework for existing vocabulary and can contribute to making complex concepts within the field more accessible and understandable to newcomers or experts from other fields. Our intention is to contribute to a constructive and unifying dialogue of intelligent systems.

### Is Chain-of-Thought (CoT) Sufficient for System 2 Thinking?

CoT is commonly believed to be sufficient for advanced reasoning to emerge in LLMs. However, we argue that there are several flaws with CoT preventing advanced thinking capabilities. First, Chain-of-Thought (CoT) involves reasoning over a discrete state space, which limits the granularity of \`\`thoughts." Second, CoT is not an intrinsic architectural capability but an external procedure applied to token sequences. Ideally, such reasoning should be embedded within the model and learned during training (unsupervised training, particularly). Third, each token is produced with a fixed computational budget, restricting the depth of reasoning per step. In contrast, humans allocate variable effort across steps when reasoning \`\`step by step". Similarly, models should be able to spend a variable amount of computation per token, as enabled by `\pa`{=latex}s. This aligns with the intuition behind the saying: \`\`a chain is only as strong as its weakest link"---*each step in a chain of thought should receive sufficient computation to avoid failure points that result in bad reasoning*.

Energy-Based Models (EBMs) Introduction {#sec:ebm_intro}
=======================================

Simplified Energy-Based Model (EBM) Introduction
------------------------------------------------

Feed-forward neural networks generally take the form of: given an $x$ predict $y$ (Fig. `\ref{fig:general_feed_forward_model_arch}`{=latex}). Energy-Based Models (EBMs) are a family of models that learn the compatibility (unnormalized probability) over all possible combinations of $x$ and $y$ (Fig. `\ref{fig:general_ebm_arch}`{=latex}). Intuitively, this can be seen as learning to verify the strength of $y$ as a prediction with $x$ as the input. Training models in this manner allows for representing multiple plausible versions of $y$ compatible with a given $x$. The differences between these models is visualized in Fig. `\ref{fig:simple_model_comparison}`{=latex}.

Formulating models in this manner ultimately brings about two primary questions:\
**First question:** Assuming we still care about ultimately predicting $\hat{y}$ *how do we use such an EBM to predict $\hat{y}$?* With feed forward models, generally we can just input $x$ and get the output of the model as $\hat{y}$, but we can't do this with EBMs?

With EBMs, what happens is conceptually similar to diffusion models [@rombach2022high], where we (commonly) initialize $\hat{y}$ as random noise. Then, we input $x$ and $\hat{y}$ into the model, and get a single scalar energy output (our initial energy output) from the model. Now, because our entire model is differentiable, we can get the *gradient from this energy scalar to $\hat{y}$* and perform gradient descent along the *energy landscape* (energy landscapes are surfaces resulting from mapping all possible predictions to scalar values, visualized in Figure `\ref{fig:energy_landscape_minimization}`{=latex}) using this gradient (this is the key)! This process is visualized in Figures `\ref{fig:energy_landscape_minimization}`{=latex}, `\ref{fig:model_architecture}`{=latex}. This gradient can be seen as the opposite of the noise (e.g., denoising) and therefore EBMs have strong relations with Diffusion models predicting the noise. EBMs can be seen as a generalization of diffusion models, where diffusion models are predicting the gradient of the energy function/scalar (more on this in Section `\ref{sec:additional_related_work}`{=latex}).

**Second question:** *How do we train an EBM?* Generally, models are trained over a dataset of $x$ and $y$ pairs, but now there are several different possible $y$ values that can be associated with any given $x$ value---so how does that work?

It turns out that all the training techniques for EBMs boil down to two main categories: contrastive and regularizing approaches [@dawid2024introduction].

Contrastive approaches are more common for EBMs and are easier to rationalize about due to their similarity to GAN discriminators. The idea behind contrastive approaches is to push down on the energy of positive samples (i.e., the true data), and to push up on the energy of negative samples. While these positive samples are easy to rationalize about, as they are just the true data, the difficulty of contrastive EBMs is finding negative samples. Several approach exist, such as GANs, which use a generator to amortize negative sample generation, or running MCMC (similar to optimization) for some time. However, as discussed in Section `\ref{sec:learning_approach}`{=latex}, such approaches don't scale well due to the curse of dimensionality.

Therefore, to achieve a scalable EBM approach, we train EBMs through an optimization procedure (which has strong resemblance to Langevin Dynamics). That is, EBMs are trained to, starting from an initial prediction, optimize predictions to the ground truth solution (shown in Figures `\ref{fig:energy_landscape_minimization}`{=latex}, `\ref{fig:model_architecture}`{=latex}). This pushes the energy landscape to be locally convex surrounding the ground truth solution, thereby regularizing the energy landscape to have low energy only on the true data. As mentioned in [@wang2023energy], this can be seen as being similar to denoising score matching [@vincent2011connection].

Commonly, when people learn about EBMs and the iterative denoising/optimization procedure performed during training, they think of diffusion models, so we include a more in depth comparison between the two in Section `\ref{sec:diffusion_vs_ebm}`{=latex}.

```{=latex}
\scriptsize
```
![Feed Forward Model](images/general_feed_forward_model_arch.png){#fig:general_feed_forward_model_arch height="0.15\\textheight"}

![Energy-Based Model](images/general_ebm_arch.png){#fig:general_ebm_arch height="0.15\\textheight"}

Energy-Based Model Types
------------------------

Because EBMs are a broad modeling framework, and can generalize many existing approaches, we aim to provide precise language to distinguish EBM types. We broadly classify the EBMs described throughout this paper as **explicit EBMs**, meaning they explicitly define an energy function over inputs as the entire function being learned. In other words, explicit EBMs directly map all variables (inputs) to a single scalar energy as the output of the neural network. We define these in contrast to **implicit EBMs**, or EBMs where the energy function is not the learned model but rather some implicit definition of the learned model. For example, with diffusion models, the energy function is implicitly defined by the learned score network ($s_\theta(x,t)$) as the following:\
$$\begin{aligned}
\nabla_x E(x,t) &= -\,s_\theta(x,t)\,\\
E(x,t) &= -\int^x s_\theta(u,t)\,\mathrm{d}u \;+\; C(t)\,.\end{aligned}$$\
Other notable examples of implicit EBMs include Hopfield Networks [@hopfield1982neural], RNNs [@geiping2025scaling], and Boltzmann machines.

Energy-Based Model Frequently Asked Questions (FAQ)
---------------------------------------------------

-   **What is energy/compatibility, what does it represent, and how is it learned? What energy corresponds to what probability?**\
    Energy is just a learned compatibility score between $x$ and $y$ (lower means more likely). The EBMs described in this paper learn it implicitly as described in Section `\ref{sec:learning_approach}`{=latex} such that the true data (good pairs) have low energy and bad pairs (non-data) have higher energy. Probabilities follow:\
    $$\begin{aligned}
            p_\theta(x) = \frac{e^{-E_{\theta}(x, y)}}{Z(\theta)} \\
            p_\theta(x, y)\propto e^{-E_\theta(x,y)}
        \end{aligned}$$

    so the energy $E$ is essentially the (unnormalized) negative log-likelihood up to an additive constant. The term compatibility is just a term used for intuition.

-   **Does training EBMs require a full Hessian calculation?**\
    No---the approach described in the paper only requires Hessian-vector products. That makes training only about a constant $1.66\times$ as expensive as a vanilla feed-forward model given everything else remains constant and you use a single step.

-   **Why is low energy good (and high energy bad)? Why not just use probability?**\
    Low energy is good because of the negative exponential. The reason we don't use probabilities is avoiding normalized probabilities makes the problem much more tractable in real-world high-dimensional continuous state spaces by removing the focus on explicit normalization via regularizing the partition function. EBMs come from a long line of work in statistical physics.

-   **Is it okay that energies are unnormalized probabilities?**\
    Yes, for most real-world applications, you only ever need sample **relative likelihood** comparison; it's significantly less common to need the exact likelihood of samples. An example of this is reward models, which can be seen as EBMs (just multiplying the reward by $-1$), where all that really matters is the relative reward for choosing which sample to use or which behavior to perform.

-   **Is it fine to not do Maximum Likelihood Training?**\
    Contrary to what your intuition may say, the answer is **yes**! For most real-world distributions, data lies completely concentrated on a very thin manifold with no defined distribution outside of this manifold. Thus, directly doing Maximum Likelihood Estimation (MLE) training would push EBMs to have low energy on the true data manifold and then infinite energy off that manifold (as the probability of such samples is $0$). We don't want this as it would make the score (gradient of the energy function) undefined and the energy landscape untraversable---so not doing MLE makes the problem tractable.

Energy-Based Transformers (EBTs) Tutorial {#sec:tutorial}
=========================================

Improving Stability and Scalability
-----------------------------------

Energy-Based Models are notorious for instability during training [@du2019implicit; @du2020improved; @li2023learning; @arbel2020generalized]. Therefore, we experiment with several different hyperparameters to increase the stability and scalability of `\pa`{=latex}s and EBMs in general.

### Optimization Step Size and Stability

We found that the step size for gradient descent updates of predictions ($\alpha$) was one of the primary factors affecting the stability of `\pa`{=latex}s. Thus, for S1 models, we make the step size a learnable parameter (this is not the case for S2 models). We calculate its learning rate by multiplying the model's learning rate by the step size learning rate multiplier. We found that the values for the step size have a large effect on the magnitude of gradients generated for the optimization of predictions. This is because the step size is directly multiplied by the prediction gradients. Particularly, a smaller step size results in larger generated gradients, whereas a larger step size results in smaller gradients. Therefore, the step size needs to be tuned per modality, as the update required for predictions depends on data. It's also worth noting that we do not weight decay the step size in any of the models.

We also found that a relatively high optimization step size was necessary ($30,000$ for video and between $5$ and $500$ for text). Without a high optimization step size, gradient magnitudes continued to increase throughout training, resulting in unstable training dynamics.

### Architecture Stability

For autoregressive models, we found that simply prepending a learnable \`\`step" embedding to sequences significantly improved scalability and stability, especially for S1 models. This step embedding mapped a discrete step index (i.e., step 0, 1, etc.) of the current optimization step to an embedding the same dimension as the model's embedding. We believe this helped improve stability by enabling the accumulation of attention mass, as well as enabling less steep energy landscapes conditioned on the optimization step.

Additionally, we experimented with adaptive layer normalization from the DiT architecture, but we found that the timestep embedding worked better. We also experimented with several different normalization and initialization approaches, where we found that the standard Llama2 [@touvron2023llama] architecture and initialization worked best. This involves using RMSNorm, Xavier init [@glorot2010understanding], SwiGLU MLP [@shazeer2020glu], and RoPE [@su2024roformer].

### System 2 Thinking Hyperparameter Stability

For S2 models, we found certain hyperparameters to be essential for the stability and scalability of models. First, we found that using a lower number of optimization steps resulted in more stability, as using more optimization steps necessitates longer gradient chains. Additionally, we found that the strategy used for the randomization of $\alpha$ to be very important for stability. Particularly, we found that randomizing $\alpha$ with a single value for an entire batch resulted in issues with training convergence. We believe this is because using the same value for every element in the batch resulted in high-variance gradients. Thus, when randomizing $\alpha$ differently for every batch and sequence element, we found training convergence to be more stable. It's possible that randomizing the number of optimization steps would yield similar results in reducing gradient variance, however, we did not experiment with such a configuration.

### Clamping Optimization Gradients for Stability

We found that clamping gradients of the energy function with respect to predictions (or prediction update gradients) could help improve training stability at the cost of some slight reductions in convergence speed. We do not conduct any experiments in the paper with clamped prediction gradients as we found that the other hyperparameters were sufficient for stable training, however, it's a potentially useful trick worth discussing.

### Normalizing Data for Stability

We found that normalizing/standardizing input data was crucial for the stability of EBTs. An example of this was in our NLP experiments, where we ran experiments with and without normalizing probability distributions. The experiments with unnormalized distributions often had extreme activations as well as large loss spikes, whereas the experiments with normalized distributions (by applying softmax) were stable.

How to Train Your EBT {#sec:how_to_learn_ebt}
---------------------

The hyperparameters used for training `\pa`{=latex}s are *extremely* important and can often be highly sensitive towards performance, so here we offer a guide toward hyperparameter tuning. First, we recommend starting off with training S1-EBTs, which are the easiest and most stable variant of EBTs not designed for System 2 Thinking. Then, once S1-EBTs can be trained succesfully and scaled, we recommend changing the hyperparameters gradually towards the hyperparameters used for S2 models (we say gradually here as occasionally these parameters can cause instability and require additional tuning).

When training S1 EBTs, we recommend tuning hyperparameters in the following order:\

-   First, tune standard hyperparameters such as Learning Rate (LR), batch size, etc. Having a high batch size helps with stability by making gradients less noisy (because you initialize predictions from random noise this makes gradients noisier).\

-   Second, start tuning S1-specific hyperparameters---primarily alpha and its LR multiplier (we recommend keeping its LR multiplier around $3x$ the value of alpha) and then tuning the number of optimization steps.

-   Third, potentially tune whether the step size is learnable and try other EBT architectures (inspired by DiT [@peebles2023scalable] we tried a time embedding as well as adaptive layer normalization).

Once you have tuned these, the model should be stable and fine for most use cases. At which point, if you are desiring System 2 capabilities, you can proceed to the S2 models `\ref{sec:how_to_think_ebt}`{=latex}. Some potential metrics to monitor and look out for include the gradient magnitudes (if these increase too much or spike a lot that's a bad sign) and the gap between the initial and final energy after optimization (if this is too high or low it could be a sign your model's alpha value needs to be adjusted).

Following [@wang2023energy], we give pseudocode for training `\pa`{=latex}s in natural language for language modeling (Listing `\ref{lst:ebt_nlp_training}`{=latex}) as well as in computer vision for autoregressive video modeling (Listing `\ref{lst:ebt_cv_training}`{=latex}). The pseudocode is primarily for S2 models without any energy landscape regularization techniques. The first primary design decision in the presented pseudocode is whether or not to detach predictions in between steps. Not detaching predictions in between steps allows for more \`\`Thinking Time" before making predictions, but makes the gradient computation graph longer and therefore increases the likelihood of stability issues with gradients. Similarly, calculating the loss at every step versus solely the last step enables model to \`\`think for longer" before needing to make accurate predictions, and therefore affects the convexity and smoothness of the energy landscape. For S2 models, we found that **not** detaching between steps was best, and similarly that calculating the loss only at the last step was best. For S1 models, we found the opposite to be most stable. Generally, if one is calculating the loss only at the last step, then one should not detach between steps as it's best if the gradient propagates to previous steps in this case. For more details on these techniques, we refer the reader to the source code as well as Section `\ref{sec:how_to_think_ebt}`{=latex}.

``` {#lst:ebt_nlp_training .python language="Python" caption="\\textbf{Autoregressive Language Model Training Pseudocode in PyTorch}" label="lst:ebt_nlp_training"}
# make sure to enable gradient tracking
with torch.set_grad_enabled(True):
    loss_fn = nn.CrossEntropyLoss(weight=None, ignore_index=tokenizer_pad_token_id)

    context_embeddings = self.embeddings(input_ids[:, :-1]) # B, S, D
    next_tokens = input_ids[:, 1:]
    next_embeddings = self.embeddings(next_tokens) # B, S, V; are just used for shaping next tensor
    predicted_distributions = torch.randn_like(next_embeddings) # B, S, V; initialize predictions as random

    for _ in range(num_steps):
        # Can optionally detach predicted distributions so that no gradient flows through past steps
        # predicted_distributions = predicted_distributions.detach()

        predicted_embeddings = self.vocab_to_embed(softmax(predicted_distributions)) # B, S, D; need to proj. to embed space for transformer to work in, use linear layer, weighted sum, etc
        

        all_embeddings = torch.cat((context_embeddings, predicted_embeddings), dim=1) # B, 2S, D
        predicted_energies = self.transformer(all_embeddings) # B, S, 1; this returns only energies for the predicted_embeddings
        
        # Compute the gradient of predicted energies w.r.t. predicted distributions
        predicted_distributions_grad = torch.autograd.grad(
            predicted_energies.sum(), 
            predicted_distributions, 
            create_graph=True
        )[0] # B, S, V
        
        # Perform gradient descent w.r.t. the energy function where self.alpha is the optimization step size
        predicted_distributions = predicted_distributions - self.alpha * predicted_distributions_grad
        
    # Calculate cce loss based on predicted and ground truth distributions, optionally at each optimization step or only at the end
    cce_loss = loss_fn(predicted_distributions, next_tokens)
```

``` {#lst:ebt_cv_training .python language="Python" caption="\\textbf{Autoregressive Video Model Training Pseudocode in PyTorch}" label="lst:ebt_cv_training"}
# make sure to enable gradient tracking
with torch.set_grad_enabled(True):
    loss_fn = torch.nn.SmoothL1Loss(beta=1.0) # use whichever loss function desired

    context_embeddings = embeddings[:, :-1] # B, S, D
    next_embeddings = embeddings[:, 1:] # B, S, D
    predicted_embeddings = torch.randn_like(next_embeddings) # B, S, D; initialize predictions as random

    for _ in range(num_steps):
        # Can optionally detach embeddings so that no gradient flows through past steps
        # predicted_embeddings = predicted_embeddings.detach()

        all_embeddings = torch.cat((context_embeddings, predicted_embeddings), dim=1) # B, 2S, D
        predicted_energies = self.transformer(all_embeddings) # this returns only energies for the predicted_embeddings # B, S, 1
        
        # Compute the gradient of predicted energies w.r.t. predicted embeddings
        predicted_embeddings_grad = torch.autograd.grad(
            predicted_energies.sum(), 
            predicted_embeddings, 
            create_graph=True
        )[0] # B, S, D
        
        # Perform gradient descent w.r.t. the energy function where self.alpha is the optimization step size
        predicted_embeddings = predicted_embeddings - self.alpha * predicted_embeddings_grad
        
    # Calculate reconstruction loss based on predicted and ground truth embeddings, optionally at each optimization step or only at the end
    reconstruction_loss = loss_fn(predicted_embeddings, next_embeddings)
```

How to Think Using Your EBT {#sec:how_to_think_ebt}
---------------------------

Once an S1 EBT has been trained, we recommend tuning the System 2 hyperparameters in the following manner:\

-   Remove detaching tensors between optimization steps and add loss truncation so the loss is only calculated at the final step.

-   Then, tune alpha, as it's by far the most important EBT-specific hyperparameter. But, do not make it learnable.

-   Next, tune the number of optimization steps, including potentially a minimum and maximum number when randomizing the number of steps.

-   Then add a replay buffer, Langevin Dynamics, and eventually a randomized alpha (step size). Tune all of these in tandem while tweaking the earlier parameters (particularly alpha).

It's possible that randomizing the number of steps for each sample within a batch would work better (similar to randomizing the alpha value within a batch). Additionally, it's worth mentioning that all optimization steps are performed along the same energy landscape (same time embedding condition), unlike with S1-EBTs using multiple time steps.

[^1]: We acknowledge that these are not comprehensive for achieving System 2 Thinking, but rather, a good first step (more info is in Section `\ref{sec:thinking_formalization}`{=latex}).

[^2]: It's important to note that we refer to this dynamic computation capability at the granularity of **each** prediction being made, meaning current LLMs built with traditional AR transformers/RNNs cannot dynamically allocate compute per token generated as they have a finite depth/width and are only updated with new text tokens. Please check Section `\ref{sec:counterarguments}`{=latex} for more information.

[^3]: We acknowledge that there are approaches to achieve uncertainty with the models discussed, such as Mixture Density Networks [@bishop1994mixture] as well as score-based diffusion models [@song2020score]. However, these approaches have seen less widespread success and scalability than the current dominant approaches.

[^4]: In the case of self-supervised learning, $x$ is some unmasked portion of the original $x$ and $\hat{y}$ is the masked portion.

[^5]: Here, Autoregressive and Bidirectional refer to the procedure for generation. It's worth noting that autoregressive models are compatible with bidirectional attention, as in [@li2025autoregressive].

[^6]: For more information on this perspective please refer to Section `\ref{sec:scaling_law_extra_info}`{=latex}

[^7]: The FLOP calculation is nuanced and depends on specific hyperparameters. For more information please refer to Section `\ref{sec:flop_calc}`{=latex}.

[^8]: Because we pretrained language models from scratch, and are unable to train models the size of modern foundation models, we find models did not benefit from inference time techniques such as Chain-of-Thought. However, we expect both `\pa`{=latex} and Transformer++ models to benefit equally from existing techniques.

[^9]: For simplicity we often use the term matrix to refer to slices of tensors.

[^10]: We note that it's possible to make each prediction not be independent, however, not making every prediction independent would mean that there is stochasticity for each prediction caused not only by its initial value but also by all initial values of all previous predictions.

[^11]: In practice, we shared the weights for the Q/K/V matrices for both observations and predictions to enable a one-to-one comparison to existing feed-forward transformers. It would be interesting to experiment with using different weight matrices to determine if that improves convergence and generalization at the cost of more parameters.

[^12]: This recursive application was only performed during testing due to the distribution shift.
