Energy-Based Transformers are Scalable Learners and Thinkers

Source

Raw Markdown: paper_ebt-2025.md
PDF: paper_ebt-2025.pdf
Preprint: arXiv 2507.02092
Official project website: energy-based-transformers.github.io
Official blog post: Energy-Based Transformers are Scalable Learners and Thinkers
Official code: github.com/alexiglad/EBT

Core Claim

Energy-Based Transformers (EBTs) train a Transformer as an explicit energy-based model over a context and candidate prediction. Instead of producing the prediction in one forward pass, an EBT assigns an energy score to the candidate and refines the candidate by gradient descent on that energy.

The paper’s central claim is that this makes dynamic compute, uncertainty estimation, and prediction verification part of the unsupervised pretraining interface rather than a later reward-model, chain-of-thought, or task-specific verifier layer.

Key Contributions

Defines EBTs as Transformer implementations of explicit EBMs for autoregressive and bidirectional prediction.
Frames “thinking” as optimization of candidate predictions under a learned energy landscape: $\overset{y}{^}_{i + 1} = \overset{y}{^}_{i} - α \nabla_{\overset{y}{^}_{i}} E_{θ} (x, \overset{y}{^}_{i})$ .
Uses a single model as both verifier and implicit generator: the forward pass scores compatibility, and the gradient with respect to the prediction updates the candidate.
Adds energy-landscape regularization for scalable thinking: replay buffer, Langevin-style noise, randomized optimization step size, and randomized number of optimization steps.
Evaluates EBTs against Transformer++ for autoregressive language and video prediction, and against DiT for bidirectional image denoising.
Provides a practical tutorial section and PyTorch code for EBT training and inference.

Method Notes

EBT changes the output interface. A standard feed-forward Transformer learns context -> prediction. EBT learns context, candidate_prediction -> energy, then searches the candidate-prediction space for a low-energy prediction. This makes each token, image patch, or future frame a small optimization problem.

For autoregressive language modeling, the implementation must avoid information leakage while parallelizing predictions. The paper doubles the sequence representation into observed and predicted states and uses a specialized causal attention pattern so every predicted state can attend to the allowed observed context and itself, but not future targets.

For System 2 style EBTs, training backpropagates through the optimization path and therefore uses gradient-of-gradient computations. The paper argues that Hessian-vector products keep this theoretically linear in model size, but the practical cost remains higher than a standard Transformer pass.

Evidence And Results

The paper reports language-model scaling experiments on RedPajamaV2 with a GPT-NeoX tokenizer, comparing EBTs to the Transformer++ recipe across data, batch size, depth, parameters, FLOPs, and embedding dimension. It reports that EBTs have higher scaling rates across all measured axes, with the headline result of up to 35% higher scaling rate.

For inference-time thinking, the paper evaluates two mechanisms: running more optimization steps for a prediction, and self-verification by optimizing multiple candidate predictions and selecting the lowest-energy candidate. It reports up to 29% language-model performance improvement from additional forward passes, and reports that the gain increases for more out-of-distribution data.

For downstream language generalization, the paper reports a comparison where EBT has worse pretraining perplexity than Transformer++ but better perplexity on GSM8K, BigBench Elementary Math QA, and BigBench Dyck Languages, with SQuAD as the exception.

For continuous modalities, the paper reports faster scaling than Transformer++ on Something-Something V2 next-frame prediction in SD-XL VAE feature space. In bidirectional image denoising on COCO 2014, it reports better PSNR/MSE than a DiT baseline on in-distribution and out-of-distribution noise, while using far fewer forward passes.

Limitations

Training and inference require extra gradient computation and add sensitive hyperparameters, especially optimization step size and number of optimization steps.
The strongest experiments are still far below frontier foundation-model scale; the paper reports scaling up to roughly 800M parameters and explicitly leaves larger training for future work.
Some results are extrapolative: the claim that EBTs would beat Transformer++ at modern foundation-model scale follows measured scaling trends, not a direct trillion-token training run.
The paper reports a many-mode failure case for text-to-image style generation: the current convex-energy-landscape training assumption can merge nearby modes and produce blurry samples.
The current experiments do not include numeric time series, logged actions, treatments, control inputs, interventions, or closed-loop control benchmarks.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute allocation	partially closes	EBT gives an explicit per-prediction compute interface: optimize candidates for more steps or sample several candidates and select the lowest-energy one.	Needs numeric time-series, streaming, and serving-cost evaluations.
Multi-modal future distributions	adjacent	A learned energy score over candidate predictions is a natural compatibility surface for multiple plausible futures.	Current EBT training struggles with many-mode generation; needs calibrated scenario distributions rather than only low-energy point choices.
Control and counterfactuals	adjacent	The paper sketches world models where current context, future state, and future actions are jointly scored, then actions are optimized by energy minimization.	No action-conditioned experiment is provided; needs logged candidate actions, interventions, and outcome rollouts.
Scaling substrate	adjacent	The paper reports better scaling-rate trends than Transformer++ outside time series across text, video, and image denoising tasks.	Need time-series-specific scaling laws and comparison against TSFM backbones under matched compute.

Links Into The Wiki

Open Questions

Can EBT-style energy minimization become a practical inference-time compute mechanism for dense numeric forecasting, generation, and editing?
How should EBTs represent multiple plausible futures without collapsing nearby modes into one averaged low-energy basin?
Can an action-conditioned EBT world model optimize candidate control inputs directly while preserving safety constraints and uncertainty?
What serving contract makes sense for EBTs: always optimize every prediction, optimize only high-energy predictions, or use EBT as a slow verifier above a fast feed-forward model?

Alex Open Research Wiki

Explorer

Energy-Based Transformers are Scalable Learners and Thinkers

Energy-Based Transformers are Scalable Learners and Thinkers

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks