Improved Baselines with Representation Autoencoders

Source

Raw Markdown: paper_raev2-2026.md
PDF: paper_raev2-2026.pdf
Preprint: https://arxiv.org/abs/2605.18324
Official project page: https://raev2.github.io/
Official code: https://github.com/nanovisionx/RAEv2
Official Hugging Face collection: https://huggingface.co/collections/nyu-visionx/raev2
Official X thread: Jaskirat Singh thread

Credibility

This is a very recent arXiv preprint posted on 2026-05-18. The official project page lists authors from Adobe Research, ANU, and New York University, and the release includes an official project page, code, Hugging Face models, artifacts, and data. It is credible enough to ingest as an important source, but it is not yet peer reviewed, so headline benchmark claims should be treated as author-reported until independently replicated.

Core Claim

RAEv2 simplifies representation autoencoders by aggregating multiple pretrained vision-encoder layers, combining RAE with REPA instead of treating them as substitutes, and reusing the REPA prediction head for internal guidance. The paper reports better reconstruction, better generation, and over 10x faster convergence than the original RAE, with validation on ImageNet-256, text-to-image generation, and action-conditioned navigation world-model rollouts.

Key Contributions

Defines a generalized representation autoencoder where the encoder output is a multi-layer sum of the last $K$ hidden states rather than only the final encoder layer.
Shows RAE and REPA have complementary roles: RAE contributes global semantic latent quality, while REPA improves spatial structure in intermediate diffusion features.
Recasts REPA as x-prediction in RAE latent space, letting the REPA head act as the weaker internal-guidance branch without a second AutoGuidance model or a separate CFG forward pass.
Introduces $EP_{FID@ k}$ as an efficiency metric: epochs needed to reach unguided gFID below a target threshold.
Extends the recipe to text-to-image generation and to a navigation world-model setting conditioned on egocentric actions.

Method Notes

The core multi-layer-sum representation is:

x = ℓ = L - K + 1 \sum L z_{ℓ} .

RAEv2’s self-guidance writes the full and REPA-head predictions in the same RAE latent space:

\overset{x}{^}_{g u i d e d} = \overset{x}{^}_{f u l l} + w (\overset{x}{^}_{f u l l} - \overset{x}{^}_{r e p a}) .

flowchart LR
  VisionEncoder["Pretrained vision encoder layers"]
  MLS["Multi-layer sum over last K layers"]
  RAE["RAE latent space"]
  REPA["REPA head predicts x in RAE space"]
  DiT["DiT / flow model"]
  Guidance["Internal guidance without extra model"]
  NWM["T2I and navigation world-model tests"]

  VisionEncoder --> MLS --> RAE --> DiT --> NWM
  RAE --> REPA --> Guidance --> DiT

For this wiki, the important mechanism is not only “better image generation.” It is the representation-interface lesson: a useful latent for generation and action-conditioned video rollout may need both semantic abstraction and local spatial detail, and the best layer aggregation may be a protocol variable rather than a fixed final-layer choice.

X Discussion Notes

The linked X discussion is useful because it sharpens a mechanism caveat that is easy to miss from the announcement thread alone.

Non-paper discussion: Lucas Beyer reply thread
Lucas Beyer noted that if the summed layers are residual-stream outputs, the operation is not just an unweighted collection of independent block outputs. It effectively reuses earlier residual contributions multiple times, so it behaves like a fixed depth-weighted aggregation, modulo layer norm and implementation details.
Jaskirat Singh confirmed that RAE and RAEv2 treat the representation at the K-th layer as the output with residual stream, and that the reweighting analogy is directionally correct.
The discussion therefore reframes the simple sum as one point in a larger layer-weighting design space. Plausible next experiments include weighted sums over block outputs, learned coefficients, sparse layer selection, per-encoder sweeps, or depth-aware communication.
A key guardrail from the thread: if the aggregation is learned as a target or loss, it must not collapse to a trivial easy target; it also needs an understanding-preservation constraint such as linear-probe performance.
Saining Xie pointed out that feature-pyramid work explored many fixed and learnable weighting schemes in the 2016-2019 dense-prediction era. That makes the RAEv2 question less “does summing work?” and more “which old feature-pyramid and modern depth-routing recipes transfer to representation autoencoders?”
Alex’s follow-up question connects RAEv2 to MoDA: direct inter-layer communication might encourage layer specialization and potentially reduce the number of layers needed for the same performance.

Evidence And Results

On ImageNet-256, the paper reports gFID 1.06 after 80 epochs, FD $_{r}^{6}$ 2.17 after 80 epochs, and $EP_{FID@2} = 35$ epochs, compared with 177 epochs for the original RAE.
The $K$ sweep shows a controllable reconstruction-generation tradeoff: smaller $K$ can be best for guided generation, while larger $K$ improves reconstruction.
The paper reports that multi-layer-sum RAEv2 preserves ImageNet linear-probe accuracy across $K$ , so the reconstruction/generation gains do not obviously destroy global semantic understanding.
In text-to-image experiments, RAEv2 improves the reported GenEval and DPG scores over Flux-VAE and original RAE under the paper’s protocol.
In navigation world modeling on RECON, RAEv2-NWM conditions on four past egocentric frames plus action tokens and reports FVD 105.61, compared with 200.97 for NWM and 312.01 for the RAE baseline.

Limitations

The paper is an arXiv preprint with author-reported results, not a peer-reviewed venue paper yet.
The text-to-image and navigation world-model experiments are at 256x256 resolution, so the source should not be over-read as a solved high-resolution T2I or general robotics world-model result.
The method uses very simple aggregation recipes. The X discussion makes the mechanism less settled: residual-stream summation is a fixed reweighting scheme, not proof that uniform layer aggregation is optimal.
The navigation world-model evidence is visual robotics and RECON-specific. It is useful for the wiki’s action-conditioned world-model analogy, but it does not validate numeric time-series world models, digital-system interventions, or contact-rich manipulation.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Representation quality: semantic state vs dense detail	adjacent	RAEv2 improves the semantic-latent versus reconstruction tradeoff by summing multiple pretrained vision-encoder layers while preserving linear-probe accuracy.	Vision-only evidence; no numeric time-series reconstruction, editing, or calibrated dense-value generation.
Action-conditioned future-state prediction	adjacent	RAEv2-NWM predicts future navigation frames from recent observations and explicit egocentric action tokens.	Visual navigation evidence only; no digital-system actions, broad manipulation validation, or numeric time-series interface.
Counterfactual control and planning	insufficient evidence	The rollout setup conditions on action sequences, but the paper does not evaluate candidate-action optimization or closed-loop planning.	Needs action ranking, counterfactual intervention tests, or planner-in-the-loop evaluation.
Dynamic compute and layer allocation	adjacent	The X discussion identifies fixed residual-layer reweighting and proposes learned, sparse, or depth-attention alternatives.	No experiments yet on learned layer weighting, Mixture-of-Depths-style communication, or matched-FLOPs layer specialization.
Benchmarks: what level of modeling is tested?	warning	EP@FID-k separates convergence speed from final gFID and highlights training efficiency.	Need independent replication and stronger decision-facing world-model metrics beyond visual rollout scores.

Links Into The Wiki

Open Questions

Which layer aggregation is best for RAE: fixed residual-stream sums, weighted block outputs, learned coefficients, sparse layer selection, feature-pyramid-style fusion, or depth attention?
Does learned layer selection improve reconstruction and generation without making the target too easy or reducing understanding performance?
Can direct inter-layer communication such as MoDA make pretrained encoders specialize layers more cleanly, reducing the layers needed for RAE-quality generation?
Does the RAEv2-NWM gain transfer from navigation rollouts to contact-rich manipulation or to non-visual action-conditioned time-series world models?

Alex Open Research Wiki

Explorer

Improved Baselines with Representation Autoencoders

Improved Baselines with Representation Autoencoders

Source

Credibility

Core Claim

Key Contributions

Method Notes

X Discussion Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks