Next-Embedding Prediction Makes Strong Vision Learners

Source

Raw Markdown: paper_nepa-2025.md
PDF: paper_nepa-2025.pdf

Core Claim

NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.

Key Contributions

Introduces Next-Embedding Predictive Autoregression for visual SSL.
Uses causal masking and stop-gradient in an embedding-prediction setup.
Reports strong ImageNet fine-tuning and ADE20K transfer.
Frames generative pretraining from embeddings as a modality-agnostic alternative.

Method Notes

NEPA belongs to Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.

Evidence And Results

The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.

Limitations

NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.

Links Into The Wiki

Open Questions

Can NEPA be made collapse-resistant without stop-gradient?
How does next-embedding prediction compare to DINOv3 at larger data/model scales?

Alex Knowledge Base

Explorer

Next-Embedding Prediction Makes Strong Vision Learners

Next-Embedding Prediction Makes Strong Vision Learners

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks