Next-Embedding Prediction Makes Strong Vision Learners
Source
- Raw Markdown: paper_nepa-2025.md
- PDF: paper_nepa-2025.pdf
Core Claim
NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.
Key Contributions
- Introduces Next-Embedding Predictive Autoregression for visual SSL.
- Uses causal masking and stop-gradient in an embedding-prediction setup.
- Reports strong ImageNet fine-tuning and ADE20K transfer.
- Frames generative pretraining from embeddings as a modality-agnostic alternative.
Method Notes
NEPA belongs to Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.
Evidence And Results
The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.
Limitations
NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.
Links Into The Wiki
Open Questions
- Can NEPA be made collapse-resistant without stop-gradient?
- How does next-embedding prediction compare to DINOv3 at larger data/model scales?