Next-Embedding Prediction Makes Strong Vision Learners

Source

Core Claim

NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.

Key Contributions

  • Introduces Next-Embedding Predictive Autoregression for visual SSL.
  • Uses causal masking and stop-gradient in an embedding-prediction setup.
  • Reports strong ImageNet fine-tuning and ADE20K transfer.
  • Frames generative pretraining from embeddings as a modality-agnostic alternative.

Method Notes

NEPA belongs to Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.

Evidence And Results

The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.

Limitations

NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.

Open Questions

  • Can NEPA be made collapse-resistant without stop-gradient?
  • How does next-embedding prediction compare to DINOv3 at larger data/model scales?