Self-Supervised Representation Learning

Summary

The wiki’s SSL thread compares scaled visual representation learning with predictive embedding objectives that avoid raw reconstruction.

What The Wiki Currently Believes

  • DINOv3 is the scaled vision-foundation-model reference point, with strong dense features and broad frozen transfer.
  • LeJEPA argues for a theory-grounded JEPA objective with SIGReg.
  • NEPA shows next-embedding prediction can make strong vision learners without pixels, tokens, contrastive loss, or task-specific heads.
  • VL-JEPA applies predictive embedding learning to vision-language tasks.

Evidence

The corpus suggests a spectrum from large-scale SSL systems to simpler predictive objectives. DINOv3 shows the value of scale and careful training; LeJEPA and NEPA ask whether the objective itself can be simpler and more principled.

Open Questions

  • Which predictive objective best preserves dense spatial structure?
  • How much of DINOv3’s performance comes from scale versus objective design?