Self-Supervised Representation Learning
Summary
The wiki’s SSL thread compares scaled visual representation learning with predictive embedding objectives that avoid raw reconstruction.
What The Wiki Currently Believes
- DINOv3 is the scaled vision-foundation-model reference point, with strong dense features and broad frozen transfer.
- LeJEPA argues for a theory-grounded JEPA objective with SIGReg.
- NEPA shows next-embedding prediction can make strong vision learners without pixels, tokens, contrastive loss, or task-specific heads.
- VL-JEPA applies predictive embedding learning to vision-language tasks.
Evidence
The corpus suggests a spectrum from large-scale SSL systems to simpler predictive objectives. DINOv3 shows the value of scale and careful training; LeJEPA and NEPA ask whether the objective itself can be simpler and more principled.
Open Questions
- Which predictive objective best preserves dense spatial structure?
- How much of DINOv3’s performance comes from scale versus objective design?