Vision Foundation Models
Summary
Vision foundation models in this wiki are evaluated not only by classification, but by dense features, downstream transfer, latent-space usefulness, and compatibility with generation.
What The Wiki Currently Believes
- DINOv3 is the strongest scaled SSL vision reference, emphasizing frozen versatility and dense feature quality.
- NEPA explores whether next-embedding prediction can be a simple generative-pretraining alternative for vision.
- Prism argues that semantic and pixel encoders occupy different spectral roles.
- Reconstruction or Semantics? shows semantic visual latents can be more useful than reconstruction latents for policy-relevant robotic world models.
- Tuna-2 challenges reliance on pretrained vision encoders by learning pixel embeddings end to end.
Evidence
The corpus does not point to one universal visual representation. It instead maps a tradeoff between semantic abstraction, dense spatial fidelity, pixel-level generation, and downstream control.
Open Questions
- Can one visual representation support dense prediction, generation, policy evaluation, and VQA without task-specific compromise?
- Is pixel-space end-to-end training a scaling substitute for pretrained semantic encoders?