Vision Foundation Models

Summary

Vision foundation models in this wiki are evaluated not only by classification, but by dense features, downstream transfer, latent-space usefulness, and compatibility with generation.

What The Wiki Currently Believes

  • DINOv3 is the strongest scaled SSL vision reference, emphasizing frozen versatility and dense feature quality.
  • NEPA explores whether next-embedding prediction can be a simple generative-pretraining alternative for vision.
  • Prism argues that semantic and pixel encoders occupy different spectral roles.
  • Reconstruction or Semantics? shows semantic visual latents can be more useful than reconstruction latents for policy-relevant robotic world models.
  • Tuna-2 challenges reliance on pretrained vision encoders by learning pixel embeddings end to end.

Evidence

The corpus does not point to one universal visual representation. It instead maps a tradeoff between semantic abstraction, dense spatial fidelity, pixel-level generation, and downstream control.

Open Questions

  • Can one visual representation support dense prediction, generation, policy evaluation, and VQA without task-specific compromise?
  • Is pixel-space end-to-end training a scaling substitute for pretrained semantic encoders?