Vision Foundation Models

Summary

Vision foundation models in this wiki are evaluated not only by classification, but by dense features, downstream transfer, latent-space usefulness, and compatibility with generation.

What The Wiki Currently Believes

DINOv3 is the strongest scaled SSL vision reference, emphasizing frozen versatility and dense feature quality.
NEPA explores whether next-embedding prediction can be a simple generative-pretraining alternative for vision.
Prism argues that semantic and pixel encoders occupy different spectral roles.
Reconstruction or Semantics? shows semantic visual latents can be more useful than reconstruction latents for policy-relevant robotic world models.
Tuna-2 challenges reliance on pretrained vision encoders by learning pixel embeddings end to end.

Evidence

The corpus does not point to one universal visual representation. It instead maps a tradeoff between semantic abstraction, dense spatial fidelity, pixel-level generation, and downstream control.

Open Questions

Can one visual representation support dense prediction, generation, policy evaluation, and VQA without task-specific compromise?
Is pixel-space end-to-end training a scaling substitute for pretrained semantic encoders?

Alex Knowledge Base

Explorer

Vision Foundation Models

Vision Foundation Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

Vision Foundation Models

Vision Foundation Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks