VL-JEPA

Summary

VL-JEPA is a vision-language model that predicts continuous target-text embeddings instead of autoregressively generating text tokens.

Role In The Wiki

VL-JEPA extends JEPA-style representation prediction to general-domain vision-language tasks and selective decoding.

Evidence