VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language
Source
- Raw Markdown: paper_vl-jepa-2025.md
- PDF: paper_vl-jepa-2025.pdf
Core Claim
VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.
Key Contributions
- Predicts target text embeddings from vision inputs and textual queries.
- Uses a lightweight decoder only when text output is needed.
- Supports selective decoding to reduce decoding operations.
- Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.
Method Notes
VL-JEPA belongs to JEPA, Vision-Language Models, and Self-Supervised Representation Learning.
Evidence And Results
The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.
Limitations
The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.
Links Into The Wiki
Open Questions
- Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
- How should target text embeddings be trained and regularized to avoid semantic collapse?