VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Raw Markdown: paper_vl-jepa-2025.md
PDF: paper_vl-jepa-2025.pdf

Core Claim

VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.

Key Contributions

Predicts target text embeddings from vision inputs and textual queries.
Uses a lightweight decoder only when text output is needed.
Supports selective decoding to reduce decoding operations.
Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.

Method Notes

VL-JEPA belongs to JEPA, Vision-Language Models, and Self-Supervised Representation Learning.

Evidence And Results

The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.

Limitations

The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.

Links Into The Wiki

Open Questions

Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
How should target text embeddings be trained and regularized to avoid semantic collapse?

Alex Knowledge Base

Explorer

VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks