VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Core Claim

VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.

Key Contributions

  • Predicts target text embeddings from vision inputs and textual queries.
  • Uses a lightweight decoder only when text output is needed.
  • Supports selective decoding to reduce decoding operations.
  • Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.

Method Notes

VL-JEPA belongs to JEPA, Vision-Language Models, and Self-Supervised Representation Learning.

Evidence And Results

The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.

Limitations

The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.

Open Questions

  • Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
  • How should target text embeddings be trained and regularized to avoid semantic collapse?