Vision-Language Models

Summary

The vision-language thread asks whether VLMs should keep generating text tokens, predict text embeddings, or move toward unified multimodal objectives.

What The Wiki Currently Believes

VL-JEPA predicts text embeddings and decodes selectively, reducing trainable parameters and decoding work.
Beyond Language Modeling studies from-scratch multimodal pretraining with language, image, video, and action-conditioned data.
Tuna-2 argues that pixel-space unified modeling can support both understanding and generation without pretrained vision encoders.

Evidence

All three sources loosen the standard “vision encoder plus autoregressive text decoder” pattern, but they do so in different ways: embedding prediction, unified multimodal pretraining, and pixel-space modeling.

Open Questions

How much text decoding is actually needed for online multimodal tasks?
Can embedding-space prediction and pixel-space unified modeling be combined?

JEPA
Unified Multimodal Models

Alex Knowledge Base

Explorer

Vision-Language Models

Vision-Language Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

Vision-Language Models

Vision-Language Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks