Vision-Language Models
Summary
The vision-language thread asks whether VLMs should keep generating text tokens, predict text embeddings, or move toward unified multimodal objectives.
What The Wiki Currently Believes
- VL-JEPA predicts text embeddings and decodes selectively, reducing trainable parameters and decoding work.
- Beyond Language Modeling studies from-scratch multimodal pretraining with language, image, video, and action-conditioned data.
- Tuna-2 argues that pixel-space unified modeling can support both understanding and generation without pretrained vision encoders.
Evidence
All three sources loosen the standard “vision encoder plus autoregressive text decoder” pattern, but they do so in different ways: embedding prediction, unified multimodal pretraining, and pixel-space modeling.
Open Questions
- How much text decoding is actually needed for online multimodal tasks?
- Can embedding-space prediction and pixel-space unified modeling be combined?