Unified Multimodal Models
Summary
Unified multimodal models aim to share a model substrate across understanding and generation, often exposing tensions between semantic abstraction and raw fidelity.
What The Wiki Currently Believes
- Beyond Language Modeling studies native multimodal pretraining with Transfusion and finds visual/language data complementarity, world-modeling emergence, and MoE specialization.
- Tuna-2 removes pretrained vision encoders and works directly with pixel embeddings for understanding and generation.
- TimeOmni-VL adapts the unified-model idea to time series by mapping time series to images and back.
Evidence
The papers agree that unified training is desirable, but differ in representation: RAE-style visual representations, raw pixel embeddings, and fidelity-preserving time-series images.
Open Questions
- Does unification require a single representation, or can it use task-specific bridges that remain faithful enough?
- How should generation objectives avoid damaging understanding representations?