Unified Multimodal Models

Summary

Unified multimodal models aim to share a model substrate across understanding and generation, often exposing tensions between semantic abstraction and raw fidelity.

What The Wiki Currently Believes

  • Beyond Language Modeling studies native multimodal pretraining with Transfusion and finds visual/language data complementarity, world-modeling emergence, and MoE specialization.
  • Tuna-2 removes pretrained vision encoders and works directly with pixel embeddings for understanding and generation.
  • TimeOmni-VL adapts the unified-model idea to time series by mapping time series to images and back.

Evidence

The papers agree that unified training is desirable, but differ in representation: RAE-style visual representations, raw pixel embeddings, and fidelity-preserving time-series images.

Open Questions

  • Does unification require a single representation, or can it use task-specific bridges that remain faithful enough?
  • How should generation objectives avoid damaging understanding representations?