Unified Multimodal Models

Summary

Unified multimodal models aim to share a model substrate across understanding and generation, often exposing tensions between semantic abstraction and raw fidelity.

What The Wiki Currently Believes

Beyond Language Modeling studies native multimodal pretraining with Transfusion and finds visual/language data complementarity, world-modeling emergence, and MoE specialization.
Tuna-2 removes pretrained vision encoders and works directly with pixel embeddings for understanding and generation.
TimeOmni-VL adapts the unified-model idea to time series by mapping time series to images and back.

Evidence

The papers agree that unified training is desirable, but differ in representation: RAE-style visual representations, raw pixel embeddings, and fidelity-preserving time-series images.

Open Questions

Does unification require a single representation, or can it use task-specific bridges that remain faithful enough?
How should generation objectives avoid damaging understanding representations?

Alex Knowledge Base

Explorer

Unified Multimodal Models

Unified Multimodal Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

Unified Multimodal Models

Unified Multimodal Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks