Beyond Language Modeling: An Exploration Of Multimodal Pretraining
Source
- Raw Markdown: paper_beyond-language-modeling-2026.md
- PDF: paper_beyond-language-modeling-2026.pdf
Core Claim
Native multimodal pretraining can move foundation models beyond text-only language modeling when visual representations, data mixtures, world-modeling data, and MoE scaling are controlled together.
Key Contributions
- Uses controlled from-scratch Transfusion pretraining over language, images, video, image-text pairs, and action-conditioned video.
- Finds representation autoencoders useful for visual understanding and generation.
- Reports complementarity between visual and language data.
- Treats MoE as a way to handle modality specialization and asymmetric scaling needs.
Method Notes
The paper is a design-space study. It is linked to Unified Multimodal Models, Mixture Of Experts, Vision-Language Models, and World Models.
Evidence And Results
The source emphasizes IsoFLOP-style comparisons and controlled changes to representation, modality mix, and architecture. Its main synthesis value is the claim that visual data is more data-hungry than language and that MoE can harmonize the mismatch.
Limitations
The conclusions are tied to the Transfusion setup and data mixture. They should be compared against pixel-space approaches such as Tuna-2 and spectral representation arguments in Prism.
Links Into The Wiki
Open Questions
- Does RAE remain the best visual substrate when pixel-space end-to-end models scale further?
- How general is the reported world-modeling emergence across action-conditioned datasets?