Tuna-2: Pixel Embeddings Beat Vision Encoders For Multimodal Understanding And Generation
Source
- Raw Markdown: paper_tuna-2-2026.md
- PDF: paper_tuna-2-2026.pdf
Core Claim
Tuna-2 argues that native unified multimodal models can perform understanding and generation directly with pixel embeddings, without relying on pretrained vision encoders or VAE-style latent modules.
Key Contributions
- Introduces a pixel-space unified multimodal model.
- Compares encoder-based and encoder-free variants.
- Reports state-of-the-art multimodal benchmark performance and strong fine-grained visual perception.
- Suggests pretrained vision encoders are not necessary for scalable multimodal modeling.
Method Notes
Tuna-2 is a major source for Unified Multimodal Models and Vision Foundation Models, and a counterpoint to semantic-encoder-heavy approaches.
Evidence And Results
The abstract reports strong performance across multimodal understanding and generation benchmarks, with the encoder-free design improving understanding at scale.
Limitations
The claim depends on large-scale end-to-end training. It should be compared against semantic-latent results in RSLWM and spectral harmonization in Prism.
Links Into The Wiki
Open Questions
- At what scale do pixel embeddings overtake pretrained vision encoders?
- Can pixel-space unification preserve planning-relevant semantics?