Reconstruction Or Semantics? What Makes A Latent Space Useful For Robotic World Models
Source
- Raw Markdown: paper_reconstruction-or-semantics-2026.md
- PDF: paper_reconstruction-or-semantics-2026.pdf
Core Claim
For robotic latent-diffusion world models, semantic latent spaces can be more policy-relevant than reconstruction-oriented autoencoding latents.
Key Contributions
- Compares six reconstruction and semantic encoders for action-conditioned latent diffusion world models.
- Evaluates along visual fidelity, planning/downstream policy performance, and latent representation quality.
- Shows visual fidelity alone is insufficient for world-model selection.
- Advocates semantic latents such as V-JEPA 2.1, Web-DINO, and SigLIP 2 for policy-relevant robotics world models.
Method Notes
RSLWM connects World Models, Vision Foundation Models, and Latent-Space Predictive Learning.
Evidence And Results
The abstract reports that reconstruction encoders can win pixel metrics while semantic encoders perform better on policy and representation-quality axes.
Limitations
The conclusion is specific to action-conditioned robotic LDMs and BridgeV2-style evaluation. It should not be generalized to all visual generation tasks.
Links Into The Wiki
Open Questions
- Which semantic latent features are most responsible for policy improvements?
- Can semantic latents retain enough geometry for contact-rich manipulation?