Reconstruction Or Semantics? What Makes A Latent Space Useful For Robotic World Models

Source

Core Claim

For robotic latent-diffusion world models, semantic latent spaces can be more policy-relevant than reconstruction-oriented autoencoding latents.

Key Contributions

  • Compares six reconstruction and semantic encoders for action-conditioned latent diffusion world models.
  • Evaluates along visual fidelity, planning/downstream policy performance, and latent representation quality.
  • Shows visual fidelity alone is insufficient for world-model selection.
  • Advocates semantic latents such as V-JEPA 2.1, Web-DINO, and SigLIP 2 for policy-relevant robotics world models.

Method Notes

RSLWM connects World Models, Vision Foundation Models, and Latent-Space Predictive Learning.

Evidence And Results

The abstract reports that reconstruction encoders can win pixel metrics while semantic encoders perform better on policy and representation-quality axes.

Limitations

The conclusion is specific to action-conditioned robotic LDMs and BridgeV2-style evaluation. It should not be generalized to all visual generation tasks.

Open Questions

  • Which semantic latent features are most responsible for policy improvements?
  • Can semantic latents retain enough geometry for contact-rich manipulation?