LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels

Source

Core Claim

LeWorldModel trains a stable end-to-end JEPA world model from raw pixels using next-embedding prediction and Gaussian-distribution regularization.

Key Contributions

  • Presents a two-term objective for stable pixel world modeling.
  • Avoids EMA, pretrained encoders, auxiliary supervision, and multi-loss heuristic stacks.
  • Uses Gaussian-distributed latent embeddings to prevent collapse.
  • Reports fast planning and meaningful physical latent structure on control tasks.

Method Notes

LeWorldModel operationalizes ideas from APTAMI, LeJEPA, and World Models.

Evidence And Results

The abstract reports training with about 15M parameters on a single GPU, planning up to 48x faster than foundation-model-based world models, and competitive control performance across 2D and 3D tasks.

Limitations

The paper notes short-horizon planning, offline data coverage, and action-label reliance as remaining limitations.

Open Questions

  • Can LeWorldModel scale to long-horizon hierarchical planning?
  • Can inverse dynamics reduce dependence on explicit action labels?