Joint Embedding Predictive Architecture
Summary
JEPA is the wiki’s central pattern for learning by predicting in representation space instead of reconstructing raw observations or generating tokens.
What The Wiki Currently Believes
- A Path Towards Autonomous Machine Intelligence frames JEPA as a building block for predictive world models and hierarchical planning.
- Introduction to Latent Variable Energy-Based Models presents H-JEPA as a hierarchical stack of joint embedding predictors for multi-level prediction under uncertainty.
- LeJEPA argues that JEPA needs a target embedding distribution, specifically an isotropic Gaussian, and proposes SIGReg as a scalable way to enforce it.
- LeWorldModel applies JEPA to action-conditioned pixel world modeling with a two-term objective.
- VL-JEPA extends the idea to vision-language learning by predicting target text embeddings rather than autoregressive text tokens.
Evidence
The source set shows JEPA moving from architecture proposal to theory, then to domain-specific systems: autonomous intelligence in APTAMI, lecture-note grounding in LVEBM, theory and regularization in LeJEPA, pixel control in LeWorldModel, and vision-language tasks in VL-JEPA.
Open Questions
- Can SIGReg-style Gaussian regularization replace stop-gradient and teacher-student stabilizers at very large multimodal scale?
- Which domains require latent variables beyond deterministic embeddings?