Representation Collapse

Summary

Representation collapse is the failure mode where predictive representation learning maps inputs to uninformative or nearly identical embeddings.

What The Wiki Currently Believes

  • LeJEPA argues that a good JEPA objective should force embeddings toward an isotropic Gaussian target distribution.
  • LeWorldModel uses Gaussian regularization to stabilize end-to-end pixel world-model training without EMA, pretrained encoders, or auxiliary supervision.
  • NEPA uses next-embedding prediction with causal masking and stop-gradient, showing a simpler visual predictive objective can work without pixel reconstruction or discrete tokens.

Evidence

The sources agree collapse prevention is central, but they disagree in mechanism: distribution matching and Gaussian regularization versus stop-gradient predictive training.

Open Questions

  • Which collapse-prevention mechanism is most robust at frontier data/model scale?
  • Can a single target embedding distribution work across visual, temporal, and language modalities?