Perception Encoder

Source

Core Claim

Perception Encoder shows that a carefully scaled contrastive vision-language model can learn strong general visual features for classification, VQA, grounding, detection, tracking, and depth, but those features are often hidden in intermediate layers rather than exposed at the final output.

Key Contributions

  • Builds PE Core from robust image-text contrastive pretraining plus a video data engine that recaptions videos for image-video contrastive finetuning.
  • Runs layerwise frozen-feature analysis across classification, language, and spatial tasks, finding that PE Core’s intermediate layers can match specialized AIMv2-style language features and DINOv2-style spatial features.
  • Shows that the final PE Core layer can be much worse than intermediate layers for some downstream tasks, especially grounding and spatial tasks.
  • Introduces language alignment that uses MLLM-style training to lift useful intermediate features to the output for multimodal language modeling.
  • Introduces spatial alignment that self-distills semantic PE features and distills SAM 2.1 mask-logit correspondences to improve dense prediction and locality.

Main Takeaways

The paper extends the Guillotine Regularization lesson from projector heads to full-scale vision encoders. Strong pretraining can create general internal state that the model’s final interface does not expose. For PE Core, the final layers behave partly like a decoder for the global CLIP-style objective rather than a universal feature endpoint.

The paper also gives a mitigation path. Instead of only cutting to the best layer at inference time, PE uses alignment tuning to move task-useful internal features to the output: language alignment for MLLM use, and spatial alignment for dense prediction.

Gotchas

  • “Contrastive pretraining is enough” is only true with a strong recipe and careful layer handling. The paper’s strongest general features are not simply the last-layer CLIP embedding.
  • Best layer is task-specific. Spatial tracking peaks before global-token behavior appears; semantic dense tasks can use later global-token layers; language tasks prefer still different layers.
  • Alignment tuning is not free evidence that the original output was good. It is a downstream repair step that changes which layer is useful.
  • The paper is vision-centric. For time-series or world-model use, the analogous lesson is to inspect latent states across depth and objective heads rather than assuming the final embedding preserves all downstream-relevant dynamics.
  • The released PE Video Dataset is important context for the model, but dataset payloads should not be committed into this repository.

Open Questions

  • Can alignment tuning preserve all useful intermediate features, or does each alignment still privilege one downstream family?
  • Which layerwise diagnostics should be standard when evaluating a new visual or temporal encoder?
  • Does the same “hidden general features” pattern appear in time-series foundation models trained with forecasting, reconstruction, contrastive, or JEPA-style objectives?