Intermediate-Layer Representations
Summary
Intermediate-layer representations are a recurring warning against treating the final model output as the default state for downstream transfer. The final layer is often an interface optimized for the pretraining objective; the most reusable latent state may live earlier.
What The Wiki Currently Believes
- Guillotine Regularization shows that SSL projectors can improve training while making the final representation worse for downstream tasks; the best layer can be the backbone, an intermediate projector layer, or another trunk layer.
- Perception Encoder shows the same pattern at large vision-language scale: PE Core learns strong general features, but the best language and spatial features are hidden in intermediate layers until alignment tuning lifts them to the output.
- TiViT is a time-series-adjacent example: rendered time-series images can benefit from frozen vision-model hidden representations rather than only final outputs.
Gotchas
- Best layer is a protocol variable. It can change with target task, target data distribution, OOD shift, optimizer, architecture, and whether the target task matches the pretraining invariances.
- “Discard the head” is too coarse. A projector or late block can contain useful intermediate states even if its final output is too objective-specific.
- Layerwise probes should use the target evaluation protocol. Source-task or projector-only probes can miss downstream-relevant state.
- Alignment tuning and layer cutting are different moves. Layer cutting selects an already useful internal state; alignment tuning changes the model so a desired state is exposed at the output.
- Invariance can erase information. For time-series and world-model work, augmentations or objectives that remove scale, phase, local detail, channel identity, action information, or exogenous variables may make final embeddings brittle even when earlier layers still encode those factors.
Implications For Time-Series And World Models
For time-series encoders, the analogous question is not “which visual layer is best?” but “which latent state preserves the dynamics needed by the downstream task?” A forecasting head, reconstruction head, contrastive projector, or language-alignment adapter may optimize a useful interface while suppressing variables needed for classification, anomaly detection, counterfactual prediction, or action-conditioned world modeling.
When evaluating temporal representation models, the wiki should prefer reports that identify the probed layer, head, pooling rule, adaptation budget, target split, and whether the evaluation is in-distribution or OOD.
Open Questions
- Can one pretraining objective produce a final representation that is simultaneously good for global semantics, local structure, temporal dynamics, and intervention-sensitive state?
- Which layerwise diagnostics best predict downstream transfer before training many probes?
- Should foundation-model releases expose stable intermediate-layer APIs by default?