Learning is Forgetting: LLM Training As Lossy Compression
Source
- Raw Markdown: paper_learning-is-forgetting-2026.md
- PDF: paper_learning-is-forgetting-2026.pdf
- Preprint: arXiv 2604.07569
- ICLR/OpenReview PDF: ICLR 2026 version
- Code: soft_h
- Technical blog / web version: LLMs are a Lossy Compression of the Internet
- Princeton companion blog: I am, myself, a lossy compression
- X announcement: Seraphina Goldfarb-Tarrant thread
- Gonzo ML discussion: post 5327
- Third-party review: ArxivIQ note
Status And Credibility
Submitted April 8, 2026 as arXiv 2604.07569 and presented as an ICLR 2026 paper. The author affiliations in the paper are Princeton University and Cohere, and the code is public. Treat as credible current evidence for information-theoretic analysis of LLM training, while remembering that entropy estimation of continuous representations is still an approximation.
Core Claim
The paper frames LLM training as lossy compression. During training, a model should retain information from the input that is useful for the objective and discard information that is not. In the Information Bottleneck notation, useful compression pushes representations toward high target information with lower input complexity:
minimize I(X; Z) - beta * I(Y; Z)where X is input information, Z is the representation, and Y is the target prediction. In this view, learning is not only adding knowledge. It is also learning what to forget.
Key Contributions
- Applies an Information Bottleneck and rate-distortion framing to LLM representations at practical LLM scale.
- Uses a soft entropy estimator to estimate representation information without expensive clustering or hard binning.
- Reports two-phase pretraining dynamics in OLMo2 checkpoints: an early fitting/expansion phase followed by a compression phase where input information decreases relative to target information.
- Reports that larger OLMo2 models, especially 7B and 32B, approach the compression bound more cleanly than the 1B model.
- Compares many open-weight LLMs and reports convergence near an optimal compression frontier.
- Reports that optimality of compression predicts aggregate benchmark performance across 47 open-weight models, with token-level optimality correlated with performance at
r=0.52. - Reports that preference information in model representations predicts preference-style downstream behavior more strongly, with
r=0.76for the aggregate analysis.
Why It Matters
This source gives a training-dynamics version of the compression argument. The agentic-coding source Scaling Test-Time Compute for Agentic Coding shows compression as a runtime interface: raw rollouts become summaries before selection and reuse. This source shows compression as a training outcome: model representations themselves become better when they discard irrelevant input information and preserve target-relevant structure.
For the wiki agenda, the strong synthesis is:
end-to-end learning should not preserve raw detail by default;
it should learn which detail remains predictive, controllable, or useful.The catch is that “irrelevant” depends on the objective. If the training target is next-token prediction, the model may compress around text prediction rather than action consequences, rare failure precursors, dense numeric fidelity, or operational safety.
Method Notes
The paper estimates mutual information between model representations and labels derived from token, bigram, trigram, quadgram, and preference conditions. It uses C4 as a broad text-distribution proxy for most training-compression estimates and Tulu preference data for preference information.
The authors explicitly caution that they estimate entropy with respect to a sample and estimator, not the true continuous latent distribution. This matters for wiki use: the result is comparative evidence across models and checkpoints, not an exact physical measurement of all information inside a model.
Post-Training Note
The paper’s post-training appendix is useful for LLM Post-Training. It suggests pretraining learns the broad compression of text data, while post-training can change which information survives by increasing preference information with relatively small changes to general compression structure. That matches the wiki’s broader update-geometry lens: post-training is not only “more training”; it edits the representation target.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Representation quality | adjacent | Treats representation quality as goal-relevant lossy compression, not lossless memorization. | Needs analogous information probes for numeric time series, events, topology, actions, and rewards. |
| Training dynamics | partially closes | Shows a measurable expansion-to-compression trajectory in LLM pretraining checkpoints. | Needs TSFM pretraining evidence and optimizer/data-mixture sensitivity. |
| Dynamic tokenization and compression | adjacent | Gives a global learning-theoretic reason that compression can improve generalization. | Does not learn token boundaries or routing policies for time-series streams. |
| Control and counterfactuals | warning | Compression preserves information relevant to the training objective. | If the objective lacks action consequences, compression can erase control-relevant state. |
Links Into The Wiki
- Training Dynamics
- LLM Post-Training
- Hierarchical Modeling with a Fixed FLOPs Budget
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
- Representation Collapse
Open Questions
- Can an analogous information-plane analysis be defined for multivariate time-series models with dense numeric targets?
- Which TSFM objectives make compression preserve rare events, interventions, and action-relevant state rather than only average forecast accuracy?
- Can compression optimality become a checkpoint-selection criterion for time-series pretraining?
- How should post-training for time-series reasoning change the information that survives without damaging numeric priors?