Learning is Forgetting: LLM Training As Lossy Compression

Source

Raw Markdown: paper_learning-is-forgetting-2026.md
PDF: paper_learning-is-forgetting-2026.pdf
Preprint: arXiv 2604.07569
ICLR/OpenReview PDF: ICLR 2026 version
Code: soft_h
Technical blog / web version: LLMs are a Lossy Compression of the Internet
Princeton companion blog: I am, myself, a lossy compression
X announcement: Seraphina Goldfarb-Tarrant thread
Gonzo ML discussion: post 5327
Third-party review: ArxivIQ note

Status And Credibility

Submitted April 8, 2026 as arXiv 2604.07569 and presented as an ICLR 2026 paper. The author affiliations in the paper are Princeton University and Cohere, and the code is public. Treat as credible current evidence for information-theoretic analysis of LLM training, while remembering that entropy estimation of continuous representations is still an approximation.

Core Claim

The paper frames LLM training as lossy compression. During training, a model should retain information from the input that is useful for the objective and discard information that is not. In the Information Bottleneck notation, useful compression pushes representations toward high target information with lower input complexity:

minimize I(X; Z) - beta * I(Y; Z)

where X is input information, Z is the representation, and Y is the target prediction. In this view, learning is not only adding knowledge. It is also learning what to forget.

Key Contributions

Applies an Information Bottleneck and rate-distortion framing to LLM representations at practical LLM scale.
Uses a soft entropy estimator to estimate representation information without expensive clustering or hard binning.
Reports two-phase pretraining dynamics in OLMo2 checkpoints: an early fitting/expansion phase followed by a compression phase where input information decreases relative to target information.
Reports that larger OLMo2 models, especially 7B and 32B, approach the compression bound more cleanly than the 1B model.
Compares many open-weight LLMs and reports convergence near an optimal compression frontier.
Reports that optimality of compression predicts aggregate benchmark performance across 47 open-weight models, with token-level optimality correlated with performance at r=0.52.
Reports that preference information in model representations predicts preference-style downstream behavior more strongly, with r=0.76 for the aggregate analysis.

Why It Matters

This source gives a training-dynamics version of the compression argument. The agentic-coding source Scaling Test-Time Compute for Agentic Coding shows compression as a runtime interface: raw rollouts become summaries before selection and reuse. This source shows compression as a training outcome: model representations themselves become better when they discard irrelevant input information and preserve target-relevant structure.

For the wiki agenda, the strong synthesis is:

end-to-end learning should not preserve raw detail by default;
it should learn which detail remains predictive, controllable, or useful.

The catch is that “irrelevant” depends on the objective. If the training target is next-token prediction, the model may compress around text prediction rather than action consequences, rare failure precursors, dense numeric fidelity, or operational safety.

Method Notes

The paper estimates mutual information between model representations and labels derived from token, bigram, trigram, quadgram, and preference conditions. It uses C4 as a broad text-distribution proxy for most training-compression estimates and Tulu preference data for preference information.

The authors explicitly caution that they estimate entropy with respect to a sample and estimator, not the true continuous latent distribution. This matters for wiki use: the result is comparative evidence across models and checkpoints, not an exact physical measurement of all information inside a model.

Post-Training Note

The paper’s post-training appendix is useful for LLM Post-Training. It suggests pretraining learns the broad compression of text data, while post-training can change which information survives by increasing preference information with relatively small changes to general compression structure. That matches the wiki’s broader update-geometry lens: post-training is not only “more training”; it edits the representation target.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Representation quality	adjacent	Treats representation quality as goal-relevant lossy compression, not lossless memorization.	Needs analogous information probes for numeric time series, events, topology, actions, and rewards.
Training dynamics	partially closes	Shows a measurable expansion-to-compression trajectory in LLM pretraining checkpoints.	Needs TSFM pretraining evidence and optimizer/data-mixture sensitivity.
Dynamic tokenization and compression	adjacent	Gives a global learning-theoretic reason that compression can improve generalization.	Does not learn token boundaries or routing policies for time-series streams.
Control and counterfactuals	warning	Compression preserves information relevant to the training objective.	If the objective lacks action consequences, compression can erase control-relevant state.

Links Into The Wiki

Open Questions

Can an analogous information-plane analysis be defined for multivariate time-series models with dense numeric targets?
Which TSFM objectives make compression preserve rare events, interventions, and action-relevant state rather than only average forecast accuracy?
Can compression optimality become a checkpoint-selection criterion for time-series pretraining?
How should post-training for time-series reasoning change the information that survives without damaging numeric priors?

Alex Open Research Wiki

Explorer

Learning is Forgetting: LLM Training As Lossy Compression

Learning is Forgetting: LLM Training As Lossy Compression

Source

Status And Credibility

Core Claim

Key Contributions

Why It Matters

Method Notes

Post-Training Note

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks