Learning to Forget: Continual Learning with Adaptive Weight Decay
Source
- Raw Markdown: paper_fade-2026.md
- PDF: paper_fade-2026.pdf
- Preprint: arXiv 2604.27063
- Code: FADE
- Gonzo ML discussion: post 5330
- Review: ArxivIQ note
Status And Credibility
Recent April 29, 2026 arXiv preprint from IDSIA, University of Alberta / Amii, and KAUST authors, including Juergen Schmidhuber. Treat as credible early continual-learning evidence, but not yet as a peer-reviewed large-model recipe.
Core Claim
Continual learners need controlled forgetting. Fixed scalar weight decay forgets uniformly across time and across parameters, which is mismatched to non-stationary online learning. FADE, short for Forgetting through Adaptive Decay, learns per-parameter decay rates online so stable weights can retain information while weights tied to changing targets can forget faster.
The useful intuition is:
lambda_i high -> shorter memory horizon for parameter i
lambda_i low -> longer memory horizon for parameter iThe method views weight decay as a learnable memory horizon rather than only as a regularizer.
Method
FADE parameterizes each decay rate as:
lambda_i = exp(gamma_i)It updates gamma_i with approximate online meta-gradients, tracking a sensitivity trace for how each weight depends on its decay meta-parameter. The derivation is for online linear regression, and the neural-network experiments apply FADE to the final linear layer while using standard optimizers for the hidden layers.
The method adds two scalar states per parameter on the adapted layer and preserves O(d) per-step cost for the online update.
Key Contributions
- Reframes weight decay as a mechanism for selective forgetting in non-stationary continual learning.
- Derives per-parameter adaptive decay with forward-mode meta-gradient approximations.
- Shows FADE complements per-parameter step-size adaptation through IDBD in a linear tracking task.
- Reports that FADE+SGD achieves roughly half the nonlinear tracking error of AdamW in the paper’s teacher-student setup.
- Reports that FADE improves streaming label-permuted EMNIST, with FADE+SGD reaching 0.807 average online accuracy versus 0.612 for the weight-clipping baseline and 0.372 for AdamW.
- Shows fixed decay on only the network head can be a surprisingly strong baseline, but is fragile to the decay coefficient; FADE reduces that sensitivity.
Why It Matters
FADE is a useful counterpoint to the “avoid forgetting” framing. Continual systems should not preserve everything forever. They need a policy for which information stays in long-term weights, which information is cleared, and which information belongs in fast context or external memory.
For the wiki’s agent and world-model agenda, the lesson is:
forgetting is not always failure;
uncontrolled forgetting is failure.This matters for operational agents, streaming time-series models, and adaptive world models because environments change. Some stale mappings should be removed quickly, while stable dynamics and rare safety knowledge should persist.
Limitations
- FADE is derived for online linear regression and applied only to the final layer in neural networks.
- A naive all-layer extension improves over fixed decay but performs much worse than head-only FADE on EMNIST, so the current method is not a ready all-parameter deep-network recipe.
- Evidence is online tracking and streaming classification, not large LLM post-training, TSFM pretraining, or action-conditioned world-model training.
- The method adapts slow parameter decay, not input-dependent runtime memory gates.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state and constant updates | adjacent | Gives a parameter-level mechanism for adapting memory horizon under non-stationary online data. | Needs sequence-model or TSFM evidence with continuous streams and retained latent state. |
| Training dynamics | adjacent | Shows weight decay can be treated as learnable forgetting rather than a fixed regularizer. | Needs interaction with AdamW, Muon, normalization layers, and large-scale pretraining. |
| Data diversity and long tail | warning | Selective forgetting can free capacity for changing targets. | Must not erase rare but safety-critical regimes or old interventions that remain relevant. |
| LLM post-training and continual adaptation | adjacent | Complements update-drift and retention concerns by adding controlled decay as another adaptation axis. | Needs LLM-scale retention tests and layerwise decay policies. |
Links Into The Wiki
- FADE
- Training Dynamics
- LLM Post-Training
- Hierarchical Modeling with a Fixed FLOPs Budget
- Foundation Time-Series Model Research Agenda
- Evolution Strategies
Open Questions
- Can adaptive decay be extended beyond final layers without breaking the meta-gradient approximation?
- How should adaptive weight decay interact with optimizer state, momentum, AdamW-style decoupled decay, and normalization?
- Can selective forgetting help TSFMs adapt to schema changes, seasonality shifts, sensor drift, or policy changes without forgetting rare failures?
- Should operational world models separate stable dynamics, local policies, and temporary incident context into different memory horizons?