Learning to Forget: Continual Learning with Adaptive Weight Decay

Source

Raw Markdown: paper_fade-2026.md
PDF: paper_fade-2026.pdf
Preprint: arXiv 2604.27063
Code: FADE
Gonzo ML discussion: post 5330
Review: ArxivIQ note

Status And Credibility

Recent April 29, 2026 arXiv preprint from IDSIA, University of Alberta / Amii, and KAUST authors, including Juergen Schmidhuber. Treat as credible early continual-learning evidence, but not yet as a peer-reviewed large-model recipe.

Core Claim

Continual learners need controlled forgetting. Fixed scalar weight decay forgets uniformly across time and across parameters, which is mismatched to non-stationary online learning. FADE, short for Forgetting through Adaptive Decay, learns per-parameter decay rates online so stable weights can retain information while weights tied to changing targets can forget faster.

The useful intuition is:

lambda_i high  -> shorter memory horizon for parameter i
lambda_i low   -> longer memory horizon for parameter i

The method views weight decay as a learnable memory horizon rather than only as a regularizer.

Method

FADE parameterizes each decay rate as:

lambda_i = exp(gamma_i)

It updates gamma_i with approximate online meta-gradients, tracking a sensitivity trace for how each weight depends on its decay meta-parameter. The derivation is for online linear regression, and the neural-network experiments apply FADE to the final linear layer while using standard optimizers for the hidden layers.

The method adds two scalar states per parameter on the adapted layer and preserves O(d) per-step cost for the online update.

Key Contributions

Reframes weight decay as a mechanism for selective forgetting in non-stationary continual learning.
Derives per-parameter adaptive decay with forward-mode meta-gradient approximations.
Shows FADE complements per-parameter step-size adaptation through IDBD in a linear tracking task.
Reports that FADE+SGD achieves roughly half the nonlinear tracking error of AdamW in the paper’s teacher-student setup.
Reports that FADE improves streaming label-permuted EMNIST, with FADE+SGD reaching 0.807 average online accuracy versus 0.612 for the weight-clipping baseline and 0.372 for AdamW.
Shows fixed decay on only the network head can be a surprisingly strong baseline, but is fragile to the decay coefficient; FADE reduces that sensitivity.

Why It Matters

FADE is a useful counterpoint to the “avoid forgetting” framing. Continual systems should not preserve everything forever. They need a policy for which information stays in long-term weights, which information is cleared, and which information belongs in fast context or external memory.

For the wiki’s agent and world-model agenda, the lesson is:

forgetting is not always failure;
uncontrolled forgetting is failure.

This matters for operational agents, streaming time-series models, and adaptive world models because environments change. Some stale mappings should be removed quickly, while stable dynamics and rare safety knowledge should persist.

Limitations

FADE is derived for online linear regression and applied only to the final layer in neural networks.
A naive all-layer extension improves over fixed decay but performs much worse than head-only FADE on EMNIST, so the current method is not a ready all-parameter deep-network recipe.
Evidence is online tracking and streaming classification, not large LLM post-training, TSFM pretraining, or action-conditioned world-model training.
The method adapts slow parameter decay, not input-dependent runtime memory gates.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state and constant updates	adjacent	Gives a parameter-level mechanism for adapting memory horizon under non-stationary online data.	Needs sequence-model or TSFM evidence with continuous streams and retained latent state.
Training dynamics	adjacent	Shows weight decay can be treated as learnable forgetting rather than a fixed regularizer.	Needs interaction with AdamW, Muon, normalization layers, and large-scale pretraining.
Data diversity and long tail	warning	Selective forgetting can free capacity for changing targets.	Must not erase rare but safety-critical regimes or old interventions that remain relevant.
LLM post-training and continual adaptation	adjacent	Complements update-drift and retention concerns by adding controlled decay as another adaptation axis.	Needs LLM-scale retention tests and layerwise decay policies.

Links Into The Wiki

Open Questions

Can adaptive decay be extended beyond final layers without breaking the meta-gradient approximation?
How should adaptive weight decay interact with optimizer state, momentum, AdamW-style decoupled decay, and normalization?
Can selective forgetting help TSFMs adapt to schema changes, seasonality shifts, sensor drift, or policy changes without forgetting rare failures?
Should operational world models separate stable dynamics, local policies, and temporary incident context into different memory horizons?

Alex Open Research Wiki

Explorer

Learning to Forget: Continual Learning with Adaptive Weight Decay

Learning to Forget: Continual Learning with Adaptive Weight Decay

Source

Status And Credibility

Core Claim

Method

Key Contributions

Why It Matters

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks