Evolutionary Strategies lead to Catastrophic Forgetting in LLMs

Source

Core Claim

This paper argues that ES can approach GRPO on new math and reasoning tasks, but does so with substantially worse catastrophic forgetting of prior LLM capabilities.

Key Contributions

  • Compares ES and GRPO on Countdown, GSM8K, MATH, and OlympiadBench with Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct.
  • Finds ES close to GRPO in new-task accuracy, while GRPO still marginally dominates most reported task/model combinations.
  • Tracks prior capability retention with HellaSwag during Countdown fine-tuning and finds ES prior-task performance declines as training continues.
  • Attributes the forgetting to ES updates being much denser and having much larger update norms than GRPO updates.

Method Notes

The paper is best read as a continual-learning stress test for Evolution Strategies at Scale. It accepts the premise that ES is a plausible gradient-free post-training method, but asks whether the model can preserve prior capabilities while adapting online.

Evidence And Results

The key retention experiment fine-tunes Qwen2.5-1.5B-Instruct on Countdown while tracking HellaSwag. ES reaches most of its Countdown gain by roughly 200 iterations, but additional iterations continue degrading prior-task performance. The update analysis reports ES parameter drift with Frobenius norms orders of magnitude larger than GRPO and much lower sparsity across layers and parameter groups.

Alex Context

Alex’s earlier ES note grouped this as a normal, not-yet-read follow-up source. In the wiki it should act as a cautionary counterpoint to the optimistic ES-at-scale papers rather than as a replacement for them.

Open Questions

  • Can ES be regularized to preserve prior capabilities without losing its gradient-free memory advantages?
  • Are dense high-norm ES updates inherent to full-parameter ES, or mostly a consequence of population size, noise scale, and update normalization choices?
  • Does low-rank ES, adapter-only ES, or EGGROLL-style perturbation change the forgetting profile?