---
abstract: |
  One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, **the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online**. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
author:
- |
  **Immanuel Abdi[^1], Akshat Gupta`\footnotemark[1]`{=latex}**\
  **Micah Mok, Alexander Lu, Nicholas Lee, Gopala Anumanchipalli**\
  UC Berkeley\
  `{immanuelazn, akshat.gupta}@berkeley.edu`
bibliography:
- custom.bib
title: Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
---

```{=latex}
\maketitle
```
Introduction
============

Despite rapid advances in AI with transformer-based LLMs [@vaswani2017attention; @brown2020language; @deepseekai2024deepseekllm], most state-of-the-art systems remain static after training and lack the ability to learn continually during deployment. In many real-world settings, models need to adapt to new tasks, user preferences, or data distributions to perform optimally. While modern chatbots like ChatGPT do this by taking notes in the form of user memory [@openai_memory_2024] and use in-context learning [@brown2020language] to incorporate this information, we currently lack solutions that can achieve this by modifying the model weights during deployment. One of the reasons that makes this challenging is that current post-training and adaptation methods for LLMs are exclusively gradient-based, including approaches such as SFT [@wei2022finetunedlanguagemodelszeroshot], RLHF [@ouyangTrainingLanguageModels2022], DPO [@rafailov2024directpreferenceoptimizationlanguage], and GRPO [@shao2024deepseekmathpushinglimitsmathematical]. While effective, these methods require storing gradients, optimizer states, or intermediate activations, causing substantial memory overhead.

Evolutionary Strategies (ES) [@qiuEvolutionStrategiesScale2025; @korotyshova2025essaevolutionarystrategiesscalable] have recently re-emerged as a gradient-free alternative for optimizing LLMs. By estimating updates through population-level perturbations rather than backpropagation, ES avoids explicit gradient storage and can significantly reduce memory requirements during deployment. @qiuEvolutionStrategiesScale2025 have shown that ES achieves comparable performance to GRPO on the Countdown task [@panJiayiPanTinyZero2026], presenting ES as a viable candidate for continual learning in LLMs. However, a more comprehensive analysis on task generalization was missing in their work. More importantly from the perspective of continual learning, @qiuEvolutionStrategiesScale2025 do not evaluate the extent to which ES preserves existing capabilities while learning new tasks.

In this work, we present a comprehensive empirical analysis of ES for fine-tuning LLMs, with a focus on continual learning and forgetting. We compare ES against GRPO on multiple math and reasoning benchmarks and evaluate forgetting curves over many update steps. Our results confirm that ES is able to reach performance levels comparable to GRPO on a large suite of tasks; however, contrary to results reported in @qiuEvolutionStrategiesScale2025, we find that GRPO still dominates ES marginally on almost all tasks. Additionally, we show that training LLMs using ES leads to significant model degradation and forgetting of existing abilities when compared to GRPO. To better understand this behavior, we analyze the structure of parameter updates produced by ES and compare them to those obtained using GRPO. We find that ES updates are significantly less sparse and exhibit much larger $\ell_2$ norms, leading to more global parameter changes that interfere with previously learned capabilities.

Our findings highlight that although ES presents a tempting memory-efficient and gradient-free alternative to inference-time model adaptation, it is also accompanied by \`\`catastrophic" forgetting [@kirkpatrick2017overcoming; @gupta2024model] of prior abilities of the model. We hope these results can inspire future advancements in gradient-free algorithms with continual learning and catastrophic forgetting at the forefront of thought. We also release our codebase[^2] and models[^3] for reference.

To summarize, our work makes the following contributions:

1.  We show that ES is able to reach comparable performance to GRPO on several math and reasoning benchmarks with similar number of update steps.

2.  We show that training models using ES causes significant model degradation when compared to GRPO, leading to catastrophic fortgetting of prior abilities.

3.  Finally, we also explore the reason behind catastrophic forgetting in ES and show that this happens because model updates using ES are much less sparse when compared to GRPO with significantly larger $\ell_2$ norms.

Related Work
============

Evolution Strategies are a class of algorithms that search for solutions to first-order optimization problems by randomly modifying population members to find better performing members [@10.1007/978-3-642-83814-9_6; @schwefel1977numerische; @Beyer1995b]. Although implementations such as CMA-ES [@hansen2001completelyderandomizedselfadaptationinevolutionstrategies] and natural ES [@wierstra2011naturalevolutionstrategies; @sun2012efficientnaturalevolutionstrategies] demonstrated success, initial implementations remained in the million-parameter scale [@such2018deepneuroevolutiongeneticalgorithms; @risi2019deepneuroevolutionrecurrentdiscrete; @zhang2017relationshipopenaievolutionstrategy]. However, recent updates have brought ES up to scale and in competition with GPRO, leveraging how it is highly parallelizable, [@salimansEvolutionStrategiesScalable2017], memory efficient [@malladi2024finetuninglanguagemodelsjust; @korotyshova2025essaevolutionarystrategiesscalable], faster [@sarkar2025evolutionstrategieshyperscale], robust to sparse reward horizons [@salimansEvolutionStrategiesScalable2017], and can be modified with LoRA adaptions [@jin2024derivativefreeoptimizationlowrankadaptation; @korotyshova2025essaevolutionarystrategiesscalable; @sarkar2025evolutionstrategieshyperscale]. @qiuEvolutionStrategiesScale2025 recently published a novel implementation of ES and showed that it outperforms GRPO. However, their study lacked a thorough analysis of model degradation during continued training. Additionally, a bulk of their study was focused on a single dataset. We extend their analysis to multiple datasets, evaluate model degradation during fine-tuning and also study the difference in weight updates in ES when compared to GRPO.

Experiments
===========

ES vs GRPO Comparison {#subsec:experiment-datasets}
---------------------

We use the ES implementation of @qiuEvolutionStrategiesScale2025 and compare it with the GRPO [@shao2024deepseekmathpushinglimitsmathematical] implementation from the VERL libary [@Sheng_2025]. An algorithmic analogy between the ES and GRPO algorithms can be found in `\ref{appendix:A.1}`{=latex} while implementations details can be found in `\ref{appendix:A.4}`{=latex}. We extend the analysis of ES and GRPO to three math and reasoning tasks -- GSM8K [@cobbe2021trainingverifierssolvemath], MATH [@hendrycks2021measuringmathematicalproblemsolving] and OlympiadBench [@he2024olympiadbenchchallengingbenchmarkpromoting], in addition to the Countdown dataset which was extensively studied in prior work [@qiuEvolutionStrategiesScale2025]. We perform this study for two models: Qwen2.5-1.5B-Instruct [@qwenQwen25TechnicalReport2024] and Llama-3.2-1B-Instruct [@grattafiori2024llama3herdmodels]. Following the experimental conditions of @qiuEvolutionStrategiesScale2025, we train our models on 200 examples from each dataset with identical batch size and number of rollouts.

The results for comparison between ES and GRPO for fine-tuning LLMs can be found in Table `\ref{tab:dataset-accuracy}`{=latex}. We see that for both models, ES is within 3-4 percentage points of GRPO in terms of task performance. These results are in contrast to prior work by @qiuEvolutionStrategiesScale2025, who claim that ES significantly outperforms GRPO on the Countdown task. In our experiments, we see that although ES performance numbers are close to GRPO, GRPO still outperforms ES for all but the GSM8K dataset with Llama-3.2-1B model. Therefore, we find different relative performance trends than those reported in prior work, which may stem from differences in GRPO implementations, hyperparameter choices, or evaluation protocols. We release our codebase and open source our trained models for reference.

```{=latex}
\centering
```
```{=latex}
\scalebox{0.8}{
\begin{tabular}{llcc}
\toprule
Model & Task & ES & GRPO \\
\midrule
\multirow{4}{*}{\shortstack[l]{Qwen-2.5-1.5B\\ (Instruct)}}
 & Countdown       & 53.0 & \textbf{56.4} \\
 & GSM8K         & 77.4 & \textbf{80.4} \\
 & MATH            & 59.1 & \textbf{63.2} \\
 & OlympiadBench   & 15.2 & \textbf{18.2} \\
\midrule
\multirow{4}{*}{\shortstack[l]{Llama-3.2-1B\\ (Instruct)}}
 & Countdown       & 15.2 & \textbf{37.6} \\
 & GSM8K         & \textbf{55.2} & 53.8 \\
 & MATH            & 32.2 & \textbf{35.6} \\
 & OlympiadBench   & 5.6  & \textbf{5.9} \\
\bottomrule
\end{tabular}
}
```
The fact that the performance numbers of ES are comparable to a state-of-the-art post-training algorithm like GRPO is very encouraging and establishes ES as a potential gradient-free alternative to training LLMs. We also see that for all tasks except Countdown, ES is able to reach peak performance in similar number of update steps, which is shown in Figure `\ref{fig:es-grpo-comparison}`{=latex}. This makes the compute requirements of ES also comparable to GRPO.

ES and Catastrophic Forgetting {#subsec:exp-catastrophic-forgetting}
------------------------------

While Section `\ref{subsec:experiment-datasets}`{=latex} shows that ES performs competitively with GRPO on various downstream tasks, a defining factor in the viability of using a fine-tuning algorithm for continual learning is its relationship with catastrophic forgetting. We utilized Qwen2.5-1.5B-Instruct trained on the Countdown dataset with GRPO and ES to evaluate catastrophic forgetting. HellaSwag [@zellers2019hellaswagmachinereallyfinish] was used to evaluate LLMs on their prior capabilities. In an ideal scenario, performance on previous tasks should be preserved as new capability is gained. We thus evaluate task performance across each checkpoint of our trained models.

```{=latex}
\centering
```
![Pareto front between new task (Countdown) and old task (HellaSwag) performance across fine-tuning with ES and GRPO.](countdown_vs_hellaswag_colored_by_iteration.png){#fig:new-vs-prev-task-performance-iter width="0.9\\linewidth"}

```{=latex}
\centering
```
![Prior task accuracy (%; HellaSwag) vs. training iteration for ES and GRPO fine-tuning. GRPO-trained models exhibit stable prior task accuracy across training iteration, while ES-trained models degrade with continued fine-tuning.](hellaswag_vs_iteration.png){#fig:hellaswag-vs-iteration width="0.8\\columnwidth"}

Figure `\ref{fig:new-vs-prev-task-performance-iter}`{=latex} illustrates the relationship between new-task performance (Countdown) and prior-task performance (HellaSwag) across fine-tuning iterations. When training with ES, prior-task performance systematically deteriorates as fine-tuning proceeds, even after new-task performance has effectively converged. This can observed in the convex Pareto front made by ES in Figure `\ref{fig:new-vs-prev-task-performance-iter}`{=latex}. The darker color dots, which depict early training iterations begin with a lower \`\`New Task Accuracy". With enough training iterations, the increase in \`\`New Task Accuracy" for ES is accompanied by a gradual but evident decline in \`\`Prior Task Accuracy". Additionally, ES models reach near-maximum Countdown performance by approximately 200 iterations, after which additional training yields negligible gains on the new task. As shown in Figure `\ref{fig:hellaswag-vs-iteration}`{=latex}, despite this convergence, previous task performance continues to decline with further iterations, resulting in an approximately 10% drop relative to the best observed prior-task performance. **This pattern indicates that continued ES optimization disproportionately harms previously acquired capabilities, rather than trading off against improvements on the new task.**

In contrast, models fine-tuned with GRPO exhibit markedly different behavior. Across the full range of training iterations, GRPO maintains stable previous task performance while achieving strong new task accuracy. This can be seen by the cluster of crosses on the top-right corner of Figure `\ref{fig:new-vs-prev-task-performance-iter}`{=latex}. **This suggests that GRPO avoids the destructive interference observed with ES.** This property of GRPO has also been observed in prior work [@shenfeld2025rlsrazoronlinereinforcement].

Therefore, we see that although ES-trained models can be competitive to GRPO, they do so at the cost of severe catastrophic forgetting. Notably, this forgetting occurs within a single fine-tuning run rather than across sequential tasks, highlighting a fundamental instability in ES-based continual adaptation. These results suggest that ES is poorly suited for scenarios requiring task generalization or reuse of previously learned capabilities, whereas GRPO provides a substantially more stable fine-tuning regime.

```{=latex}
\centering
```
![ Relationship between Frobenius norm of a model update and number of training iterations on a new task. ES-trained models drift several orders of magnitude more than GRPO-trained models. ](frobenius_vs_iteration.png){#fig:hellaswag-frobenius width="\\linewidth"}

Dissecting ES Updates: Norm and Sparsity
----------------------------------------

In this section, we seek to determine the characteristics of fine-tuning with ES that cause catastrophic forgetting. We do this by analyzing two features: the update norm and sparsity.

#### Norm.

Here we investigate norm growth of the updated matrix as a function of number of updates with ES and GRPO. We measure the Frobenius norm between model checkpoints within a training run. We do this for the Qwen2.5-1.5B-Instruct model trained on the Countdown dataset.

The results are shown in Figure `\ref{fig:hellaswag-frobenius}`{=latex}. The Frobenius norm increases monotonically with the number of training iterations for ES-trained models. A similar trend is also present for GRPO-trained models (Figure `\ref{fig:hellaswag-frobenius-log}`{=latex}); however, the key distinction lies in scale. After just 500 training iteration, the Frobenius norm of the ES-trained model relative to the base model is *three* orders of magnitude larger than the GRPO-trained model. When combined with what we learn from Figure `\ref{fig:hellaswag-vs-iteration}`{=latex}, we see a clear association between the large increases in ES Frobenius norm and a decline in prior task accuracy. **Thus, ES updates have significantly higher $\ell_2$ norm difference, causing orders or magnitiude larger parameter-shifts compared to GRPO.**

```{=latex}
\centering
```
```{=latex}
\setlength{\belowcaptionskip}{0pt}
```
![ Layerwise sparsity of parameter updates during fine-tuning (higher indicates more sparse updates). ES exhibits broadly distributed, dense updates across components, whereas GRPO updates are overall highly sparse across LLM parameter groups, consistent with more targeted parameter changes. ](sparsity_comparison.png "fig:"){width="\\linewidth"}\
`\vspace{-0.4em}`{=latex}

```{=latex}
\vspace{-0.6em}
```
`\label{fig:update-sparsity}`{=latex}

#### Sparsity.

Each update in ES is constructed from high-variance, global perturbations applied across all parameters, which may affect a large number of stored parameters uniformly. In contrast, it is known that GRPO applies much sparser and targeted updates via backpropagation, limiting the extent of unintended parameter drift [@mukherjee2025reinforcement]. To check the difference in the number of parameters affected by these algorithms, we evaluate the update sparsity in ES when compared to GRPO.

We analyze Qwen2.5-1.5B-Instruct trained on the Countdown task with both GRPO and ES. We analyze the difference between a base model checkpoint and its corresponding fine-tuned checkpoint. For each shared parameter tensor, we compute the update $\Delta W = W_{new} - W_{base}$. Following prior work [@mukherjee2025reinforcement], we define sparsity as the percentage of elements whose absolute magnitude is below a fixed threshold ($\tau = 10^{-6}$). Therefore, higher sparsity values mean a larger number of parameters are below this threshold, which means that the updates are more sparse. Parameters are grouped by architectural component, including attention projections $(Q, K, V)$, the attention output projection $(W_O)$, MLP layers and LayerNorms. Updates are further aggregated by transformer layer index to obtain layerwise sparsity profiles.

The results are shown in Figure `\ref{fig:update-sparsity}`{=latex}. **We see that ES updates are substantially less sparse across layers and parameter groups when compared to GRPO.** The sparsity levels for GRPO updates across all parameter types and layers are close to 95%, which means updates are concentrated around a very small number of parameters. However, the updates using ES have very low sparsity levels, showing that a much larger number of parameters are perturbed when fine-tuning with ES. The most sparse updates in ES appear for LayerNorm; however it also contains the least (and neglible) number of parameters compared to other parts of the model. Other layer updates, irrespective of model depth, are highly dense in ES-trained models.

Therefore, GRPO exhibits structured and comparatively sparse updates, aligning with the hypothesis that gradient-based optimization concentrates changes in task-relevant subspaces and mitigates interference with prior capabilities. When combined with KL regularization, these mechanisms provide a natural safeguard against large-scale parameter drift and, consequently, catastrophic forgetting. In contrast, we see that updates using ES have orders of magnitude larger norms and are much less sparse compared to GRPO. The lack of sparsity and large update norms in ES drifts the fine-tuned model further away from the base model, potentially leading to the catastrophic forgetting behavior observed in previous sections.

Conclusion
==========

We perform an empirical analysis of Evolutionary Strategies for fine-tuning LLMs based on recent work [@qiuEvolutionStrategiesScale2025] and show that it performs competetively with GRPO. Although, a critical roadblock still persists: **we observe that ES exhibits significant catastrophic forgetting and progressively deteriorates performance on prior skills of the model.** We show that this happens because ES updates have large norms and low sparsity levels (more dense), resulting in parameter drifts that are 1000x higher in magnitude than drifts observed with GRPO for the same number of update steps. These results imply that although recent progress in ES has bridged performance gap with state-of-the-art learning algorithms like GRPO, its intense model degradation still remains a challenge before its widespread adoption.

Limitations {#limitations .unnumbered}
===========

ES has an inherent randomness associated with the updates due to the nature of the algorithm. As a result, models trained with ES exhibit high variance in stochastic perturbations. Although we observed consistent qualitative trends across the settings we tested, where we worked with a population size of 30 as suggested in prior work [@qiuEvolutionStrategiesScale2025], increased population size will decrease the variance and increase statistical stability of our performance numbers.

Additionally, we evaluate catastrophic forgetting by tracking performance on one task during continued fine-tuning on Countdown, which measures retention of a broad prior capability. Doing so does not fully capture multi-facetted loss of performance that may be happening in the model; however, is enough to give strong evidence of the occurence of the phenomenon.

```{=latex}
\appendix
```
Appendix {#sec:appendix}
========

```{=latex}
\setcounter{figure}{0}
```
```{=latex}
\renewcommand{\thefigure}{A\arabic{figure}}
```
Algorithmic Overview and Analogy Between ES and GRPO {#appendix:A.1}
----------------------------------------------------

### ES Algorithm Overview

@qiuEvolutionStrategiesScale2025 implement a version of evolutionary strategies that features these techniques: weight adjustment in-place with noise generation from stored random seeds, ranked weight updates, and learning rate ingestion.

Each update step can be understood through the following equations.

Each population member at step $t$ has a unique seed. With noise per iteration $\epsilon_{n,l} \sim \mathcal{N}(0, I)$, model parameters for timestep $t$ $\theta_t$, layer parameters for step $t$ $\theta_{t,l}$, reward function $R$, reward score for the $n$th member $R_n$, z-score for $n$th member $Z_n$, and noise coefficient $\sigma$, and learning rate $\alpha$.

Reset random seed generator. Sample noise $\epsilon_{n,l} \sim \mathcal{N}(0, I)$. For all layers, perturb in-place: $$\theta_{t-1,l} \leftarrow \theta_{t-1,l} + \sigma \cdot \epsilon_{n,l}.$$

Reward for perturbed model is calculated: $$R_n = R(\theta_{t-1}).$$

Reset random seed generator. Sample noise $\epsilon_{n,l} \sim \mathcal{N}(0, I)$. For all layers, restore in-place: $$\theta_{t-1,l} \leftarrow \theta_{t-1,l} - \sigma \cdot \epsilon_{n,l}.$$

Z-score is calculated per population member: $$Z_n = \frac{R_n - R_{\text{mean}}}{R_{\text{std}}},$$

Reset random seed generator. Sample noise $\epsilon_{n,l} \sim \mathcal{N}(0, I)$. For all layers, update with noise weighted by z-score and learning rate in-place: $$\theta_{t,l} \leftarrow \theta_{t-1,l} + \alpha \cdot \frac{1}{N} Z_n \epsilon_{n,l}.$$

where $R_{\text{mean}}$ and $R_{\text{std}}$ are the mean and standard deviation of $R_1, R_2, \dots, R_N$.

### ES Algorithm Overview

@shao2024deepseekmathpushinglimitsmathematical implement Group Relative Policy Optimization (GRPO), which eliminates the critic model by estimating advantages from group statistics.

For each prompt $q$, sample a group of $G$ outputs $\{o_1, o_2, \dots, o_G\}$ from the current policy $\pi_{\theta_{old}}$.

Compute rewards $\{r_1, r_2, \dots, r_G\}$ for each output and calculate relative advantages via z-score normalization: $$A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})}.$$

The policy $\pi_\theta$ is updated by maximizing the GRPO objective:

$$\rho_i(\theta) = \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)}$$

$$\begin{aligned}
\mathcal{J}_{\text{GRPO}}(\theta)
&= \frac{1}{G} \sum_{i=1}^{G}
\min \Big(
\rho_i(\theta) A_i,\;\\
&
\text{clip}(\rho_i(\theta), 1-\epsilon, 1+\epsilon) A_i
\Big)\\
& - \beta \, \mathbb{D}_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}).
\end{aligned}$$

To penalize divergence from the reference policy $\pi_{ref}$ without additional sampling, the KL term is approximated: $$\mathbb{D}_{KL}(\pi_\theta || \pi_{ref}) = \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_\theta(o_i|q)} - 1.$$

By replacing the value function with group-relative rewards, this implementation reduces computational overhead and memory usage compared to standard PPO.

### Analogy Between ES Population Size and GRPO Rollout Count

Both GRPO and ES rely on creating different responses and then updating the model parameters via the fitness of those responses. The following section describes why the population size in ES and number of rollouts in GRPO play an analogous role in controlling parameter updates.

Following the algorithm described by @qiuEvolutionStrategiesScale2025, an ES training update comprises of $N$ different seeds used to generate perturbations to the baseline model, resulting in $N$ different population members. Each population member is sampled with temperature at 0 to generate $N$ different responses, which are evaluated by a reward function to determine their fitness, which is converted into a z-score to weight the contribution of that respective perturbation to the baseline model. Further explanation can be found in `\ref{appendix:A.1}`{=latex}.

Similarly, a GRPO training update samples $N$ candidate outputs from the current policy, evaluates them to obtain relative reward signals, and updates the policy via a policy-gradient objective while constraining deviation from a fixed reference policy through KL regularization [@deepseek-aiDeepSeekR1IncentivizingReasoning2025]. Crucially, although ES simultaneously maintains multiple different versions of a model and GRPO maintains one, ES population size and GRPO number of rollouts both determine the number of samples used to estimate a stochastic update and to form a stochastic gradient or gradient-free estimator that drives the parameter update.

Implementation Details {#appendix:A.4}
----------------------

Our implementations for both GRPO and ES model training and analysis is attached to this submission.

### GRPO

The GRPO setup in this study is implemented on the VERL library, which employs the HybridFlow engine proposed by @Sheng_2025. Training was conducted on NVIDIA RTX A6000 GPUs and the Fully Sharded Data Parallel (FSDP) protocol was used to train across GPUs. Across all experiments, we maintained 30 rollouts for GRPO to mimic the 30 mutations generated by the original ES study by @qiuEvolutionStrategiesScale2025. To benchmark-finetuning, we used a batch-size of 200 examples along with a mini-batch size of 32 examples. A KL-Loss coefficient of $\beta = 0.001$ was used. The trainer was set to run for a total of 500 epochs, although once the validation accuracy appeared to plateau, we stopped training prematurely.

### ES

We replicated the original author's implementation of ES with two improvements: the authors found that using fp16 instead of bf16 improved validation accuracy on certain tasks. Additionally, the application of the Qwen chat template to the original task prompts improved validation accuracy on the experiment replica Countdown task for Qwen2.5-1.5B, but left model performance on all other regimes virtually the same. Runs were performed both with and without the chat template to assess the effect.

### Reward functions

For the countdown task, we employ the same reward function used by @qiuEvolutionStrategiesScale2025, adapted to fit the VERL API. An answer reward is calculated, which assigns a reward of $1.0$ if the model's answer uses all numbers once and evaluates to the provided target, and $0.0$ otherwise. A separate format score is calculated, which serves to ensure that the model's response obeys an XML-style format with `<think>...</think>` thinking tokens first followed by response tokens `<answer>...</answer>`. We take a weighted average of the two rewards to calculate the final reward to assign to the model: $\mathrm{Reward}
= 0.1 \cdot \mathrm{Format\ Reward}
+ 0.9 \cdot \mathrm{Answer\ Reward}$

For the GSM8K, MATH, and OlympiadBench benchmarks, we employ a rule-based reward function using a binary evaluation logic. An answer reward is calculated by extracting the model's conclusion from the final 300 characters of the response using a regex pattern. The function first identifies the `#### [number]` format, falling back to `\boxed{...}` tags if necessary, and assigns a reward of $1.0$ if the extraction matches the ground truth and $0.0$ otherwise.

Hyperparameter Values
---------------------

```{=latex}
\centering
```
::: {#tab:es-hyperparameters}
  **Hyperparameter**        **Value**
  ------------------------ -----------
  Population size              30
  Noise scale $\sigma$        0.001
  Learning rate $\alpha$     0.0005
  Max tokens                  1024

  : Hyperparameters used for Evolution Strategies (ES) fine-tuning.
:::

Additional Experiments {#appendix:additional-exp}
----------------------

### Catastrophic Forgetting and KL {#appendix:additional-exp:cat-forgetting-kl}

```{=latex}
\begin{figure*}[t]\centering
  \includegraphics[width=\textwidth]{kl_relationships_2x2.png}
  \vspace{-0.4em}
  \caption{
  Relationship between KL divergence and task performance.
  Top row: new-task accuracy (Countdown).
  Bottom row: prior-task accuracy (HellaSwag).  Training step indicated per sample.
  ES exhibits increasing KL accompanied by degradation on the prior task, whereas GRPO maintains stable performance across a broader KL range.
  }
  \label{fig:kl-combined}
\end{figure*}
```
@shenfeld2025rlsrazoronlinereinforcement had previously established a negative correlation between KL-divergence and previous task score. We therefore searched whether this trend also is reflected within ES-trained models . We first looked at KL-divergence between the trained and base models `\ref{fig:kl-combined}`{=latex} on the newly trained task. While ES-trained models increase in KL-divergence with subsequent training steps, this behavior was not consistent when trained with GRPO. This can be attributed to the explicit KL-regularization factor in GRPO, preventing continuous drifts from the base model.

The trends for KL divergence and accuracy continue to diverge when evaluating previously known tasks `\ref{fig:kl-combined}`{=latex}. ES has a clear negative correlation between KL-divergence and old task performance. The KL-divergence between the new and base models also shows an increase over number of training iterations in ES. GRPO however continues to show no association between number of training steps and KL-divergence, as well as between KL-divergence and previous task accuracy. Therefore, KL-divergence is a less reliable indicator of catastrophic convergence across GRPO and ES.

```{=latex}
\centering
```
![ Log relationship between Frobenius norm of a model update and number of training iterations on a new task (Countdown). ES-trained models drift several orders of magnitude more than GRPO-trained models. ](frobenius_vs_iteration_log.png){#fig:hellaswag-frobenius-log width="\\linewidth"}

```{=latex}
\begin{figure*}[t]\centering

\vspace{0.4em}
\begin{minipage}[t]{0.49\textwidth}
  \centering
  \includegraphics[width=\linewidth]{Qwen2.5-1.5B.png}\\[-0.2em]
  {\footnotesize\textbf{(a)} Qwen-2.5-1.5B}
\end{minipage}
\hfill
\begin{minipage}[t]{0.49\textwidth}
  \centering
  \includegraphics[width=\linewidth]{Llama3.2-1B.png}\\[-0.2em]
  {\footnotesize\textbf{(b)} LLaMa-3.2-1B}
\end{minipage}
\vspace{-0.6em}
\caption{
Mean accuracy curves for ES and GRPO runs across across datasets: Countdown, GSM8K, Math, Olympiad.
}
\label{fig:es-grpo-comparison}
\end{figure*}
```

[^1]: Equal contribution.

[^2]: Our codeabase can be found here - <https://github.com/akshat57/es-catastrophic>

[^3]: Our models can be found here - <https://huggingface.co/collections/immanuelabdi/es-at-scale-lead-to-catastrophic-forgetting>
