Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

Source

Core Claim

This paper argues that evolution strategies can directly fine-tune all parameters of billion-parameter LLMs without backpropagation, challenging the assumption that ES cannot scale to modern post-training.

Key Contributions

  • Scales ES to full-parameter LLM fine-tuning rather than restricting search to final layers, adapters, or action-space evolution.
  • Uses inference-only perturbation evaluation and small populations to search in billion-parameter spaces.
  • Compares ES against PPO, GRPO, and Dr.GRPO-style RL baselines on reasoning and conciseness settings.
  • Reports stronger robustness across base LLMs, more stable runs, less reward hacking in conciseness tuning, and better tolerance for sparse delayed rewards.

Method Notes

The paper treats LLM fine-tuning as black-box parameter-space optimization. Each perturbation is evaluated by response-level reward, then the population-weighted update moves the mean parameters. This makes ES a post-training path for outcome-only objectives where token-level credit assignment and differentiable losses are awkward.

Evidence And Results

The strongest evidence is on reasoning and puzzle-style tasks where the reward arrives after a long generated trajectory. The paper reports ES improvements on Countdown across multiple Qwen/LLaMA-family bases, Sudoku gains from 2% to 66.5% test solve rate with Qwen-2.5-3B-Instruct, and ARC-AGI experiments where RL attempts are described as weak reranking rather than strategy discovery.

Alex Context

Alex marked this source as important and read. His October 9, 2025 note frames it as a possible revival of ES as an RL alternative for LLMs, explicitly connecting it back to the 2017 OpenAI ES result and the missing bridge to billion-parameter language models.

Open Questions

  • Does ES remain competitive when reward models are noisy, adversarial, or preference-based rather than executable task rewards?
  • Can ES and gradient-based post-training be composed cleanly, or do they optimize incompatible behavior distributions?
  • Is the reduced reward hacking a stable property of population-distribution optimization or a task-specific artifact?