Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning
Source
- Raw Markdown: paper_evolution-strategies-at-scale-2025.md
- PDF: paper_evolution-strategies-at-scale-2025.pdf
- Code: https://github.com/VsonicV/es-fine-tuning-paper
- Review/context: https://arxiviq.substack.com/p/evolution-strategies-at-scale-llm
- Related article from Alex’s note: https://medium.com/@evolutionmlmail/evolution-strategies-at-scale-fine-tuning-harder-tasks-b4f29be26ae7
Core Claim
This paper argues that evolution strategies can directly fine-tune all parameters of billion-parameter LLMs without backpropagation, challenging the assumption that ES cannot scale to modern post-training.
Key Contributions
- Scales ES to full-parameter LLM fine-tuning rather than restricting search to final layers, adapters, or action-space evolution.
- Uses inference-only perturbation evaluation and small populations to search in billion-parameter spaces.
- Compares ES against PPO, GRPO, and Dr.GRPO-style RL baselines on reasoning and conciseness settings.
- Reports stronger robustness across base LLMs, more stable runs, less reward hacking in conciseness tuning, and better tolerance for sparse delayed rewards.
Method Notes
The paper treats LLM fine-tuning as black-box parameter-space optimization. Each perturbation is evaluated by response-level reward, then the population-weighted update moves the mean parameters. This makes ES a post-training path for outcome-only objectives where token-level credit assignment and differentiable losses are awkward.
Evidence And Results
The strongest evidence is on reasoning and puzzle-style tasks where the reward arrives after a long generated trajectory. The paper reports ES improvements on Countdown across multiple Qwen/LLaMA-family bases, Sudoku gains from 2% to 66.5% test solve rate with Qwen-2.5-3B-Instruct, and ARC-AGI experiments where RL attempts are described as weak reranking rather than strategy discovery.
Alex Context
Alex marked this source as important and read. His October 9, 2025 note frames it as a possible revival of ES as an RL alternative for LLMs, explicitly connecting it back to the 2017 OpenAI ES result and the missing bridge to billion-parameter language models.
Links Into The Wiki
- Evolution Strategies
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning
- Evolution Strategies at the Hyperscale
- EGGROLL
Open Questions
- Does ES remain competitive when reward models are noisy, adversarial, or preference-based rather than executable task rewards?
- Can ES and gradient-based post-training be composed cleanly, or do they optimize incompatible behavior distributions?
- Is the reduced reward hacking a stable property of population-distribution optimization or a task-specific artifact?