Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Source

Core Claim

The OpenAI 2017 paper argues that evolution strategies are a scalable black-box alternative to MDP-style RL algorithms for policy optimization in MuJoCo and Atari.

Key Contributions

  • Shows ES can solve 3D humanoid walking in about 10 minutes with 1,440 workers and can obtain competitive Atari results after one hour of training.
  • Uses common random numbers so distributed workers communicate scalar fitness values and random seeds rather than full gradients.
  • Emphasizes advantages for delayed rewards, long horizons, action-frequency invariance, no value-function approximation, and no backpropagation.
  • Frames lower data efficiency as partly offset by easier parallel scaling and lower per-step computation.

Method Notes

This paper is the historical anchor for the modern ES revival thread. It studies policy parameters directly rather than LLM parameters, but it establishes the same design pattern used by later work: evaluate perturbations independently, communicate scalar rewards, and scale through distributed inference-like execution.

Evidence And Results

The experiments compare ES with TRPO on MuJoCo and A3C-style baselines on Atari. The key lesson for the knowledge base is not that ES always dominates RL, but that ES changes the bottleneck from gradient communication and credit assignment to population evaluation and distributed throughput.

Open Questions

  • Which of the 2017 advantages survive the move from policy networks in simulators to LLM post-training and tool-use trajectories?
  • Does ES mainly win when parallel inference is cheap, or when the reward landscape is hard for differentiable credit assignment?