Evolution Strategies at the Hyperscale

Source

Core Claim

This paper introduces EGGROLL, a low-rank perturbation method that makes evolution strategies hardware-efficient enough for billion-parameter models and very large populations.

Key Contributions

  • Replaces full-rank Gaussian perturbation matrices with low-rank factors, reducing auxiliary perturbation storage from O(mn) to O(r(m+n)) per matrix layer.
  • Uses counter-based deterministic RNG and batched low-rank adapter-style inference to avoid materializing perturbations.
  • Reports up to a hundredfold speedup for billion-parameter models at large population sizes, reaching up to 91% of pure batch-inference throughput.
  • Shows EGGROLL can pretrain nonlinear recurrent language models in integer datatypes, compete with GRPO for reasoning post-training, and preserve ES behavior in tabula rasa RL.

Method Notes

EGGROLL samples factors A and B, forms a rank-r perturbation AB^T / sqrt(r), and weights the perturbation by scalar fitness. Individual perturbations are low-rank, but the aggregate update can be full-rank when population size times rank exceeds the matrix dimension.

Evidence And Results

The paper’s most distinctive evidence is systems-level: EGGROLL improves arithmetic intensity by sharing the base matrix multiply and applying perturbations in a LoRA-like batched form. It also reports int8 recurrent LM pretraining from scratch and reasoning experiments on Countdown and GSM8K with RWKV-family models.

Alex Context

Alex marked this source as normal and none. His note highlights this as the next step after ES-at-scale work: low-rank factorization turns the simple ES idea into a more practical hyperscale implementation.

Open Questions

  • Does rank-1 or very low-rank perturbation remain enough for dense Transformer LLMs beyond recurrent architectures and selected reasoning tasks?
  • How should EGGROLL be compared against LoRA-style gradient updates when both use low-rank structure but optimize through different signals?
  • Can integer-only or non-differentiable language-model components become a practical advantage rather than a niche demonstration?