Evolution Strategies at the Hyperscale
Source
- Raw Markdown: paper_evolution-strategies-at-the-hyperscale-2025.md
- PDF: paper_evolution-strategies-at-the-hyperscale-2025.pdf
- Code/project: https://eshyperscale.github.io/
- Review/context: https://arxiviq.substack.com/p/evolution-strategies-at-the-hyperscale
Core Claim
This paper introduces EGGROLL, a low-rank perturbation method that makes evolution strategies hardware-efficient enough for billion-parameter models and very large populations.
Key Contributions
- Replaces full-rank Gaussian perturbation matrices with low-rank factors, reducing auxiliary perturbation storage from
O(mn)toO(r(m+n))per matrix layer. - Uses counter-based deterministic RNG and batched low-rank adapter-style inference to avoid materializing perturbations.
- Reports up to a hundredfold speedup for billion-parameter models at large population sizes, reaching up to 91% of pure batch-inference throughput.
- Shows EGGROLL can pretrain nonlinear recurrent language models in integer datatypes, compete with GRPO for reasoning post-training, and preserve ES behavior in tabula rasa RL.
Method Notes
EGGROLL samples factors A and B, forms a rank-r perturbation AB^T / sqrt(r), and weights the perturbation by scalar fitness. Individual perturbations are low-rank, but the aggregate update can be full-rank when population size times rank exceeds the matrix dimension.
Evidence And Results
The paper’s most distinctive evidence is systems-level: EGGROLL improves arithmetic intensity by sharing the base matrix multiply and applying perturbations in a LoRA-like batched form. It also reports int8 recurrent LM pretraining from scratch and reasoning experiments on Countdown and GSM8K with RWKV-family models.
Alex Context
Alex marked this source as normal and none. His note highlights this as the next step after ES-at-scale work: low-rank factorization turns the simple ES idea into a more practical hyperscale implementation.
Links Into The Wiki
- Evolution Strategies
- EGGROLL
- Evolution Strategies at Scale
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Open Questions
- Does rank-1 or very low-rank perturbation remain enough for dense Transformer LLMs beyond recurrent architectures and selected reasoning tasks?
- How should EGGROLL be compared against LoRA-style gradient updates when both use low-rank structure but optimize through different signals?
- Can integer-only or non-differentiable language-model components become a practical advantage rather than a niche demonstration?