Evolution Strategies at the Hyperscale

Source

Raw Markdown: paper_evolution-strategies-at-the-hyperscale-2025.md
PDF: paper_evolution-strategies-at-the-hyperscale-2025.pdf
Code/project: https://eshyperscale.github.io/
Review/context: https://arxiviq.substack.com/p/evolution-strategies-at-the-hyperscale

Core Claim

This paper introduces EGGROLL, a low-rank perturbation method that makes evolution strategies hardware-efficient enough for billion-parameter models and very large populations.

Key Contributions

Replaces full-rank Gaussian perturbation matrices with low-rank factors, reducing auxiliary perturbation storage from O(mn) to O(r(m+n)) per matrix layer.
Uses counter-based deterministic RNG and batched low-rank adapter-style inference to avoid materializing perturbations.
Reports up to a hundredfold speedup for billion-parameter models at large population sizes, reaching up to 91% of pure batch-inference throughput.
Shows EGGROLL can pretrain nonlinear recurrent language models in integer datatypes, compete with GRPO for reasoning post-training, and preserve ES behavior in tabula rasa RL.

Method Notes

EGGROLL samples factors A and B, forms a rank-r perturbation AB^T / sqrt(r), and weights the perturbation by scalar fitness. Individual perturbations are low-rank, but the aggregate update can be full-rank when population size times rank exceeds the matrix dimension.

Evidence And Results

The paper’s most distinctive evidence is systems-level: EGGROLL improves arithmetic intensity by sharing the base matrix multiply and applying perturbations in a LoRA-like batched form. It also reports int8 recurrent LM pretraining from scratch and reasoning experiments on Countdown and GSM8K with RWKV-family models.

Alex Context

Alex marked this source as normal and none. His note highlights this as the next step after ES-at-scale work: low-rank factorization turns the simple ES idea into a more practical hyperscale implementation.

Links Into The Wiki

Open Questions

Does rank-1 or very low-rank perturbation remain enough for dense Transformer LLMs beyond recurrent architectures and selected reasoning tasks?
How should EGGROLL be compared against LoRA-style gradient updates when both use low-rank structure but optimize through different signals?
Can integer-only or non-differentiable language-model components become a practical advantage rather than a niche demonstration?

Alex Knowledge Base

Explorer

Evolution Strategies at the Hyperscale

Evolution Strategies at the Hyperscale

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Alex Context

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks