SGD at the Edge of Stability: The Stochastic Sharpness Gap

Source

Raw Markdown: paper_stochastic-sharpness-gap-2026.md
PDF: paper_stochastic-sharpness-gap-2026.pdf
Preprint: arXiv 2604.21016
Gonzo ML discussion: post 5300
Review: ArxivIQ note

Status And Credibility

Recent April 2026 arXiv preprint. Treat as credible theory and controlled-experiment evidence, but not as peer-reviewed or as a large-scale training recipe until venue status and larger-scale replications are known.

Core Claim

Mini-batch SGD does not sit at the same full-batch sharpness threshold as full-batch gradient descent. Gradient noise projected onto the top Hessian eigenvector increases the variance of the edge-of-stability oscillation, strengthening the cubic self-stabilizing force and moving the equilibrium full-batch sharpness below $2/ η$ .

Key Contributions

Extends edge-of-stability self-stabilization theory from full-batch gradient descent to mini-batch SGD.
Derives a closed-form stochastic sharpness gap, $Δ S = η β σ_{u}^{2} / (4 α)$ , where the projected gradient-noise variance is the batch-size-sensitive term.
Separates full-batch sharpness from batch sharpness: full-batch sharpness can sit below $2/ η$ , while batch sharpness is the quantity that approaches the stochastic stability edge.
Reports controlled experiments on FC-Tanh, FC-ReLU, CNN, and ResNet settings showing batch-size and noise-variance scaling consistent with the theory.

Important Caveats

The experiments are controlled optimization studies, mostly on CIFAR-10 subsets and vanilla SGD variants. The paper does not directly test large language models, modern TSFMs, AdamW/Muon-style optimizers, distributed training, or pretraining-scale data mixtures.

The paper also reports that a CNN with cross-entropy on a small dataset does not enter the edge-of-stability regime: sharpness peaks far below $2/ η$ and then collapses. That makes the mechanism conditional on the training regime, loss, and sustained progressive sharpening.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Training recipe and optimizer dynamics	adjacent	Batch size, learning rate, projected noise variance, and sharpness can change the effective training equilibrium.	Needs evidence for TSFM pretraining losses, optimizers, data mixtures, and model scales.
Scaling and efficiency	warning	Smaller batches may induce flatter full-batch sharpness through stochastic self-stabilization, but the relevant curvature is batch sharpness.	Need matched compute, throughput, and generalization comparisons, not only curvature diagnostics.
Benchmark hygiene	warning	A reported “sharpness” metric can mean full-batch Hessian sharpness or batch sharpness, with different stability implications.	Need explicit measurement protocol and loss/optimizer context.

Links Into The Wiki

Open Questions

Does the stochastic sharpness gap appear in TSFM pretraining with pinball loss, flow matching, masked reconstruction, or latent predictive objectives?
How do AdamW, Muon, momentum, weight decay, gradient clipping, and distributed data parallelism change the projected-noise mechanism?
Can batch sharpness or projected noise variance become useful diagnostics for choosing TSFM batch size and learning rate?

Alex Open Research Wiki

Explorer

SGD at the Edge of Stability: The Stochastic Sharpness Gap

SGD at the Edge of Stability: The Stochastic Sharpness Gap

Source

Status And Credibility

Core Claim

Key Contributions

Important Caveats

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks