SGD at the Edge of Stability: The Stochastic Sharpness Gap
Source
- Raw Markdown: paper_stochastic-sharpness-gap-2026.md
- PDF: paper_stochastic-sharpness-gap-2026.pdf
- Preprint: arXiv 2604.21016
- Gonzo ML discussion: post 5300
- Review: ArxivIQ note
Status And Credibility
Recent April 2026 arXiv preprint. Treat as credible theory and controlled-experiment evidence, but not as peer-reviewed or as a large-scale training recipe until venue status and larger-scale replications are known.
Core Claim
Mini-batch SGD does not sit at the same full-batch sharpness threshold as full-batch gradient descent. Gradient noise projected onto the top Hessian eigenvector increases the variance of the edge-of-stability oscillation, strengthening the cubic self-stabilizing force and moving the equilibrium full-batch sharpness below .
Key Contributions
- Extends edge-of-stability self-stabilization theory from full-batch gradient descent to mini-batch SGD.
- Derives a closed-form stochastic sharpness gap, , where the projected gradient-noise variance is the batch-size-sensitive term.
- Separates full-batch sharpness from batch sharpness: full-batch sharpness can sit below , while batch sharpness is the quantity that approaches the stochastic stability edge.
- Reports controlled experiments on FC-Tanh, FC-ReLU, CNN, and ResNet settings showing batch-size and noise-variance scaling consistent with the theory.
Important Caveats
The experiments are controlled optimization studies, mostly on CIFAR-10 subsets and vanilla SGD variants. The paper does not directly test large language models, modern TSFMs, AdamW/Muon-style optimizers, distributed training, or pretraining-scale data mixtures.
The paper also reports that a CNN with cross-entropy on a small dataset does not enter the edge-of-stability regime: sharpness peaks far below and then collapses. That makes the mechanism conditional on the training regime, loss, and sustained progressive sharpening.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Training recipe and optimizer dynamics | adjacent | Batch size, learning rate, projected noise variance, and sharpness can change the effective training equilibrium. | Needs evidence for TSFM pretraining losses, optimizers, data mixtures, and model scales. |
| Scaling and efficiency | warning | Smaller batches may induce flatter full-batch sharpness through stochastic self-stabilization, but the relevant curvature is batch sharpness. | Need matched compute, throughput, and generalization comparisons, not only curvature diagnostics. |
| Benchmark hygiene | warning | A reported “sharpness” metric can mean full-batch Hessian sharpness or batch sharpness, with different stability implications. | Need explicit measurement protocol and loss/optimizer context. |
Links Into The Wiki
Open Questions
- Does the stochastic sharpness gap appear in TSFM pretraining with pinball loss, flow matching, masked reconstruction, or latent predictive objectives?
- How do AdamW, Muon, momentum, weight decay, gradient clipping, and distributed data parallelism change the projected-noise mechanism?
- Can batch sharpness or projected noise variance become useful diagnostics for choosing TSFM batch size and learning rate?