SGD at the Edge of Stability: The Stochastic Sharpness Gap

Source

Status And Credibility

Recent April 2026 arXiv preprint. Treat as credible theory and controlled-experiment evidence, but not as peer-reviewed or as a large-scale training recipe until venue status and larger-scale replications are known.

Core Claim

Mini-batch SGD does not sit at the same full-batch sharpness threshold as full-batch gradient descent. Gradient noise projected onto the top Hessian eigenvector increases the variance of the edge-of-stability oscillation, strengthening the cubic self-stabilizing force and moving the equilibrium full-batch sharpness below .

Key Contributions

  • Extends edge-of-stability self-stabilization theory from full-batch gradient descent to mini-batch SGD.
  • Derives a closed-form stochastic sharpness gap, , where the projected gradient-noise variance is the batch-size-sensitive term.
  • Separates full-batch sharpness from batch sharpness: full-batch sharpness can sit below , while batch sharpness is the quantity that approaches the stochastic stability edge.
  • Reports controlled experiments on FC-Tanh, FC-ReLU, CNN, and ResNet settings showing batch-size and noise-variance scaling consistent with the theory.

Important Caveats

The experiments are controlled optimization studies, mostly on CIFAR-10 subsets and vanilla SGD variants. The paper does not directly test large language models, modern TSFMs, AdamW/Muon-style optimizers, distributed training, or pretraining-scale data mixtures.

The paper also reports that a CNN with cross-entropy on a small dataset does not enter the edge-of-stability regime: sharpness peaks far below and then collapses. That makes the mechanism conditional on the training regime, loss, and sustained progressive sharpening.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Training recipe and optimizer dynamicsadjacentBatch size, learning rate, projected noise variance, and sharpness can change the effective training equilibrium.Needs evidence for TSFM pretraining losses, optimizers, data mixtures, and model scales.
Scaling and efficiencywarningSmaller batches may induce flatter full-batch sharpness through stochastic self-stabilization, but the relevant curvature is batch sharpness.Need matched compute, throughput, and generalization comparisons, not only curvature diagnostics.
Benchmark hygienewarningA reported “sharpness” metric can mean full-batch Hessian sharpness or batch sharpness, with different stability implications.Need explicit measurement protocol and loss/optimizer context.

Open Questions

  • Does the stochastic sharpness gap appear in TSFM pretraining with pinball loss, flow matching, masked reconstruction, or latent predictive objectives?
  • How do AdamW, Muon, momentum, weight decay, gradient clipping, and distributed data parallelism change the projected-noise mechanism?
  • Can batch sharpness or projected noise variance become useful diagnostics for choosing TSFM batch size and learning rate?