Training Dynamics
Summary
Training dynamics tracks the optimizer, loss, batch, learning-rate, curvature, compression, forgetting, and noise effects that shape what a model learns before any architecture-level claim is evaluated.
For the wiki’s agenda, this is supporting evidence rather than a model family. It matters because TSFM papers often report gains from architecture, data, or scale while treating optimizer dynamics as incidental. The training recipe can itself change representation geometry, stability, compression, retention, and generalization.
Current Evidence
Learning is Forgetting adds a whole-model representation-dynamics frame. It treats LLM pretraining as lossy compression: representations first expand to capture target-relevant information, then compress input information as they approach an Information Bottleneck bound. The useful wiki lesson is that learning can be a controlled forgetting process, not only accumulation of raw detail.
Treat these compression curves as comparative diagnostics over a fixed sample and estimator, not exact measurements of all latent information. The OLMo2 scale result also suggests compression depends on model capacity relative to data complexity.
FADE adds a parameter-level forgetting mechanism. It adapts per-parameter weight decay online so some weights can retain stable information while others forget stale mappings faster. This is not TSFM evidence yet, but it makes the retention question more precise: continual systems need selective forgetting, not blanket preservation or blanket decay.
The current neural evidence is final-layer adaptation. The naive all-layer extension underperformed head-only FADE, so this is a controlled-forgetting diagnostic and research direction rather than a general optimizer recipe.
SGD at the Edge of Stability adds a concrete optimizer-dynamics warning. In full-batch gradient descent, sharpness rises toward the classical edge-of-stability threshold. In mini-batch SGD, full-batch sharpness can stabilize below that threshold because stochastic gradient noise projected onto the top Hessian eigenvector changes the self-stabilizing oscillation.
The paper’s useful distinction is full-batch sharpness versus batch sharpness. A model can look “below the edge” under one sharpness definition while the batch-conditioned curvature is the quantity that approaches the stochastic stability edge.
DiffusionBlocks adds a local-objective training-dynamics branch. Instead of backpropagating one global loss through every layer, each block is assigned a denoising interval and learns a local score-matching-derived objective. The open training-dynamics question is whether those local objectives preserve global coordination, rare states, and pretrained representations once the method moves beyond from-scratch experiments.
Dragon Hatchling adds a fast-state BPTT caveat. Sparse synapse activations suggest possible cheaper gradient-routing approximations, but the paper’s preliminary no-BPTT variant loses cross-language concept matching while retaining some language modeling ability. For this wiki, that makes BDH a reminder that fast-state architecture claims should report what the training path preserves, not only what inference state can represent.
Reading Frame For TSFMs
Use this page when a time-series paper’s gains may depend on training protocol rather than only model architecture:
- compression pressure can change what information survives in the representation;
- “forgetting” can be destructive capability loss or a useful way to discard stale mappings;
- batch size and learning rate can change the effective curvature regime;
- the relevant noise is directional, not only scalar gradient variance;
- sharpness measurements need protocol labels;
- the loss function can decide whether an edge-of-stability mechanism appears at all;
- optimizer and data-mixture changes can create representation differences that look architectural.
Limitations
The current sources are mostly LLM, controlled online-learning, or upstream architecture-training evidence. They should be cited as training-dynamics and representation-dynamics warnings, not as direct recipes for numeric time-series models.
Related Pages
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- LLM Post-Training
- Company-Local Block-Wise Fine-Tuning
- Foundation Time-Series Model Research Agenda
Open Questions
- Which sharpness and noise diagnostics should be logged for TSFM pretraining?
- Can TSFM checkpoint selection use compression diagnostics rather than only validation loss?
- Which information should a time-series model learn to forget under non-stationarity, and which rare state must be protected?
- Do TSFM objectives enter edge-of-stability regimes under practical training recipes?
- Can local denoising objectives preserve global sequence or time-series state as well as end-to-end objectives?
- Can fast-state models avoid full BPTT without losing cross-concept or cross-channel state binding?
- Can optimizer-dynamics probes separate architecture gains from training-recipe gains?
- How do AdamW, Muon, momentum, weight decay, gradient clipping, and distributed data parallelism change projected-noise sharpness dynamics?