LLM Post-Training
Summary
LLM post-training is the wiki’s umbrella for supervised fine-tuning, RL-style optimization, preference optimization, and black-box population search after pretraining. The useful comparison axis is not only benchmark score; it is how much the method moves weights, where those changes land, and which prior capabilities survive.
What The Wiki Currently Believes
- Dynamic Fine-Tuning reframes SFT as a policy-gradient-like update with an implicit sparse reward and inverse-probability weighting, then removes the low-probability amplification by scaling each token loss with detached target-token probability.
- Evolution Strategies at Scale argues that black-box full-parameter ES can optimize LLM behavior through response-level rewards.
- Evolution Strategies at the Hyperscale makes the ES path more plausible at scale through low-rank perturbation systems work.
- Evolutionary Strategies lead to Catastrophic Forgetting in LLMs warns that ES can match new-task reward while causing dense, high-norm parameter drift and worse retention.
- TimeOmni-1 is the time-series reasoning example of a staged SFT-then-RL curriculum: SFT injects domain reasoning priors, then RL rewards push beyond imitation.
Weight-Update Lens
The post-training cluster should be evaluated through update geometry:
- SFT gives token-level gradients on demonstrations and can overfit exact references.
- DFT changes SFT’s gradient scale by downweighting low-confidence expert tokens, aiming for more stable and less outlier-dominated updates.
- PPO/GRPO/RLVR-style RL uses sampled trajectories and explicit rewards, often with KL or reference constraints to control drift.
- ES searches directly in parameter space with scalar rewards, making credit assignment simple but risking broad dense updates.
The ES catastrophic-forgetting source makes this lens concrete: new-task reward can improve while prior capabilities degrade. DFT adds the complementary lesson that conservative update scaling can improve reasoning generalization but may fail when low-probability targets contain genuinely new knowledge.
Implications For Time-Series And World Models
For time-series reasoning models, SFT can inject decomposition priors, formatting, and domain procedures, while RL can reward verifiable temporal reasoning or intervention decisions. The DFT/ES contrast suggests agents should track not just task score, but parameter drift, retention of base-model skills, and whether updates preserve the model’s numeric and temporal priors.
Gotchas
- “RL-like” does not mean the same mechanism. DFT gives an RL interpretation of an SFT gradient; PPO/GRPO sample trajectories under explicit rewards; ES perturbs parameters and uses scalar fitness.
- Reward-only success is incomplete without retention tests.
- Smaller updates are not automatically better: a method can preserve priors by refusing to learn rare but important new facts.
- Benchmark gains should be reported alongside adaptation mode: full-parameter SFT, LoRA, DFT, PPO/GRPO, DPO/RFT, ES, or staged mixtures.
Open Questions
- Which post-training methods have the best target-gain-to-parameter-drift ratio?
- Can DFT-like reward rectification, RL KL constraints, and ES low-rank perturbations be composed without fighting each other?
- Which retention tests should be mandatory for time-series reasoning and world-model post-training?