LLM Post-Training

Summary

LLM post-training is the wiki’s umbrella for supervised fine-tuning, RL-style optimization, preference optimization, and black-box population search after pretraining. The useful comparison axis is not only benchmark score; it is how much the method moves weights, where those changes land, and which prior capabilities survive.

What The Wiki Currently Believes

  • Dynamic Fine-Tuning reframes SFT as a policy-gradient-like update with an implicit sparse reward and inverse-probability weighting, then removes the low-probability amplification by scaling each token loss with detached target-token probability.
  • Evolution Strategies at Scale argues that black-box full-parameter ES can optimize LLM behavior through response-level rewards.
  • Evolution Strategies at the Hyperscale makes the ES path more plausible at scale through low-rank perturbation systems work.
  • Evolutionary Strategies lead to Catastrophic Forgetting in LLMs warns that ES can match new-task reward while causing dense, high-norm parameter drift and worse retention.
  • TimeOmni-1 is the time-series reasoning example of a staged SFT-then-RL curriculum: SFT injects domain reasoning priors, then RL rewards push beyond imitation.

Weight-Update Lens

The post-training cluster should be evaluated through update geometry:

  • SFT gives token-level gradients on demonstrations and can overfit exact references.
  • DFT changes SFT’s gradient scale by downweighting low-confidence expert tokens, aiming for more stable and less outlier-dominated updates.
  • PPO/GRPO/RLVR-style RL uses sampled trajectories and explicit rewards, often with KL or reference constraints to control drift.
  • ES searches directly in parameter space with scalar rewards, making credit assignment simple but risking broad dense updates.

The ES catastrophic-forgetting source makes this lens concrete: new-task reward can improve while prior capabilities degrade. DFT adds the complementary lesson that conservative update scaling can improve reasoning generalization but may fail when low-probability targets contain genuinely new knowledge.

Implications For Time-Series And World Models

For time-series reasoning models, SFT can inject decomposition priors, formatting, and domain procedures, while RL can reward verifiable temporal reasoning or intervention decisions. The DFT/ES contrast suggests agents should track not just task score, but parameter drift, retention of base-model skills, and whether updates preserve the model’s numeric and temporal priors.

Gotchas

  • “RL-like” does not mean the same mechanism. DFT gives an RL interpretation of an SFT gradient; PPO/GRPO sample trajectories under explicit rewards; ES perturbs parameters and uses scalar fitness.
  • Reward-only success is incomplete without retention tests.
  • Smaller updates are not automatically better: a method can preserve priors by refusing to learn rare but important new facts.
  • Benchmark gains should be reported alongside adaptation mode: full-parameter SFT, LoRA, DFT, PPO/GRPO, DPO/RFT, ES, or staged mixtures.

Open Questions

  • Which post-training methods have the best target-gain-to-parameter-drift ratio?
  • Can DFT-like reward rectification, RL KL constraints, and ES low-rank perturbations be composed without fighting each other?
  • Which retention tests should be mandatory for time-series reasoning and world-model post-training?