LLM Post-Training

Summary

LLM post-training is the wiki’s umbrella for supervised fine-tuning, RL-style optimization, preference optimization, and black-box population search after pretraining. The useful comparison axis is not only benchmark score; it is how much the method moves weights, where those changes land, and which prior capabilities survive.

What The Wiki Currently Believes

Dynamic Fine-Tuning reframes SFT as a policy-gradient-like update with an implicit sparse reward and inverse-probability weighting, then removes the low-probability amplification by scaling each token loss with detached target-token probability.
Evolution Strategies at Scale argues that black-box full-parameter ES can optimize LLM behavior through response-level rewards.
Evolution Strategies at the Hyperscale makes the ES path more plausible at scale through low-rank perturbation systems work.
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs warns that ES can match new-task reward while causing dense, high-norm parameter drift and worse retention.
TimeOmni-1 is the time-series reasoning example of a staged SFT-then-RL curriculum: SFT injects domain reasoning priors, then RL rewards push beyond imitation.

Weight-Update Lens

The post-training cluster should be evaluated through update geometry:

SFT gives token-level gradients on demonstrations and can overfit exact references.
DFT changes SFT’s gradient scale by downweighting low-confidence expert tokens, aiming for more stable and less outlier-dominated updates.
PPO/GRPO/RLVR-style RL uses sampled trajectories and explicit rewards, often with KL or reference constraints to control drift.
ES searches directly in parameter space with scalar rewards, making credit assignment simple but risking broad dense updates.

The ES catastrophic-forgetting source makes this lens concrete: new-task reward can improve while prior capabilities degrade. DFT adds the complementary lesson that conservative update scaling can improve reasoning generalization but may fail when low-probability targets contain genuinely new knowledge.

Implications For Time-Series And World Models

For time-series reasoning models, SFT can inject decomposition priors, formatting, and domain procedures, while RL can reward verifiable temporal reasoning or intervention decisions. The DFT/ES contrast suggests agents should track not just task score, but parameter drift, retention of base-model skills, and whether updates preserve the model’s numeric and temporal priors.

Gotchas

“RL-like” does not mean the same mechanism. DFT gives an RL interpretation of an SFT gradient; PPO/GRPO sample trajectories under explicit rewards; ES perturbs parameters and uses scalar fitness.
Reward-only success is incomplete without retention tests.
Smaller updates are not automatically better: a method can preserve priors by refusing to learn rare but important new facts.
Benchmark gains should be reported alongside adaptation mode: full-parameter SFT, LoRA, DFT, PPO/GRPO, DPO/RFT, ES, or staged mixtures.

Open Questions

Which post-training methods have the best target-gain-to-parameter-drift ratio?
Can DFT-like reward rectification, RL KL constraints, and ES low-rank perturbations be composed without fighting each other?
Which retention tests should be mandatory for time-series reasoning and world-model post-training?

Alex Knowledge Base

Explorer

LLM Post-Training

LLM Post-Training

Summary

What The Wiki Currently Believes

Weight-Update Lens

Implications For Time-Series And World Models

Gotchas

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

LLM Post-Training

LLM Post-Training

Summary

What The Wiki Currently Believes

Weight-Update Lens

Implications For Time-Series And World Models

Gotchas

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks