On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Source

Core Claim

Dynamic Fine-Tuning reframes supervised fine-tuning as a policy-gradient-like update with an implicit sparse reward and an inverse-probability importance weight. Standard SFT therefore amplifies updates for low-probability expert tokens. DFT rectifies that implicit reward by multiplying each token loss by the model’s detached probability for the target token, making the update closer to directly maximizing target-token probability rather than maximizing log-probability.

Why It Matters

Alex flagged this paper because it connects SFT and RL through the effect they have on model weights. The useful wiki framing is not simply “DFT beats SFT”; it is that post-training methods can be compared by the geometry and scale of their parameter updates, the reward signal they imply, and the capabilities they preserve or damage.

This links directly to the ES thread: Evolutionary Strategies lead to Catastrophic Forgetting in LLMs criticizes ES because it can produce dense, high-norm parameter drift. DFT is an opposite kind of cautionary example: it deliberately reduces the extreme gradient amplification caused by low-probability tokens, but that also means underrepresented or unfamiliar targets may receive too little learning signal.

Key Contributions

  • Shows that the SFT gradient can be written as a policy-gradient-like expectation with a sparse exact-match reward and an importance weight proportional to 1 / pi_theta(y | x).
  • Identifies the inverse-probability term as a source of unstable, over-concentrated updates on low-probability expert tokens.
  • Proposes DFT: token-level loss reweighting by detached target-token probability.
  • Reports improvements over standard SFT on math reasoning, code generation, and multimodal reasoning benchmarks.
  • Tests an offline RL setting and reports DFT competitive with, or stronger than, DPO, RFT, PPO, and GRPO in the paper’s Qwen2.5-Math-1.5B setup.
  • Includes an important negative result: DFT underperforms SFT on Natural Questions, suggesting it is weak when the goal is to absorb new factual knowledge outside the model’s existing competence.

Main Takeaways

SFT and RL can be compared through implicit rewards, not only through data format. Standard SFT on positive demonstrations behaves like exact-match reward with inverse-probability weighting, which strongly pushes low-confidence tokens. DFT removes that amplification.

DFT’s one-line implementation is conceptually large: loss = loss * p(target_token).detach(). This preserves the SFT workflow but changes the gradient scale so low-probability targets no longer dominate.

The method seems best suited to tasks where the base model already has enough prior competence and the goal is to improve reasoning trajectories, solution selection, or structured prediction. It is less clearly suited to injecting new factual knowledge, where downweighting low-probability tokens can suppress the very information the model needs to learn.

Weight-Update Lens

For the wiki’s ES discussion, DFT should be filed as a gradient-based counterpoint to full-parameter black-box post-training. ES-at-scale sources ask whether reward-only perturbation search can replace or complement RL; the catastrophic-forgetting source warns that dense, high-norm parameter drift may damage prior capabilities. DFT instead tries to make SFT updates less dominated by low-probability outliers.

The shared research question is: which post-training objective changes weights just enough to gain the target behavior while preserving the pretrained model’s useful prior? DFT, GRPO/PPO-style RL, and ES should be compared not only by benchmark reward, but by update norm, sparsity, layer distribution, retention, and whether changes concentrate in semantically meaningful parts of the model.

Gotchas

  • DFT is not a universal SFT replacement. The paper’s own Natural Questions case shows standard SFT can be better for factual acquisition.
  • The theory is a lens, not a claim that the practical DFT training loop performs online RL. The final method remains a supervised loss modification.
  • Token-level probability weighting is important. The paper reports sequence-level probability weighting as numerically weak or unstable.
  • The strongest evidence is still concentrated on math, code, and multimodal reasoning with models up to the paper’s tested scale.
  • Because DFT uses the model’s own confidence, it can reinforce existing beliefs and undertrain rare, hard, or out-of-distribution targets.

Open Questions

  • Can DFT-style reward rectification be combined with retention regularizers or KL constraints to explicitly control parameter drift?
  • Does DFT preserve broad capabilities better than standard SFT, PPO/GRPO, or ES when update norm and sparsity are measured directly?
  • When should low-probability tokens be treated as harmful outliers versus genuinely missing knowledge?