---
abstract: |
  In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (`\model`{=latex}), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, `\model`{=latex} achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at <https://github.com/yongliang-wu/DFT>.
author:
- |
  Yongliang Wu$^{1}$[^1] `\quad `{=latex}Yizhou Zhou$^{2*\dag}$ `\quad `{=latex}Zhou Ziheng$^{3}$ `\quad `{=latex}Yingzhe Peng$^{1}$ `\quad `{=latex}Xinyu Ye$^{4}$\
   **Xinting Hu$^{5}$** `\quad `{=latex}**Wenbo Zhu$^{6}$** `\quad `{=latex}**Lu Qi$^{7}$** `\quad `{=latex}**Ming-Hsuan Yang$^{8}$** `\quad `{=latex}**Xu Yang$^{1\ddag}$**\
  $^1$Southeast University `\quad `{=latex}$^2$Independent Researcher`\quad `{=latex}$^3$University of California, Los Angeles\
  $^4$Shanghai Jiao Tong University `\quad `{=latex}$^5$Nanyang Technological University\
  $^6$University of California, Berkeley `\quad `{=latex}$^7$Wuhan University `\quad `{=latex}$^8$University of California, Merced\
  `\small `{=latex}` yongliang0223@gmail.com, zyz0205@hotmail.com, xuyang_palm@seu.edu.cn`\
bibliography:
- iclr2025\_conference.bib
title: 'On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\maketitle
```
```{=latex}
\newcommand{\model}{DFT}
```
Introduction
============

Supervised Fine-Tuning (SFT), which adapts models to expert demonstrations, has become the standard post-training paradigm for Large Language Models (LLMs) [@zhang2025causal; @zhang2024causal; @zhang2025rewatch; @zhang2025vrest; @wang2025seed]. It enables efficient task adaptation and capability enhancement [@chung2022scaling; @zhang2024instruction; @sanh2022multitask; @ouyang2022training; @chen2024hi; @chen2025hifi; @fang2025and], and is popular for its ease of implementation and rapid acquisition of expert-like behaviors [@wei2022finetuned; @zhou2024lima]. Despite these advantages, SFT often shows limited generalization compared to reinforcement learning (RL) [@chu2025sft; @ouyang2022training; @christiano2017deep; @bai2022training; @huan2025math; @swamy2025roads]. RL leverages explicit reward or verification signals to explore diverse strategies and thus generalizes better. However, RL requires substantial computation, careful hyperparameter tuning, and explicit reward signals---conditions often impractical in real-world settings [@schulman2017proximal; @ouyang2022training; @sheng2024hybridflow; @strubell2019energy; @liu2024green; @winsta2025hidden]. Moreover, RL can struggle to recover expert-like behaviors that SFT captures efficiently [@mandlekar2021matters; @chen2025empirical].

To exploit the complementary strengths of both approaches, many hybrid methods combine SFT with RL [@ouyang2022training; @sheng2024hybridflow; @rafailov2024direct; @liu2025unified; @qiu2025metisrise]. Yet a key question remains: can SFT itself be fundamentally improved? This is crucial, as SFT remains the only viable option when datasets contain only positive demonstrations, with no negative samples or reward model available.

In this work, we address this gap with a mathematical analysis of the connection between SFT and RL. We show that the gradient update in SFT can be interpreted as a form of policy gradient with a specific, implicitly defined reward under certain assumptions. Crucially, this reward is (i) sparse, and (ii) inversely proportional to the model's probability of expert actions (see `\eqref{eq:sft-grad-as-rl}`{=latex}). As a result, when the model assigns low probability to expert actions, the gradient becomes excessively large, yielding an ill-posed reward structure and unstable optimization [@pascanu2013difficulty; @yangmean].

Building on this insight, we propose Dynamic Fine-Tuning (`\model`{=latex}), a principled fix. Our method rescales the SFT objective at each token by its probability, canceling the distortion introduced by inverse-probability weighting. This reframing turns the SFT gradient from a potentially unstable and biased estimator into a more stable, more uniformly weighted update rule that behaves closer to an RL-style.

Empirically, `\model`{=latex} delivers substantial improvements. On the Qwen-2.5-Math series [@qwen2025technical] fine-tuned with NuminaMath-CoT [@numina_math_datasets], `\model`{=latex} yields gains several times larger than standard SFT. More importantly, unlike SFT, which often degrades on challenging benchmarks such as OlympiadBench [@he-etal-2024-olympiadbench], AIME 2024 [@aime2024dataset], and AMC 2023 [@amc2023dataset], our method consistently improves performance and generalization. These improvements hold across models, scales, and data sizes (Table `\ref{tab:math}`{=latex}, Figure `\ref{fig:accuracy-curve}`{=latex}), and extend to code generation and multimodal reasoning (Tables `\ref{tab:code}`{=latex}, `\ref{tab:multi_modal}`{=latex}) [@zhao2025envisioning; @luo2025mono; @li2025easier; @li2025speed] [@zhu2024enhancing; @zhu2024selective; @zhu2024boosting; @zhu2020robust].

We further test `\model`{=latex} in off-policy RL settings (Table `\ref{tab:rl}`{=latex}), where dense rewards are available [@levine2020offline]. Our method not only outperforms offline RL approaches such as DPO [@rafailov2024direct] and RAFT [@dong2023raft; @ahn2024large], but also achieves competitive or superior performance to online methods like GRPO and PPO on math tasks with Qwen2.5-Math-1.5B [@he2025tempflow; @tan2026easytune; @zhao2025unsupervised; @zhao2025real]. Unlike these RL methods, `\model`{=latex} requires neither a reference model nor large batch sizes, making it a simpler and more resource-efficient alternative.

To understand its effect, we analyze token probability distributions after training (Figure `\ref{fig:token_distribution}`{=latex}). While traditional SFT uniformly pushes probabilities toward the training set, `\model`{=latex} selectively increases some while reducing others. In particular, the proportion of less strongly fitted tokens rises, suggesting improved regularization. We provide further discussion in Appendix `\ref{sec:discussion}`{=latex}.

The contributions of this work are theoretical and practical. On the theoretical side, we mathematically establish LLM SFT as a special RL in policy gradient space, pinpoint the underlying reasons for the limited generalization of SFT, and derive a method to improve it. On the experimental side, we show that such a simple solution, just one line of code, can enhance the performance and generalization capabilities of SFT across various tasks and models.

Related Work
============

The trade-off between supervised fine-tuning (SFT) and reinforcement learning (RL) is central to the alignment of large language models [@song2024moviechat; @chai2024auroracap; @song2025moviechat+; @song2025video; @xu2025auroralong; @song2025videonsa; @zhu2025hierarchical; @zhu2025enhancing; @zhu2025dynamic]. SFT is widely adopted due to its simplicity and efficiency in imitating expert demonstrations [@chung2022scaling; @zhou2024lima; @wei2022finetuned], analogous to behavioral cloning in robotics [@sammut2011behavioral; @mandlekar2021matters]. However, the literature consistently highlights its limitations, particularly the tendency to overfit and generalize poorly compared to RL, which leverages reward signals to discover more robust policies [@ouyang2022training; @christiano2017deep; @bai2022training; @swamy2025roads; @zhang2025right]. A recent systematic comparison by @chu2025sft across textual and visual domains confirms this distinction, concisely summarized as "SFT memorizes while RL generalizes." They further show that SFT remains indispensable as an initialization step, stabilizing output formatting prior to effective RL training. Nonetheless, RL faces significant practical hurdles, including computational expense, sensitivity to hyperparameters, and the requirement of an explicit reward function, all of which constrain its applicability [@schulman2017proximal; @strubell2019energy; @sheng2024hybridflow].

To combine the strengths of both paradigms, much recent work has pursued hybrid approaches. The most common strategy involves SFT pretraining followed by RL-based refinement with a learned reward model, as popularized by InstructGPT [@ouyang2022training]. More recent methods interleave SFT and RL updates to improve stability and performance [@sheng2024hybridflow; @liu2025unified; @qiu2025metisrise]. Other approaches, such as Direct Preference Optimization (DPO) [@rafailov2024direct], bypass reward modeling entirely by directly optimizing policies on preference data, thereby unifying imitation and reinforcement signals within a single loss function. @chen2025bridging introduce Negative-aware Fine-Tuning (NFT), which models incorrect generations via an implicit negative policy, enabling self-improvement without explicit feedback. While powerful, these methods rely on reward signals, preference pairs, or negative samples. They enrich the training pipeline but do not fundamentally improve SFT in its native setting, where only positive demonstrations are available. Our work instead focuses on enhancing SFT itself without requiring external feedback.

A complementary line of theoretical research seeks to unify SFT and RL under a common formalism. @du2025rewardweighted reinterpret RLHF as a reward-weighted variant of SFT, preserving reliance on an explicit reward. @wang2025implicit show that SFT can be cast as RL with an implicit reward, proposing adjustments such as smaller learning rates to manage the vanishing KL constraint. @abdolmaleki2024negative analyze learning from both positive and negative feedback, studying how their balance affects convergence. @qin2025supervised view SFT as a lower bound of RL and introduce importance weighting based on the data-generating policy. While these works establish connections between SFT and RL through weighting, they do not provide a precise mathematical equivalence between the SFT gradient and the offline policy gradient. Some methods approximate this connection in practice by reweighting training losses. For instance, MixCE [@zhang2023mixce] combines the forward and reverse KL divergences to form a unified objective, while GOLD [@pang2021text] adopts offline RL with demonstrations, introducing reliance on an unknown demonstration distribution $\pi_b$ and a restrictive $1/N$ assumption. @pavan2022anunderstandingof also provide a clear and insightful exposition of GOLD's motivation and mechanics from an alternative perspective, offering useful intuition for understanding its underlying design. @zhao2025mm provide a promising method to combine RL and SFT during training. In contrast, our work offers a more formal perspective on this connection, highlighting the role of the inverse-probability weighting term in shaping the difference between SFT and RL-like updates. This perspective motivates a simple adjustment: multiplying the loss by the model's token probability to neutralize the weighting.

Interestingly, our method modifies the standard cross-entropy (CE) loss in a way that inverts the weighting philosophy of the widely used Focal Loss [@lin2017focal]. Specifically, our modified CE takes the form $-p \log(p)$, whereas focal loss is defined as $-(1-p)^{\gamma} \log(p)$. Focal Loss deliberately downweights well-classified samples to emphasize underrepresented or hard cases, whereas we deliberately downweight poorly classified samples to encourage generalization. This inversion reflects a fundamental shift in the LLM era: while underfitting was once a central challenge, overfitting and memorization now dominate, demanding a rethinking of objective design.

Method {#sec:method}
======

Preliminaries
-------------

#### Supervised Fine-Tuning.

Let $\mathcal D=\{(x,y^\star)\}$ denote a corpus of expert demonstrations, where $y^\star$ is the complete reference response to the query $x$. SFT minimizes the sentence-level cross-entropy: $$\mathcal{L}_{\mathrm{SFT}}(\theta)
  \;=\;
  \mathbb{E}_{(x, y^\star)\sim\mathcal D}
  \bigl[-\log \pi_\theta\bigl(y^\star \mid x\bigr)\bigr].
  \label{eq:sft-sentence-loss}$$ Its gradient is: $$\nabla_\theta \mathcal{L}_{\mathrm{SFT}}(\theta)
  \;=\;
  \mathbb{E}_{(x, y^\star)\sim\mathcal D}
  \bigl[-\nabla_\theta \log \pi_\theta\bigl(y^\star \mid x\bigr)\bigr].
  \label{eq:sft-grad}$$

#### Reinforcement Learning.

Let $y$ denote a response sampled from the policy $\pi_\theta(\cdot\mid x)$ for query $x$. Given a reward function $r(x,y)\in\mathbb R$, the policy objective is $$J(\theta)
  \;=\;
  \mathbb{E}_{x\sim\mathcal D_x,\;y\sim\pi_\theta(\cdot\mid x)}
  \bigl[r(x,y)\bigr].
  \label{eq:rl-sentence-obj}$$ Its policy gradient at the sentence level is $$\nabla_\theta J(\theta)
  \;=\;
  \mathbb{E}_{x\sim\mathcal D_x,\;y\sim\pi_\theta(\cdot\mid x)}
  \bigl[\nabla_\theta \log \pi_\theta(y \mid x)\;r(x,y)\bigr].
  \label{eq:policy-grad}$$

Unify SFT and RL Gradient Expression
------------------------------------

#### Rewriting SFT Gradient as Policy Gradient via Importance Sampling.

The SFT gradient in `\eqref{eq:sft-grad}`{=latex} is taken under the *fixed* demonstration distribution. We convert it to an on-policy expectation by inserting an importance weight that compares the expert (Dirac Delta) distribution with the model distribution. $$\mathbb{E}_{(x,y^\star)\sim\mathcal D}\,\bigl[-\nabla_\theta \log \pi_\theta\bigl(y^\star \mid x\bigr)\bigr]
=
\mathbb{E}_{x\sim\mathcal D_x}\,
  \underbrace{\mathbb{E}_{y\sim\pi_\theta(\cdot\mid x)}
    \frac{\mathbf 1[y=y^\star]}{\pi_\theta(y\mid x)}\,\bigl[-\nabla_\theta \log \pi_\theta\bigl(y \mid x\bigr)\bigr]}_{\text{resample + reweight}}\,
\label{eq:importance-sampling}$$

Define the auxiliary variables (importance sampling weight) as $$w(y\mid x)
  = \frac{\mathbf 1}{\pi_\theta(y \mid x)} ,  \quad
r(x,y)=\mathbf 1[y=y^\star].$$ Reorganizing `\eqref{eq:importance-sampling}`{=latex} and rewriting it using the above auxiliary variables, we obtain the form $$\nabla_\theta\mathcal{L}_{\mathrm{SFT}}(\theta)
=
-\mathbb{E}_{x\sim\mathcal D_x,\;y\sim\pi_\theta(\cdot\mid x)}
\bigl[
  \textcolor{blue}{w(y\mid x)}\,\nabla_\theta\log\pi_\theta(y\mid x)\,\textcolor{blue}{ r(x,y)}
\bigr].
\label{eq:sft-grad-as-rl}$$

This form of the SFT gradient closely resembles the policy gradient in Equation `\eqref{eq:policy-grad}`{=latex}. Under this formulation, conventional SFT can be interpreted as an on-policy gradient method, where the reward is a sparse indicator function matching the expert trajectory, but biased by an importance weighting term $1/\pi_\theta$. We emphasize that this RL-style characterization serves solely as a theoretical lens: both the analysis and subsequent modifications are developed within the RL framework, while the final method remains fully implementable in standard SFT form for computational efficiency. Detailed derivations are provided in Appendix `\ref{sec:derivation}`{=latex}.

Due to the inherently sparse reward signal in the SFT setting, we identify the importance weight $1/\pi_\theta$ as a key contributor to SFT's generalization limitations compared to RL. When the model assigns low probability to the expert response, the resulting weight becomes excessively large, introducing an ill-posed reward landscape. This leads to disproportionately large gradients and training instability. The issue is compounded by the fact that the reward function $r(x, y) = \mathbf{1}[y = y^\star]$ is non-zero only for exact matches to the expert outputm causing optimization to overfit rare exact-match samples and weakening the model's ability to generalize beyond the training data.

Proposed Method
---------------

#### Reward Rectification via Dynamic Reweighting.

To neutralize the skewed reward issue identified when viewing SFT under the RL objective, we dynamically reweight the reward by multiplying by a corrective inverse ratio given by the policy probability $1/w$. The resulting \`\`dynamically fine-tuned\" gradient is then

$$\nabla_\theta\mathcal{L}_{\mathrm{SFT}}(\theta)
=
-\mathbb{E}_{x\sim\mathcal D_x,\;y\sim\pi_\theta(\cdot\mid x)}
\bigl[
  \operatorname{sg}(\frac{1}{w}) \cdot \textcolor{blue}{w(y\mid x)}\,\nabla_\theta\log\pi_\theta(y\mid x)\,\textcolor{blue}{ r(x,y)}
\bigr].
\label{eq:dr-grad}$$

where $\operatorname{sg}(\cdot)$ denotes the stop gradient operator, ensuring that gradients do not flow through the reward scaling term $w$. To facilitate transition to later equations, we directly write $1/w$ to be $\pi_\theta(y^\star \mid x)$ instead of $\pi_\theta(y \mid x)$ because the indicator function in `\eqref{eq:importance-sampling}`{=latex} or `\eqref{eq:sft-grad-as-rl}`{=latex} would leave all cases where $y \neq y^\star$ is 0. Now since the gradient does not flow, the corrected SFT loss also becomes a simple reweighted loss, called Dynamic Fine-tuning (`\model`{=latex}).

$$\mathcal L_{\text{DFT}}(\theta)=
  \mathbb E_{(x,y^\star)\sim\mathcal D}
  \Bigl[
    -\operatorname{sg}\big(\pi_\theta(y^\star\mid x)\big)
    \log\pi_\theta(y^\star\mid x)\Bigr].
  \label{eq:dr-loss}$$

However, in practice, computing importance weights over the entire trajectory can induce numerical instability. A common treatment of this issue is to simply apply importance sampling at the token level, as was adopted in PPO [@schulman2017proximal]. This leads to the final DFT loss version:

$$\mathcal L_{\text{DFT}}(\theta)=
  \mathbb E_{(x,y^\star)\sim\mathcal D}
  \Bigl[-\!\sum_{t=1}^{|y^\star|}
    \operatorname{sg}\big(\pi_\theta(y^\star_t\mid y^\star_{<t},x)\big)
    \log\pi_\theta(y^\star_t\mid y^\star_{<t},x)\Bigr].
  \label{eq:dr-loss-token-level}$$

Note that the reward of this corrected SFT (in RL form), i.e., DFT, now becomes 1 uniformly for all expert trajectory. This is akin to contemporary verification based reward approach RLVR [@deepseekr12025] that assigns uniform reward to all correct samples. Consequently, it avoids over-concentration on specific low-probability reference tokens, leading to more stable updates and improved generalization without introducing any additional sampling or reward models.

Experiments {#sec:exp}
===========

We design four groups of experiments to comprehensively evaluate `\model`{=latex}. We first study the standard SFT setting on mathematical reasoning tasks to establish its core advantage over SFT (Section `\ref{sec:math_exp_sft}`{=latex}). We then extend to an offline RL setting, comparing `\model`{=latex} with representative offline and online RL methods (Section `\ref{sec:math_exp_rft}`{=latex}). To test cross-domain robustness, we further examine `\model`{=latex} on code generation benchmarks (Section `\ref{sec:code_exp_sft}`{=latex}) and its applicability to multi-modal reasoning math datasets (Section `\ref{sec:multimodal_exp_sft}`{=latex}).

Main Experiment - Mathematical Reasoning Task {#sec:math_exp_sft}
---------------------------------------------

To examine whether `\model`{=latex} can outperform vanilla SFT across tasks, architectures, and scales, we use mathematical reasoning as a representative testbed.

#### Implementation details.

To efficiently manage computational resources, We andomly sample 100,000 instances from the the NuminaMath-CoT dataset [@numina_math_datasets] for training. We conduct experiments using multiple models, including Qwen2.5-Math-1.5B, Qwen2.5-Math-7B [@qwen2024technical], LLaMA-3.2-3B, LLaMA-3.1-8B [@dubey2024llama], and DeepSeekMath-7B [@shao2024deepseekmath]. Our implementation builds upon the verl framework [@sheng2024hybridflow], using recommended SFT hyper-parameters. Specifically, we employ the AdamW optimizer with learning rates of $5 \times 10^{-5}$ for all models except the LLaMA-3.1-8B-Base, for which we adopt a lower learning rate of $2 \times 10^{-5}$. We set the mini-batch size to 256 and the maximum input length to 2048 tokens. The learning rate follows a cosine decay schedule with a warm-up ratio of 0.1. We evaluate on benchmarks including Math500 [@hendrycks2021measuring], Minerva Math [@lewkowycz2022solving], Olympiad Bench [@he-etal-2024-olympiadbench], AIME 2024 [@aime2024dataset], and AMC 2023 [@amc2023dataset] through the official Qwen2.5-Math evaluation pipeline [@qwen2024technical]. Each model uses the default chat template and Chain-of-Thought (CoT) prompting to stimulate step-by-step reasoning. All reported results represent average accuracy across 16 decoding runs, evaluated with a temperature of 1.0 and maximum generation length of 4096 tokens.

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
LLaMA-3.2-3B & 1.63 & 1.36 & 1.01 & 0.41 & 1.56 & 1.19\\
LLaMA-3.2-3B w/SFT & 8.65 & 2.38 & 2.06 & 0.00 & 3.13 &  3.24 \\
\rowcolor{gray!20} LLaMA-3.2-3B w/DFT & \textbf{12.79} & \textbf{2.84} & \textbf{2.90} & \textbf{0.83} & \textbf{3.91} & \textbf{4.65} \\
\midrule
LLaMA-3.1-8B & 1.86 & 0.98 & 0.94 & 0.21 & 1.01 & 1.00 \\
LLaMA-3.1-8B w/SFT & 16.85 & 5.78 & 3.88 & 0.00 & 5.16 & 6.33 \\
\rowcolor{gray!20} LLaMA-3.1-8B w/DFT & \textbf{27.44} & \textbf{8.26} & \textbf{6.94} & \textbf{0.41} & \textbf{12.03} & \textbf{11.02} \\
\midrule
DeepSeekMath-7B & 6.15 & 2.15 & 1.74 & 0.21 & 2.97  & 2.64 \\
DeepSeekMath-7B w/SFT & 26.83 & 7.26 & 6.33 & 0.41 & 8.28 & 9.82  \\
\rowcolor{gray!20} DeepSeekMath-7B w/DFT & \textbf{41.46} & \textbf{16.79} & \textbf{15.00} & \textbf{1.24} & \textbf{16.25} & \textbf{18.15} \\
\midrule
Qwen2.5-Math-1.5B &  31.66 & 8.51 & 15.88 & 4.16 & 19.38 & 15.92 \\
Qwen2.5-Math-1.5B w/SFT & 43.76 & 13.04 & 12.63 & 1.87 & 18.75 & 18.01 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & \textbf{64.89} & \textbf{20.94}  & \textbf{27.08} & \textbf{6.87} & \textbf{38.13} & \textbf{31.58} \\
\midrule
Qwen2.5-Math-7B & 40.12 & 14.39 & 17.12 & 6.68 & 27.96 & 21.25 \\
Qwen2.5-Math-7B w/SFT & 53.96 & 16.66 & 18.93 & 2.48 & 26.09 & 23.62 \\
\rowcolor{gray!20} Qwen2.5-Math-7B w/DFT & \textbf{68.20} & \textbf{30.16} & \textbf{33.83} & \textbf{8.56} & \textbf{45.00} & \textbf{37.15} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\vspace{-10pt}
```
`\label{tab:math}`{=latex}

`\model`{=latex} consistently yields average performance improvements over base models compared to standard SFT across all benchmarks. Table `\ref{tab:math}`{=latex} shows that, for Qwen2.5-Math-1.5B, `\model`{=latex} achieves an average gain of +15.66 points over the base model, which is over 5.9$\times$ larger than the +2.09 point improvement from SFT. This pattern generalizes across other model families and sizes: LLaMA-3.2-3B benefits from a +3.46 point gain with `\model`{=latex}, exceeding the SFT gain (+2.05) by approximately 1.4$\times$; LLaMA-3.1-8B achieves +10.02 from `\model`{=latex}, surpassing SFT's +5.33 by 1.88$\times$; DeepSeekMath-7B sees a +15.51 point improvement via `\model`{=latex}, which is 1.58$\times$ larger than SFT's +7.18; and Qwen2.5-Math-7B reaches a +15.90 point gain, nearly 3.8$\times$ higher than the SFT improvement of +2.37.

`\model`{=latex} demonstrates generalization and robustness, especially on challenging benchmarks where standard SFT yields minimal or even negative impact. For instance, on Olympiad Bench, SFT degrades performance for Qwen2.5-Math-1.5B, dropping accuracy from 15.88 to 12.63, while `\model`{=latex} boosts it to 27.08, +11.20 point improvement over base model. On AIME24, SFT reduces accuracy for Qwen2.5-Math-7B by 4.20 points (from 6.68 to 2.48), whereas `\model`{=latex} improves performance to 8.56, achieving a +1.88 point gain over the base model despite the difficulty of the benchmark. A similar trend is observed on AMC23. SFT reduces the performance of Qwen2.5-Math-1.5B from 19.38 to 18.75, while `\model`{=latex} raises it to 38.13, a +18.75 point gain over base. For Qwen2.5-Math-7B, SFT yields only a marginal improvement (+1.86), whereas `\model`{=latex} achieves a +17.04 point gain. These results underscore that `\model`{=latex} not only scales more effectively across models of varying capacities, but also exhibits better resilience on difficult reasoning tasks where traditional SFT struggles.

`\model`{=latex} exhibits better learning efficiency and faster convergence characteristics. Figure `\ref{fig:accuracy-curve}`{=latex} reveals clear differences in learning dynamics between `\model`{=latex} and standard SFT on Qwen2.5-Math-1.5B across all math reasoning benchmarks. Compared to SFT, our method demonstrates three distinct advantages: (1) Faster convergence, achieving peak performance within the first 120 training steps on most benchmarks; (2) Better early-stage performance, with `\model`{=latex} already outperforming best final accuracy of SFT within the first 10--20 steps; and (3) Higher sample efficiency, consistently requiring fewer updates to reach relatively optimal results. This accelerated convergence shows that the dynamic reweighting mechanism in `\model`{=latex} leads to more informative gradient updates, guiding the model toward high-quality solutions early in training. It also suggests that `\model`{=latex} helps avoid the optimization plateaus or noise-prone regions often encountered in standard SFT, thereby enabling more efficient acquisition of complex mathematical reasoning patterns.

We also report the results of parameter-efficient fine-tuning (PEFT) training setting [@hu2022lora] and training on the OpenR1-Math dataset [@huggingface2025openr1] with better quality in Appendix `\ref{sec:lora-training}`{=latex} and Appendix `\ref{sec:openr1-math}`{=latex}, respectively. Comparison and Discussion with the concurrent method iw-SFT [@qin2025supervised] is provided in Appendix `\ref{sec:iw-sft}`{=latex}.

```{=latex}
\centering
```
![Accuracy progression for Qwen2.5-Math-1.5B across mathematical benchmarks, illustrating faster convergence and better performance achieved by `\model`{=latex} relative to SFT.](Figure/step_acc.png){#fig:accuracy-curve width="1\\linewidth"}

```{=latex}
\vspace{-15pt}
```
Exploratory Experiment - Offline RL Setting {#sec:math_exp_rft}
-------------------------------------------

Equation `\ref{eq:dr-grad}`{=latex} shows that SFT suffers from reward sparsity, since in a constructed dataset each query $x$ has only a single reference answer $y^\star$. From the perspective of RL, RFT/RAFT [@dong2023raft; @ahn2024large] can be viewed as alleviating the sparse reward issue by effectively increasing reward density, thereby enhancing model performance. Motivated by this observation, we conduct an exploratory study applying `\model`{=latex} in an offline RL setting, where the reward sparsity problem is inherently less severe compared to standard SFT, to further validate the effectiveness.

#### Implementation details.

We sample responses for 100,000 math questions using a temperature of 1.0 and generate four responses per question from the base model itself. Correct responses are identified using math verify and retained as training data, resulting in approximately 140,000 examples. For DPO training, we construct 100,000 positive--negative preference pairs from the generated responses. We compare `\model`{=latex} with representative offline RL methods, including DPO [@rafailov2024direct] and RFT [@dong2023raft; @ahn2024large], as well as online RL methods PPO [@schulman2017proximal] and GRPO [@shao2024deepseekmath]. For RFT and `\model`{=latex}, the training setup follows the configuration in Section `\ref{sec:math_exp_sft}`{=latex}. For DPO, we use the ms-swift [@zhao2024swiftascalablelightweightinfrastructure] with a learning rate of $1 \times 10^{-6}$, batch size of 128, and a warmup ratio of 0.05. For PPO and GRPO, training is performed using the verl [@sheng2024hybridflow] with a learning rate of $1 \times 10^{-6}$, batch size of 256, and a warmup ratio of 0.1. We set the number of response $n = 4$ for GRPO.

```{=latex}
\centering
```
```{=latex}
\small
```
```{=latex}
\renewcommand\arraystretch{1.3}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{llcccccc}
\toprule
 & \textbf{Setting} & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\hline
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & SFT & 64.89 & 20.94  & 27.08 & 6.87 & 38.13 & 31.58 \\
\midrule
Qwen2.5-Math-1.5B w/DPO & Offline & 46.89 & 11.53 & 22.86 & 4.58 & 30.16 & 23.20 \\
Qwen2.5-Math-1.5B w/RFT & Offline & 48.23 & 14.19 & 22.29 & 4.37 & 30.78 & 23.97 \\
\midrule
Qwen2.5-Math-1.5B w/PPO & Online & 56.10 & 15.41 & 26.33 & 7.50 & 37.97 & 28.66 \\
Qwen2.5-Math-1.5B w/GRPO & Online & 62.86 & 18.93 & 28.62 & \textbf{8.34} & 41.25 & 32.00 \\
\midrule
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & Offline & \textbf{64.71} & \textbf{25.16}  & \textbf{30.93} & 7.93 & \textbf{48.44} & \textbf{35.43} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\vspace{-10pt}
```
`\label{tab:rl}`{=latex}

`\model`{=latex} demonstrates competitive performance in the offline RL setting, outperforming both offline and online RL baselines. Table `\ref{tab:rl}`{=latex} shows `\model`{=latex} achieves an average score of 35.43, exceeding the best offline method RFT by +11.46 points, and even outperforming the strongest online RL algorithm GRPO by +3.43 points. Specially, on Math500, `\model`{=latex} scores 64.71, slightly ahead of GRPO (62.86) and better than PPO (56.10) and RFT (48.23). The gains are also notable on more challenging benchmarks: on AMC23, `\model`{=latex} achieves 48.44, a +7.19 point margin over GRPO and a +17.66 point gain over RFT. Similarly, on Minerva Math, `\model`{=latex} reaches 25.16, outperforming GRPO by +6.23 points, PPO by +9.75, and all offline baseline methods.

These results highlight the strength of `\model`{=latex} as a simple yet effective fine-tuning strategy. Despite its lack of iterative reward modeling or environment interaction, it provides a stronger learning signal than both offline methods like DPO/RFT and online policy optimization algorithms like PPO/GRPO in certain scale train set. This suggests that `\model`{=latex} can serve as a more efficient and scalable alternative to traditional RL pipelines, particularly in domains where preference supervision is available but reward modeling or online response sampling is expensive or impractical.

Exploratory Experiment - Code Generation Task {#sec:code_exp_sft}
---------------------------------------------

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccccccccccc}
\toprule
 & \multicolumn{2}{c}{\textbf{HumanEval}} & \multicolumn{9}{c}{\textbf{MultiPL-E}} \\
 & \textbf{HE} & \textbf{HE+} & \textbf{Python} & \textbf{C++} & \textbf{Java} & \textbf{PHP} & \textbf{TS} & \textbf{C\#} & \textbf{Bash} & \textbf{JS} & \textbf{Avg.} \\
\midrule
Qwen2.5-3B & 43.3 & 36.0 & 43.29 & 40.99 & 37.34 & 37.89 & \textbf{47.17} & \textbf{43.04} & 24.68 & 45.96 & 40.05 \\
Qwen2.5-3B w/SFT & 41.5 &  34.8 & 42.07 & 42.24 & 37.97 & 37.27 & 43.40 & 41.77 & 20.25 & \textbf{47.83} & 39.10 \\
\rowcolor{gray!20} Qwen2.5-3B w/DFT & \textbf{45.7} & \textbf{39.0 }& \textbf{45.73} & \textbf{44.72} & \textbf{41.77} & \textbf{45.34} & 42.14 & \textbf{43.04} & \textbf{27.85} & 44.10 & \textbf{41.84} \\
\midrule
Qwen2.5-Coder-3B & 52.4 & 42.7 & 51.83 &  53.42 & 46.20 & 47.20 & 54.09 & 55.06 & 25.32 & \textbf{54.04} & 48.39\\
Qwen2.5-Coder-3B w/SFT & 51.8 & 43.9 & 51.22 & 51.55 & 48.10 & 54.66 & \textbf{59.12} & 51.27 & \textbf{34.18} & \textbf{54.04} & 50.52\\
\rowcolor{gray!20} Qwen2.5-Coder-3B w/DFT & \textbf{56.7} & \textbf{50.0} & \textbf{57.32} & \textbf{54.66} & \textbf{51.27} & \textbf{58.39} & 58.49 & \textbf{60.76} & 31.01 & 53.42 & \textbf{53.16} \\
\midrule
Qwen2.5-Coder-7B & 62.2 & 53.0 & 63.41 & 63.98 &53.16 &59.01 &62.89 &59.49 &39.24 &60.87& 57.76 \\
Qwen2.5-Coder-7B w/SFT & 54.9 & 48.8 & 54.88 & 64.60 & 51.27 & \textbf{62.11} & 68.55 & 60.76 & 33.54 & \textbf{65.22} & 57.62 \\
\rowcolor{gray!20} Qwen2.5-Coder-7B w/DFT & \textbf{67.7} & \textbf{59.8} &  \textbf{67.68}& \textbf{67.70} &\textbf{54.43} & 60.87 &\textbf{70.44} &\textbf{65.19} &\textbf{48.73} &63.35 &\textbf{62.30} \\
\bottomrule
\end{tabular}}
```
```{=latex}
\vspace{-10pt}
```
`\label{tab:code}`{=latex}

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \multicolumn{4}{c}{\textbf{MathVerse}} & \multirow{2}{*}{\textbf{MathVision}} & \multirow{2}{*}{\textbf{WeMath}} \\
 & \textbf{Vision Only} & \textbf{Vision Intensive} & \textbf{Vision Dominant} & \textbf{Overall} &  &  \\
\midrule
Qwen2.5-VL-3B & 28.81 & 30.96 & 31.60 & 33.83 & 21.25 & 4.10 \\
Qwen2.5-VL-3B w/SFT & 30.96 & 33.63 & 32.74 & 35.66 & 21.02 & 23.33 \\
\rowcolor{gray!20} Qwen2.5-VL-3B w/DFT & \textbf{32.49} & \textbf{35.91} & \textbf{33.50} & \textbf{37.54} & \textbf{22.30} & \textbf{23.71} \\
\bottomrule
\end{tabular}}
```
```{=latex}
\vspace{-10pt}
```
`\label{tab:multi_modal}`{=latex}

#### Implementation details.

We adopt UltraFeedback [@cui2023ultrafeedback] as the training dataset. From this corpus, we sample 10,000 prompts and, for each prompt, select the response with the highest average score to perform supervised fine-tuning (SFT) [@du2025rewardweighted]. Model performance is assessed on three widely used code generation benchmarks: HumanEval [@chen2021evaluating], HumanEval+ [@liu2023your], and MultiPL-E [@cassano2023multipl]. Training is conducted for one epoch with a learning rate of $5 \times 10^{-5}$, a warm-up ratio of 0.05, and a batch size of 16.

Table `\ref{tab:code}`{=latex} shows `\model`{=latex} achieves improvements in most cases compared to both base models and SFT. For Qwen2.5-3B, DFT raises HumanEval from 43.3 to 45.7 and HumanEval+ from 36.0 to 39.0, with the MultiPL-E average also increasing from 40.05 (base) and 39.10 (SFT) to 41.84. Similar trends are observed for Qwen2.5-Coder-3B, where DFT improves HumanEval to 56.7 and HumanEval+ to 50.0, outperforming both base and SFT. For Qwen2.5-Coder-7B, DFT reaches 67.7 on HumanEval, 59.8 on HumanEval+, and 62.3 average on MultiPL-E, surpassing SFT by +12.8, +11.0, and +4.7 points respectively. The overall trend demonstrates that DFT generally provides stronger performance across different models and languages.

Exploratory Experiment - Multi-Modal Reasoning {#sec:multimodal_exp_sft}
----------------------------------------------

#### Implementation details.

We use the WeThink dataset [@yang2025wethink] for training. The model is fine-tuned using LLaMA-Factory [@zheng2024llamafactory] and evaluated with VLMEvalKit [@duan2024vlmevalkit]. We train the model for 1 epoch with a learning rate of 5e-5. To comprehensively assess reasoning capabilities, we adopt a suite of multi-modal reasoning benchmarks including MathVerse [@zhang2024mathverse], MathVision [@wang2024measuring], and WeMath [@qiao2024we] for evaluation.

`\model`{=latex} achieves consistent improvements over base models and SFT across all multi-modal reasoning benchmarks. Table `\ref{tab:multi_modal}`{=latex} shows, on MathVerse, `\model`{=latex} boosts Qwen2.5-VL-3B from 33.83 to 37.54 average accuracy, outperforming the SFT gain of only +1.83 by +3.71 points. Consistent improvements are observed across all major vision-related subcategories. On MathVision, `\model`{=latex} improves performance from 21.25 (base) to 22.30, exceeding SFT which fails to provide gains (21.02). On WeMath, SFT already yields a +19.23 point gain, but `\model`{=latex} pushes performance slightly further to 23.71, maintaining superiority over both base and SFT. These results indicate that `\model`{=latex} not only strengthens text-only reasoning but also extends effectively to multi-modal domains.

Limitations of DFT: A Case Study on Factual Knowledge
-----------------------------------------------------

While DFT consistently outperforms SFT on reasoning-heavy tasks, it may not always be the better choice, particularly in factual knowledge domains. We conduct an exploratory experiment on the Natural Questions dataset [@kwiatkowski2019natural], which consists of real-user, open-domain factual queries grounded in Wikipedia articles.

In this setting, we find that SFT improves performance from 31.24% to 36.62%, while DFT unexpectedly reduces it to 30.14%. This result reveals an important limitation of DFT: because it reweights samples based on the model's own confidence, it tends to reinforce the model's existing beliefs. When the model lacks sufficient factual knowledge, such reinforcement may hinder effective learning instead of facilitating it.

This case suggests that DFT is most effective when the task aligns well with the model's prior competence, such as logical reasoning or structured prediction. In contrast, when the objective is to absorb new factual information, especially in domains beyond the model's current capabilities, SFT remains a more reliable and stable fine-tuning strategy.

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
Qwen2.5-Math-1.5B &  31.66 & 8.51 & 15.88 & 4.16 & 19.38 & 15.92 \\
Sentence-Level Weighting &  31.26 & 8.05 & 16.47 & 3.12 & 19.84 & 15.75 \\
Geometric-Mean Weighting &  42.87 & 12.34 & 13.03 & 1.23 & 16.56 & 17.21 \\
\rowcolor{gray!20} Token-Level Weighting & \textbf{64.89} & \textbf{20.94}  & \textbf{27.08} & \textbf{6.87} & \textbf{38.13} & \textbf{31.58} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\vspace{-10pt}
```
`\label{tab:math_sentence}`{=latex}

An Empirical Comparison with Sentence-Level Weighting
-----------------------------------------------------

Our framework applies confidence-based weighting at the token level. While this design was primarily motivated by numerical stability, we also compared it against two sentence-level variants to better understand their behavior.

The first variant uses the full sequence probability to scale the loss. However, these values are extremely small in practice, making the loss nearly uninformative and producing a highly skewed weight distribution that is difficult to tune. To address this, we also evaluated a geometric-mean variant inspired by GSPO [@zheng2025group], which rescales sentence probabilities to avoid numerical collapse. Although this version is more stable, it still provides a weak training signal and offers limited performance gains.

As shown in Table `\ref{tab:math_sentence}`{=latex}, both sentence-level strategies lead to minimal changes over the base model, while our token-level formulation delivers substantial and consistent improvements, raising average accuracy from 15.92 to 31.58. These results demonstrate that token-level weighting provides a more reliable optimization signal and significantly stronger empirical performance.

Analysis of Probabilities
-------------------------

```{=latex}
\centering
```
![Token probability distributions on the training set before training and after fine-tuning with `\model`{=latex}, SFT, and various RL methods. A logarithmic scale is used on the y-axis for clarity.](Figure/token_distribution.png){#fig:token_distribution width="1\\linewidth"}

```{=latex}
\vspace{-15pt}
```
To understand how the model trained by `\model`{=latex} is different from SFT and other RL methods, we look into the token probability distribution of the model's output over the training set in Figure `\ref{fig:token_distribution}`{=latex}. SFT tends to uniformly increase token probabilities, shifting the entire distribution towards higher confidence, but mainly targeting the lower and lowest probability tokens. The highest probability token portion barely increases. In stark contrast, `\model`{=latex} exhibits a polarizing effect: it significantly boosts the probabilities of a subset of tokens while actively suppressing the probabilities of others. This leads to a bimodal distribution, with more tokens occupying both the highest and lowest probability bins. Other RL methods such as DPO, GPPO and PPO show the same trend as `\model`{=latex}, although the scale is much milder than it. We look into the words that belong to the lowest probability bin, and find that they are generally the conjunctive words or punctuations such as \`the', \`let', \`,', \`.' etc. These results suggest that for robust learning, models should not attempt to fit all tokens with uniform confidence. It may be beneficial to deprioritize fitting tokens that serve grammatical functions rather than carrying primary semantic content. This concept is analogous to human pedagogy, where students are taught to focus on substantive concepts rather than perfecting the usage of common connective words. Further analysis can be found in Appendix `\ref{sec:discussion}`{=latex}.

Conclusion
==========

In this work, we revisit the well-known generalization gap between SFT and RL. We offer a theoretical perspective showing that the standard SFT gradient can be interpreted as a policy gradient with an ill-posed, implicitly defined reward inversely related to model confidence. This formulation helps explain the instability and limited generalization observed in SFT training. Motivated by this analysis, we introduce `\model`{=latex}, a simple yet effective method that dynamically reweights the SFT loss using the token probability. This one-line change improves gradient stability and leads to better generalization. Our empirical results show that `\model`{=latex} consistently improves over standard SFT across a range of models and challenging mathematical reasoning tasks. Beyond supervised settings, we adapt `\model`{=latex} to offline RL scenarios and find that it outperforms several established online and offline RL baselines, suggesting broader applicability. Overall, this work contributes both a refined understanding of SFT's limitations and a lightweight, practical method that helps bridge the gap to more complex RL-based approaches.

#### Limitations.

While our experiments demonstrate the effectiveness of `\model`{=latex} on mathematical reasoning benchmarks and code generation tasks, the evaluation scope remains limited. We have not yet assessed its performance on broader task categories or with larger-scale LLM, which we leave for future exploration. Moreover, `\model`{=latex} can not offer universal benefits across all scenarios. In domains that primarily involve the acquisition of factual knowledge, conventional SFT still remains the most efficient approach. `\model`{=latex} may also not be an ideal choice for hard examples or domains underrepresented in the training data, since it assigns low initial probabilities to such samples, reducing their learning weight. Our aim is not to assert that `\model`{=latex} universally outperforms SFT, but rather to offer a new perspective on objective design by analyzing the distinction between RL and SFT. Besides, an important future direction is to explore non-uniform or quality-aware reward assignments for demonstrations.

Acknowledgements {#acknowledgements .unnumbered}
================

Supported by Jiangsu Province Carbon Peak Carbon Neutrality Science and Technology Innovation Special Fund Project (Grant No. BT2025029), National Natural Science Foundation of China (Grant No. 62576091), and Big Data Computing Center of Southeast University.

Ethics Statement {#ethics-statement .unnumbered}
================

This work adheres to the ICLR Code of Ethics. Our study does not involve human subjects, personally identifiable information, or proprietary data. All datasets used, including NuminaMath, OpenR1-Math, UltraFeedback, and WeThink, are publicly available and documented in the appendix. The proposed method is a simple training strategy that modifies gradient computation for improved generalization. It does not introduce any new capabilities that could cause harm, nor does it enable misuse beyond the standard capabilities of existing large language models. We are not aware of any potential risks related to bias, fairness, or security that arise specifically from the method proposed. Nonetheless, we acknowledge that like any fine-tuning strategy, DFT may inherit biases present in the underlying data or model, and future research may explore safeguards for these scenarios. No conflicts of interest, legal compliance issues, or sponsorship-related influences are present in this work.

Reproducibility Statement {#reproducibility-statement .unnumbered}
=========================

We have taken multiple steps to ensure the reproducibility of our work. All datasets used in our experiments are publicly available and properly cited in the main text and appendix. Training configurations, including model architectures, hyperparameters, optimizers, and evaluation settings, are described in detail in Section `\ref{sec:exp}`{=latex} and Appendi `\ref{sec:openr1-math}`{=latex}-`\ref{sec:lora-training}`{=latex}. Theoretical claims, including the equivalence between SFT and policy gradient, are formally derived in Appendix `\ref{sec:derivation}`{=latex}. Experimental results include multiple model scales, tasks, and training settings to validate robustness. A complete implementation of our method is included in the supplementary material, along with scripts for reproducing all reported results. We will release the full source code and training logs upon publication to further support reproducibility.

```{=latex}
\bibliographystyle{iclr2025_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Appendix
========

Usage of LLM
------------

We employ LLM primarily as writing assistants to refine and polish the manuscript. Their usage was limited to improving clarity, coherence, and presentation, while all conceptual and experimental contributions remain original.

Detailed Derivation of Equation (5) {#sec:derivation}
-----------------------------------

We start from the SFT gradient in Equation (2): $$\nabla_\theta \mathcal{L}_{\text{SFT}}(\theta) 
= \mathbb{E}_{(x,y^\star)\sim \mathcal{D}} \big[ - \nabla_\theta \log \pi_\theta(y^\star \mid x) \big].
\tag{1}$$

For each query $x$, the expectation over expert demonstrations $(x,y^\star)$ can be written explicitly as a summation over all possible outputs $y$: $$\mathbb{E}_{(x,y^\star)\sim \mathcal{D}} \big[ - \nabla_\theta \log \pi_\theta(y^\star \mid x) \big]
= \mathbb{E}_{x \sim \mathcal{D}_x} \sum_{y} \mathbf 1[y=y^\star]] \, \big[ - \nabla_\theta \log \pi_\theta(y \mid x) \big].
\tag{2}$$

We insert the model distribution $\pi_\theta(y \mid x)$, which allows us to express the summation in terms of importance weights: $$\mathbb{E}_{x \sim \mathcal{D}_x} \sum_{y} 
\pi_\theta(y \mid x) \cdot 
\frac{\mathbf 1[y=y^\star]]}{\pi_\theta(y \mid x)} \,
\big[ - \nabla_\theta \log \pi_\theta(y \mid x) \big].
\tag{3}$$

Here, the term $\tfrac{\mathbf 1[y=y^\star]]}{\pi_\theta(y \mid x)}$ serves as an importance weight comparing the expert (Dirac delta) distribution with the model's distribution.

The summation over $y$ can now be rewritten as an expectation under the policy distribution $y \sim \pi_\theta(\cdot \mid x)$: $$\mathbb{E}_{x \sim \mathcal{D}_x} \,
\mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)} 
\left[ \frac{\mathbf 1[y=y^\star]]}{\pi_\theta(y \mid x)} 
\big( - \nabla_\theta \log \pi_\theta(y \mid x) \big) \right].
\tag{4}$$

Thus, we obtain Equation (5): $$\mathbb{E}_{(x,y^\star)\sim\mathcal{D}} \big[ - \nabla_\theta \log \pi_\theta(y^\star \mid x) \big]
= \mathbb{E}_{x \sim \mathcal{D}_x} \,
\mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)} 
\left[ \frac{\mathbf 1[y=y^\star]]}{\pi_\theta(y \mid x)} 
\big( - \nabla_\theta \log \pi_\theta(y \mid x) \big) \right].
\tag{5}$$

This derivation shows that the SFT gradient can be expressed as an on-policy policy gradient with importance sampling, where the expert demonstration distribution is reweighted relative to the model distribution.

Discussions and Insights {#sec:discussion}
------------------------

#### Gradient Analysis of DFT.

We now analyze the gradient induced by the DFT surrogate loss. Recall the sequence-level definition: $$\mathcal L_{\text{DFT}}(\theta)
  = -\,\operatorname{sg}\!\big(\pi_\theta(y^\star \mid x)\big)\;
    \log \pi_\theta(y^\star \mid x),
  \label{eq:dft-seq-loss}$$ where $\operatorname{sg}(\cdot)$ denotes the stop-gradient operator. Since the stop-gradient blocks backpropagation, the detached probability $\operatorname{sg}(\pi_\theta(y^\star \mid x))$ is treated as a constant during differentiation. Consequently, the gradient becomes $$\begin{aligned}
\nabla_\theta \mathcal L_{\text{DFT}}
&= -\,\operatorname{sg}\!\big(\pi_\theta(y^\star \mid x)\big)\;
   \frac{1}{\pi_\theta(y^\star \mid x)}\,
   \nabla_\theta \pi_\theta(y^\star \mid x) \\
&= -\,\Bigl(\tfrac{\operatorname{sg}(\pi_\theta(y^\star \mid x))}{\pi_\theta(y^\star \mid x)}\Bigr)\,
   \nabla_\theta \pi_\theta(y^\star \mid x).\end{aligned}$$

Since $\operatorname{sg}(\pi_\theta(y^\star \mid x))$ equals $\pi_\theta(y^\star \mid x)$ in the forward pass, the prefactor is numerically equal to $1$. Therefore, $$\nabla_\theta \mathcal L_{\text{DFT}}
= -\,\nabla_\theta \pi_\theta(y^\star \mid x).
\label{eq:dft-grad}$$ This shows that DFT is mathematically equivalent to directly maximizing the model probability of the target token, rather than its log-probability as in cross-entropy.

For cross-entropy, the loss is $$\mathcal L_{\text{CE}}(\theta) = -\log \pi_\theta(y^\star \mid x),$$ yielding gradient $$\nabla_\theta \mathcal L_{\text{CE}}
= -\tfrac{1}{\pi_\theta(y^\star \mid x)} \nabla_\theta \pi_\theta(y^\star \mid x).$$ Thus both CE and DFT share the same gradient direction but differ in scaling: CE amplifies updates for low-probability targets (factor $1/\pi$), while DFT applies a uniform factor $1$. As a result, DFT avoids the instability caused by excessively large gradients on unlikely expert tokens, providing more conservative and stable updates.

From the reinforcement learning perspective, the reward of DFT becomes uniformly $1$ across all expert trajectories, equivalent to a verification-style objective that treats all correct references equally. From the optimization perspective, DFT trades off aggressive fitting of rare tokens for better stability and calibration. Practically, this explains why DFT often yields smoother training and stronger generalization, while maintaining alignment with the pre-training distribution.

#### Learning from Noisy Data.

`\model`{=latex} offers a simple yet effective approach, prompting us to reflect on why it might actually work. One intuitive explanation lies in its ability to learn from noisy data [@freund2009more]. @sasaki2020behavioral propose an imitation learning algorithm for learning from noisy demonstrations, based on the core idea of avoiding the fitting of data that is difficult to model, as such data is likely to originate from suboptimal behaviors, i.e., noise. Their method introduces a weighted behavioral cloning objective, where the weights are derived from a previously trained policy's confidence in each action. Similarly, the weighting mechanism in `\model`{=latex}  shares the same intuition, but instead of relying on a fixed old policy model to compute confidence scores, it uses a single policy model to perform confidence-based weighting on-the-fly during training.

Comparision with Concurrent work iw-SFT {#sec:iw-sft}
---------------------------------------

We include a concurrent method, Importance-Weighted SFT (iw-SFT) [@qin2025supervised], for comparison. All training settings follow those reported in the original paper, except that we set the number of training epochs to 1.

As shown in Table `\ref{tab:iw-sft}`{=latex}, `\model`{=latex} achieves higher average accuracy than iw-SFT on most model families: LLaMA-3.2-3B (+2.39), LLaMA-3.1-8B (+4.15), DeepSeekMath-7B (+3.34), and Qwen2.5-Math-1.5B (+1.30). Although iw-SFT outperforms our method on Qwen2.5-Math-7B (+2.45), this improvement is not consistent across datasets. In particular, for LLaMA-3.2-3B, iw-SFT underperforms standard SFT on Math500 (5.13 vs. 8.65) and AMC23 (2.03 vs. 3.13). Similarly, for LLaMA-3.1-8B, iw-SFT results in worse performance than SFT on Minerva Math (4.31 vs. 5.78) and AMC23 (7.34 vs. 8.28). In contrast, `\model`{=latex} consistently improves upon both the base model and SFT across nearly all datasets, including those where iw-SFT fails. These results underline better generalization ability of `\model`{=latex} in diverse mathematical reasoning scenarios. Moreover, iw-SFT incurs additional computational overhead by requiring a separate reference model to compute importance weights, whereas `\model`{=latex} dynamically derives its own weighting directly from the token probabilities of model, resulting in a more efficient training procedure.

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
LLaMA-3.2-3B w/iw-SFT & 5.13 & 2.63 & 1.51 & 0.00 & 2.03 & 2.26 \\
\rowcolor{gray!20} LLaMA-3.2-3B w/DFT & \textbf{12.79} & \textbf{2.84} & \textbf{2.90} & \textbf{0.83} & \textbf{3.91} & \textbf{4.65} \\
\midrule
LLaMA-3.1-8B w/iw-SFT & 18.21 & 4.31 & 4.31 & 0.20 & 7.34 & 6.87\\
\rowcolor{gray!20} LLaMA-3.1-8B w/DFT & \textbf{27.44} & \textbf{8.26} & \textbf{6.94} & \textbf{0.41} & \textbf{12.03} & \textbf{11.02} \\
\midrule
DeepSeekMath-7B w/iw-SFT & 35.32 & 8.75 & 11.11 & 0.61 & \textbf{18.28} & 14.81 \\
\rowcolor{gray!20} DeepSeekMath-7B w/DFT & \textbf{41.46} & \textbf{16.79} & \textbf{15.00} & \textbf{1.24} & 16.25 & \textbf{18.15} \\
\midrule
Qwen2.5-Math-1.5B w/iw-SFT & 59.38 & 17.08 & 26.82 & \textbf{8.13} & \textbf{40.00} & 30.28 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & \textbf{64.89} & \textbf{20.94}  & \textbf{27.08} & 6.87 & 38.13 & \textbf{31.58} \\
\midrule
Qwen2.5-Math-7B w/iw-SFT & \textbf{70.28} & 25.70 & \textbf{34.46} & \textbf{16.46} & \textbf{51.09} & \textbf{39.60} \\
\rowcolor{gray!20} Qwen2.5-Math-7B w/DFT & 68.20 & \textbf{30.16} & 33.83 & 8.56 & 45.00 & 37.15 \\
\bottomrule
\end{tabular}
}
```
`\label{tab:iw-sft}`{=latex}

```{=latex}
\centering
```
```{=latex}
\small
```
```{=latex}
\renewcommand\arraystretch{1.3}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{llcccccc}
\toprule
 & \textbf{Setting} & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
Qwen2.5-Math-1.5B w/iw-SFT & SFT & 59.38 & 17.08 & 26.82 & 8.13 & 40.00 & 30.28 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & SFT  & 62.50 & 22.94  & 26.87 & 7.31 & 33.75 & 30.67 \\
\midrule
Qwen2.5-Math-1.5B w/iw-SFT & Offline & 60.80 & 18.13 & 27.83 & \textbf{8.33} & 44.21 & 31.86 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & Offline & \textbf{64.71} & \textbf{25.16}  & \textbf{30.93} & 7.93 & \textbf{48.44} & \textbf{35.43} \\
\bottomrule
\end{tabular}
}
```
`\label{tab:rl-iw-sft}`{=latex}

We also compare against iw-SFT under the offline setting, as shown in Table `\ref{tab:rl-iw-sft}`{=latex}. While iw-SFT performs competitively on certain datasets, achieving 60.80 on Math500 and 44.21 on AMC23, its overall average performance (31.86) remains below that of our method by +3.57 points. Moreover, iw-SFT shows only modest improvements compared to its standard SFT counterpart, with an average score of 31.86 in the offline RL setting versus 30.28 with SFT (+1.58). In contrast, `\model`{=latex} achieves a larger gain of +4.76 (from 30.67 to 35.43). These results indicate that iw-SFT provides limited benefits from reward supervision under offline constraints, whereas `\model`{=latex} is able to more effectively incorporate such signals, leading to better generalization and higher task performance.

Exploratory Experiment - OpenR1-Math Training Dataset {#sec:openr1-math}
-----------------------------------------------------

Inspired by DeepSeek-R1 [@deepseekr12025], several studies have attempted to train open-source models to reproduce its reasoning capabilities [@huggingface2025openr1]. To this end, a high-quality dataset, OpenR1-Math-220k [@huggingface2025openr1], was constructed, where the prompts are drawn from NuminaMath 1.5 and the off-policy reasoning traces are generated by DeepSeek-R1. LUFFY [@yan2025learning] further filtered out sequences longer than 8192 tokens as well as those verified incorrect by Math-Verify, resulting in about 45k prompts paired with off-policy reasoning traces. We adopt this dataset as the training corpus for SFT. All training details remain the same as previous experiments, except that the number of epochs is set to 3.

As shown in Table `\ref{tab:open_r1_math}`{=latex}, training on OpenR1-Math-220k consistently improves performance, and the use of higher-quality annotations yields additional gains. SFT on this dataset increases the average accuracy of Qwen2.5-Math-1.5B by +13.24 points compared to the base model, while DFT provides a further +9.03 gain, resulting in a total improvement of +22.27. These results suggest that DFT remains effective even when applied on top of high-quality training data, highlighting its potential as a general fine-tuning paradigm.

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
Qwen2.5-Math-1.5B &  31.66 & 8.51 & 15.88 & 4.16 & 19.38 & 15.92 \\
Qwen2.5-Math-1.5B w/SFT & 61.60 & 20.29 & 24.27 & 4.16 & 35.47 & 29.16 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & \textbf{71.76} & \textbf{27.00}  & \textbf{33.48} & \textbf{9.79} & \textbf{48.91} & \textbf{38.19} \\
\bottomrule
\end{tabular}
}
```
`\label{tab:open_r1_math}`{=latex}

```{=latex}
\centering
```
```{=latex}
\renewcommand\arraystretch{1.5}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
 & \textbf{Math500} & \textbf{Minerva Math} & \textbf{Olympiad Bench} & \textbf{AIME24} & \textbf{AMC23} & \textbf{Avg.} \\
\midrule
LLaMA-3.2-3B & 1.63 & 1.36 & 1.01 & \textbf{0.41} & 1.56 & 1.19\\
LLaMA-3.2-3B w/SFT & 4.88 & 1.56 & 1.68 & 0.00 & 2.66 & 2.56 \\
\rowcolor{gray!20} LLaMA-3.2-3B w/DFT & \textbf{11.13} & \textbf{5.18} & \textbf{3.87} & 0.00 & \textbf{2.97} & \textbf{4.63} \\
\midrule
Qwen2.5-Math-1.5B &  31.66 & 8.51 & 15.88 & 4.16 & 19.38 & 15.92 \\
Qwen2.5-Math-1.5B w/SFT & 41.47 & 10.85 & 11.56 & 1.45 & 17.03 & 16.87 \\
\rowcolor{gray!20} Qwen2.5-Math-1.5B w/DFT & \textbf{64.85} & \textbf{22.58}  & \textbf{28.45} & \textbf{5.84} & \textbf{40.78} & \textbf{32.90} \\
\bottomrule
\end{tabular}
}
```
`\label{tab:lora_math}`{=latex}

Exploratory Experiment - PEFT Training Setting {#sec:lora-training}
----------------------------------------------

To investigate whether DFT remains effective under parameter-efficient fine-tuning (PEFT) settings with limited compute, we apply DFT using LoRA adapters across two model families: LLaMA-3.2-3B and Qwen2.5-Math-1.5B. All training configurations remain identical to previous full-parameter experiments, except that LoRA is enabled with rank=8 and alpha=16.

As shown in Table `\ref{tab:lora_math}`{=latex}, DFT provides consistent improvements over both base and SFT baselines under LoRA-based PEFT. For Qwen2.5-Math-1.5B, DFT increases the average accuracy from 15.92 (base) and 16.87 (SFT) to 32.90. For LLaMA-3.2-3B, DFT achieves a gain of +3.44 over SFT (from 1.19 to 4.63). These results indicate that DFT can serve as an effective fine-tuning strategy in low-resource or compute-constrained settings, where full model updates are not practical.

Training Hyper-Parameters Ablation
----------------------------------

```{=latex}
\centering
```
![Ablation study of training hyper-parameters, learning rates and batch size, for `\model`{=latex} and SFT on Qwen2.5-Math-1.5B model.](Figure/lr_bs_ablation.png){#fig:lr_ablation width="1\\linewidth"}

To assess the robustness and sensitivity of our approach (`\model`{=latex}) with respect to key training hyper-parameters, we conduct an ablation study focused on learning rate and batch size, using the Qwen2.5-Math-1.5B base model. This analysis aims to answer two central questions: (1) Is the performance gap between `\model`{=latex} and SFT due to a suboptimal hyperparameter configuration in SFT? (2) How sensitive are both methods to changes in learning rate and batch size?

We evaluate both `\model`{=latex} and SFT across four learning rates: 2e-4, 1e-4, 5e-5, and 1e-5. As shown in Figure `\ref{fig:lr_ablation}`{=latex} (left), both methods exhibit a certain degree of sensitivity to the learning rate. `\model`{=latex} consistently outperforms SFT under all configurations, suggesting that the performance gap cannot be attributed solely to suboptimal hyperparameter choices in SFT. For both methods, intermediate learning rates (1e-4 and 5e-5) yield the best results, while both lower (1e-5) and higher (2e-4) values lead to noticeable degradation.

We further assess the impact of batch size, sweeping values from 32 to 256. As shown in Figure `\ref{fig:lr_ablation}`{=latex} (right), both `\model`{=latex} and SFT exhibit relatively stable performance across the full range of batch sizes. While minor fluctuations are observed, there is no consistent trend indicating that larger or smaller batches significantly affect final accuracy. This suggests that batch size is not a dominant factor for either method in this setup, and that default values may suffice in practice.

[^1]: Equal Contribution. $^{\dag}$Project Leader. $^{\ddag}$Corresponding Author.
