SFT의 일반화에 관하여: 보상 보정을 통한 강화 학습 관점에서

초록

우리는 대규모 언어 모델(LLM)의 지도 미세 조정(SFT)에 대한 간단하면서도 이론적으로 동기를 부여된 개선 방안을 제시하며, 강화 학습(RL)에 비해 제한된 일반화 능력을 해결합니다. 수학적 분석을 통해, 표준 SFT 그래디언트가 모델의 일반화 능력을 심각하게 제한할 수 있는 문제적인 보상 구조를 암묵적으로 인코딩한다는 사실을 밝혔습니다. 이를 바로잡기 위해, 우리는 동적 미세 조정(DFT)을 제안하며, 각 토큰에 대한 목적 함수를 해당 토큰의 확률로 동적으로 재조정하여 그래디언트 업데이트를 안정화합니다. 놀랍게도, 이 단일 라인 코드 변경은 여러 도전적인 벤치마크와 기본 모델에서 표준 SFT를 크게 능가하며, 크게 향상된 일반화 능력을 보여줍니다. 또한, 우리의 접근 방식은 오프라인 RL 설정에서도 경쟁력 있는 결과를 보여주며, 효과적이면서도 더 간단한 대안을 제공합니다. 이 작업은 이론적 통찰과 실용적인 솔루션을 연결하여 SFT 성능을 크게 발전시킵니다. 코드는 https://github.com/yongliang-wu/DFT에서 제공될 예정입니다.

English

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

SFT의 일반화에 관하여: 보상 보정을 통한 강화 학습 관점에서

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

초록

Support