UFT：教師あり学習と強化学習の統合的ファインチューニング

要旨

ポストトレーニングは、大規模言語モデル（LLMs）の推論能力を向上させる上でその重要性が示されてきた。主なポストトレーニング手法は、教師ありファインチューニング（SFT）と強化学習ファインチューニング（RFT）に分類される。SFTは効率的であり、小規模な言語モデルに適しているが、過学習を引き起こし、大規模モデルの推論能力を制限する可能性がある。一方、RFTは一般に優れた汎化性能をもたらすが、ベースモデルの強さに大きく依存する。SFTとRFTの限界を克服するため、我々はSFTとRFTを単一の統合プロセスに統合した新しいポストトレーニングパラダイムであるUnified Fine-Tuning（UFT）を提案する。UFTは、モデルが有益な教師信号を取り入れながら効果的に解決策を探索することを可能にし、既存手法の根底にある記憶と思考のギャップを埋める。特に、UFTはモデルサイズに関わらず、一般的にSFTとRFTを上回る性能を示す。さらに、我々は理論的に、UFTがRFTの本質的な指数的サンプル複雑性のボトルネックを打破し、統合トレーニングが長期的な推論タスクにおいて収束を指数的に加速できることを初めて示す。

English

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

UFT：教師あり学習と強化学習の統合的ファインチューニング

UFT: Unifying Supervised and Reinforcement Fine-Tuning

要旨

Support