UFT：統一監督學習與強化學習的微調框架

摘要

後訓練在提升大型語言模型（LLMs）的推理能力方面已展現其重要性。主要的後訓練方法可分為監督式微調（SFT）和強化式微調（RFT）。SFT效率高且適合小型語言模型，但可能導致過度擬合並限制大型模型的推理能力。相比之下，RFT通常能帶來更好的泛化效果，但高度依賴基礎模型的強度。為解決SFT和RFT的局限性，我們提出了統一微調（UFT），這是一種新穎的後訓練範式，將SFT和RFT統一為一個整合的過程。UFT使模型能夠有效探索解決方案，同時融入有信息量的監督信號，彌補了現有方法在記憶與思考之間的差距。值得注意的是，無論模型大小如何，UFT在整體上均優於SFT和RFT。此外，我們從理論上證明了UFT打破了RFT固有的指數級樣本複雜度瓶頸，首次展示了統一訓練能夠在長程推理任務上指數級加速收斂。

English

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

UFT：統一監督學習與強化學習的微調框架

UFT: Unifying Supervised and Reinforcement Fine-Tuning

摘要

Support