UFT：统一监督学习与强化学习的微调框架

摘要

后训练在提升大语言模型（LLMs）的推理能力方面已展现出其重要性。主要的后训练方法可分为监督微调（SFT）和强化微调（RFT）。SFT效率高，尤其适合小型语言模型，但可能导致过拟合，限制大型模型的推理能力。相比之下，RFT通常能带来更好的泛化效果，但高度依赖于基础模型的强度。为了克服SFT和RFT的局限性，我们提出了统一微调（UFT），这是一种新颖的后训练范式，将SFT和RFT统一为一个整合的过程。UFT使模型在融入有信息量的监督信号的同时，有效探索解决方案，弥合了现有方法中记忆与思考之间的鸿沟。值得注意的是，无论模型规模大小，UFT在总体上均优于SFT和RFT。此外，我们从理论上证明了UFT突破了RFT固有的指数级样本复杂度瓶颈，首次展示了统一训练能在长程推理任务上指数级加速收敛。

English

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

UFT：统一监督学习与强化学习的微调框架

UFT: Unifying Supervised and Reinforcement Fine-Tuning

摘要

Support