UFT:统一监督学习与强化学习的微调框架
UFT: Unifying Supervised and Reinforcement Fine-Tuning
May 22, 2025
作者: Mingyang Liu, Gabriele Farina, Asuman Ozdaglar
cs.AI
摘要
后训练在提升大语言模型(LLMs)的推理能力方面已展现出其重要性。主要的后训练方法可分为监督微调(SFT)和强化微调(RFT)。SFT效率高,尤其适合小型语言模型,但可能导致过拟合,限制大型模型的推理能力。相比之下,RFT通常能带来更好的泛化效果,但高度依赖于基础模型的强度。为了克服SFT和RFT的局限性,我们提出了统一微调(UFT),这是一种新颖的后训练范式,将SFT和RFT统一为一个整合的过程。UFT使模型在融入有信息量的监督信号的同时,有效探索解决方案,弥合了现有方法中记忆与思考之间的鸿沟。值得注意的是,无论模型规模大小,UFT在总体上均优于SFT和RFT。此外,我们从理论上证明了UFT突破了RFT固有的指数级样本复杂度瓶颈,首次展示了统一训练能在长程推理任务上指数级加速收敛。
English
Post-training has demonstrated its importance in enhancing the reasoning
capabilities of large language models (LLMs). The primary post-training methods
can be categorized into supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT). SFT is efficient and well-suited for small language models,
but it may lead to overfitting and limit the reasoning abilities of larger
models. In contrast, RFT generally yields better generalization but depends
heavily on the strength of the base model. To address the limitations of SFT
and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm
that unifies SFT and RFT into a single, integrated process. UFT enables the
model to effectively explore solutions while incorporating informative
supervision signals, bridging the gap between memorizing and thinking
underlying existing methods. Notably, UFT outperforms both SFT and RFT in
general, regardless of model sizes. Furthermore, we theoretically prove that
UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for
the first time that unified training can exponentially accelerate convergence
on long-horizon reasoning tasks.Summary
AI-Generated Summary