UFT: 지도 학습과 강화 학습 미세 조정의 통합

초록

사후 훈련(Post-training)은 대규모 언어 모델(LLMs)의 추론 능력을 향상시키는 데 있어 그 중요성이 입증되어 왔다. 주요 사후 훈련 방법은 지도 미세 조정(Supervised Fine-Tuning, SFT)과 강화 미세 조정(Reinforcement Fine-Tuning, RFT)으로 분류할 수 있다. SFT는 효율적이며 소규모 언어 모델에 적합하지만, 과적합을 유발하고 더 큰 모델의 추론 능력을 제한할 수 있다. 반면, RFT는 일반적으로 더 나은 일반화를 이끌어내지만 기본 모델의 강도에 크게 의존한다. SFT와 RFT의 한계를 해결하기 위해, 우리는 SFT와 RFT를 단일 통합 프로세스로 결합한 새로운 사후 훈련 패러다임인 통합 미세 조정(Unified Fine-Tuning, UFT)을 제안한다. UFT는 모델이 정보성 있는 지도 신호를 통합하면서도 효과적으로 해결책을 탐색할 수 있게 하여, 기존 방법의 암기와 사고 간의 간극을 메운다. 특히, UFT는 모델 크기에 관계없이 일반적으로 SFT와 RFT를 능가한다. 더 나아가, 우리는 이론적으로 UFT가 RFT의 고유한 지수적 샘플 복잡도 병목 현상을 극복함을 증명하며, 통합 훈련이 장기적 추론 작업에서 수렴을 지수적으로 가속시킬 수 있음을 처음으로 보여준다.

English

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

UFT: 지도 학습과 강화 학습 미세 조정의 통합

UFT: Unifying Supervised and Reinforcement Fine-Tuning

초록

Support