自我對弈微調將弱語言模型轉換為強語言模型。
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
January 2, 2024
作者: Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu
cs.AI
摘要
通過監督微調(SFT)利用人類標註數據的能力對於推進大型語言模型(LLMs)至關重要。在本文中,我們探討在無需獲取額外人類標註數據的情況下,如何從弱模型中發展出強大的LLM。我們提出了一種名為自我對弈微調(SPIN)的新微調方法,該方法從一個監督微調的模型開始。SPIN的核心是自我對弈機制,LLM通過與自身實例對戰來提升自身能力。更具體地說,LLM從其先前迭代中生成自己的訓練數據,通過區分這些自生成的回應與從人類標註數據獲得的回應來完善其策略。我們的方法逐步將LLM從新生模型提升為強大模型,發揮人類標註示範數據在SFT中的全部潛力。從理論上來說,我們證明了我們方法的訓練目標函數的全局最優解僅在LLM策略與目標數據分佈一致時才能實現。在實證方面,我們在幾個基準數據集上評估了我們的方法,包括HuggingFace Open LLM Leaderboard、MT-Bench以及來自Big-Bench的數據集。我們的結果顯示,SPIN能夠顯著提高LLM在各種基準測試中的性能,甚至優於通過直接偏好優化(DPO)搭配額外GPT-4偏好數據訓練的模型。這為自我對弈的潛力帶來曙光,實現了在LLMs中達到人類水平性能而無需專家對手的可能性。
English
Harnessing the power of human-annotated data through Supervised Fine-Tuning
(SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we
delve into the prospect of growing a strong LLM out of a weak one without the
need for acquiring additional human-annotated data. We propose a new
fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a
supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism,
where the LLM refines its capability by playing against instances of itself.
More specifically, the LLM generates its own training data from its previous
iterations, refining its policy by discerning these self-generated responses
from those obtained from human-annotated data. Our method progressively
elevates the LLM from a nascent model to a formidable one, unlocking the full
potential of human-annotated demonstration data for SFT. Theoretically, we
prove that the global optimum to the training objective function of our method
is achieved only when the LLM policy aligns with the target data distribution.
Empirically, we evaluate our method on several benchmark datasets including the
HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our
results show that SPIN can significantly improve the LLM's performance across a
variety of benchmarks and even outperform models trained through direct
preference optimization (DPO) supplemented with extra GPT-4 preference data.
This sheds light on the promise of self-play, enabling the achievement of
human-level performance in LLMs without the need for expert opponents.