自我对弈微调将弱语言模型转化为强语言模型。
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
January 2, 2024
作者: Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu
cs.AI
摘要
通过监督微调(SFT)利用人工标注数据的能力对于推进大型语言模型(LLMs)至关重要。本文探讨了在不需要获取额外人工标注数据的情况下,如何将一个弱LLM发展成一个强大LLM的可能性。我们提出了一种名为自我对弈微调(SPIN)的新微调方法,该方法从一个经过监督微调的模型开始。SPIN的核心是自我对弈机制,LLM通过与自身实例对弈来提升自身能力。更具体地说,LLM通过从其先前迭代中生成自己的训练数据,通过区分这些自动生成的响应和从人工标注数据中获得的响应来完善其策略。我们的方法逐步将LLM从一个新生模型提升为一个强大模型,释放人工标注示范数据在SFT中的全部潜力。从理论上讲,我们证明了我们方法的训练目标函数的全局最优解仅在LLM策略与目标数据分布一致时才能实现。在经验上,我们在包括HuggingFace开放LLM排行榜、MT-Bench以及Big-Bench数据集在内的几个基准数据集上评估了我们的方法。我们的结果显示,SPIN可以显著提高LLM在各种基准测试中的性能,甚至胜过通过额外GPT-4偏好数据补充的直接偏好优化(DPO)训练的模型。这为自我对弈的前景带来了希望,实现了在LLMs中达到人类水平性能而无需专家对手的可能性。
English
Harnessing the power of human-annotated data through Supervised Fine-Tuning
(SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we
delve into the prospect of growing a strong LLM out of a weak one without the
need for acquiring additional human-annotated data. We propose a new
fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a
supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism,
where the LLM refines its capability by playing against instances of itself.
More specifically, the LLM generates its own training data from its previous
iterations, refining its policy by discerning these self-generated responses
from those obtained from human-annotated data. Our method progressively
elevates the LLM from a nascent model to a formidable one, unlocking the full
potential of human-annotated demonstration data for SFT. Theoretically, we
prove that the global optimum to the training objective function of our method
is achieved only when the LLM policy aligns with the target data distribution.
Empirically, we evaluate our method on several benchmark datasets including the
HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our
results show that SPIN can significantly improve the LLM's performance across a
variety of benchmarks and even outperform models trained through direct
preference optimization (DPO) supplemented with extra GPT-4 preference data.
This sheds light on the promise of self-play, enabling the achievement of
human-level performance in LLMs without the need for expert opponents.