セルフプレイによるファインチューニングは、弱い言語モデルを強い言語モデルに変換する

要旨

人間によるアノテーションデータの力を活用した教師ありファインチューニング（SFT）は、大規模言語モデル（LLM）の進化において極めて重要です。本論文では、追加の人間によるアノテーションデータを必要とせずに、弱いLLMから強力なLLMを成長させる可能性について探求します。我々は、教師ありファインチューニングされたモデルから始める新しいファインチューニング手法であるSelf-Play fIne-tuNing（SPIN）を提案します。SPINの核心は、LLMが自身のインスタンスと対戦することで能力を洗練させるセルフプレイメカニズムにあります。具体的には、LLMは以前のイテレーションから自身のトレーニングデータを生成し、これらの自己生成された応答と人間によるアノテーションデータから得られた応答を識別することでポリシーを洗練します。我々の手法は、LLMを未熟なモデルから強力なモデルへと段階的に進化させ、SFTにおける人間によるデモンストレーションデータの全潜在能力を引き出します。理論的には、我々の手法のトレーニング目的関数の大域的最適解は、LLMのポリシーがターゲットデータ分布と一致する場合にのみ達成されることを証明します。実験的には、HuggingFace Open LLM Leaderboard、MT-Bench、Big-Benchのデータセットなど、いくつかのベンチマークデータセットで我々の手法を評価します。結果は、SPINが様々なベンチマークでLLMの性能を大幅に向上させ、追加のGPT-4選好データを補完した直接選好最適化（DPO）でトレーニングされたモデルを凌駕することさえあることを示しています。これは、専門家の対戦相手を必要とせずに、LLMで人間レベルの性能を達成するためのセルフプレイの可能性に光を当てています。

English

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.

セルフプレイによるファインチューニングは、弱い言語モデルを強い言語モデルに変換する

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

要旨

Support