SPaR: 大規模言語モデルにおける指示に従う能力向上のための木探索精緻化を用いたセルフプレイ

要旨

指示の遵守は、言語モデルの基本的な能力であり、モデルには指示の最も微妙な要件さえ認識させ、その出力に正確に反映させる必要があります。このような能力は、しばしば好みの学習によって適しており、最適化されます。しかし、既存の方法では、しばしばモデルから複数の独立した応答を直接サンプリングして好みのペアを作成します。このような実践は、指示が正確に遵守されているかどうかには関係のないコンテンツの変化を導入する可能性があります（例：同じ意味についての異なる表現）、これはモデルに改善された指示の遵守につながる重要な違いを認識することを妨げます。このため、私たちは、自己対戦フレームワークであるSPaRを導入し、ノイズのない有効で比較可能な好みのペアを生成するために木探索自己改善を統合しています。LLMは自己対戦を通じて、指示に対して以前の応答を改善し、不必要な変化を最小限に抑えるために木探索戦略を採用します。実験では、SPaRによって導かれた3回の反復トレーニングを受けたLLaMA3-8Bモデルが、一般的な能力を失うことなくIFEvalベンチマークでGPT-4-Turboを凌駕することを示しています。さらに、SPaRは有望なスケーラビリティと転移性を示し、GLM-4-9BやLLaMA3-70Bなどのモデルを大幅に向上させます。また、木探索における推論スケーリングがモデルのパフォーマンスにどのように影響するかを特定しています。私たちのコードとデータは、https://github.com/thu-coai/SPaR で公開されています。

English

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.

SPaR: 大規模言語モデルにおける指示に従う能力向上のための木探索精緻化を用いたセルフプレイ

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

要旨

Support