Step-DPO: 大規模言語モデルの長鎖推論のための段階的選好最適化

要旨

数学的推論は、正確性を保つために広範かつ精密な推論の連鎖を必要とするため、大規模言語モデル（LLM）にとって重要な課題となっています。各推論ステップの正確性を確保することが極めて重要です。この課題に対処するため、我々は人間のフィードバックから学習することでLLMの堅牢性と事実性を向上させることを目指しています。しかし、直接選好最適化（DPO）は、長い連鎖的な数学的推論において限定的な効果しか示さず、DPOを採用したモデルは誤った回答における詳細なエラーを特定するのに苦労します。この制約は、細かいプロセス監視の欠如に起因しています。我々は、Step-DPOと呼ばれるシンプルで効果的かつデータ効率の良い手法を提案します。この手法は、回答を全体的に評価するのではなく、個々の推論ステップを選好最適化の単位として扱います。さらに、Step-DPOのためのデータ構築パイプラインを開発し、10,000のステップごとの選好ペアを含む高品質なデータセットの作成を可能にしました。また、DPOにおいて、自己生成データは人間やGPT-4によって生成されたデータよりも効果的であることが観察されました。これは、後者が分布外の性質を持つためです。我々の研究結果は、70B以上のパラメータを持つモデルにおいて、わずか10,000の選好データペアと500未満のStep-DPOトレーニングステップで、MATHにおける精度が約3%向上することを示しています。特に、Qwen2-72B-InstructにStep-DPOを適用した場合、MATHとGSM8Kのテストセットでそれぞれ70.8%と94.0%のスコアを達成し、GPT-4-1106、Claude-3-Opus、Gemini-1.5-Proを含む一連のクローズドソースモデルを上回りました。我々のコード、データ、モデルはhttps://github.com/dvlab-research/Step-DPOで公開されています。

English

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.

Step-DPO: 大規模言語モデルの長鎖推論のための段階的選好最適化

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

要旨

Support