Step-DPO：用於LLM長鏈推理的逐步偏好優化

摘要

對於大型語言模型（LLMs）來說，數學推理提出了重大挑戰，因為為了確保準確性，需要進行廣泛且精確的推理鏈。確保每個推理步驟的正確性至關重要。為了應對這一挑戰，我們旨在通過從人類反饋中學習來增強LLMs的魯棒性和事實性。然而，直接偏好優化（DPO）對於長鏈數學推理的幫助有限，因為採用DPO的模型很難識別不正確答案中的詳細錯誤。這一限制源於缺乏細粒度的過程監督。我們提出了一種簡單、有效且數據高效的方法，名為Step-DPO，該方法將單個推理步驟視為偏好優化的單位，而不是對答案進行整體評估。此外，我們還為Step-DPO開發了一個數據構建流程，可以創建包含10K步驟偏好對的高質量數據集。我們還觀察到，在DPO中，自生成的數據比人類或GPT-4生成的數據更有效，因為後者具有超出分佈範圍的特性。我們的研究結果表明，僅需10K偏好數據對和少於500個Step-DPO訓練步驟，即可使具有超過70B參數的模型在MATH上的準確率提高近3％。值得注意的是，當應用於Qwen2-72B-Instruct時，Step-DPO在MATH和GSM8K的測試集上分別達到了70.8％和94.0％的得分，超越了一系列封閉源模型，包括GPT-4-1106、Claude-3-Opus和Gemini-1.5-Pro。我們的代碼、數據和模型可在https://github.com/dvlab-research/Step-DPO 上找到。

English

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.

Step-DPO：用於LLM長鏈推理的逐步偏好優化

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

摘要

Support