Step-DPO：LLM的长链推理的逐步偏好优化

摘要

大型语言模型（LLMs）面临着数学推理方面的重大挑战，因为准确性需要广泛和精确的推理链。确保每个推理步骤的正确性至关重要。为了解决这一问题，我们旨在通过从人类反馈中学习来增强LLMs的鲁棒性和事实性。然而，直接偏好优化（DPO）对于长链数学推理的益处有限，因为采用DPO的模型很难识别错误答案中的详细错误。这种局限源于缺乏细粒度的过程监督。我们提出了一种简单、有效且数据高效的方法，称为Step-DPO，它将单个推理步骤视为偏好优化的单元，而不是对答案进行整体评估。此外，我们开发了一个用于Step-DPO的数据构建流水线，可以创建一个包含10K个逐步偏好对的高质量数据集。我们还观察到，在DPO中，自动生成的数据比人类或GPT-4生成的数据更有效，因为后者具有超出分布的特性。我们的研究结果表明，只需10K个偏好数据对和少于500个Step-DPO训练步骤，就可以使具有超过70B参数的模型在MATH上准确率提高近3%。值得注意的是，将Step-DPO应用于Qwen2-72B-Instruct时，在MATH和GSM8K的测试集上分别获得了70.8%和94.0%的分数，超过了一系列闭源模型，包括GPT-4-1106、Claude-3-Opus和Gemini-1.5-Pro。我们的代码、数据和模型可在https://github.com/dvlab-research/Step-DPO 上获得。

English

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.

Step-DPO：LLM的长链推理的逐步偏好优化

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

摘要

Support