基于步骤控制的DPO：利用逐步误差以增强数学推理

摘要

直接偏好优化（DPO）已被证明在提高大型语言模型（LLMs）在推理和对齐等下游任务上的性能方面是有效的。在这项工作中，我们提出了步骤控制的DPO（SCDPO），这是一种通过创建数学推理基础的负样本并从指定步骤开始制造错误，从而自动提供分步错误监督的方法。通过在DPO训练中应用这些样本，SCDPO可以更好地使模型对理解推理错误并输出准确的推理步骤进行对齐。我们将SCDPO应用于代码集成和思维链解决方案，经验性地表明它相对于朴素DPO在三种不同的SFT模型上均能持续改善性能，包括一个现有的SFT模型和两个我们微调的模型。对SCDPO和DPO的学分分配进行定性分析表明了SCDPO在识别数学解决方案中的错误方面的有效性。然后我们将SCDPO应用于InternLM2-20B模型，得到一个在GSM8K上达到88.5％、在MATH上达到58.1％的20B模型，与所有其他开源LLMs相媲美，展示了我们方法的巨大潜力。

English

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.

基于步骤控制的DPO：利用逐步误差以增强数学推理

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

摘要

Support