基于步骤控制的DPO:利用逐步误差以增强数学推理
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
June 30, 2024
作者: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan
cs.AI
摘要
直接偏好优化(DPO)已被证明在提高大型语言模型(LLMs)在推理和对齐等下游任务上的性能方面是有效的。在这项工作中,我们提出了步骤控制的DPO(SCDPO),这是一种通过创建数学推理基础的负样本并从指定步骤开始制造错误,从而自动提供分步错误监督的方法。通过在DPO训练中应用这些样本,SCDPO可以更好地使模型对理解推理错误并输出准确的推理步骤进行对齐。我们将SCDPO应用于代码集成和思维链解决方案,经验性地表明它相对于朴素DPO在三种不同的SFT模型上均能持续改善性能,包括一个现有的SFT模型和两个我们微调的模型。对SCDPO和DPO的学分分配进行定性分析表明了SCDPO在识别数学解决方案中的错误方面的有效性。然后我们将SCDPO应用于InternLM2-20B模型,得到一个在GSM8K上达到88.5%、在MATH上达到58.1%的20B模型,与所有其他开源LLMs相媲美,展示了我们方法的巨大潜力。
English
Direct Preference Optimization (DPO) has proven effective at improving the
performance of large language models (LLMs) on downstream tasks such as
reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO),
a method for automatically providing stepwise error supervision by creating
negative samples of mathematical reasoning rationales that start making errors
at a specified step. By applying these samples in DPO training, SCDPO can
better align the model to understand reasoning errors and output accurate
reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought
solutions, empirically showing that it consistently improves the performance
compared to naive DPO on three different SFT models, including one existing SFT
model and two models we finetuned. Qualitative analysis of the credit
assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at
identifying errors in mathematical solutions. We then apply SCDPO to an
InternLM2-20B model, resulting in a 20B model that achieves high scores of
88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing
the great potential of our method.Summary
AI-Generated Summary