迭代推理偏好优化
Iterative Reasoning Preference Optimization
April 30, 2024
作者: Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
cs.AI
摘要
最近,迭代偏好优化方法已被证明在通用指令调整任务中表现良好,但通常在推理任务上改进有限(Yuan等,2024年,Chen等,2024年)。在这项工作中,我们开发了一种迭代方法,通过优化导致正确答案的推理步骤的胜负来优化竞争生成的“思维链”(CoT)候选项之间的偏好。我们使用修改后的DPO损失(Rafailov等,2023年)进行训练,其中包括一个额外的负对数似然项,我们发现这一项至关重要。我们展示了通过该方案的重复迭代推理得到改进。尽管仅依赖于训练集中的示例,我们的方法导致Llama-2-70B-Chat在GSM8K上的准确率从55.6%提高到81.6%(在32个样本中通过多数投票达到88.7%),在MATH上从12.5%提高到20.8%,在ARC-Challenge上从77.8%提高到86.7%,这超过了其他不依赖于额外数据集的基于Llama-2的模型。
English
Iterative preference optimization methods have recently been shown to perform
well for general instruction tuning tasks, but typically make little
improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this
work we develop an iterative approach that optimizes the preference between
competing generated Chain-of-Thought (CoT) candidates by optimizing for winning
vs. losing reasoning steps that lead to the correct answer. We train using a
modified DPO loss (Rafailov et al., 2023) with an additional negative
log-likelihood term, which we find to be crucial. We show reasoning improves
across repeated iterations of this scheme. While only relying on examples in
the training set, our approach results in increasing accuracy for
Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting
out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on
ARC-Challenge, which outperforms other Llama-2-based models not relying on
additionally sourced datasets.Summary
AI-Generated Summary