迭代推理偏好优化

摘要

最近，迭代偏好优化方法已被证明在通用指令调整任务中表现良好，但通常在推理任务上改进有限（Yuan等，2024年，Chen等，2024年）。在这项工作中，我们开发了一种迭代方法，通过优化导致正确答案的推理步骤的胜负来优化竞争生成的“思维链”（CoT）候选项之间的偏好。我们使用修改后的DPO损失（Rafailov等，2023年）进行训练，其中包括一个额外的负对数似然项，我们发现这一项至关重要。我们展示了通过该方案的重复迭代推理得到改进。尽管仅依赖于训练集中的示例，我们的方法导致Llama-2-70B-Chat在GSM8K上的准确率从55.6%提高到81.6%（在32个样本中通过多数投票达到88.7%），在MATH上从12.5%提高到20.8%，在ARC-Challenge上从77.8%提高到86.7%，这超过了其他不依赖于额外数据集的基于Llama-2的模型。

English

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.

迭代推理偏好优化

Iterative Reasoning Preference Optimization

摘要

Support