通过直接优化偏好进行自我训练可改善思维链推理。
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
July 25, 2024
作者: Tianduo Wang, Shichen Li, Wei Lu
cs.AI
摘要
数学推理任务的语言模型(LMs)的有效训练需要高质量的监督微调数据。除了从人类专家那里获得注释之外,一个常见的替代方法是从更大更强大的LMs中抽样。然而,这种知识蒸馏方法可能成本高且不稳定,特别是当依赖像GPT-4这样的闭源专有LMs时,其行为常常难以预测。在这项工作中,我们展示了通过自训练可以增强小规模LMs的推理能力,这是一种模型从自身输出中学习的过程。我们还表明,传统的自训练可以通过一种称为直接偏好优化(DPO)的偏好学习算法进一步增强。通过将DPO整合到自训练中,我们利用偏好数据来引导LMs朝着更准确和多样化的思维链推理。我们使用不同的基础模型在各种数学推理任务上评估了我们的方法。我们的实验证明,与依赖大型专有LMs相比,这种方法不仅提高了LMs的推理性能,而且提供了一种更具成本效益和可扩展性的解决方案。
English
Effective training of language models (LMs) for mathematical reasoning tasks
demands high-quality supervised fine-tuning data. Besides obtaining annotations
from human experts, a common alternative is sampling from larger and more
powerful LMs. However, this knowledge distillation approach can be costly and
unstable, particularly when relying on closed-source, proprietary LMs like
GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate
that the reasoning abilities of small-scale LMs can be enhanced through
self-training, a process where models learn from their own outputs. We also
show that the conventional self-training can be further augmented by a
preference learning algorithm called Direct Preference Optimization (DPO). By
integrating DPO into self-training, we leverage preference data to guide LMs
towards more accurate and diverse chain-of-thought reasoning. We evaluate our
method across various mathematical reasoning tasks using different base models.
Our experiments show that this approach not only improves LMs' reasoning
performance but also offers a more cost-effective and scalable solution
compared to relying on large proprietary LMs.Summary
AI-Generated Summary