通過直接偏好優化的自我訓練改善了思維鏈推理。
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
July 25, 2024
作者: Tianduo Wang, Shichen Li, Wei Lu
cs.AI
摘要
對於數學推理任務,有效訓練語言模型(LMs)需要高質量的監督式微調數據。除了從人類專家獲取標註外,一個常見的替代方法是從更大更強大的LMs中取樣。然而,這種知識蒸餾方法可能成本高昂且不穩定,特別是在依賴像GPT-4這樣的封閉源、專有LMs時,其行為常常難以預測。在這項工作中,我們展示了通過自我訓練可以增強小規模LMs的推理能力,這是一種模型從自身輸出中學習的過程。我們還表明,傳統的自我訓練可以透過一種稱為直接偏好優化(DPO)的偏好學習算法進一步增強。通過將DPO整合到自我訓練中,我們利用偏好數據來引導LMs朝向更準確和多樣化的思維鏈推理。我們在不同基礎模型上評估了我們的方法在各種數學推理任務中的效果。我們的實驗表明,這種方法不僅提高了LMs的推理性能,而且相較於依賴大型專有LMs,還提供了一種更具成本效益和可擴展性的解決方案。
English
Effective training of language models (LMs) for mathematical reasoning tasks
demands high-quality supervised fine-tuning data. Besides obtaining annotations
from human experts, a common alternative is sampling from larger and more
powerful LMs. However, this knowledge distillation approach can be costly and
unstable, particularly when relying on closed-source, proprietary LMs like
GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate
that the reasoning abilities of small-scale LMs can be enhanced through
self-training, a process where models learn from their own outputs. We also
show that the conventional self-training can be further augmented by a
preference learning algorithm called Direct Preference Optimization (DPO). By
integrating DPO into self-training, we leverage preference data to guide LMs
towards more accurate and diverse chain-of-thought reasoning. We evaluate our
method across various mathematical reasoning tasks using different base models.
Our experiments show that this approach not only improves LMs' reasoning
performance but also offers a more cost-effective and scalable solution
compared to relying on large proprietary LMs.Summary
AI-Generated Summary