반복적 추론 선호 최적화

초록

반복적 선호도 최적화 방법은 최근 일반적인 지시 튜닝 작업에서 우수한 성능을 보여주었으나, 일반적으로 추론 작업에서는 개선이 미미한 것으로 나타났다(Yuan et al., 2024, Chen et al., 2024). 본 연구에서는 정답으로 이어지는 승리 대 패배 추론 단계를 최적화함으로써 경쟁적인 Chain-of-Thought(CoT) 후보들 간의 선호도를 최적화하는 반복적 접근법을 개발한다. 우리는 수정된 DPO 손실(Rafailov et al., 2023)에 추가적인 음의 로그 가능도 항을 포함하여 학습을 진행했으며, 이 항이 중요하다는 것을 발견했다. 우리는 이 방식을 반복적으로 적용함에 따라 추론 능력이 개선됨을 보여준다. 훈련 세트의 예제만을 사용하면서도, 우리의 접근법은 GSM8K에서 Llama-2-70B-Chat의 정확도를 55.6%에서 81.6%로(32개 샘플의 다수결 투표 시 88.7%), MATH에서 12.5%에서 20.8%로, ARC-Challenge에서 77.8%에서 86.7%로 증가시켰다. 이는 추가적으로 수집된 데이터셋에 의존하지 않는 다른 Llama-2 기반 모델들을 능가하는 성과이다.

English

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.

반복적 추론 선호 최적화

Iterative Reasoning Preference Optimization

초록

Support