Step-DPO: 대규모 언어 모델의 장기간 추론을 위한 단계별 선호도 최적화

초록

수학적 추론은 정확성을 위해 광범위하고 정밀한 추론 과정을 요구하기 때문에 대형 언어 모델(LLMs)에게 상당한 도전 과제로 여겨집니다. 각 추론 단계의 정확성을 보장하는 것이 매우 중요합니다. 이를 해결하기 위해, 우리는 인간의 피드백을 학습하여 LLMs의 견고성과 사실성을 향상시키고자 합니다. 그러나 직접 선호 최적화(Direct Preference Optimization, DPO)는 장기간의 수학적 추론에서 제한된 이점만을 보여주었는데, 이는 DPO를 사용하는 모델들이 잘못된 답변에서 세부적인 오류를 식별하는 데 어려움을 겪기 때문입니다. 이러한 한계는 세밀한 과정 감독의 부재에서 비롯됩니다. 우리는 이러한 문제를 해결하기 위해 단계별 선호 최적화(Step-DPO)라는 간단하고 효과적이며 데이터 효율적인 방법을 제안합니다. 이 방법은 답변을 전체적으로 평가하는 대신 개별 추론 단계를 선호 최적화의 단위로 취급합니다. 또한, 우리는 Step-DPO를 위한 데이터 구축 파이프라인을 개발하여 10,000개의 단계별 선호 쌍을 포함한 고품질 데이터셋을 생성할 수 있게 했습니다. 또한, DPO에서 자체 생성된 데이터가 인간이나 GPT-4가 생성한 데이터보다 더 효과적이라는 것을 관찰했는데, 이는 후자가 분포 외(out-of-distribution) 특성을 띠기 때문입니다. 우리의 연구 결과는 70B 이상의 매개변수를 가진 모델에서 단 10,000개의 선호 데이터 쌍과 500개 미만의 Step-DPO 학습 단계만으로도 MATH 데이터셋에서 거의 3%의 정확도 향상을 가져올 수 있음을 보여줍니다. 특히, Qwen2-72B-Instruct에 Step-DPO를 적용한 결과, MATH와 GSM8K 테스트 세트에서 각각 70.8%와 94.0%의 점수를 기록하여 GPT-4-1106, Claude-3-Opus, Gemini-1.5-Pro를 포함한 일련의 클로즈드 소스 모델들을 능가했습니다. 우리의 코드, 데이터, 모델은 https://github.com/dvlab-research/Step-DPO에서 확인할 수 있습니다.

English

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.

Step-DPO: 대규모 언어 모델의 장기간 추론을 위한 단계별 선호도 최적화

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

초록

Support