언어 모델 정렬을 위한 자기 주도적 선호도 최적화

초록

기존의 인간 피드백을 통한 강화 학습(RLHF) 접근법은 Bradley-Terry 모델과 같은 파라미터 모델에 의존함으로써 인간 선호의 비이행성과 비합리성을 충분히 포착하지 못하는 한계를 보였습니다. 최근 연구에서는 선호 확률을 직접 다루는 것이 인간 선호를 더 정확하게 반영할 수 있으며, 이를 통해 언어 모델 정렬을 더 유연하고 정확하게 수행할 수 있다는 점이 제시되었습니다. 본 논문에서는 언어 모델 정렬 문제를 상수합(constant-sum) 2인 게임으로 간주하고 내쉬 균형 정책을 찾는 것을 목표로 하는 자기대전(self-play) 기반 방법을 제안합니다. 우리의 접근법인 Self-Play Preference Optimization(SPPO)은 반복적인 정책 업데이트를 통해 내쉬 균형을 근사하며, 이론적 수렴 보장을 갖추고 있습니다. 이 방법은 선택된 응답의 로그 가능도를 효과적으로 증가시키고 거부된 응답의 로그 가능도를 감소시킬 수 있으며, 이러한 결과는 Direct Preference Optimization(DPO) 및 Identity Preference Optimization(IPO)와 같은 대칭적 쌍별 손실 함수로는 쉽게 달성할 수 없습니다. 실험에서는 UltraFeedback 데이터셋의 60k 프롬프트(응답 없음)만을 사용하고 프롬프트 증강 없이, 0.4B 파라미터의 사전 학습된 선호 모델 PairRM을 활용하여 Mistral-7B-Instruct-v0.2를 미세 조정한 모델을 얻었습니다. 이 모델은 AlpacaEval 2.0에서 GPT-4-Turbo 대비 28.53%의 최신 길이 제어 승률을 달성했으며, MT-Bench와 Open LLM Leaderboard에서도 (반복적) DPO 및 IPO를 능가했습니다. 특히, SPPO의 강력한 성능은 GPT-4나 다른 더 강력한 언어 모델로부터의 추가 외부 감독(예: 응답, 선호 등) 없이도 달성되었습니다.

English

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

언어 모델 정렬을 위한 자기 주도적 선호도 최적화

Self-Play Preference Optimization for Language Model Alignment

초록

Support