ChatPaper.aiChatPaper

自我對弈偏好優化以達成語言模型對齊

Self-Play Preference Optimization for Language Model Alignment

May 1, 2024
作者: Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
cs.AI

摘要

傳統的從人類反饋學習(RLHF)方法,依賴如Bradley-Terry模型之參數模型,無法捕捉人類偏好中的不傳遞性和非理性。最近的進展表明,直接處理偏好概率可以更準確地反映人類偏好,從而實現更靈活和準確的語言模型對齊。本文提出了一種基於自我對弈的語言模型對齊方法,將問題視為一個旨在識別納什均衡策略的恆和雙人遊戲。我們的方法被稱為自我對弈偏好優化(SPPO),通過迭代策略更新來逼近納什均衡,並具有理論上的收斂保證。我們的方法可以有效地提高所選回應的對數似然,降低被拒絕回應的對數似然,這是對稱配對損失(如直接偏好優化(DPO)和身份偏好優化(IPO))無法輕易實現的。在我們的實驗中,僅使用UltraFeedback數據集中的60k提示(不包括回應),並且不進行任何提示擴充,通過利用僅具有0.4B參數的預訓練偏好模型PairRM,SPPO可以從微調Mistral-7B-Instruct-v0.2獲得一個在AlpacaEval 2.0上對抗GPT-4-Turbo的最新控制長度勝率達28.53%的模型。它還在MT-Bench和Open LLM Leaderboard上優於(迭代的)DPO和IPO。值得注意的是,SPPO的強勁表現是在沒有來自GPT-4或其他更強大語言模型的額外外部監督(例如回應、偏好等)的情況下實現的。
English
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

Summary

AI-Generated Summary

PDF287December 15, 2024