迭代式納什政策優化:透過無悔學習將LLMs與一般偏好對齊
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
June 30, 2024
作者: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu
cs.AI
摘要
以人類反饋的強化學習(RLHF)在對齊大型語言模型(LLMs)與人類偏好方面取得了巨大成功。普遍的RLHF方法是基於獎勵的,遵循Bradley-Terry(BT)模型假設,這可能無法完全捕捉人類偏好的複雜性。在本文中,我們探索了在一個通用偏好框架下的RLHF,並從博弈理論的角度來處理它。具體而言,我們將問題定義為一個雙人遊戲,並提出了一種新穎的算法,即迭代納什策略優化(INPO)。其關鍵思想是通過無悔學習讓策略與自身對弈,從而逼近納什策略。與先前的方法不同,INPO避免了對個別回應的預期勝率進行估計的需要,這通常會帶來高計算或標註成本。相反,我們引入了一個新的損失目標,直接在偏好數據集上最小化。我們對我們的方法進行了理論分析,並通過在各種代表性基準測試上的實驗展示了其有效性。基於LLaMA-3-8B的SFT模型,INPO在AlpacaEval 2.0上實現了41.5%的長度控制勝率,在Arena-Hard上實現了38.3%的勝率,顯示出相對於基於BT模型假設的最新迭代算法[Dong等,2024]有顯著改進。此外,我們的消融研究凸顯了將KL正則化納入回應長度控制的好處。
English
Reinforcement Learning with Human Feedback (RLHF) has achieved great success
in aligning large language models (LLMs) with human preferences. Prevalent RLHF
approaches are reward-based, following the Bradley-Terry (BT) model assumption,
which may not fully capture the complexity of human preferences. In this paper,
we explore RLHF under a general preference framework and approach it from a
game-theoretic perspective. Specifically, we formulate the problem as a
two-player game and propose a novel algorithm, iterative Nash policy
optimization (INPO). The key idea is to let the policy play against itself via
no-regret learning, thereby approximating the Nash policy. Unlike previous
methods, INPO bypasses the need for estimating the expected win rate for
individual responses, which typically incurs high computational or annotation
costs. Instead, we introduce a new loss objective that is directly minimized
over a preference dataset. We provide theoretical analysis for our approach and
demonstrate its effectiveness through experiments on various representative
benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5%
length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on
Arena-Hard, showing substantial improvement over the state-of-the-art iterative
algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our
ablation study highlights the benefits of incorporating KL regularization for
response length control.Summary
AI-Generated Summary