迭代纳什策略优化:通过无悔学习将LLMs与一般偏好对齐
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
June 30, 2024
作者: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu
cs.AI
摘要
基于人类反馈的强化学习(RLHF)在将大型语言模型(LLMs)与人类偏好对齐方面取得了巨大成功。流行的RLHF方法是基于奖励的,遵循Bradley-Terry(BT)模型假设,这可能无法完全捕捉人类偏好的复杂性。在本文中,我们探讨了在一般偏好框架下的RLHF,并从博弈论的角度进行了研究。具体而言,我们将问题建模为一个双人博弈,并提出了一种新颖的算法,迭代纳什策略优化(INPO)。关键思想是通过无悔学习让策略自我对弈,从而逼近纳什策略。与先前的方法不同,INPO避免了估计个体响应的预期胜率,这通常会带来高计算或注释成本。相反,我们引入了一个新的损失目标,直接在偏好数据集上最小化。我们为我们的方法提供了理论分析,并通过在各种代表性基准测试中的实验展示了其有效性。基于LLaMA-3-8B的SFT模型,INPO在AlpacaEval 2.0上实现了41.5%的长度控制胜率,在Arena-Hard上实现了38.3%的胜率,相比基于BT模型假设的最先进迭代算法[Dong等,2024]有了显著改进。此外,我们的消融研究突出了将KL正则化纳入响应长度控制的好处。
English
Reinforcement Learning with Human Feedback (RLHF) has achieved great success
in aligning large language models (LLMs) with human preferences. Prevalent RLHF
approaches are reward-based, following the Bradley-Terry (BT) model assumption,
which may not fully capture the complexity of human preferences. In this paper,
we explore RLHF under a general preference framework and approach it from a
game-theoretic perspective. Specifically, we formulate the problem as a
two-player game and propose a novel algorithm, iterative Nash policy
optimization (INPO). The key idea is to let the policy play against itself via
no-regret learning, thereby approximating the Nash policy. Unlike previous
methods, INPO bypasses the need for estimating the expected win rate for
individual responses, which typically incurs high computational or annotation
costs. Instead, we introduce a new loss objective that is directly minimized
over a preference dataset. We provide theoretical analysis for our approach and
demonstrate its effectiveness through experiments on various representative
benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5%
length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on
Arena-Hard, showing substantial improvement over the state-of-the-art iterative
algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our
ablation study highlights the benefits of incorporating KL regularization for
response length control.Summary
AI-Generated Summary