WPO:利用加权偏好优化增强RLHF
WPO: Enhancing RLHF with Weighted Preference Optimization
June 17, 2024
作者: Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu
cs.AI
摘要
人类反馈强化学习(RLHF)是使大型语言模型(LLMs)与人类价值观更加契合的一个有前途的解决方案。离策略偏好优化,其中偏好数据来自其他模型,因其成本效益和可扩展性而被广泛采用。然而,离策略偏好优化经常受到用于数据收集的策略与目标策略之间的分布差距的困扰,导致次优的优化。在本文中,我们提出了一种新颖的策略,通过模拟离策略偏好数据的策略学习来缓解这一问题。我们的加权偏好优化(WPO)方法通过根据它们在当前策略下的概率重新加权偏好对,使离策略数据更接近于策略数据。这种方法不仅解决了分布差距问题,还增强了优化过程,而不会增加额外成本。我们在包括Alpaca Eval 2和MT-bench在内的指令跟随基准上验证了我们的方法。WPO不仅在Alpaca Eval 2上比直接偏好优化(DPO)高出多达5.6%,而且基于Llama-3-8B-Instruct,对GPT-4-turbo取得了令人瞩目的长度控制胜率48.6%,使其成为排行榜上最强大的8B模型。我们将在https://github.com/wzhouad/WPO发布代码和模型。
English
Reinforcement learning from human feedback (RLHF) is a promising solution to
align large language models (LLMs) more closely with human values. Off-policy
preference optimization, where the preference data is obtained from other
models, is widely adopted due to its cost efficiency and scalability. However,
off-policy preference optimization often suffers from a distributional gap
between the policy used for data collection and the target policy, leading to
suboptimal optimization. In this paper, we propose a novel strategy to mitigate
this problem by simulating on-policy learning with off-policy preference data.
Our Weighted Preference Optimization (WPO) method adapts off-policy data to
resemble on-policy data more closely by reweighting preference pairs according
to their probability under the current policy. This method not only addresses
the distributional gap problem but also enhances the optimization process
without incurring additional costs. We validate our method on instruction
following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only
outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2
but also establishes a remarkable length-controlled winning rate against
GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B
model on the leaderboard. We will release the code and models at
https://github.com/wzhouad/WPO.Summary
AI-Generated Summary