WPO：利用加權偏好優化增強RLHF

摘要

從人類反饋中學習的強化學習（RLHF）是一種有前途的解決方案，可以使大型語言模型（LLMs）與人類價值更加密切地保持一致。離線策略偏好優化，其中偏好數據來自其他模型，由於其成本效益和可擴展性而被廣泛採用。然而，離線策略偏好優化通常受到用於數據收集的策略與目標策略之間的分布差異的影響，導致次優化。在本文中，我們提出了一種新的策略來減輕這個問題，即通過使用離線偏好數據模擬在線學習。我們的加權偏好優化（WPO）方法通過根據它們在當前策略下的概率對偏好配對進行重新加權，將離線數據調整得更加貼近在線數據。這種方法不僅解決了分布差異問題，還增強了優化過程，而且不會產生額外成本。我們在包括Alpaca Eval 2和MT-bench在內的指令遵循基準上驗證了我們的方法。WPO不僅在Alpaca Eval 2上比直接偏好優化（DPO）高達5.6％，而且基於Llama-3-8B-Instruct，對抗GPT-4-turbo的勝率達到了驚人的48.6％，使其成為排行榜上最強大的8B模型。我們將在https://github.com/wzhouad/WPO 上發布代碼和模型。

English

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.

WPO：利用加權偏好優化增強RLHF

WPO: Enhancing RLHF with Weighted Preference Optimization

摘要

Support