迭代式納什政策優化：透過無悔學習將LLMs與一般偏好對齊

摘要

以人類反饋的強化學習（RLHF）在對齊大型語言模型（LLMs）與人類偏好方面取得了巨大成功。普遍的RLHF方法是基於獎勵的，遵循Bradley-Terry（BT）模型假設，這可能無法完全捕捉人類偏好的複雜性。在本文中，我們探索了在一個通用偏好框架下的RLHF，並從博弈理論的角度來處理它。具體而言，我們將問題定義為一個雙人遊戲，並提出了一種新穎的算法，即迭代納什策略優化（INPO）。其關鍵思想是通過無悔學習讓策略與自身對弈，從而逼近納什策略。與先前的方法不同，INPO避免了對個別回應的預期勝率進行估計的需要，這通常會帶來高計算或標註成本。相反，我們引入了一個新的損失目標，直接在偏好數據集上最小化。我們對我們的方法進行了理論分析，並通過在各種代表性基準測試上的實驗展示了其有效性。基於LLaMA-3-8B的SFT模型，INPO在AlpacaEval 2.0上實現了41.5%的長度控制勝率，在Arena-Hard上實現了38.3%的勝率，顯示出相對於基於BT模型假設的最新迭代算法[Dong等，2024]有顯著改進。此外，我們的消融研究凸顯了將KL正則化納入回應長度控制的好處。

English

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

迭代式納什政策優化：透過無悔學習將LLMs與一般偏好對齊

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

摘要

Support