人類反饋的納許學習

摘要

從人類反饋中學習的強化學習（RLHF）已成為將大型語言模型（LLMs）與人類偏好對齊的主要範式。通常，RLHF 包括從人類反饋中學習獎勵模型的初始步驟，這些反饋通常表達為預先訓練的LLM生成的文本對之間的偏好。隨後，通過強化學習算法對LLM的策略進行微調，以最大化獎勵模型。然而，當前獎勵模型的固有局限性在於無法完全表達人類偏好的豐富性以及其對抽樣分佈的依賴性。在本研究中，我們介紹了一種使用成對人類反饋進行LLMs微調的替代流程。我們的方法包括首先學習一個偏好模型，該模型在給定提示的情況下條件於兩個輸入，然後追求一個策略，該策略一貫地生成優於任何競爭策略生成的回應，從而定義了該偏好模型的納什均衡。我們將此方法稱為從人類反饋中學習的納什（NLHF）。在表格式策略表示的背景下，我們提出了一種基於鏡像下降原則的新穎算法解決方案，稱為Nash-MD。該算法生成一系列策略，最後一次迭代收斂到正則化的納什均衡。此外，我們探索了策略的參數表示形式，並引入了用於深度學習架構的梯度下降算法。為了展示我們方法的有效性，我們提出了涉及對LLM進行文本摘要任務微調的實驗結果。我們認為NLHF為偏好學習和策略優化提供了一個引人注目的途徑，有望推動LLMs與人類偏好對齊領域的發展。

English

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

人類反饋的納許學習

Nash Learning from Human Feedback

摘要

Support