直接納什優化：教導語言模型通過一般偏好自我改進

摘要

本文研究後訓練大型語言模型（LLMs），利用來自強大預言者的偏好反饋，幫助模型逐步改進自身。後訓練LLMs的典型方法涉及從人類反饋中進行強化學習（RLHF），傳統上將獎勵學習和後續策略優化分開。然而，這種獎勵最大化方法受到“點對點”獎勵（如Bradley-Terry模型）的限制，無法表達複雜的不傳遞性或循環偏好關係。儘管RLHF的進展表明獎勵學習和策略優化可以合併為單一對比目標以實現穩定性，但它們仍然受限於獎勵最大化框架。最近，一波新的研究避開了獎勵最大化的假設，轉而直接優化“成對”或一般偏好。在本文中，我們介紹了直接納什優化（DNO），這是一種可證明且可擴展的算法，將對比學習的簡單性和穩定性與優化一般偏好的理論普遍性相結合。由於DNO是一種使用基於回歸的目標的批量在策略上的算法，其實現是簡單且高效的。此外，DNO在迭代過程中享有單調改進，有助於其甚至優於強大的教師（如GPT-4）。在我們的實驗中，由DNO對齊的結果為7B參數的Orca-2.5模型在AlpacaEval 2.0上實現了與GPT-4-Turbo的33%的最新勝率，即使在控制回應長度後，也實現了26%（從7%到33%）的絕對增益。它勝過了具有更多參數的模型，包括Mistral Large、自我獎勵LM（70B參數）和較舊版本的GPT-4。

English

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

直接納什優化：教導語言模型通過一般偏好自我改進

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

摘要

Support