直接Nash优化：教导语言模型通过一般偏好自我改进

摘要

本文研究了利用来自强大预言者的偏好反馈来帮助模型迭代改进的后训练大型语言模型（LLMs）。后训练LLMs的典型方法涉及从人类反馈中进行强化学习（RLHF），传统上将奖励学习和随后的策略优化分开。然而，这种奖励最大化方法受到“点对点”奖励（如Bradley-Terry模型）的限制，无法表达复杂的不传递或循环偏好关系。虽然RLHF的进展表明奖励学习和策略优化可以合并为单一对比目标以实现稳定性，但它们仍然依赖奖励最大化框架。最近，一波新的研究避开了奖励最大化的假设，而是直接优化“成对”或一般偏好。在本文中，我们介绍了直接纳什优化（DNO），这是一种可证明且可扩展的算法，它将对比学习的简单性和稳定性与优化一般偏好的理论普适性相结合。由于DNO是一种基于回归目标的批处理在线算法，其实现简单高效。此外，DNO在迭代过程中呈现单调改进，有助于它甚至优于强大的教师（如GPT-4）。在我们的实验中，通过DNO对齐的7B参数Orca-2.5模型在AlpacaEval 2.0上实现了与GPT-4-Turbo的最新胜率，达到33％（即使在控制响应长度后），比初始化模型提高了26％（从7％到33％）。它胜过了具有更多参数的模型，包括Mistral Large、Self-Rewarding LM（70B参数）和较旧版本的GPT-4。

English

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

直接Nash优化：教导语言模型通过一般偏好自我改进

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

摘要

Support