自我探索语言模型：在线对齐的主动偏好引导

摘要

偏好优化，特别是通过人类反馈的强化学习（RLHF），已经在使大型语言模型（LLMs）与人类意图保持一致方面取得了显著成功。与使用固定数据集进行离线对齐不同，从人类或人工智能收集在线反馈，通常会通过迭代过程产生更有能力的奖励模型，并且更好地对齐LLMs。然而，要实现全局准确的奖励模型，需要系统地探索以生成涵盖自然语言广阔空间的多样化响应。仅从标准奖励最大化的LLMs中进行随机抽样是不足以满足这一要求的。为了解决这个问题，我们提出了一个双层目标，乐观地偏向潜在高奖励响应，以积极探索超出分布范围的区域。通过使用重新参数化的奖励函数解决内部问题，得到的算法，名为自我探索语言模型（SELM），消除了对单独RM的需求，并通过简单的目标迭代更新LLM。与直接偏好优化（DPO）相比，SELM目标减少了对未见外推的不加区分的偏爱，并增强了探索效率。我们的实验结果表明，当在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上进行微调时，SELM显著提升了在MT-Bench和AlpacaEval 2.0等指令遵循基准测试中的性能，以及在不同设置下的各种标准学术基准测试。我们的代码和模型可在https://github.com/shenao-zhang/SELM 上获得。

English

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

自我探索语言模型：在线对齐的主动偏好引导

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

摘要

Support