自我探索语言模型:在线对齐的主动偏好引导
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
May 29, 2024
作者: Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang
cs.AI
摘要
偏好优化,特别是通过人类反馈的强化学习(RLHF),已经在使大型语言模型(LLMs)与人类意图保持一致方面取得了显著成功。与使用固定数据集进行离线对齐不同,从人类或人工智能收集在线反馈,通常会通过迭代过程产生更有能力的奖励模型,并且更好地对齐LLMs。然而,要实现全局准确的奖励模型,需要系统地探索以生成涵盖自然语言广阔空间的多样化响应。仅从标准奖励最大化的LLMs中进行随机抽样是不足以满足这一要求的。为了解决这个问题,我们提出了一个双层目标,乐观地偏向潜在高奖励响应,以积极探索超出分布范围的区域。通过使用重新参数化的奖励函数解决内部问题,得到的算法,名为自我探索语言模型(SELM),消除了对单独RM的需求,并通过简单的目标迭代更新LLM。与直接偏好优化(DPO)相比,SELM目标减少了对未见外推的不加区分的偏爱,并增强了探索效率。我们的实验结果表明,当在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上进行微调时,SELM显著提升了在MT-Bench和AlpacaEval 2.0等指令遵循基准测试中的性能,以及在不同设置下的各种标准学术基准测试。我们的代码和模型可在https://github.com/shenao-zhang/SELM 上获得。
English
Preference optimization, particularly through Reinforcement Learning from
Human Feedback (RLHF), has achieved significant success in aligning Large
Language Models (LLMs) to adhere to human intentions. Unlike offline alignment
with a fixed dataset, online feedback collection from humans or AI on model
generations typically leads to more capable reward models and better-aligned
LLMs through an iterative process. However, achieving a globally accurate
reward model requires systematic exploration to generate diverse responses that
span the vast space of natural language. Random sampling from standard
reward-maximizing LLMs alone is insufficient to fulfill this requirement. To
address this issue, we propose a bilevel objective optimistically biased
towards potentially high-reward responses to actively explore
out-of-distribution regions. By solving the inner-level problem with the
reparameterized reward function, the resulting algorithm, named Self-Exploring
Language Models (SELM), eliminates the need for a separate RM and iteratively
updates the LLM with a straightforward objective. Compared to Direct Preference
Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen
extrapolations and enhances exploration efficiency. Our experimental results
demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct
models, SELM significantly boosts the performance on instruction-following
benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard
academic benchmarks in different settings. Our code and models are available at
https://github.com/shenao-zhang/SELM.Summary
AI-Generated Summary