自我探索語言模型:線上主動偏好引誘對齊
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
May 29, 2024
作者: Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang
cs.AI
摘要
偏好優化,特別是透過從人類反饋中進行強化學習(RLHF),已經在使大型語言模型(LLMs)符合人類意圖方面取得了顯著成功。與使用固定數據集的離線對齊不同,從人類或人工智能對模型生成的在線反饋通常通過迭代過程產生更具能力的獎勵模型和更好對齊的LLMs。然而,實現全局準確的獎勵模型需要系統性地探索以生成涵蓋自然語言廣闊空間的多樣性回應。僅從標準獎勵最大化的LLMs中隨機抽樣是不足以滿足此要求的。為了解決這個問題,我們提出了一種對潛在高獎勵回應樂觀偏向的雙層目標,以積極探索超出分布範圍的區域。通過使用重新參數化獎勵函數解決內部問題,所得到的算法,名為自我探索語言模型(SELM),消除了對單獨RM的需求,並通過簡單的目標迭代更新LLM。與直接偏好優化(DPO)相比,SELM目標減少了對未見外推的不加區分的偏愛,並增強了探索效率。我們的實驗結果表明,當在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上進行微調時,SELM在指令遵循基準(如MT-Bench和AlpacaEval 2.0)以及不同設置中的各種標準學術基準上顯著提升了性能。我們的代碼和模型可在https://github.com/shenao-zhang/SELM 上獲得。
English
Preference optimization, particularly through Reinforcement Learning from
Human Feedback (RLHF), has achieved significant success in aligning Large
Language Models (LLMs) to adhere to human intentions. Unlike offline alignment
with a fixed dataset, online feedback collection from humans or AI on model
generations typically leads to more capable reward models and better-aligned
LLMs through an iterative process. However, achieving a globally accurate
reward model requires systematic exploration to generate diverse responses that
span the vast space of natural language. Random sampling from standard
reward-maximizing LLMs alone is insufficient to fulfill this requirement. To
address this issue, we propose a bilevel objective optimistically biased
towards potentially high-reward responses to actively explore
out-of-distribution regions. By solving the inner-level problem with the
reparameterized reward function, the resulting algorithm, named Self-Exploring
Language Models (SELM), eliminates the need for a separate RM and iteratively
updates the LLM with a straightforward objective. Compared to Direct Preference
Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen
extrapolations and enhances exploration efficiency. Our experimental results
demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct
models, SELM significantly boosts the performance on instruction-following
benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard
academic benchmarks in different settings. Our code and models are available at
https://github.com/shenao-zhang/SELM.Summary
AI-Generated Summary