自我探索語言模型：線上主動偏好引誘對齊

摘要

偏好優化，特別是透過從人類反饋中進行強化學習（RLHF），已經在使大型語言模型（LLMs）符合人類意圖方面取得了顯著成功。與使用固定數據集的離線對齊不同，從人類或人工智能對模型生成的在線反饋通常通過迭代過程產生更具能力的獎勵模型和更好對齊的LLMs。然而，實現全局準確的獎勵模型需要系統性地探索以生成涵蓋自然語言廣闊空間的多樣性回應。僅從標準獎勵最大化的LLMs中隨機抽樣是不足以滿足此要求的。為了解決這個問題，我們提出了一種對潛在高獎勵回應樂觀偏向的雙層目標，以積極探索超出分布範圍的區域。通過使用重新參數化獎勵函數解決內部問題，所得到的算法，名為自我探索語言模型（SELM），消除了對單獨RM的需求，並通過簡單的目標迭代更新LLM。與直接偏好優化（DPO）相比，SELM目標減少了對未見外推的不加區分的偏愛，並增強了探索效率。我們的實驗結果表明，當在Zephyr-7B-SFT和Llama-3-8B-Instruct模型上進行微調時，SELM在指令遵循基準（如MT-Bench和AlpacaEval 2.0）以及不同設置中的各種標準學術基準上顯著提升了性能。我們的代碼和模型可在https://github.com/shenao-zhang/SELM 上獲得。

English

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

自我探索語言模型：線上主動偏好引誘對齊

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

摘要

Support