BPO:通過遵循行為接近度LLM來加速在線偏好學習
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
June 18, 2024
作者: Wenda Xu, Jiachen Li, William Yang Wang, Lei Li
cs.AI
摘要
直接對齊偏好(DAP)已成為將大型語言模型(LLMs)與人類偏好對齊的一種有前途的範式,該偏好來自預先收集的離線偏好數據集。儘管最近的研究表明現有的離線DAP方法可以直接受益於在線訓練樣本,我們強調需要開發特定的在線DAP算法,以充分發揮在線訓練的威力。具體而言,我們確定學習的LLM應該遵循行為LLM的行為接近性,該行為LLM收集訓練樣本。為此,我們提出了在接近行為LLM的偏好優化(BPO),強調構建適當信任區域以進行LLM對齊的重要性。
我們進行了廣泛的實驗,通過將其與各種DAP方法結合,以驗證我們方法的有效性和應用性,結果在訓練相同量的偏好數據時,在各種任務中實現了顯著的性能改進。即使僅引入一個額外的數據收集階段,我們的在線BPO也將其TL;DR的離線DAP基線從72.0%提高到80.2%,在人類參考文本的Anthropic Helpfulness方面,從82.2%提高到89.1%的勝率。
English
Direct alignment from preferences (DAP) has emerged as a promising paradigm
for aligning large language models (LLMs) to human desiderata from
pre-collected, offline preference datasets. While recent studies indicate that
existing offline DAP methods can directly benefit from online training samples,
we highlight the need to develop specific online DAP algorithms to fully
harness the power of online training. Specifically, we identify that the
learned LLM should adhere to the proximity of the behavior LLM, which collects
the training samples. To this end, we propose online Preference Optimization in
proximity to the Behavior LLM (BPO), emphasizing the importance of constructing
a proper trust region for LLM alignment.
We conduct extensive experiments to validate the effectiveness and
applicability of our approach by integrating it with various DAP methods,
resulting in significant performance improvements across a wide range of tasks
when training with the same amount of preference data. Even when only
introducing one additional data collection phase, our online BPO improves its
offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on
Anthropic Helpfulness in terms of win rate against human reference text.Summary
AI-Generated Summary