從線上人工智慧反饋進行直接語言模型對齊
Direct Language Model Alignment from Online AI Feedback
February 7, 2024
作者: Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel
cs.AI
摘要
最近出現的直接從偏好對齊(DAP)方法,如DPO,已成為從人類反饋中高效的替代方案,不需要單獨的獎勵模型。然而,在DAP方法中使用的偏好數據集通常在訓練之前收集,並且從不更新,因此反饋純粹是離線的。此外,這些數據集中的回應通常是從一個與正在對齊的語言模型不同的模型中抽樣的,並且由於模型在訓練過程中不斷演進,對齊階段不可避免地是離策略的。在這項研究中,我們認為在線反饋至關重要,並且可以改善DAP方法。我們的方法,在線人工智能反饋(OAIF),使用一個大型語言模型作為標註者:在每個訓練迭代中,我們從當前模型中抽樣兩個回應,並提示LLM標註者選擇哪個更受偏好,從而提供在線反饋。儘管方法簡單,但我們通過在多個任務中進行人類評估表明,OAIF優於離線DAP和RLHF方法。我們進一步展示,在OAIF中利用的反饋通過對LLM標註者的指示提示可以輕鬆控制。
English
Direct alignment from preferences (DAP) methods, such as DPO, have recently
emerged as efficient alternatives to reinforcement learning from human feedback
(RLHF), that do not require a separate reward model. However, the preference
datasets used in DAP methods are usually collected ahead of training and never
updated, thus the feedback is purely offline. Moreover, responses in these
datasets are often sampled from a language model distinct from the one being
aligned, and since the model evolves over training, the alignment phase is
inevitably off-policy. In this study, we posit that online feedback is key and
improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as
annotator: on each training iteration, we sample two responses from the current
model and prompt the LLM annotator to choose which one is preferred, thus
providing online feedback. Despite its simplicity, we demonstrate via human
evaluation in several tasks that OAIF outperforms both offline DAP and RLHF
methods. We further show that the feedback leveraged in OAIF is easily
controllable, via instruction prompts to the LLM annotator.