迭代式長度正規化直接偏好優化:提升 7B 語言模型至 GPT-4 水平的案例研究
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
June 17, 2024
作者: Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang
cs.AI
摘要
直接偏好優化(DPO)是一種標準方法,用於將語言模型與人類偏好對齊,傳統上應用於離線偏好。最近的研究表明,DPO受益於通過經過訓練的獎勵模型標記的在線偏好進行迭代訓練。在這項工作中,我們識別了普通迭代DPO的一個陷阱 - 改進的回應質量可能導致冗長。為了解決這個問題,我們引入了迭代長度正則化DPO(iLR-DPO)來懲罰回應長度。我們的實證結果表明,iLR-DPO可以使一個7B模型在不增加冗長的情況下表現與GPT-4相當。具體而言,我們的7B模型在AlpacaEval 2.0上以50.5%的長度控制勝率擊敗了GPT-4預覽,並在MT-Bench、Arena-Hard和OpenLLM排行榜等標準基準上表現卓越。這些結果展示了迭代DPO在對齊語言模型與人類反饋方面的有效性。
English
Direct Preference Optimization (DPO), a standard method for aligning language
models with human preferences, is traditionally applied to offline preferences.
Recent studies show that DPO benefits from iterative training with online
preferences labeled by a trained reward model. In this work, we identify a
pitfall of vanilla iterative DPO - improved response quality can lead to
increased verbosity. To address this, we introduce iterative length-regularized
DPO (iLR-DPO) to penalize response length. Our empirical results show that
iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing
verbosity. Specifically, our 7B model achieves a 50.5% length-controlled win
rate against GPT-4 Preview on AlpacaEval 2.0, and excels across
standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard.
These results demonstrate the effectiveness of iterative DPO in aligning
language models with human feedback.Summary
AI-Generated Summary