ChatPaper.aiChatPaper

迭代长度正则化直接偏好优化:提升7B语言模型至GPT-4水平的案例研究

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

June 17, 2024
作者: Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang
cs.AI

摘要

直接偏好优化(DPO)是一种用于将语言模型与人类偏好对齐的标准方法,传统上应用于离线偏好。最近的研究表明,DPO受益于通过训练有标记的在线偏好的奖励模型进行迭代训练。在这项工作中,我们确定了基本迭代DPO的一个缺陷 - 改进的响应质量可能导致冗长。为了解决这个问题,我们引入了迭代长度正则化DPO(iLR-DPO)来惩罚响应长度。我们的实证结果表明,iLR-DPO可以使一个7B模型在不增加冗长的情况下表现与GPT-4相当。具体来说,我们的7B模型在AlpacaEval 2.0上以50.5%的长度受控胜率击败了GPT-4预览,并在MT-Bench、Arena-Hard和OpenLLM排行榜等标准基准上表现出色。这些结果展示了迭代DPO在将语言模型与人类反馈对齐方面的有效性。
English
Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a 50.5% length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

Summary

AI-Generated Summary

PDF131December 2, 2024