ChatPaper.aiChatPaper

mDPO:多模态大型语言模型的条件偏好优化

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

June 17, 2024
作者: Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen
cs.AI

摘要

直接偏好优化(DPO)已被证明是大型语言模型(LLM)对齐的有效方法。最近的研究尝试将DPO应用于多模态场景,但发现难以实现一致改进。通过比较实验,我们确定了多模态偏好优化中的无条件偏好问题,即模型忽视了图像条件。为解决这一问题,我们提出了mDPO,一种多模态DPO目标,通过优化图像偏好来防止仅优化语言偏好的过度优先。此外,我们引入了奖励锚点,强制奖励对所选响应为正,从而避免其概率降低——这是相对偏好优化的固有问题。对两种不同规模的多模态LLM和三个广泛使用的基准进行的实验表明,mDPO有效解决了多模态偏好优化中的无条件偏好问题,并显著提高了模型性能,特别是在减少幻觉方面。
English
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

Summary

AI-Generated Summary

PDF391December 6, 2024