mDPO：マルチモーダル大規模言語モデルのための条件付き選好最適化

要旨

直接選好最適化（DPO）は、大規模言語モデル（LLM）のアラインメントにおいて有効な手法であることが示されています。最近の研究では、DPOをマルチモーダルなシナリオに適用しようと試みられていますが、一貫した改善を達成するのが難しいことが判明しています。比較実験を通じて、マルチモーダル選好最適化における無条件選好問題を特定しました。この問題では、モデルが画像条件を無視してしまいます。この問題に対処するため、我々はmDPOを提案します。これは、言語のみの選好を過度に優先することを防ぎ、画像選好も最適化するマルチモーダルDPOの目的関数です。さらに、選ばれた応答に対して報酬が正になるように強制する報酬アンカーを導入し、相対選好最適化に内在する問題である選好確率の低下を回避します。異なるサイズの2つのマルチモーダルLLMと3つの広く使用されているベンチマークでの実験により、mDPOがマルチモーダル選好最適化における無条件選好問題を効果的に解決し、特に幻覚（hallucination）の減少においてモデルの性能を大幅に向上させることが実証されました。

English

Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

mDPO：マルチモーダル大規模言語モデルのための条件付き選好最適化

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

要旨

Support