マルチモーダル理解と生成のための統一報酬モデル

要旨

人間の嗜好アラインメントにおける最近の進展は、マルチモーダル生成と理解を大幅に向上させました。その鍵となるアプローチは、嗜好最適化を導くための報酬モデルのトレーニングです。しかし、既存のモデルはしばしばタスク固有であり、多様な視覚アプリケーションへの適応性が制限されています。また、複数のタスクを同時に評価することを学習することで、相乗効果が生まれる可能性があります。つまり、画像理解の向上が画像生成評価を高め、洗練された画像評価がフレーム分析を通じて映像評価に利益をもたらすと考えられます。この目的のために、本論文ではUnifiedRewardを提案します。これは、マルチモーダル理解と生成評価のための初の統合報酬モデルであり、ペアワイズランキングとポイントワイズスコアリングの両方を可能にし、視覚モデルの嗜好アラインメントに利用できます。具体的には、(1) まず、画像および映像の生成/理解タスクを含む大規模な人間の嗜好データセットに基づいてUnifiedRewardを開発します。(2) 次に、視覚モデルに基づいて高品質な嗜好ペアデータを自動的に構築し、ペアランキングとポイント選別を通じてその出力を細かくフィルタリングします。(3) 最後に、これらのデータをDirect Preference Optimization (DPO) を通じて嗜好アラインメントに使用します。実験結果は、多様な視覚タスクを同時に評価する学習が相互に大きな利益をもたらすことを示しており、我々のパイプラインを画像および映像の理解/生成タスクに適用し、各領域のパフォーマンスを大幅に向上させました。

English

Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.

マルチモーダル理解と生成のための統一報酬モデル

Unified Reward Model for Multimodal Understanding and Generation

要旨

Support