統一獎勵模型用於多模態理解與生成

摘要

近期在人類偏好對齊方面的進展，顯著提升了多模態生成與理解的能力。其中一個關鍵方法是訓練獎勵模型來引導偏好優化。然而，現有模型往往針對特定任務，限制了它們在各種視覺應用中的適應性。我們認為，聯合學習評估多項任務可能會產生協同效應，即改進的圖像理解能提升圖像生成評估，而精煉的圖像評估則通過更好的幀分析來增強視頻評估。為此，本文提出了UnifiedReward，這是首個用於多模態理解與生成評估的統一獎勵模型，支持成對排序和點評分，可用於視覺模型的偏好對齊。具體而言，(1) 我們首先在構建的大規模人類偏好數據集上開發了UnifiedReward，涵蓋圖像和視頻的生成/理解任務。(2) 接著，利用該模型基於視覺模型自動構建高質量的偏好對數據，通過成對排序和點篩選逐步精細過濾其輸出。(3) 最後，這些數據被用於通過直接偏好優化（DPO）進行偏好對齊。實驗結果表明，聯合學習評估多樣視覺任務能帶來顯著的相互增益，我們將此流程應用於圖像和視頻的理解/生成任務，顯著提升了各領域的性能。

English

Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.