다중모달 이해 및 생성을 위한 통합 보상 모델

초록

최근 인간 선호도 정렬(alignment) 분야의 발전은 다중모달 생성 및 이해 능력을 크게 향상시켰습니다. 주요 접근 방식 중 하나는 선호도 최적화를 안내하기 위해 보상 모델을 훈련시키는 것입니다. 그러나 기존 모델들은 주로 특정 작업에 국한되어 있어 다양한 시각적 응용 분야에 대한 적응성이 제한적입니다. 또한, 우리는 여러 작업을 동시에 평가하는 학습이 시너지 효과를 낼 수 있다고 주장합니다. 즉, 이미지 이해 능력의 향상이 이미지 생성 평가를 개선하고, 정교해진 이미지 평가가 프레임 분석을 통해 비디오 평가에 도움을 줄 수 있다는 것입니다. 이를 위해 본 논문은 다중모달 이해 및 생성 평가를 위한 최초의 통합 보상 모델인 UnifiedReward를 제안합니다. 이 모델은 쌍별 순위 지정(pairwise ranking)과 점수 기반 평가(pointwise scoring)를 모두 가능하게 하여 시각 모델의 선호도 정렬에 활용될 수 있습니다. 구체적으로, (1) 먼저 이미지와 비디오 생성/이해 작업을 포함한 대규모 인간 선호도 데이터셋을 구축하고 이를 기반으로 UnifiedReward를 개발합니다. (2) 그런 다음, 이 모델을 사용하여 시각 모델의 출력을 쌍별 순위 지정과 점수 선별을 통해 점진적으로 필터링하여 고품질의 선호도 쌍 데이터를 자동으로 구성합니다. (3) 마지막으로, 이러한 데이터를 직접 선호도 최적화(Direct Preference Optimization, DPO)를 통해 선호도 정렬에 활용합니다. 실험 결과는 다양한 시각적 작업을 동시에 평가하는 학습이 상당한 상호 이점을 가져올 수 있음을 보여주며, 우리의 파이프라인을 이미지와 비디오 이해/생성 작업에 적용하여 각 영역에서 성능을 크게 향상시켰습니다.

English

Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.

다중모달 이해 및 생성을 위한 통합 보상 모델

Unified Reward Model for Multimodal Understanding and Generation

초록

Support