BaseReward: 멀티모달 보상 모델을 위한 강력한 베이스라인

초록

멀티모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 급속한 발전은 이를 인간의 선호와 일치시키는 것이 중요한 과제로 대두되고 있다. 보상 모델(Reward Models, RMs)은 이러한 목표를 달성하기 위한 핵심 기술이지만, 최신 멀티모달 보상 모델(Multimodal Reward Models, MRMs)을 구축하기 위한 체계적인 가이드는 현재 학계와 산업계 모두에서 부족한 상황이다. 본 논문은 철저한 실험적 분석을 통해 고성능 MRMs를 구축하기 위한 명확한 "레시피"를 제공하고자 한다. 우리는 MRM 개발 파이프라인의 모든 중요한 구성 요소를 체계적으로 조사하였으며, 이에는 보상 모델링 패러다임(예: Naive-RM, Critic-based RM, Generative RM), 보상 헤드 아키텍처, 훈련 전략, 데이터 큐레이션(10개 이상의 멀티모달 및 텍스트 전용 선호 데이터셋 포함), 백본 모델 및 모델 규모, 앙상블 방법 등이 포함된다. 이러한 실험적 통찰을 바탕으로, 우리는 멀티모달 보상 모델링을 위한 강력하고 효율적인 베이스라인인 BaseReward를 소개한다. BaseReward는 간단하지만 효과적인 아키텍처를 채택하며, {Qwen2.5-VL} 백본을 기반으로 최적화된 2층 보상 헤드를 특징으로 하고, 고품질의 멀티모달 및 텍스트 전용 선호 데이터를 신중하게 선별하여 훈련되었다. 우리의 결과는 BaseReward가 MM-RLHF-Reward Bench, VL-Reward Bench, Multimodal Reward Bench와 같은 주요 벤치마크에서 새로운 SOTA(State-of-the-Art)를 달성하며, 이전 모델들을 능가함을 보여준다. 또한, 정적 벤치마크를 넘어 실용적인 유용성을 검증하기 위해, BaseReward를 실제 강화 학습 파이프라인에 통합하여 다양한 인지, 추론, 대화 작업에서 MLLM의 성능을 성공적으로 향상시켰다. 이 연구는 최고 수준의 MRM을 제공할 뿐만 아니라, 차세대 MLLMs를 위한 견고한 보상 모델을 개발하기 위한 명확하고 실증적으로 입증된 가이드를 커뮤니티에 제공한다는 점에서 더 큰 의의를 가진다.

English

The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including reward modeling paradigms (e.g., Naive-RM, Critic-based RM, and Generative RM), reward head architecture, training strategies, data curation (covering over ten multimodal and text-only preference datasets), backbone model and model scale, and ensemble methods. Based on these experimental insights, we introduce BaseReward, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

BaseReward: 멀티모달 보상 모델을 위한 강력한 베이스라인

BaseReward: A Strong Baseline for Multimodal Reward Model

초록

Support