BaseReward:多模态奖励模型的坚实基线
BaseReward: A Strong Baseline for Multimodal Reward Model
September 19, 2025
作者: Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, Liang Wang
cs.AI
摘要
多模态大语言模型(MLLMs)的快速发展使得将其与人类偏好对齐成为一项关键挑战。奖励模型(RMs)是实现这一目标的核心技术,但目前学术界和工业界均缺乏构建最先进多模态奖励模型(MRMs)的系统性指南。通过详尽的实验分析,本文旨在为构建高性能MRMs提供一份清晰的“配方”。我们系统地研究了MRM开发流程中的每个关键组件,包括奖励建模范式(如Naive-RM、基于Critic的RM和生成式RM)、奖励头架构、训练策略、数据整理(涵盖十余种多模态和纯文本偏好数据集)、骨干模型及模型规模,以及集成方法。
基于这些实验洞察,我们引入了BaseReward,一个强大且高效的多模态奖励建模基线。BaseReward采用了一种简单而有效的架构,建立在{Qwen2.5-VL}骨干之上,配备了一个优化的双层奖励头,并在精心整理的高质量多模态和纯文本偏好数据混合体上进行训练。我们的结果表明,BaseReward在MM-RLHF-Reward Bench、VL-Reward Bench和Multimodal Reward Bench等主要基准测试中确立了新的SOTA,超越了之前的模型。此外,为了验证其在静态基准之外的实用价值,我们将BaseReward集成到一个现实世界的强化学习流程中,成功提升了MLLM在多种感知、推理和对话任务中的表现。这项工作不仅提供了一个顶级的MRM,更重要的是,为社区开发下一代MLLMs的稳健奖励模型提供了一份基于实证的清晰指南。
English
The rapid advancement of Multimodal Large Language Models (MLLMs) has made
aligning them with human preferences a critical challenge. Reward Models (RMs)
are a core technology for achieving this goal, but a systematic guide for
building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking
in both academia and industry. Through exhaustive experimental analysis, this
paper aims to provide a clear ``recipe'' for constructing high-performance
MRMs. We systematically investigate every crucial component in the MRM
development pipeline, including reward modeling paradigms (e.g.,
Naive-RM, Critic-based RM, and Generative RM), reward head
architecture, training strategies, data curation (covering
over ten multimodal and text-only preference datasets), backbone model
and model scale, and ensemble methods.
Based on these experimental insights, we introduce BaseReward, a
powerful and efficient baseline for multimodal reward modeling. BaseReward
adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone,
featuring an optimized two-layer reward head, and is trained on a carefully
curated mixture of high-quality multimodal and text-only preference data. Our
results show that BaseReward establishes a new SOTA on major benchmarks such as
MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench,
outperforming previous models. Furthermore, to validate its practical utility
beyond static benchmarks, we integrate BaseReward into a real-world
reinforcement learning pipeline, successfully enhancing an MLLM's performance
across various perception, reasoning, and conversational tasks. This work not
only delivers a top-tier MRM but, more importantly, provides the community with
a clear, empirically-backed guide for developing robust reward models for the
next generation of MLLMs.