BaseReward: マルチモーダル報酬モデルのための強力なベースライン

要旨

マルチモーダル大規模言語モデル（MLLMs）の急速な進展に伴い、それらを人間の選好に適合させることが重要な課題となっている。報酬モデル（RMs）はこの目標を達成するための中核技術であるが、学術界および産業界において、最先端のマルチモーダル報酬モデル（MRMs）を構築するための体系的なガイドが現状では不足している。本論文は、徹底的な実験分析を通じて、高性能なMRMsを構築するための明確な「レシピ」を提供することを目的としている。我々は、MRM開発パイプラインにおけるすべての重要な要素を体系的に調査し、報酬モデリングのパラダイム（例：Naive-RM、Critic-based RM、Generative RM）、報酬ヘッドのアーキテクチャ、トレーニング戦略、データキュレーション（10以上のマルチモーダルおよびテキストのみの選好データセットをカバー）、バックボーンモデルとモデル規模、アンサンブル手法などを含む。これらの実験的知見に基づき、我々はBaseRewardを紹介する。BaseRewardは、強力かつ効率的なマルチモーダル報酬モデリングのベースラインであり、{Qwen2.5-VL}バックボーンに基づいたシンプルでありながら効果的なアーキテクチャを採用し、最適化された2層の報酬ヘッドを備え、高品質なマルチモーダルおよびテキストのみの選好データを慎重にキュレーションした混合物でトレーニングされている。我々の結果は、BaseRewardがMM-RLHF-Reward Bench、VL-Reward Bench、Multimodal Reward Benchなどの主要なベンチマークにおいて新たなSOTAを確立し、従来のモデルを凌駕することを示している。さらに、静的ベンチマークを超えた実用性を検証するため、BaseRewardを実世界の強化学習パイプラインに統合し、MLLMのパフォーマンスを様々な知覚、推論、会話タスクにおいて向上させることに成功した。本作業は、単にトップクラスのMRMを提供するだけでなく、次世代のMLLMsのための堅牢な報酬モデルを開発するための明確で実証に基づいたガイドをコミュニティに提供するものである。

English

The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including reward modeling paradigms (e.g., Naive-RM, Critic-based RM, and Generative RM), reward head architecture, training strategies, data curation (covering over ten multimodal and text-only preference datasets), backbone model and model scale, and ensemble methods. Based on these experimental insights, we introduce BaseReward, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

BaseReward: マルチモーダル報酬モデルのための強力なベースライン

BaseReward: A Strong Baseline for Multimodal Reward Model

要旨

Support