強化学習によるファインチューニングを活用した統一マルチモーダル思考連鎖報酬モデル

要旨

近年のマルチモーダルな報酬モデル（Reward Models, RMs）の進展は、視覚モデルを人間の好みに合わせるための報酬信号を提供する上で大きな可能性を示しています。しかし、現在のRMsは一般的に直接的な応答を提供するか、限られた深さの浅い推論プロセスに留まっており、しばしば不正確な報酬信号を引き起こすことがあります。我々は、報酬推論プロセスに明示的な長い思考連鎖（Chain of Thought, CoT）を組み込むことで、その信頼性と堅牢性を大幅に強化できると考えています。さらに、RMsがCoT推論を内在化すれば、その直接応答の精度も暗黙の推論能力を通じて向上させることができると信じています。この目的のために、本論文ではUnifiedReward-Thinkを提案します。これは、視覚理解と生成報酬タスクの両方に対して、多次元的で段階的な長い連鎖推論が可能な初めての統一されたマルチモーダルCoTベースの報酬モデルです。具体的には、探索駆動型の強化学習ファインチューニングアプローチを採用し、モデルの潜在的な複雑な推論能力を引き出し、促進します。(1) まず、少量の画像生成選好データを使用してGPT-4oの推論プロセスを蒸留し、それをモデルのコールドスタートに利用してCoT推論の形式と構造を学習させます。(2) 次に、モデルの事前知識と汎化能力を活用して、大規模な統一マルチモーダル選好データを準備し、さまざまな視覚タスクにわたるモデルの推論プロセスを引き出します。この段階では、正しい推論出力を保持してリジェクションサンプリングを行い、モデルを洗練させます。(3) 一方で、誤った予測サンプルは最終的にGroup Relative Policy Optimization（GRPO）ベースの強化学習ファインチューニングに使用され、モデルが多様な推論パスを探索し、正しく堅牢な解決策を最適化できるようにします。さまざまな視覚報酬タスクにわたる広範な実験により、我々のモデルの優位性が実証されています。

English

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

強化学習によるファインチューニングを活用した統一マルチモーダル思考連鎖報酬モデル

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

要旨

Support