通过强化微调实现统一多模态思维链奖励模型
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
May 6, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI
摘要
近期,多模态奖励模型(RMs)的进展在向视觉模型传递奖励信号以对齐人类偏好方面展现出显著潜力。然而,当前的RMs通常仅限于提供直接响应或进行浅层推理,深度有限,往往导致奖励信号不准确。我们认为,在奖励推理过程中引入显式的长链思维(CoT)可以显著增强其可靠性和鲁棒性。此外,我们相信,一旦RMs内化了CoT推理,其直接响应的准确性也能通过隐式推理能力得到提升。为此,本文提出了UnifiedReward-Think,首个基于统一多模态CoT的奖励模型,能够在视觉理解和生成奖励任务中进行多维度的、逐步的长链推理。具体而言,我们采用探索驱动的强化微调方法来激发和激励模型的潜在复杂推理能力:(1)首先,我们利用少量图像生成偏好数据蒸馏GPT-4o的推理过程,用于模型的冷启动,学习CoT推理的格式和结构。(2)随后,通过利用模型的先验知识和泛化能力,我们准备大规模的统一多模态偏好数据,以激发模型在各种视觉任务中的推理过程。在此阶段,保留正确的推理输出用于拒绝采样以精炼模型(3),而错误的预测样本最终用于基于群体相对策略优化(GRPO)的强化微调,使模型能够探索多样化的推理路径并优化出正确且稳健的解决方案。在各种视觉奖励任务上的广泛实验证明了我们模型的优越性。
English
Recent advances in multimodal Reward Models (RMs) have shown significant
promise in delivering reward signals to align vision models with human
preferences. However, current RMs are generally restricted to providing direct
responses or engaging in shallow reasoning processes with limited depth, often
leading to inaccurate reward signals. We posit that incorporating explicit long
chains of thought (CoT) into the reward reasoning process can significantly
strengthen their reliability and robustness. Furthermore, we believe that once
RMs internalize CoT reasoning, their direct response accuracy can also be
improved through implicit reasoning capabilities. To this end, this paper
proposes UnifiedReward-Think, the first unified multimodal CoT-based reward
model, capable of multi-dimensional, step-by-step long-chain reasoning for both
visual understanding and generation reward tasks. Specifically, we adopt an
exploration-driven reinforcement fine-tuning approach to elicit and incentivize
the model's latent complex reasoning ability: (1) We first use a small amount
of image generation preference data to distill the reasoning process of GPT-4o,
which is then used for the model's cold start to learn the format and structure
of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge
and generalization capabilities, we prepare large-scale unified multimodal
preference data to elicit the model's reasoning process across various vision
tasks. During this phase, correct reasoning outputs are retained for rejection
sampling to refine the model (3) while incorrect predicted samples are finally
used for Group Relative Policy Optimization (GRPO) based reinforcement
fine-tuning, enabling the model to explore diverse reasoning paths and optimize
for correct and robust solutions. Extensive experiments across various vision
reward tasks demonstrate the superiority of our model.Summary
AI-Generated Summary