統一多模態思維鏈獎勵模型:基於強化學習的微調
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
May 6, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI
摘要
近期在多模態獎勵模型(Reward Models, RMs)上的進展顯示出其在提供獎勵信號以對齊視覺模型與人類偏好方面的巨大潛力。然而,目前的RMs通常僅限於提供直接回應或進行淺層的推理過程,深度有限,往往導致獎勵信號不準確。我們認為,在獎勵推理過程中引入顯式的長鏈思維(Chain of Thought, CoT)可以顯著增強其可靠性和魯棒性。此外,我們相信,一旦RMs內化了CoT推理,其直接回應的準確性也可以通過隱式推理能力得到提升。為此,本文提出了UnifiedReward-Think,這是首個基於統一多模態CoT的獎勵模型,能夠在多維度上進行逐步的長鏈推理,適用於視覺理解和生成獎勵任務。具體而言,我們採用了一種探索驅動的強化微調方法來激發和激勵模型的潛在複雜推理能力:(1)我們首先使用少量圖像生成偏好數據來蒸餾GPT-4o的推理過程,然後用於模型的冷啟動,以學習CoT推理的格式和結構。(2)隨後,通過利用模型的先驗知識和泛化能力,我們準備了大規模的統一多模態偏好數據,以激發模型在各種視覺任務中的推理過程。在此階段,正確的推理輸出被保留用於拒絕採樣以精煉模型,(3)而錯誤的預測樣本最終用於基於群組相對策略優化(Group Relative Policy Optimization, GRPO)的強化微調,使模型能夠探索多樣的推理路徑並優化出正確且魯棒的解決方案。在各種視覺獎勵任務上的廣泛實驗證明了我們模型的優越性。
English
Recent advances in multimodal Reward Models (RMs) have shown significant
promise in delivering reward signals to align vision models with human
preferences. However, current RMs are generally restricted to providing direct
responses or engaging in shallow reasoning processes with limited depth, often
leading to inaccurate reward signals. We posit that incorporating explicit long
chains of thought (CoT) into the reward reasoning process can significantly
strengthen their reliability and robustness. Furthermore, we believe that once
RMs internalize CoT reasoning, their direct response accuracy can also be
improved through implicit reasoning capabilities. To this end, this paper
proposes UnifiedReward-Think, the first unified multimodal CoT-based reward
model, capable of multi-dimensional, step-by-step long-chain reasoning for both
visual understanding and generation reward tasks. Specifically, we adopt an
exploration-driven reinforcement fine-tuning approach to elicit and incentivize
the model's latent complex reasoning ability: (1) We first use a small amount
of image generation preference data to distill the reasoning process of GPT-4o,
which is then used for the model's cold start to learn the format and structure
of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge
and generalization capabilities, we prepare large-scale unified multimodal
preference data to elicit the model's reasoning process across various vision
tasks. During this phase, correct reasoning outputs are retained for rejection
sampling to refine the model (3) while incorrect predicted samples are finally
used for Group Relative Policy Optimization (GRPO) based reinforcement
fine-tuning, enabling the model to explore diverse reasoning paths and optimize
for correct and robust solutions. Extensive experiments across various vision
reward tasks demonstrate the superiority of our model.Summary
AI-Generated Summary