統一多模態思維鏈獎勵模型：基於強化學習的微調

摘要

近期在多模態獎勵模型（Reward Models, RMs）上的進展顯示出其在提供獎勵信號以對齊視覺模型與人類偏好方面的巨大潛力。然而，目前的RMs通常僅限於提供直接回應或進行淺層的推理過程，深度有限，往往導致獎勵信號不準確。我們認為，在獎勵推理過程中引入顯式的長鏈思維（Chain of Thought, CoT）可以顯著增強其可靠性和魯棒性。此外，我們相信，一旦RMs內化了CoT推理，其直接回應的準確性也可以通過隱式推理能力得到提升。為此，本文提出了UnifiedReward-Think，這是首個基於統一多模態CoT的獎勵模型，能夠在多維度上進行逐步的長鏈推理，適用於視覺理解和生成獎勵任務。具體而言，我們採用了一種探索驅動的強化微調方法來激發和激勵模型的潛在複雜推理能力：（1）我們首先使用少量圖像生成偏好數據來蒸餾GPT-4o的推理過程，然後用於模型的冷啟動，以學習CoT推理的格式和結構。（2）隨後，通過利用模型的先驗知識和泛化能力，我們準備了大規模的統一多模態偏好數據，以激發模型在各種視覺任務中的推理過程。在此階段，正確的推理輸出被保留用於拒絕採樣以精煉模型，（3）而錯誤的預測樣本最終用於基於群組相對策略優化（Group Relative Policy Optimization, GRPO）的強化微調，使模型能夠探索多樣的推理路徑並優化出正確且魯棒的解決方案。在各種視覺獎勵任務上的廣泛實驗證明了我們模型的優越性。

English

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

統一多模態思維鏈獎勵模型：基於強化學習的微調

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

摘要

Support