ChatPaper.aiChatPaper

R1-獎勵模型:通過穩定強化學習訓練多模態獎勵模型

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

May 5, 2025
作者: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
cs.AI

摘要

多模態獎勵模型(MRMs)在提升多模態大型語言模型(MLLMs)的表現中扮演著關鍵角色。儘管近期的進展主要集中在改進MRMs的模型結構和訓練數據上,但對於獎勵建模中長期推理能力的有效性以及如何在MRMs中激活這些能力的探索卻相對有限。本文探討了如何利用強化學習(RL)來改進獎勵建模。具體而言,我們將獎勵建模問題重新表述為基於規則的RL任務。然而,我們觀察到,直接應用現有的RL算法(如Reinforce++)於獎勵建模,往往會因這些算法的固有局限性而導致訓練不穩定甚至崩潰。為解決這一問題,我們提出了StableReinforce算法,該算法對現有RL方法的訓練損失、優勢估計策略和獎勵設計進行了改進。這些改進帶來了更穩定的訓練動態和更優異的性能。為了促進MRM的訓練,我們從多樣化的數據集中收集了20萬條偏好數據。我們使用StableReinforce算法在該數據集上訓練的獎勵模型R1-Reward,在多模態獎勵建模基準測試中顯著提升了性能。與之前的SOTA模型相比,R1-Reward在VL Reward-Bench上提升了8.4%,在Multimodal Reward Bench上提升了14.3%。此外,隨著推理計算資源的增加,R1-Reward的性能進一步提升,這凸顯了RL算法在優化MRMs中的潛力。
English
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Summary

AI-Generated Summary

PDF161May 6, 2025