ChatPaper.aiChatPaper

R1-奖励:通过稳定强化学习训练多模态奖励模型

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

May 5, 2025
作者: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
cs.AI

摘要

多模态奖励模型(MRMs)在提升多模态大语言模型(MLLMs)性能方面发挥着关键作用。尽管近期研究主要集中于改进MRMs的模型架构和训练数据,但对于奖励建模中长期推理能力的有效性及其在MRMs中的激活方式,探索仍显不足。本文探讨了如何利用强化学习(RL)优化奖励建模。具体而言,我们将奖励建模问题重新表述为基于规则的RL任务。然而,我们观察到,直接将现有RL算法(如Reinforce++)应用于奖励建模,常因这些算法固有的局限性导致训练不稳定甚至崩溃。为解决此问题,我们提出了StableReinforce算法,该算法对现有RL方法的训练损失、优势估计策略及奖励设计进行了优化,从而实现了更稳定的训练动态和更优的性能。为支持MRM训练,我们从多样化的数据集中收集了20万条偏好数据。基于此数据集,采用StableReinforce算法训练的奖励模型R1-Reward,在多模态奖励建模基准测试中表现显著提升。与之前的最先进模型相比,R1-Reward在VL Reward-Bench上提升了8.4%,在Multimodal Reward Bench上提升了14.3%。此外,随着推理计算资源的增加,R1-Reward的性能进一步得到增强,凸显了RL算法在优化MRMs中的巨大潜力。
English
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

Summary

AI-Generated Summary

PDF161May 6, 2025