ChatPaper.aiChatPaper

Skywork-VL奖励模型:面向多模态理解与推理的高效奖励机制

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

May 12, 2025
作者: Xiaokun Wang, Chris, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI

摘要

我们提出了Skywork-VL Reward,一种多模态奖励模型,为多模态理解和推理任务提供奖励信号。我们的技术方法包含两个关键组成部分:首先,我们构建了一个大规模的多模态偏好数据集,涵盖了广泛的任务和场景,其中响应来自标准视觉语言模型(VLMs)和先进的VLM推理器。其次,我们基于Qwen2.5-VL-7B-Instruct设计了一个奖励模型架构,集成了奖励头,并在成对偏好数据上应用多阶段微调,使用成对排序损失。实验评估表明,Skywork-VL Reward在多模态VL-RewardBench上取得了最先进的结果,并在纯文本的RewardBench基准上表现出竞争力。此外,基于我们的Skywork-VL Reward构建的偏好数据在训练混合偏好优化(MPO)方面非常有效,显著提升了多模态推理能力。我们的结果强调了Skywork-VL Reward作为通用、可靠的多模态对齐奖励模型的重大进展。我们的模型已公开发布,以促进透明度和可重复性。
English
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
PDF293May 13, 2025