Skywork-VL奖励模型：面向多模态理解与推理的高效奖励机制

摘要

我们提出了Skywork-VL Reward，一种多模态奖励模型，为多模态理解和推理任务提供奖励信号。我们的技术方法包含两个关键组成部分：首先，我们构建了一个大规模的多模态偏好数据集，涵盖了广泛的任务和场景，其中响应来自标准视觉语言模型（VLMs）和先进的VLM推理器。其次，我们基于Qwen2.5-VL-7B-Instruct设计了一个奖励模型架构，集成了奖励头，并在成对偏好数据上应用多阶段微调，使用成对排序损失。实验评估表明，Skywork-VL Reward在多模态VL-RewardBench上取得了最先进的结果，并在纯文本的RewardBench基准上表现出竞争力。此外，基于我们的Skywork-VL Reward构建的偏好数据在训练混合偏好优化（MPO）方面非常有效，显著提升了多模态推理能力。我们的结果强调了Skywork-VL Reward作为通用、可靠的多模态对齐奖励模型的重大进展。我们的模型已公开发布，以促进透明度和可重复性。

English

We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.

Skywork-VL奖励模型：面向多模态理解与推理的高效奖励机制

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

摘要

Support