Skywork-VL Reward:一個有效的多模態理解與推理獎勵模型
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
May 12, 2025
作者: Xiaokun Wang, Chris, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我們提出了Skywork-VL Reward,這是一個多模態獎勵模型,能夠為多模態理解和推理任務提供獎勵信號。我們的技術方法包含兩個關鍵組件:首先,我們構建了一個大規模的多模態偏好數據集,涵蓋了廣泛的任務和場景,並收集了來自標準視覺語言模型(VLMs)和高級VLM推理器的回應。其次,我們基於Qwen2.5-VL-7B-Instruct設計了一個獎勵模型架構,整合了一個獎勵頭,並在成對偏好數據上應用多階段微調,使用成對排序損失進行訓練。實驗評估顯示,Skywork-VL Reward在多模態VL-RewardBench上達到了最先進的結果,並在僅文本的RewardBench基準上表現出競爭力。此外,基於我們的Skywork-VL Reward構建的偏好數據在訓練混合偏好優化(MPO)方面非常有效,顯著提升了多模態推理能力。我們的結果表明,Skywork-VL Reward是朝著通用、可靠的多模態對齊獎勵模型邁出的重要一步。我們的模型已公開發布,以促進透明度和可重現性。
English
We propose Skywork-VL Reward, a multimodal reward model that provides reward
signals for both multimodal understanding and reasoning tasks. Our technical
approach comprises two key components: First, we construct a large-scale
multimodal preference dataset that covers a wide range of tasks and scenarios,
with responses collected from both standard vision-language models (VLMs) and
advanced VLM reasoners. Second, we design a reward model architecture based on
Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage
fine-tuning using pairwise ranking loss on pairwise preference data.
Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art
results on multimodal VL-RewardBench and exhibits competitive performance on
the text-only RewardBench benchmark. Furthermore, preference data constructed
based on our Skywork-VL Reward proves highly effective for training Mixed
Preference Optimization (MPO), leading to significant improvements in
multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as
a significant advancement toward general-purpose, reliable reward models for
multimodal alignment. Our model has been publicly released to promote
transparency and reproducibility.