Skywork-VL Reward：一個有效的多模態理解與推理獎勵模型

摘要

我們提出了Skywork-VL Reward，這是一個多模態獎勵模型，能夠為多模態理解和推理任務提供獎勵信號。我們的技術方法包含兩個關鍵組件：首先，我們構建了一個大規模的多模態偏好數據集，涵蓋了廣泛的任務和場景，並收集了來自標準視覺語言模型（VLMs）和高級VLM推理器的回應。其次，我們基於Qwen2.5-VL-7B-Instruct設計了一個獎勵模型架構，整合了一個獎勵頭，並在成對偏好數據上應用多階段微調，使用成對排序損失進行訓練。實驗評估顯示，Skywork-VL Reward在多模態VL-RewardBench上達到了最先進的結果，並在僅文本的RewardBench基準上表現出競爭力。此外，基於我們的Skywork-VL Reward構建的偏好數據在訓練混合偏好優化（MPO）方面非常有效，顯著提升了多模態推理能力。我們的結果表明，Skywork-VL Reward是朝著通用、可靠的多模態對齊獎勵模型邁出的重要一步。我們的模型已公開發布，以促進透明度和可重現性。

English

We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.

Skywork-VL Reward：一個有效的多模態理解與推理獎勵模型

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

摘要

Support