ViLBench：视觉-语言处理奖励建模套件

摘要

过程监督奖励模型作为一种细粒度功能，为模型响应提供详细的逐步反馈，有助于在复杂任务中有效选择推理路径。尽管具有这些优势，对过程奖励模型（PRMs）的评估仍较少被探索，尤其是在多模态领域。为填补这一空白，本文首先将当前视觉大语言模型（VLLMs）作为两种奖励模型进行基准测试：输出奖励模型（ORMs）和过程奖励模型（PRMs），在多个视觉语言基准上的测试表明，无论是ORM还是PRM都无法在所有任务中持续领先，且性能更优的VLLMs未必能带来更好的奖励效果。为进一步推进评估，我们引入了ViLBench，这是一个设计用于需要密集过程奖励信号的视觉语言基准。值得注意的是，OpenAI的GPT-4o结合思维链（CoT）仅达到27.3%的准确率，显示出该基准对当前VLLMs的挑战性。最后，我们初步展示了一条弥合通用VLLMs与奖励模型之间差距的可行路径——通过使用增强的树搜索算法收集73.6K视觉语言过程奖励数据，我们的3B模型在ViLBench上相较于标准CoT平均提升了3.3%，与未训练版本相比最高提升2.5%，通过筛选OpenAI o1的生成结果实现。我们在https://ucsc-vlaa.github.io/ViLBench上发布了实现代码、模型及数据。

English

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

ViLBench：视觉-语言处理奖励建模套件

ViLBench: A Suite for Vision-Language Process Reward Modeling

摘要

Support