ViLBench：視覺語言處理獎勵建模套件

摘要

過程監督獎勵模型作為一種細粒度的功能，能夠為模型回應提供詳細的逐步反饋，從而有效選擇複雜任務的推理路徑。儘管其具有優勢，但對過程獎勵模型（PRMs）的評估仍較少探索，特別是在多模態領域。為填補這一空白，本文首先將當前視覺大語言模型（VLLMs）作為兩類獎勵模型進行基準測試：輸出獎勵模型（ORMs）和過程獎勵模型（PRMs），在多個視覺語言基準上的測試表明，無論是ORM還是PRM在所有任務中均未表現出持續的優勢，且性能更優的VLLMs並不一定帶來更好的獎勵性能。為進一步推進評估，我們引入了ViLBench，這是一個需要密集過程獎勵信號的視覺語言基準。值得注意的是，OpenAI的GPT-4o結合思維鏈（CoT）僅達到27.3%的準確率，顯示出該基準對當前VLLMs的挑戰性。最後，我們初步展示了一條彌合通用VLLMs與獎勵模型之間差距的有前景路徑——通過使用增強型樹搜索算法收集73.6K視覺語言過程獎勵數據，我們的3B模型在ViLBench上選擇OpenAI o1的生成結果時，相比標準CoT平均提升了3.3%，與未訓練的對比模型相比最高提升了2.5%。我們在https://ucsc-vlaa.github.io/ViLBench上公開了實現，包括代碼、模型和數據。

English

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

ViLBench：視覺語言處理獎勵建模套件

ViLBench: A Suite for Vision-Language Process Reward Modeling

摘要

Support