ViLBench: 시각-언어 처리 보상 모델링을 위한 평가 도구 모음

초록

프로세스 감독 보상 모델은 모델 응답에 대한 세분화된 단계별 피드백을 제공하는 정교한 기능으로, 복잡한 작업에 대한 추론 궤적의 효과적인 선택을 가능하게 합니다. 이러한 장점에도 불구하고, 특히 멀티모달 영역에서 PRM(Process Reward Model)에 대한 평가는 아직 충분히 탐구되지 않았습니다. 이러한 격차를 해결하기 위해, 본 논문은 먼저 현재의 시각 대형 언어 모델(VLLM)을 출력 보상 모델(ORM)과 프로세스 보상 모델(PRM) 두 가지 유형으로 나누어 여러 시각-언어 벤치마크에서 평가합니다. 이 평가 결과, ORM과 PRM 모두 모든 작업에서 일관되게 우수한 성능을 보이지는 않으며, 우수한 VLLM이 반드시 더 나은 보상 성능을 보이는 것은 아니라는 사실이 밝혀졌습니다. 평가를 더욱 발전시키기 위해, 우리는 집중적인 프로세스 보상 신호를 요구하는 시각-언어 벤치마크인 ViLBench를 소개합니다. 특히, OpenAI의 GPT-4o with Chain-of-Thought(CoT)는 27.3%의 정확도만을 달성하여, 현재 VLLM들에게 이 벤치마크가 얼마나 도전적인지를 보여줍니다. 마지막으로, 일반 VLLM과 보상 모델 간의 격차를 해소할 수 있는 유망한 경로를 예비적으로 제시합니다. 향상된 트리 탐색 알고리즘을 사용하여 73.6K의 시각-언어 프로세스 보상 데이터를 수집함으로써, 우리의 3B 모델은 OpenAI o1의 생성물을 선택하여 ViLBench에서 표준 CoT 대비 평균 3.3%의 개선과 미훈련 대비 최대 2.5%의 개선을 달성할 수 있었습니다. 우리는 코드, 모델, 데이터와 함께 구현 내용을 https://ucsc-vlaa.github.io/ViLBench에서 공개합니다.

English

Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.

ViLBench: 시각-언어 처리 보상 모델링을 위한 평가 도구 모음

ViLBench: A Suite for Vision-Language Process Reward Modeling

초록

Support