UPME：一個用於多模態大型語言模型評估的無監督同行評審框架

摘要

多模態大型語言模型（MLLMs）的興起旨在應對視覺問答（VQA）的挑戰，引發了對這些模型進行客觀評估的新研究焦點。現有的評估方法因需耗費大量人力設計視覺圖像的問答對而面臨限制，這從根本上制約了評估的規模和範圍。儘管自動化的MLLM-as-judge方法嘗試通過自動評估來減少人力負擔，但它們往往會引入偏差。為解決這些問題，我們提出了一種無監督的同行評審MLLM評估框架。該框架僅利用圖像數據，使模型能夠自動生成問題並對其他模型的答案進行同行評審，有效減輕了對人力負擔的依賴。此外，我們引入了視覺語言評分系統來緩解偏差問題，該系統專注於三個方面：(i) 回答的正確性；(ii) 視覺理解與推理；以及(iii) 圖像與文本的相關性。實驗結果表明，UPME在MMstar數據集上與人類評估的皮爾遜相關性達到0.944，在ScienceQA數據集上達到0.814，這表明我們的框架與人類設計的基準和內在的人類偏好高度一致。

English

Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

UPME：一個用於多模態大型語言模型評估的無監督同行評審框架

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

摘要

Support