ChatPaper.aiChatPaper

VHELM:視覺語言模型的全面評估

VHELM: A Holistic Evaluation of Vision Language Models

October 9, 2024
作者: Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, Percy Liang
cs.AI

摘要

目前用於評估視覺語言模型(VLMs)的基準通常著重於它們的感知或問題解決能力,卻忽略了其他關鍵方面,如公平性、多語能力或有毒性。此外,它們在評估程序和評估範圍上存在差異,使得比較模型變得困難。為了應對這些問題,我們將HELM框架擴展到VLMs,提出了視覺語言模型的全面評估(VHELM)。VHELM整合了各種數據集,以涵蓋視覺感知、知識、推理、偏見、公平性、多語能力、韌性、有毒性和安全性等9個方面中的一個或多個。通過這樣做,我們為VLMs在這些重要因素上的能力提供了全面多維度的視角。此外,我們標準化了標準推理參數、提示方法和評估指標,以實現跨模型的公平比較。我們的框架設計為輕量且自動化,使得評估運行成本低廉且快速。我們的首次運行評估了22個VLMs在21個現有數據集上,以提供模型的全面快照。我們發現了一些新的關鍵發現,例如,以效率為重點的模型(例如Claude 3 Haiku或Gemini 1.5 Flash)在偏見基準上表現顯著不如其完整模型(例如Claude 3 Opus或Gemini 1.5 Pro),但在其他方面評估時則沒有這種情況。為了透明度,我們在我們的網站上公佈了原始模型生成和完整結果(https://crfm.stanford.edu/helm/vhelm/v2.0.1)。VHELM旨在成為一個持續更新的基準,我們希望隨著時間的推移繼續添加新的數據集和模型。
English
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

Summary

AI-Generated Summary

PDF32November 16, 2024