ChatPaper.aiChatPaper

VHELM:视觉语言模型的整体评估

VHELM: A Holistic Evaluation of Vision Language Models

October 9, 2024
作者: Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, Percy Liang
cs.AI

摘要

目前用于评估视觉语言模型(VLMs)的基准通常侧重于它们的感知或问题解决能力,而忽视了其他关键方面,如公平性、多语言性或毒性。此外,它们在评估程序和评估范围上存在差异,使得模型之间难以比较。为了解决这些问题,我们将HELM框架扩展到VLMs,提出了视觉语言模型的整体评估(VHELM)。VHELM整合了各种数据集,涵盖了9个方面中的一个或多个:视觉感知、知识、推理、偏见、公平性、多语言性、鲁棒性、毒性和安全性。通过这样做,我们为VLMs在这些重要因素上的能力提供了全面的多维视图。此外,我们标准化了标准推理参数、提示方法和评估指标,以便在模型之间进行公平比较。我们的框架旨在轻量且自动化,使评估运行廉价且快速。我们的初步运行评估了22个VLMs在21个现有数据集上,以提供模型的整体快照。我们发现了一些新的关键发现,例如,以效率为重点的模型(例如Claude 3 Haiku或Gemini 1.5 Flash)在偏见基准测试中表现明显较差,但在其他方面评估时并非如此。为了透明度,我们在我们的网站上发布了原始模型生成和完整结果(https://crfm.stanford.edu/helm/vhelm/v2.0.1)。VHELM旨在成为一个活跃的基准测试,并希望随着时间的推移继续添加新的数据集和模型。
English
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

Summary

AI-Generated Summary

PDF32November 16, 2024