NaturalBench：在自然对抗样本上评估视觉-语言模型

摘要

视觉语言模型（VLMs）在最近的视觉问题回答（VQA）基准中取得了显著进展，评估了复杂的视觉-语言推理。然而，这些模型是否真正有效呢？在这项研究中，我们展示了VLMs在处理人类可以轻松回答的自然图像和问题时仍然存在困难，我们将其称为自然对抗样本。我们还发现使用像CLIP和ChatGPT这样的现成模型非常容易生成这些VQA样本，这些样本来自自然图像-文本语料库。我们提出了一种半自动化方法来收集一个新的基准，NaturalBench，用于可靠地评估VLMs，其中包括10,000个经过人工验证的VQA样本。至关重要的是，我们采用了以视觉为中心的设计，将每个问题与两幅产生不同答案的图像配对，防止盲目解决方案在不使用图像的情况下回答。这使得NaturalBench比以往可以通过常识先验解决的基准更具挑战性。我们在NaturalBench上评估了53种最先进的VLMs，结果显示像LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL甚至GPT-4o这样的模型在性能上落后于人类（超过90%），差距在50%-70%之间。我们从两个角度分析了为什么NaturalBench很难：（1）组合性：解决NaturalBench需要多样的视觉-语言技能，包括理解属性绑定、物体关系以及像逻辑和计数这样的高级推理。为此，与之前使用每个样本一个标签的方法不同，我们为每个NaturalBench样本打上1到8个技能标签，进行细粒度评估。（2）偏见：NaturalBench暴露了VLMs中的严重偏见，因为模型经常选择相同的答案，而不考虑图像。最后，我们将我们的基准策划方法应用于不同的数据来源，包括长标题（超过100个字）和中文、印地语等非英语语言，突显了其对VLMs进行动态评估的潜力。

English

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

NaturalBench：在自然对抗样本上评估视觉-语言模型

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

摘要

Support