NaturalBench：自然な敵対的サンプルにおけるビジョン言語モデルの評価

要旨

最近、視覚言語モデル（VLMs）は、複雑な視覚言語推論を評価する視覚質問応答（VQA）ベンチマークで、著しい進歩を遂げてきました。しかし、これらのモデルは本当に効果的なのでしょうか？本研究では、VLMsが人間が簡単に答えられるはずの自然画像や質問に依然として苦労していることを示し、これを自然な敵対的サンプルと呼びます。また、CLIPやChatGPTなどの既存のモデルを使用して、これらのVQAサンプルを自然画像テキストコーパスから驚くほど簡単に生成できることも発見しました。私たちは、10,000の人間検証済みVQAサンプルでVLMsを信頼性のある方法で評価するための新しいベンチマークであるNaturalBenchを収集するための半自動アプローチを提案します。重要なのは、各質問に異なる答えを導く2つの画像をペアにして、画像を使用せずに盲目的な解決策が答えるのを防ぐというビジョン中心の設計を採用しています。これにより、Commonsenseの事前知識で解決できる以前のベンチマークよりもNaturalBenchがより難しいものになります。NaturalBenchで53の最先端VLMsを評価し、LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL、そしてGPT-4oなどのモデルが人間のパフォーマンス（90％以上）に対して50％〜70％遅れていることを示しました。NaturalBenchが難しい理由を2つの観点から分析します：（1）合成性：NaturalBenchを解決するには、属性のバインディング、オブジェクトの関係、論理や数え上げなどの高度な推論を含む多様な視覚言語スキルが必要です。このため、各NaturalBenchサンプルに1から8のスキルタグを付けて細かく評価します。（2）バイアス：NaturalBenchは、モデルが画像に関係なく同じ答えを選択する傾向を露呈します。最後に、私たちのベンチマークキュレーション手法を、100語以上の長いキャプションや中国語、ヒンディ語などの非英語の言語を含むさまざまなデータソースに適用し、VLMsの動的評価の可能性を示します。

English

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

NaturalBench：自然な敵対的サンプルにおけるビジョン言語モデルの評価

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

要旨

Support