VisualSimpleQA: ファクト探索型質問応答における大規模視覚言語モデルの分離評価のためのベンチマーク

要旨

大規模視覚言語モデル（LVLM）は顕著な成果を上げているものの、事実を求める質問応答（QA）において非事実的な回答の生成が依然として広く見られます。現在のマルチモーダル事実探索ベンチマークは、主にモデルの出力を正解と比較することに焦点を当てており、モダリティ固有のモジュールの性能に関する洞察は限られています。このギャップを埋めるため、我々はVisualSimpleQAというマルチモーダル事実探索ベンチマークを導入します。このベンチマークには2つの主要な特徴があります。第一に、視覚と言語のモダリティにおいて、LVLMの評価を簡素化し分離することが可能です。第二に、明確に定義された難易度基準を組み込むことで、人間によるアノテーションをガイドし、挑戦的なサブセットであるVisualSimpleQA-hardの抽出を容易にします。15のLVLMを用いた実験では、GPT-4oのような最先端のモデルでさえ、VisualSimpleQAにおけるマルチモーダル事実探索QAでわずか60%以上の正答率、VisualSimpleQA-hardでは30%以上の正答率しか達成できませんでした。さらに、これらのモデルにおける分離評価は、視覚と言語の両モジュールにおいて大幅な改善の余地があることを示しています。データセットはhttps://huggingface.co/datasets/WYLing/VisualSimpleQAで公開されています。

English

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

VisualSimpleQA: ファクト探索型質問応答における大規模視覚言語モデルの分離評価のためのベンチマーク

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

要旨

Support