VisualSimpleQA：一個用於解耦評估大型視覺-語言模型在事實尋求問答中表現的基準

摘要

大型視覺語言模型（LVLMs）已展現出卓越的成就，然而在事實尋求問答（QA）中，生成非事實性回應的現象仍然普遍存在。當前的多模態事實尋求基準主要集中於比較模型輸出與標準答案，這對於模態特定模組的性能提供了有限的洞察。為彌補這一差距，我們引入了VisualSimpleQA，這是一個具有兩大關鍵特點的多模態事實尋求基準。首先，它能夠對LVLMs在視覺和語言模態上進行簡化且分離的評估。其次，它整合了明確的難度標準，以指導人工標註，並便於提取出具有挑戰性的子集——VisualSimpleQA-hard。對15個LVLMs的實驗顯示，即便是如GPT-4o這樣的頂尖模型，在VisualSimpleQA上的多模態事實尋求QA中僅達到60%以上的正確率，而在VisualSimpleQA-hard上則僅有30%以上。此外，這些模型的分離評估揭示了視覺和語言模組均有顯著的改進空間。該數據集可於https://huggingface.co/datasets/WYLing/VisualSimpleQA獲取。

English

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

VisualSimpleQA：一個用於解耦評估大型視覺-語言模型在事實尋求問答中表現的基準

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

摘要

Support