ChatPaper.aiChatPaper

VisualSimpleQA:一個用於解耦評估大型視覺-語言模型在事實尋求問答中表現的基準

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

March 9, 2025
作者: Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu
cs.AI

摘要

大型視覺語言模型(LVLMs)已展現出卓越的成就,然而在事實尋求問答(QA)中,生成非事實性回應的現象仍然普遍存在。當前的多模態事實尋求基準主要集中於比較模型輸出與標準答案,這對於模態特定模組的性能提供了有限的洞察。為彌補這一差距,我們引入了VisualSimpleQA,這是一個具有兩大關鍵特點的多模態事實尋求基準。首先,它能夠對LVLMs在視覺和語言模態上進行簡化且分離的評估。其次,它整合了明確的難度標準,以指導人工標註,並便於提取出具有挑戰性的子集——VisualSimpleQA-hard。對15個LVLMs的實驗顯示,即便是如GPT-4o這樣的頂尖模型,在VisualSimpleQA上的多模態事實尋求QA中僅達到60%以上的正確率,而在VisualSimpleQA-hard上則僅有30%以上。此外,這些模型的分離評估揭示了視覺和語言模組均有顯著的改進空間。該數據集可於https://huggingface.co/datasets/WYLing/VisualSimpleQA獲取。
English
Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

Summary

AI-Generated Summary

PDF115March 12, 2025