ChartMuseum:大型視覺-語言模型視覺推理能力測試
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
May 19, 2025
作者: Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
cs.AI
摘要
圖表理解對大型視覺語言模型(LVLMs)提出了獨特的挑戰,因為它需要整合複雜的文本和視覺推理能力。然而,當前的LVLMs在這些技能之間表現出顯著的不平衡,特別是在難以通過文本進行的視覺推理方面表現不足。我們使用一個僅能通過視覺推理解決的合成數據集進行案例研究,結果顯示,隨著視覺複雜性的增加,模型性能顯著下降,而人類的表現則保持穩健。接著,我們介紹了ChartMuseum,這是一個新的圖表問答(QA)基準,包含1,162個專家註釋的問題,涵蓋多種推理類型,這些問題來自184個來源的真實世界圖表,專門用於評估複雜的視覺和文本推理。與之前的圖表理解基準不同——在這些基準上,前沿模型的表現相似且接近飽和——我們的基準揭示了模型與人類表現之間的顯著差距,同時有效區分了模型的能力:儘管人類的準確率達到93%,但表現最佳的模型Gemini-2.5-Pro僅達到63.0%,而領先的開源LVLM Qwen2.5-VL-72B-Instruct僅達到38.5%。此外,在主要需要視覺推理的問題上,所有模型的表現比文本推理為主的問題下降了35%-55%。最後,我們的定性錯誤分析揭示了當前LVLMs在特定類別的視覺推理上面臨的挑戰。
English
Chart understanding presents a unique challenge for large vision-language
models (LVLMs), as it requires the integration of sophisticated textual and
visual reasoning capabilities. However, current LVLMs exhibit a notable
imbalance between these skills, falling short on visual reasoning that is
difficult to perform in text. We conduct a case study using a synthetic dataset
solvable only through visual reasoning and show that model performance degrades
significantly with increasing visual complexity, while human performance
remains robust. We then introduce ChartMuseum, a new Chart Question Answering
(QA) benchmark containing 1,162 expert-annotated questions spanning multiple
reasoning types, curated from real-world charts across 184 sources,
specifically built to evaluate complex visual and textual reasoning. Unlike
prior chart understanding benchmarks -- where frontier models perform similarly
and near saturation -- our benchmark exposes a substantial gap between model
and human performance, while effectively differentiating model capabilities:
although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro
attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct
achieves only 38.5%. Moreover, on questions requiring primarily visual
reasoning, all models experience a 35%-55% performance drop from
text-reasoning-heavy question performance. Lastly, our qualitative error
analysis reveals specific categories of visual reasoning that are challenging
for current LVLMs.Summary
AI-Generated Summary