ChartMuseum:测试大型视觉-语言模型的视觉推理能力
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
May 19, 2025
作者: Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
cs.AI
摘要
图表理解对大规模视觉语言模型(LVLMs)提出了独特挑战,因为它需要融合复杂的文本与视觉推理能力。然而,当前的LVLMs在这两种技能间存在显著失衡,尤其在难以通过文本实现的视觉推理方面表现欠佳。我们利用一个仅能通过视觉推理解决的合成数据集进行案例研究,发现随着视觉复杂度的增加,模型性能显著下降,而人类表现则保持稳健。随后,我们推出了ChartMuseum,这是一个全新的图表问答(QA)基准,包含1,162个专家标注的问题,涵盖多种推理类型,源自184个来源的真实世界图表,专门用于评估复杂的视觉与文本推理能力。与以往图表理解基准——其中前沿模型表现相近且接近饱和——不同,我们的基准揭示了模型与人类表现之间的巨大差距,同时有效区分了模型能力:尽管人类准确率达到93%,表现最佳的Gemini-2.5-Pro模型仅达到63.0%,领先的开源LVLM Qwen2.5-VL-72B-Instruct更是仅达到38.5%。此外,在主要依赖视觉推理的问题上,所有模型相较于文本推理为主的问题,性能下降了35%-55%。最后,我们的定性错误分析揭示了当前LVLMs在特定类别视觉推理上的挑战。
English
Chart understanding presents a unique challenge for large vision-language
models (LVLMs), as it requires the integration of sophisticated textual and
visual reasoning capabilities. However, current LVLMs exhibit a notable
imbalance between these skills, falling short on visual reasoning that is
difficult to perform in text. We conduct a case study using a synthetic dataset
solvable only through visual reasoning and show that model performance degrades
significantly with increasing visual complexity, while human performance
remains robust. We then introduce ChartMuseum, a new Chart Question Answering
(QA) benchmark containing 1,162 expert-annotated questions spanning multiple
reasoning types, curated from real-world charts across 184 sources,
specifically built to evaluate complex visual and textual reasoning. Unlike
prior chart understanding benchmarks -- where frontier models perform similarly
and near saturation -- our benchmark exposes a substantial gap between model
and human performance, while effectively differentiating model capabilities:
although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro
attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct
achieves only 38.5%. Moreover, on questions requiring primarily visual
reasoning, all models experience a 35%-55% performance drop from
text-reasoning-heavy question performance. Lastly, our qualitative error
analysis reveals specific categories of visual reasoning that are challenging
for current LVLMs.Summary
AI-Generated Summary