ChatPaper.aiChatPaper

CharXiv:在多模态LLMs中绘制现实图表理解的差距

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

June 26, 2024
作者: Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
cs.AI

摘要

在应用多模态大型语言模型(MLLMs)进行分析科学论文或财务报告等真实任务时,图表理解起着关键作用。然而,现有数据集通常侧重于过于简化和同质化的图表,配以基于模板的问题,导致对进展的过于乐观的评估。我们证明,尽管开源模型在这些基准测试上似乎胜过强大的专有模型,但通过稍微不同的图表或问题进行简单的压力测试,性能可能会下降高达34.5%。在这项工作中,我们提出了CharXiv,一个包含来自arXiv论文的2,323个自然、具有挑战性和多样化图表的综合评估套件。CharXiv包括两种类型的问题:1)关于检查基本图表元素的描述性问题,2)需要在图表中复杂的视觉元素之间综合信息的推理问题。为确保质量,所有图表和问题均由人类专家手工挑选、策划和验证。我们的结果揭示了最强专有模型(即GPT-4o)的推理能力与最强开源模型(即InternVL Chat V1.5)之间存在着一个实质性且先前被低估的差距,前者达到47.1%的准确率,而后者达到29.2%。所有模型远远落后于人类80.5%的表现,突显了现有MLLMs在图表理解能力上的弱点。我们希望CharXiv通过提供更加真实和忠实的进展评估,促进未来关于MLLM图表理解的研究。项目页面和排行榜:https://charxiv.github.io/
English
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Summary

AI-Generated Summary

PDF302November 29, 2024