ChatPaper.aiChatPaper

CharXiv:在多模式LLM中對真實圖表理解的差距進行圖表化。

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

June 26, 2024
作者: Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
cs.AI

摘要

在應用多模態大型語言模型(MLLMs)進行分析科學論文或財務報告等實際任務時,圖表理解扮演著至關重要的角色。然而,現有數據集通常專注於過於簡化和同質化的圖表,並使用基於模板的問題,這導致對進展的過於樂觀評估。我們展示,儘管開源模型在這些基準測試上似乎優於強大的專有模型,但通過稍微不同的圖表或問題進行簡單的壓力測試,性能可能下降高達34.5%。在這項工作中,我們提出了CharXiv,這是一個包含來自arXiv論文的2,323個自然、具挑戰性和多樣化圖表的全面評估套件。CharXiv包括兩類問題:1)關於檢查基本圖表元素的描述性問題,以及2)需要在圖表中複雜視覺元素之間綜合信息的推理問題。為確保質量,所有圖表和問題均由人類專家手工挑選、編輯和驗證。我們的結果揭示了最強大的專有模型(即GPT-4o)的推理能力與最強大的開源模型(即InternVL Chat V1.5)之間存在著一個實質且先前被低估的差距,GPT-4o實現了47.1%的準確率,而InternVL Chat V1.5實現了29.2%。所有模型遠遠落後於人類的80.5%的表現,突顯了現有MLLMs在圖表理解能力上的弱點。我們希望CharXiv通過提供更現實和忠實的進展評估,促進未來關於MLLM圖表理解的研究。項目頁面和排行榜:https://charxiv.github.io/
English
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Summary

AI-Generated Summary

PDF302November 29, 2024