ChartMuseum: 大規模視覚言語モデルの視覚的推論能力の検証

要旨

チャート理解は、大規模視覚言語モデル（LVLM）にとって独特の課題を提示する。なぜなら、高度なテキスト的および視覚的推論能力の統合を必要とするためである。しかし、現在のLVLMはこれらのスキル間に顕著な不均衡を示しており、テキストでは難しい視覚的推論において不足が見られる。本研究では、視覚的推論のみで解決可能な合成データセットを用いたケーススタディを行い、視覚的複雑性が増すにつれてモデルの性能が著しく低下する一方で、人間の性能は堅牢であることを示す。次に、ChartMuseumという新しいチャート質問応答（QA）ベンチマークを紹介する。このベンチマークは、184のソースから収集された実世界のチャートに基づいて専門家が注釈を付けた1,162の質問を含み、複雑な視覚的およびテキスト的推論を評価するために特別に構築された。従来のチャート理解ベンチマークでは、最先端のモデルが同様の性能を示し、飽和に近い状態であったが、本ベンチマークでは、モデルと人間の性能間に大きなギャップが明らかになり、モデルの能力を効果的に区別することができる。具体的には、人間の正解率は93%であるのに対し、最高性能のモデルGemini-2.5-Proは63.0%、主要なオープンソースLVLMであるQwen2.5-VL-72B-Instructは38.5%に留まっている。さらに、主に視覚的推論を必要とする質問では、すべてのモデルがテキスト推論が中心の質問性能から35%-55%の性能低下を経験する。最後に、定性的なエラー分析を通じて、現在のLVLMにとって困難な視覚的推論の特定のカテゴリーを明らかにする。

English

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

ChartMuseum: 大規模視覚言語モデルの視覚的推論能力の検証

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

要旨

Support