ChartAB: チャートグラウンディング・高密度アラインメントのためのベンチマーク

要旨

チャートは、可視化、推論、データ分析、および人間同士のアイデア交換において重要な役割を果たす。しかし、既存の視覚言語モデル（VLM）は、詳細の正確な知覚が不十分で、チャートから細粒度の構造を抽出することに苦戦している。このようなチャート接地の限界は、複数のチャートを比較し、それらを推論する能力も妨げている。本論文では、多様な種類と複雑さのチャートから表形式データを抽出し、可視化要素を位置特定し、様々な属性を認識するという、チャート接地タスクにおけるVLMの総合的な評価を提供するために、新規の「ChartAlign Benchmark (ChartAB)」を提案する。各接地タスクに特化した評価指標の計算を容易にするため、JSONテンプレートを設計する。新規の2段階推論ワークフローを組み込むことで、このベンチマークはさらに、2つのチャート間で要素や属性を対応付け比較するVLMの能力を評価できる。いくつかの最近のVLMに対する評価分析を通じて、チャート理解におけるそれらの知覚バイアス、弱点、頑健性、および幻覚に関する新たな知見が明らかになった。これらの発見は、チャート理解タスクにおけるVLM間の細粒度の不一致を浮き彫りにし、現行のモデルで強化が必要な特定のスキルを示唆している。

English

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

ChartAB: チャートグラウンディング・高密度アラインメントのためのベンチマーク

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

要旨

Support