ChartArena: 言語、シナリオ、フォーマットを横断したチャート解析のベンチマーキング

要旨

チャートは数量的・関係的情報を伝達する主要な媒体であるが、チャート解析モデルを体系的に評価することは依然として困難である。既存のベンチマークは限られたチャート種別に焦点を当てており、フローチャートやマインドマップなどの図式構造はほとんど扱われていない。また、モデルは互換性のない形式で出力を生成し、データセットは実際に遭遇する印刷物や手書き画像をほとんど含んでいない。これらの問題に対処するため、我々はChartArenaを導入する。これは、数値チャートと図式構造の両方にわたる8つのチャートファミリーをカバーし、各ファミリーをデジタルレンダリング、印刷写真、手書き写真の3つの視覚的シナリオで評価する、包括的なバイリンガルベンチマークである。データセットは、人間とエージェントの協調によるアノテーションパイプラインと、複数段階の人間による検証を経て構築され、アノテーションの信頼性を確保している。さらに、公平なモデル間比較を可能にするため、形式非依存の評価プロトコルを設計した。これは、異種の出力を正規化トリプルビューと有向グラフビューという2つの標準意味空間にマッピングし、構造を考慮したメトリクスでスコアリングするものである。26の主要なMLLMを広範に評価した結果、以下の3つの一貫した知見が得られた。(i) Gemini 3.1 Proなどの最先端プロプライエタリモデルが総合的にリードするが、最強のオープンソースシステムは急速にその差を縮めつつある。(ii) 文書解析モデルは数値チャートを適切に扱うが、図式構造では大きく劣る。(iii) 専門的なチャートパーサーは依然として限られたチャートファミリーに限定されている。全モデルにおいて、レーダーチャートと手書きシナリオは特に困難である。これらの知見は、ChartArenaが明確な能力ギャップを明らかにし、今後の進歩のための統一的基盤を提供することを示している。ChartArenaはhttps://github.com/pspdada/ChartArenaで公開されている。

English

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.