ChartAgent: 複雑なチャート質問応答における視覚的基盤に基づく推論のためのマルチモーダルエージェント

要旨

近年のマルチモーダル大規模言語モデル（LLM）は、チャートベースの視覚的質問応答において有望な成果を示しているが、注釈のないチャート、すなわちテキスト的なショートカットに依存せずに正確な視覚的解釈を必要とするチャートでは、その性能が著しく低下する。この問題に対処するため、我々はChartAgentを提案する。これは、チャートの空間領域内で直接視覚的推論を明示的に行う新しいエージェント型フレームワークである。テキストベースの連鎖的推論とは異なり、ChartAgentはクエリを視覚的サブタスクに反復的に分解し、注釈の描画、領域の切り抜き（例：円グラフのスライスの分割、棒グラフの分離）、軸の特定などの専門的なアクションを通じて、チャート画像を積極的に操作し、相互作用する。これにより、各サブタスクを達成するために、チャート固有の視覚ツールライブラリを活用する。この反復的推論プロセスは、人間のチャート理解における認知戦略に密接に類似している。ChartAgentは、ChartBenchおよびChartXベンチマークにおいて、従来の手法を最大16.07%の絶対的な向上で凌駕し、特に注釈のない数値集約的なクエリでは17.31%の向上を達成した。さらに、我々の分析によれば、ChartAgentは（a）多様なチャートタイプにわたって有効であり、（b）視覚的および推論的複雑さの異なるレベルにおいて最高スコアを達成し、（c）多様な基盤となるLLMの性能を向上させるプラグアンドプレイフレームワークとして機能する。我々の研究は、ツール拡張型マルチモーダルエージェントを用いたチャート理解のための視覚的基盤に基づく推論を実証した最初の試みの一つである。

English

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

ChartAgent: 複雑なチャート質問応答における視覚的基盤に基づく推論のためのマルチモーダルエージェント

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

要旨

Support