圖表代理：一種多模態代理，用於複雜圖表問答中的視覺基礎推理

摘要

近期，多模态大语言模型（LLMs）在基于图表的视觉问答任务中展现出潜力，但其在未标注图表上的表现急剧下降，尤其是在需要精确视觉解读而非依赖文本捷径的情况下。为解决这一问题，我们引入了ChartAgent，一种新颖的代理框架，该框架直接在图表的空间域内执行视觉推理。与文本链式思维推理不同，ChartAgent迭代地将查询分解为视觉子任务，并通过绘制注释、裁剪区域（如分割饼图切片、隔离柱状图）以及定位坐标轴等专门动作，主动操作并与图表图像互动，利用一套专为图表设计的视觉工具库来完成每个子任务。这一迭代推理过程紧密模拟了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试中达到了最先进的准确率，相较于之前的方法，整体上实现了高达16.07%的绝对增益，在未标注且数值密集的查询上更是提升了17.31%。此外，我们的分析表明，ChartAgent（a）在多种图表类型上均有效，（b）在不同视觉和推理复杂度级别上均取得最高分，以及（c）作为一个即插即用的框架，能够提升多种底层LLMs的性能。我们的工作是首批展示利用工具增强的多模态代理进行视觉基础推理以理解图表的研究之一。

English

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

圖表代理：一種多模態代理，用於複雜圖表問答中的視覺基礎推理

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

摘要

Support