ChartAgent: 복잡한 차트 질의응답을 위한 시각적 기반 추론 다중모달 에이전트

초록

최근의 멀티모달 LLM(Multimodal Large Language Model)들은 차트 기반 시각적 질문 응답에서 유망한 성과를 보여왔지만, 텍스트적 단서에 의존하기보다 정밀한 시각적 해석이 필요한 주석이 없는 차트에서는 성능이 급격히 저하됩니다. 이를 해결하기 위해, 우리는 ChartAgent라는 새로운 에이전트 기반 프레임워크를 소개합니다. 이 프레임워크는 차트의 공간적 영역 내에서 직접 시각적 추론을 명시적으로 수행합니다. 텍스트 기반의 사고 연쇄(chain-of-thought) 추론과 달리, ChartAgent는 질의를 시각적 하위 작업으로 반복적으로 분해하고, 주석 그리기, 영역 자르기(예: 파이 조각 분할, 막대 분리), 축 위치 지정 등과 같은 특화된 동작을 통해 차트 이미지를 능동적으로 조작하고 상호작용합니다. 이를 위해 차트 특화 시각 도구 라이브러리를 사용하여 각 하위 작업을 수행합니다. 이 반복적 추론 과정은 인간의 차트 이해를 위한 인지 전략을 밀접하게 반영합니다. ChartAgent는 ChartBench 및 ChartX 벤치마크에서 최첨단 정확도를 달성하며, 기존 방법 대비 최대 16.07%의 절대적 성능 향상과 주석이 없고 수치적으로 복잡한 질의에서 17.31%의 성능 향상을 보였습니다. 또한, 우리의 분석은 ChartAgent가 (a) 다양한 차트 유형에서 효과적이며, (b) 다양한 시각적 및 추론적 복잡도 수준에서 최고 점수를 달성하며, (c) 다양한 기반 LLM에 걸쳐 성능을 향상시키는 플러그 앤 플레이 프레임워크로 기능함을 보여줍니다. 우리의 작업은 도구가 강화된 멀티모달 에이전트를 사용하여 차트 이해를 위한 시각적 기반 추론을 입증한 초기 연구 중 하나입니다.

English

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

ChartAgent: 복잡한 차트 질의응답을 위한 시각적 기반 추론 다중모달 에이전트

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

초록

Support