ChartArena: 언어, 시나리오 및 형식에 걸친 차트 파싱 벤치마킹

초록

차트는 정량적 및 관계적 정보를 전달하는 주요 매체이지만, 차트 파싱 모델을 체계적으로 평가하는 것은 여전히 어렵다. 기존 벤치마크는 제한된 차트 유형에 초점을 맞추며 순서도와 마인드맵과 같은 다이어그램 구조는 대부분 다루지 않고, 모델들은 호환되지 않는 형식으로 출력을 생성하며, 데이터셋은 실제 환경에서 접하는 인쇄물이나 손그림 이미지를 거의 포함하지 않는다. 이러한 문제를 해결하기 위해, 우리는 수치형 차트와 다이어그램 구조를 모두 포함하는 8개 차트 계열을 포괄하는 이중 언어 벤치마크인 ChartArena를 소개한다. 각 계열은 디지털 렌더링, 인쇄 사진, 손그림 사진의 세 가지 시각적 시나리오에 걸쳐 평가된다. 데이터셋은 주석 신뢰성을 보장하기 위해 다단계 인간 검증을 거친 인간-에이전트 협업 주석 파이프라인을 통해 구축된다. 또한, 공정한 교차 모델 비교를 위해, 우리는 이질적인 출력을 정규화된 삼중 뷰(Normalized Triple View)와 방향 그래프 뷰(Directed Graph View)라는 두 가지 표준 의미 공간으로 매핑하고 구조 인식 메트릭으로 점수를 매기는 형식에 구애받지 않는 평가 프로토콜을 설계한다. 26개의 선도적인 MLLM에 대한 광범위한 평가를 통해, 우리는 세 가지 일관된 결과를 관찰했다: (i) Gemini 3.1 Pro와 같은 최첨단 독점 모델이 전반적으로 선두를 차지하지만, 가장 강력한 오픈소스 시스템이 빠르게 격차를 좁히고 있다; (ii) 문서 파싱 모델은 수치형 차트를 합리적으로 처리하지만 다이어그램 구조에서는 크게 뒤처진다; (iii) 전문 차트 파서는 여전히 좁은 차트 계열에 국한된다. 모든 모델에서 레이더 차트와 손그림 시나리오는 특히 어려운 과제로 남아 있다. 이러한 결과는 ChartArena가 명확한 능력 격차를 드러내고 향후 발전을 위한 통일된 기반을 제공함을 보여준다. ChartArena는 https://github.com/pspdada/ChartArena에서 공개적으로 이용 가능하다.

English

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.