ChartArena：跨语言、场景与格式的图表解析基准测试

摘要

图表是传递定量和关联信息的主要媒介，然而系统性地评估图表解析模型仍然困难重重。现有基准测试局限于狭窄的图表类型，流程图和思维导图等图表结构在很大程度上未被涉及；同时，模型输出格式不兼容，数据集也鲜少包含实际应用中常见的打印或手绘图像。为解决这些问题，我们提出了ChartArena，一个涵盖八大图表族（包括数值型图表和图表结构）的综合性双语基准测试。每个图表族在三种视觉场景下进行评估：数字渲染图、打印照片和手绘照片。该数据集通过人机协作的标注流程构建，并经过多阶段人工验证以确保标注可靠性。为实现公平的跨模型比较，我们进一步设计了一种格式无关的评估协议，将异构输出映射到两个规范语义空间——归一化三元组视图和有向图视图，并采用结构感知指标进行评分。通过对26个领先多模态大语言模型（MLLMs）的广泛评估，我们观察到三个一致的发现：（i）Gemini 3.1 Pro等前沿专有模型总体领先，但最强的开源系统正在迅速缩小差距；（ii）文档解析模型在数值型图表上表现尚可，但在图表结构上显著落后；（iii）专家级图表解析器仍局限于狭窄的图表族。在所有模型中，雷达图和手绘场景尤其具有挑战性。这些发现表明，ChartArena揭示出明确的能力差距，并为未来的进展提供了统一基础。ChartArena已在 https://github.com/pspdada/ChartArena 公开提供。

English

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.