ChartArena：跨語言、場景與格式的圖表解析基準測試

摘要

圖表是傳達量化與關係資訊的主要媒介，然而系統性地評估圖表解析模型仍具挑戰。現有基準專注於狹隘的圖表類型，且大多未涵蓋流程圖與心智圖等圖示結構；同時，各模型輸出格式不相容，資料集亦鮮少包含實務中常見的印刷或手繪圖像。為解決這些問題，我們提出 ChartArena，一個全面的雙語基準，涵蓋八大圖表家族，橫跨數值圖表與圖示結構，並在三種視覺場景（數位渲染、印刷照片、手繪照片）下分別評估。該資料集透過人機協作標註流程建構，並經多階段人工驗證以確保標註可靠性。為實現公平的跨模型比較，我們進一步設計格式無關的評估協議，將異質輸出映射至兩個標準語義空間：正規化三元組視圖與有向圖視圖，並以結構感知指標評分。透過對 26 個領先多模態大語言模型的廣泛評估，我們觀察到三個一致發現：(i) Gemini 3.1 Pro 等前沿專有模型整體領先，但最強開源系統正在迅速縮小差距；(ii) 文件解析模型處理數值圖表表現合理，但在圖示結構上大幅落後；(iii) 專家圖表解析器仍僅限於狹隘的圖表家族。所有模型中，雷達圖與手繪場景特別具挑戰性。這些發現顯示 ChartArena 揭露了明確的能力差距，並為未來進展提供統一根基。ChartArena 公開於 https://github.com/pspdada/ChartArena。

English

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.