從圖表到代碼:多模態模型的層次化基準測試
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
October 20, 2025
作者: Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, Alex Jinpeng Wang
cs.AI
摘要
我們推出Chart2Code,這是一個用於評估大型多模態模型(LMMs)圖表理解與代碼生成能力的新基準。Chart2Code從用戶驅動的角度精心設計,涵蓋多樣化的現實場景並逐步提升任務難度。它包含三個層次:第一層(圖表複製)根據參考圖像和用戶查詢複製圖表;第二層(圖表編輯)涉及複雜的修改,如更改圖表類型或添加元素;第三層(長表至圖表生成)要求模型根據用戶指令將信息密集的長表轉化為忠實的圖表。據我們所知,這是首個反映實際chart2code應用並系統性提升任務複雜度的層次化基準。Chart2Code總共包含22種圖表類型的2,023項任務,並配備了多層次評估指標,既評估代碼的正確性,也評估渲染圖表的視覺保真度。我們對25個最先進的(SoTA)LMMs進行了基準測試,包括專有模型及最新開源模型如GPT-5、Qwen2.5-VL、InternVL3/3.5、MiMo-VL和Seed-1.6-VL。實驗結果顯示,即使在SoTA模型GPT-5中,基於代碼的評估平均僅為0.57,圖表質量評估在編輯任務中平均僅為0.22,這凸顯了Chart2Code的難度。我們預期這一基準將推動多模態推理的進步,並促進開發更為強大且通用的LMMs。我們的代碼和數據可在Chart2Code上獲取。
English
We introduce Chart2Code, a new benchmark for evaluating the chart
understanding and code generation capabilities of large multimodal models
(LMMs). Chart2Code is explicitly designed from a user-driven perspective,
capturing diverse real-world scenarios and progressively increasing task
difficulty. It consists of three levels: Level 1 (Chart Reproduction)
reproduces charts from a reference figure and user query; Level 2 (Chart
Editing) involves complex modifications such as changing chart types or adding
elements; and Level 3 (Long-Table to Chart Generation) requires models to
transform long, information-dense tables into faithful charts following user
instructions. To our knowledge, this is the first hierarchical benchmark that
reflects practical chart2code usage while systematically scaling task
complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types,
paired with multi-level evaluation metrics that assess both code correctness
and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art
(SoTA) LMMs, including both proprietary and the latest open-source models such
as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental
results demonstrate that even the SoTA model GPT-5 averages only 0.57 on
code-based evaluation and 0.22 on chart-quality assessment across the editing
tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark
will drive advances in multimodal reasoning and foster the development of more
robust and general-purpose LMMs. Our code and data are available on Chart2Code.