从图表到代码:多模态模型的分层基准测试
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
October 20, 2025
作者: Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, Alex Jinpeng Wang
cs.AI
摘要
我们推出了Chart2Code,这是一个用于评估大型多模态模型(LMMs)图表理解与代码生成能力的新基准。Chart2Code从用户驱动视角出发,精心设计,涵盖了多样化的真实场景,并逐步提升任务难度。它包含三个层级:第一层级(图表复现)要求根据参考图像和用户查询复现图表;第二层级(图表编辑)涉及复杂修改,如更换图表类型或添加元素;第三层级(长表转图表生成)则要求模型依据用户指令,将信息密集的长表格准确转化为图表。据我们所知,这是首个既反映实际chart2code应用场景,又系统化提升任务复杂度的层次化基准。Chart2Code总计包含2,023个任务,覆盖22种图表类型,并配有多层次评估指标,既检验代码正确性,也评估渲染图表的视觉保真度。我们对25个最先进的(SoTA)LMMs进行了基准测试,包括专有模型及最新开源模型如GPT-5、Qwen2.5-VL、InternVL3/3.5、MiMo-VL和Seed-1.6-VL。实验结果显示,即便是SoTA模型GPT-5,在编辑任务上的代码评估平均得分仅为0.57,图表质量评估平均得分仅为0.22,凸显了Chart2Code的挑战性。我们预期这一基准将推动多模态推理的进步,促进开发更强大、通用的LMMs。我们的代码与数据已发布于Chart2Code平台。
English
We introduce Chart2Code, a new benchmark for evaluating the chart
understanding and code generation capabilities of large multimodal models
(LMMs). Chart2Code is explicitly designed from a user-driven perspective,
capturing diverse real-world scenarios and progressively increasing task
difficulty. It consists of three levels: Level 1 (Chart Reproduction)
reproduces charts from a reference figure and user query; Level 2 (Chart
Editing) involves complex modifications such as changing chart types or adding
elements; and Level 3 (Long-Table to Chart Generation) requires models to
transform long, information-dense tables into faithful charts following user
instructions. To our knowledge, this is the first hierarchical benchmark that
reflects practical chart2code usage while systematically scaling task
complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types,
paired with multi-level evaluation metrics that assess both code correctness
and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art
(SoTA) LMMs, including both proprietary and the latest open-source models such
as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental
results demonstrate that even the SoTA model GPT-5 averages only 0.57 on
code-based evaluation and 0.22 on chart-quality assessment across the editing
tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark
will drive advances in multimodal reasoning and foster the development of more
robust and general-purpose LMMs. Our code and data are available on Chart2Code.