ChartMimic：通过图表生成代码评估LMM的跨模态推理能力

摘要

我们引入了一个新的基准测试，名为ChartMimic，旨在评估大型多模态模型（LMMs）的视觉基础代码生成能力。ChartMimic利用信息密集型的视觉图表和文本指令作为输入，要求LMMs生成用于图表呈现的相应代码。ChartMimic包括1,000个人工策划的（图表，指令，代码）三元组，代表了科学论文中各个领域（例如物理学，计算机科学，经济学等）中发现的真实图表用例。这些图表涵盖了18种常规类型和4种高级类型，分为191个子类别。此外，我们提出了多级评估指标，以对输出代码和呈现的图表进行自动和全面的评估。与现有的代码生成基准测试不同，ChartMimic侧重于评估LMMs协调一系列认知能力的能力，包括视觉理解、代码生成和跨模态推理。对3个专有模型和11个开放权重模型的评估突显了ChartMimic带来的重大挑战。即使是先进的GPT-4V，Claude-3-opus仅分别达到73.2和53.7的平均分，表明有很大的改进空间。我们预计ChartMimic将激发LMMs的发展，推动人工通用智能的追求。

English

We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

ChartMimic：通过图表生成代码评估LMM的跨模态推理能力

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

摘要

Support