Plot2Code:一个全面的基准,用于评估多模态大型语言模型在从科学图表生成代码方面的表现。
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
May 13, 2024
作者: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo
cs.AI
摘要
多模态大型语言模型(MLLMs)取得了显著进展,由于在视觉环境中表现出色,因此引起了广泛关注。然而,它们在将视觉图转换为可执行代码方面的能力尚未得到充分评估。为了解决这一问题,我们引入了Plot2Code,这是一个全面的视觉编码基准,旨在公平、深入地评估MLLMs。我们精心收集了来自公开可用的matplotlib库的六种图表类型中共132个手动选择的高质量绘图。对于每个图表,我们仔细提供其源代码,并由GPT-4总结的描述性指令。这种方法使Plot2Code能够全面评估MLLMs在各种输入模态下的代码能力。此外,我们提出了三个自动评估指标,包括代码通过率、文本匹配比和GPT-4V整体评分,用于对输出代码和渲染图像进行细致评估。我们不仅仅是判断通过或失败,而是利用GPT-4V对生成的图像与参考图像进行整体评判,这已被证明与人类评估一致。评估结果包括对14个MLLMs的分析,如专有的GPT-4V、Gemini-Pro和开源的Mini-Gemini,突显了Plot2Code所面临的重大挑战。通过Plot2Code,我们揭示了大多数现有MLLMs在处理文本密集图表的视觉编码方面存在困难,严重依赖文本指令。我们希望Plot2Code对视觉编码的评估结果能指导MLLMs未来的发展。Plot2Code涉及的所有数据均可在https://huggingface.co/datasets/TencentARC/Plot2Code上获得。
English
The remarkable progress of Multi-modal Large Language Models (MLLMs) has
attracted significant attention due to their superior performance in visual
contexts. However, their capabilities in turning visual figure to executable
code, have not been evaluated thoroughly. To address this, we introduce
Plot2Code, a comprehensive visual coding benchmark designed for a fair and
in-depth assessment of MLLMs. We carefully collect 132 manually selected
high-quality matplotlib plots across six plot types from publicly available
matplotlib galleries. For each plot, we carefully offer its source code, and an
descriptive instruction summarized by GPT-4. This approach enables Plot2Code to
extensively evaluate MLLMs' code capabilities across various input modalities.
Furthermore, we propose three automatic evaluation metrics, including code pass
rate, text-match ratio, and GPT-4V overall rating, for a fine-grained
assessment of the output code and rendered images. Instead of simply judging
pass or fail, we employ GPT-4V to make an overall judgement between the
generated and reference images, which has been shown to be consistent with
human evaluation. The evaluation results, which include analyses of 14 MLLMs
such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini,
highlight the substantial challenges presented by Plot2Code. With Plot2Code, we
reveal that most existing MLLMs struggle with visual coding for text-dense
plots, heavily relying on textual instruction. We hope that the evaluation
results from Plot2Code on visual coding will guide the future development of
MLLMs. All data involved with Plot2Code are available at
https://huggingface.co/datasets/TencentARC/Plot2Code.Summary
AI-Generated Summary