Plot2Code:一個全面的基準測試,用於評估多模態大型語言模型在從科學圖中生成程式碼的能力。
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
May 13, 2024
作者: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo
cs.AI
摘要
多模式大型語言模型(MLLMs)取得了顯著進展,由於其在視覺情境中的卓越表現,吸引了相當大的關注。然而,它們在將視覺圖轉換為可執行代碼的能力尚未得到充分評估。為了解決這個問題,我們引入了Plot2Code,這是一個全面的視覺編碼基準,旨在公平且深入地評估MLLMs。我們精心收集了132個手動選定的高質量matplotlib圖,涵蓋六種圖表類型,這些圖表來自公開可用的matplotlib畫廊。對於每個圖表,我們仔細提供其源代碼,以及由GPT-4總結的描述性指導。這種方法使得Plot2Code能夠全面評估MLLMs在各種輸入模式下的代碼能力。此外,我們提出了三個自動評估指標,包括代碼通過率、文本匹配比率和GPT-4V總體評分,用於對輸出代碼和渲染圖像進行細緻評估。我們不僅僅是判斷通過或失敗,而是利用GPT-4V對生成的圖像和參考圖像進行總體評估,這已被證明與人類評估一致。評估結果包括對14個MLLMs(如專有的GPT-4V、Gemini-Pro和開源的Mini-Gemini)的分析,突顯了Plot2Code所面臨的重大挑戰。通過Plot2Code,我們揭示了大多數現有MLLMs在文本密集圖表的視覺編碼方面存在困難,嚴重依賴於文本指導。我們希望Plot2Code對視覺編碼的評估結果將指導MLLMs未來的發展。Plot2Code涉及的所有數據均可在https://huggingface.co/datasets/TencentARC/Plot2Code 上獲得。
English
The remarkable progress of Multi-modal Large Language Models (MLLMs) has
attracted significant attention due to their superior performance in visual
contexts. However, their capabilities in turning visual figure to executable
code, have not been evaluated thoroughly. To address this, we introduce
Plot2Code, a comprehensive visual coding benchmark designed for a fair and
in-depth assessment of MLLMs. We carefully collect 132 manually selected
high-quality matplotlib plots across six plot types from publicly available
matplotlib galleries. For each plot, we carefully offer its source code, and an
descriptive instruction summarized by GPT-4. This approach enables Plot2Code to
extensively evaluate MLLMs' code capabilities across various input modalities.
Furthermore, we propose three automatic evaluation metrics, including code pass
rate, text-match ratio, and GPT-4V overall rating, for a fine-grained
assessment of the output code and rendered images. Instead of simply judging
pass or fail, we employ GPT-4V to make an overall judgement between the
generated and reference images, which has been shown to be consistent with
human evaluation. The evaluation results, which include analyses of 14 MLLMs
such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini,
highlight the substantial challenges presented by Plot2Code. With Plot2Code, we
reveal that most existing MLLMs struggle with visual coding for text-dense
plots, heavily relying on textual instruction. We hope that the evaluation
results from Plot2Code on visual coding will guide the future development of
MLLMs. All data involved with Plot2Code are available at
https://huggingface.co/datasets/TencentARC/Plot2Code.Summary
AI-Generated Summary