ChatPaper.aiChatPaper

ML-Bench:大型語言模型利用開源庫進行機器學習任務

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

November 16, 2023
作者: Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein
cs.AI

摘要

大型語言模型在程式碼生成基準測試中展現了令人期待的表現。然而,在這些基準成就與實際應用之間存在著相當大的差距,主要歸因於現實世界程式設計對於現有庫的依賴。本研究旨在提出一種新的評估設置,讓大型語言模型使用開源庫來完成機器學習任務,而非從頭開始編碼。因此,我們提出了 ML-Bench,這是一個廣泛的基準測試,旨在評估大型語言模型在利用開源庫中現有功能方面的效果。該基準測試包含了來自 14 個知名機器學習 GitHub 倉庫的 130 個任務,共 10044 個樣本。在這個設置中,根據程式碼庫中特定的機器學習任務指示和相應的 README,大型語言模型被要求生成完成任務的程式碼。這需要理解交織著長篇語言和程式碼的文件,以及理解複雜的跨文件程式碼結構,帶來了新的挑戰。值得注意的是,儘管 GPT-4 在其他大型語言模型上表現出顯著改進,但僅完成了 39.73% 的任務,留下了很大的改進空間。我們通過提出 ML-Agent 來應對這些挑戰,該代理旨在有效地導航程式碼庫,找到文檔,檢索程式碼並生成可執行的程式碼。實證結果表明,建立在 GPT-4 基礎上的 ML-Agent 實現了進一步的改進。程式碼、數據和模型可在 https://ml-bench.github.io/ 上獲得。
English
Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.
PDF110December 15, 2024