ChatPaper.aiChatPaper

ML-Bench:大型语言模型利用开源库进行机器学习任务

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

November 16, 2023
作者: Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein
cs.AI

摘要

大型语言模型在代码生成基准测试中表现出色。然而,这些基准测试成果与其在实际应用中的适用性之间存在明显差距,主要原因在于现实世界编程对现有库的依赖。本研究旨在提出一种新的评估设置,其中大型语言模型利用开源库完成机器学习任务,而非从零开始编写代码。因此,我们提出了ML-Bench,一个广泛的基准测试,旨在评估大型语言模型在利用开源库中现有功能方面的有效性。该基准测试包括10044个样本,涵盖了14个知名的机器学习GitHub存储库中的130个任务。在这种设置下,给定一个特定的机器学习任务说明和代码库中的相关自述文件,大型语言模型被要求生成完成任务的代码。这需要理解交织着长篇语言和代码的文档,以及复杂的跨文件代码结构的理解,从而引入了新的挑战。值得注意的是,虽然GPT-4相对其他大型语言模型表现出显著改进,但它仅完成了39.73\%的任务,留下了巨大的改进空间。我们通过提出ML-Agent来解决这些挑战,旨在有效地浏览代码库,定位文档,检索代码并生成可执行代码。实证结果表明,建立在GPT-4基础上的ML-Agent带来了进一步的改进。代码、数据和模型可在https://ml-bench.github.io/获取。
English
Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.
PDF110December 15, 2024