ML-Bench：大型语言模型利用开源库进行机器学习任务

摘要

大型语言模型在代码生成基准测试中表现出色。然而，这些基准测试成果与其在实际应用中的适用性之间存在明显差距，主要原因在于现实世界编程对现有库的依赖。本研究旨在提出一种新的评估设置，其中大型语言模型利用开源库完成机器学习任务，而非从零开始编写代码。因此，我们提出了ML-Bench，一个广泛的基准测试，旨在评估大型语言模型在利用开源库中现有功能方面的有效性。该基准测试包括10044个样本，涵盖了14个知名的机器学习GitHub存储库中的130个任务。在这种设置下，给定一个特定的机器学习任务说明和代码库中的相关自述文件，大型语言模型被要求生成完成任务的代码。这需要理解交织着长篇语言和代码的文档，以及复杂的跨文件代码结构的理解，从而引入了新的挑战。值得注意的是，虽然GPT-4相对其他大型语言模型表现出显著改进，但它仅完成了39.73\%的任务，留下了巨大的改进空间。我们通过提出ML-Agent来解决这些挑战，旨在有效地浏览代码库，定位文档，检索代码并生成可执行代码。实证结果表明，建立在GPT-4基础上的ML-Agent带来了进一步的改进。代码、数据和模型可在https://ml-bench.github.io/获取。

English

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.

ML-Bench：大型语言模型利用开源库进行机器学习任务

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

摘要

Support