ML-Bench：大型語言模型利用開源庫進行機器學習任務

摘要

大型語言模型在程式碼生成基準測試中展現了令人期待的表現。然而，在這些基準成就與實際應用之間存在著相當大的差距，主要歸因於現實世界程式設計對於現有庫的依賴。本研究旨在提出一種新的評估設置，讓大型語言模型使用開源庫來完成機器學習任務，而非從頭開始編碼。因此，我們提出了 ML-Bench，這是一個廣泛的基準測試，旨在評估大型語言模型在利用開源庫中現有功能方面的效果。該基準測試包含了來自 14 個知名機器學習 GitHub 倉庫的 130 個任務，共 10044 個樣本。在這個設置中，根據程式碼庫中特定的機器學習任務指示和相應的 README，大型語言模型被要求生成完成任務的程式碼。這需要理解交織著長篇語言和程式碼的文件，以及理解複雜的跨文件程式碼結構，帶來了新的挑戰。值得注意的是，儘管 GPT-4 在其他大型語言模型上表現出顯著改進，但僅完成了 39.73% 的任務，留下了很大的改進空間。我們通過提出 ML-Agent 來應對這些挑戰，該代理旨在有效地導航程式碼庫，找到文檔，檢索程式碼並生成可執行的程式碼。實證結果表明，建立在 GPT-4 基礎上的 ML-Agent 實現了進一步的改進。程式碼、數據和模型可在 https://ml-bench.github.io/ 上獲得。

English

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.

ML-Bench：大型語言模型利用開源庫進行機器學習任務

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

摘要

Support