ML-Bench: 大規模言語モデルが機械学習タスクにオープンソースライブラリを活用

要旨

大規模言語モデルは、コード生成ベンチマークにおいて有望な性能を示しています。しかし、これらのベンチマークでの成果と実際の適用性との間には大きな隔たりがあり、その主な原因は現実世界のプログラミングが既存のライブラリに依存していることにあります。本論文では、LLMがゼロからコードを生成することを評価するのではなく、オープンソースライブラリを活用して機械学習タスクを完了する新しい評価設定を提案することを目的としています。そこで、ML-Benchという広範なベンチマークを提案します。これは、LLMがオープンソースライブラリの既存の関数を活用する効果を評価するために開発されました。ML-Benchは、14の著名な機械学習GitHubリポジトリにわたる130のタスクにまたがる10044のサンプルで構成されています。この設定では、特定の機械学習タスクの指示とコードベースに付随するREADMEが与えられ、LLMはそのタスクを達成するためのコードを生成することを求められます。これには、長くて言語とコードが混在したドキュメントの理解、および複雑なクロスファイルコード構造の理解が必要であり、新たな課題を導入しています。特に、GPT-4は他のLLMと比較して顕著な改善を示していますが、タスクの39.73％しか達成できず、改善の余地が大きく残されています。これらの課題に対処するため、ML-Agentを提案します。ML-Agentは、コードベースを効果的にナビゲートし、ドキュメントを特定し、コードを取得し、実行可能なコードを生成するように設計されています。実験結果は、GPT-4を基に構築されたML-Agentがさらなる改善をもたらすことを示しています。コード、データ、およびモデルはhttps://ml-bench.github.io/で公開されています。

English

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.

ML-Bench: 大規模言語モデルが機械学習タスクにオープンソースライブラリを活用

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

要旨

Support