ML-Bench: 대규모 언어 모델이 머신러닝 작업을 위해 오픈소스 라이브러리를 활용하는 방법

초록

대규모 언어 모델(LLM)은 코드 생성 벤치마크에서 유망한 성능을 보여주고 있습니다. 그러나 이러한 벤치마크 성과와 실제 적용 가능성 사이에는 상당한 격차가 존재하며, 이는 주로 현실 세계의 프로그래밍이 기존 라이브러리에 의존하기 때문입니다. 이 연구는 LLM이 처음부터 코드를 작성하는 능력을 평가하는 대신, 오픈소스 라이브러리를 활용하여 머신러닝 작업을 완료하는 새로운 평가 설정을 제안하는 것을 목표로 합니다. 이를 위해 우리는 ML-Bench라는 포괄적인 벤치마크를 제안합니다. 이 벤치마크는 오픈소스 라이브러리의 기존 함수를 활용하는 LLM의 효과를 평가하기 위해 개발되었으며, 14개의 주목할 만한 머신러닝 GitHub 저장소에 걸쳐 130개의 작업과 10044개의 샘플로 구성되어 있습니다. 이 설정에서는 특정 머신러닝 작업 지시사항과 코드베이스의 README 파일이 주어졌을 때, LLM이 해당 작업을 완료하기 위한 코드를 생성해야 합니다. 이는 길고 언어와 코드가 혼합된 문서를 이해하고, 복잡한 교차 파일 코드 구조를 이해하는 것을 필요로 하며, 새로운 도전 과제를 제시합니다. 특히 GPT-4는 다른 LLM에 비해 뛰어난 개선을 보이지만, 여전히 작업의 39.73%만 완료할 수 있어 개선의 여지가 큽니다. 우리는 이러한 도전 과제를 해결하기 위해 ML-Agent를 제안합니다. ML-Agent는 코드베이스를 효과적으로 탐색하고, 문서를 찾고, 코드를 검색하며, 실행 가능한 코드를 생성하도록 설계되었습니다. 실험 결과는 GPT-4를 기반으로 구축된 ML-Agent가 추가적인 개선을 가져온다는 것을 보여줍니다. 코드, 데이터 및 모델은 https://ml-bench.github.io/에서 확인할 수 있습니다.

English

Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at https://ml-bench.github.io/.

ML-Bench: 대규모 언어 모델이 머신러닝 작업을 위해 오픈소스 라이브러리를 활용하는 방법

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

초록

Support