FEA-Bench: 기능 구현을 위한 리포지토리 수준 코드 생성을 평가하기 위한 벤치마크

초록

리포지토리 수준의 코드베이스에 새로운 기능을 구현하는 것은 코드 생성 모델의 중요한 응용 분야입니다. 그러나 현재의 벤치마크는 이러한 능력을 평가하기 위한 전용 프레임워크가 부족합니다. 이러한 격차를 메우기 위해, 우리는 대규모 언어 모델(LLM)이 코드 리포지토리 내에서 점진적 개발을 수행하는 능력을 평가하기 위해 설계된 벤치마크인 FEA-Bench를 소개합니다. 우리는 83개의 GitHub 리포지토리에서 풀 리퀘스트를 수집하고, 규칙 기반 및 의도 기반 필터링을 통해 새로운 기능 개발에 초점을 맞춘 작업 인스턴스를 구성합니다. 코드 변경이 포함된 각 작업 인스턴스는 솔루션을 검증할 수 있도록 관련 단위 테스트 파일과 짝을 이룹니다. 기능 구현을 위해 LLM은 새로운 구성 요소에 대한 코드 완성 능력과 코드 리포지토리 내 다른 관련 부분에 대한 코드 편집 능력을 동시에 보유해야 하며, 이는 LLM의 자동화된 소프트웨어 엔지니어링 능력을 보다 포괄적으로 평가하는 방법을 제공합니다. 실험 결과, LLM은 FEA-Bench에서 상당히 낮은 성능을 보였으며, 이는 리포지토리 수준의 점진적 코드 개발에서 상당한 도전 과제가 있음을 강조합니다.

English

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

FEA-Bench: 기능 구현을 위한 리포지토리 수준 코드 생성을 평가하기 위한 벤치마크

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

초록

Support