FEA-Bench: 機能実装のためのリポジトリレベルコード生成評価用ベンチマーク

要旨

リポジトリレベルのコードベースに新機能を実装することは、コード生成モデルの重要な応用分野である。しかし、現在のベンチマークでは、この能力を評価するための専用のフレームワークが欠けている。このギャップを埋めるため、我々はFEA-Benchを導入した。これは、大規模言語モデル（LLM）がコードリポジトリ内で増分的な開発を実行する能力を評価するために設計されたベンチマークである。83のGitHubリポジトリからプルリクエストを収集し、ルールベースおよび意図ベースのフィルタリングを用いて、新機能開発に焦点を当てたタスクインスタンスを構築した。各タスクインスタンスにはコード変更が含まれており、関連するユニットテストファイルとペアにすることで、ソリューションが検証可能であることを保証している。機能の実装には、LLMが新規コンポーネントに対するコード補完能力と、コードリポジトリ内の他の関連部分に対するコード編集能力を同時に持つことが要求され、LLMの自動化されたソフトウェアエンジニアリング能力をより包括的に評価する方法を提供する。実験結果は、LLMがFEA-Benchで著しく低いパフォーマンスを示し、このようなリポジトリレベルの増分的なコード開発における大きな課題を浮き彫りにしている。

English

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

FEA-Bench: 機能実装のためのリポジトリレベルコード生成評価用ベンチマーク

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

要旨

Support