FEA-Bench:一個用於評估倉庫級代碼生成在功能實現中的基準測試
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
March 9, 2025
作者: Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li
cs.AI
摘要
在程式碼庫層級實現新功能是程式碼生成模型的一個關鍵應用。然而,現有的基準測試缺乏專門針對此能力的評估框架。為填補這一空白,我們引入了FEA-Bench,這是一個旨在評估大型語言模型(LLMs)在程式碼庫中進行增量開發能力的基準測試。我們從83個GitHub儲存庫中收集了拉取請求,並使用基於規則和基於意圖的過濾方法來構建專注於新功能開發的任務實例。每個包含程式碼變更的任務實例都配備了相關的單元測試文件,以確保解決方案能夠被驗證。該功能實現要求LLMs同時具備新元件的程式碼補全能力和程式碼庫中其他相關部分的程式碼編輯能力,從而提供了一種更全面的LLMs自動化軟體工程能力評估方法。實驗結果顯示,LLMs在FEA-Bench中的表現顯著較差,突顯了此類程式碼庫層級增量開發中的重大挑戰。
English
Implementing new features in repository-level codebases is a crucial
application of code generation models. However, current benchmarks lack a
dedicated evaluation framework for this capability. To fill this gap, we
introduce FEA-Bench, a benchmark designed to assess the ability of large
language models (LLMs) to perform incremental development within code
repositories. We collect pull requests from 83 GitHub repositories and use
rule-based and intent-based filtering to construct task instances focused on
new feature development. Each task instance containing code changes is paired
with relevant unit test files to ensure that the solution can be verified. The
feature implementation requires LLMs to simultaneously possess code completion
capabilities for new components and code editing abilities for other relevant
parts in the code repository, providing a more comprehensive evaluation method
of LLMs' automated software engineering capabilities. Experimental results show
that LLMs perform significantly worse in the FEA-Bench, highlighting
considerable challenges in such repository-level incremental code development.Summary
AI-Generated Summary