REPOEXEC: 저장소 수준 실행 가능 벤치마크를 통한 코드 생성 평가

초록

리포지토리 수준에서 실행 가능하고 기능적으로 정확한 코드를 생성하는 CodeLLM의 능력은 아직까지 크게 탐구되지 않았습니다. 우리는 리포지토리 수준의 코드 생성을 평가하기 위한 새로운 벤치마크인 RepoExec를 소개합니다. RepoExec는 실행 가능성, 높은 커버리지율을 가진 자동화된 테스트 케이스 생성을 통한 기능적 정확성, 그리고 정확한 코드 생성을 위해 신중하게 설계된 크로스 파일 컨텍스트라는 세 가지 주요 측면에 초점을 맞춥니다. 우리의 연구는 개발자가 필요한 코드 의존성을 명시하는 통제된 시나리오를 탐구하며, 모델이 이를 정확하게 통합하도록 요구합니다. 실험 결과, 사전 학습된 LLM이 정확성 면에서 지시 튜닝된 모델을 능가하는 반면, 후자는 제공된 의존성을 활용하고 디버깅 능력을 보여주는 데 뛰어난 성과를 보였습니다. 또한, 우리는 코드 의존성에 초점을 맞춘 새로운 지시 튜닝 데이터셋을 소개하고, 이 데이터셋으로 미세 조정된 CodeLLM이 이러한 의존성을 효과적으로 활용하는 능력이 더 뛰어남을 입증합니다. RepoExec는 코드 기능성과 개발자 의도와의 일치를 포괄적으로 평가하여, 실제 시나리오에서 더 신뢰할 수 있고 적용 가능한 CodeLLM을 위한 길을 열어줄 것을 목표로 합니다. 데이터셋과 소스 코드는 https://github.com/FSoft-AI4Code/RepoExec에서 확인할 수 있습니다.

English

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~https://github.com/FSoft-AI4Code/RepoExec.

REPOEXEC: 저장소 수준 실행 가능 벤치마크를 통한 코드 생성 평가

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

초록

Support