RepoFusion: 리포지토리를 이해하도록 코드 모델을 학습시키기

초록

GitHub Copilot과 같은 코딩 보조 도구에서 대형 언어 모델(LLMs)이 큰 성공을 거두었음에도 불구하고, 이러한 모델들은 리포지토리에 존재하는 컨텍스트(예: 임포트, 부모 클래스, 유사한 이름의 파일 등)를 이해하는 데 어려움을 겪어 부정확한 코드 완성을 생성합니다. 이러한 효과는 모델이 학습 중에 접하지 못한 리포지토리, 예를 들어 독점 소프트웨어나 진행 중인 코드 프로젝트에서 이러한 보조 도구를 사용할 때 더욱 두드러집니다. 최근 연구에서는 추론 중에 리포지토리 컨텍스트를 사용하는 것이 유망함을 보여주었습니다. 본 연구에서는 이 아이디어를 확장하여 관련 리포지토리 컨텍스트를 통합하도록 모델을 훈련시키는 프레임워크인 RepoFusion을 제안합니다. 단일 라인 코드 완성 실험에서 리포지토리 컨텍스트로 훈련된 우리의 모델은 CodeGen-16B-multi(약 73배 더 큰 모델)와 같은 훨씬 더 큰 코드 모델을 크게 능가하며, Fill-in-the-Middle 목표로 훈련된 약 70배 더 큰 StarCoderBase 모델의 성능과 근접한 결과를 보여줍니다. 이러한 결과는 리포지토리 컨텍스트를 사용한 훈련이 가져올 수 있는 이점에 대한 새롭고 강력한 증명으로 여겨집니다. 우리는 컨텍스트 유형, 컨텍스트 수, 컨텍스트 길이, 초기화와 같은 설계 선택의 영향을 조사하기 위해 광범위한 절제 연구를 수행합니다. 마지막으로, 우리는 허가된 라이선스를 가진 200개의 Java 리포지토리와 세 가지 유형의 리포지토리 컨텍스트로 보강된 거의 중복 제거된 파일로 구성된 Stack-Repo 데이터셋을 공개합니다. 또한, 우리는 연구를 위한 코드와 훈련된 체크포인트를 공개합니다. 우리가 공개한 리소스는 https://huggingface.co/RepoFusion에서 확인할 수 있습니다.

English

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi (sim73times larger) and closely match the performance of the sim 70times larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at https://huggingface.co/RepoFusion.

RepoFusion: 리포지토리를 이해하도록 코드 모델을 학습시키기

RepoFusion: Training Code Models to Understand Your Repository

초록

Support