ChatPaper.aiChatPaper

RepoFusion:訓練程式碼模型以理解您的程式庫

RepoFusion: Training Code Models to Understand Your Repository

June 19, 2023
作者: Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak
cs.AI

摘要

儘管大型語言模型(LLMs)在像GitHub Copilot這樣的編碼助手中取得了巨大成功,但這些模型在理解存儲庫中的上下文(例如導入、父類別、具有相似名稱的文件等)方面仍然存在困難,因此導致代碼補全不準確。當使用這些助手來處理模型在訓練期間未見過的存儲庫時(例如專有軟件或正在進行中的代碼項目),這種影響將更加明顯。最近的研究表明,在推論過程中使用存儲庫的上下文具有潛力。在本研究中,我們擴展了這個想法,提出了RepoFusion,一個用於訓練模型以整合相關存儲庫上下文的框架。對單行代碼補全的實驗顯示,我們訓練的具有存儲庫上下文的模型明顯優於遠大於其的CodeGen-16B-multi代碼模型(大小相差73倍),並且與使用Fill-in-the-Middle目標訓練的大小相差70倍的StarCoderBase模型的性能相當。我們認為這些結果是對訓練具有存儲庫上下文的模型所帶來的收益的一個新穎而引人注目的展示。我們進行了大量的消融研究,以探討設計選擇(例如上下文類型、上下文數量、上下文長度和初始化)在我們的框架中的影響。最後,我們釋出了Stack-Repo,這是一個包含200個Java存儲庫的數據集,這些存儲庫具有寬鬆的許可證並且具有幾乎重複的文件,並且增加了三種類型的存儲庫上下文。此外,我們還提供了我們工作的代碼和訓練檢查點。我們釋出的資源可以在https://huggingface.co/RepoFusion 找到。
English
Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi (sim73times larger) and closely match the performance of the sim 70times larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at https://huggingface.co/RepoFusion.
PDF130December 15, 2024