RepoFusion:训练代码模型以理解您的代码库
RepoFusion: Training Code Models to Understand Your Repository
June 19, 2023
作者: Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak
cs.AI
摘要
尽管大型语言模型(LLMs)在GitHub Copilot等编码助手中取得了巨大成功,但这些模型难以理解存储库中存在的上下文(例如导入、父类、文件名相似等),因此产生了不准确的代码补全。当使用这些助手处理模型在训练中未见过的存储库时,如专有软件或正在进行中的代码项目时,这种影响更加显著。最近的研究表明,在推断过程中利用存储库的上下文具有潜力。在这项工作中,我们扩展了这一想法,并提出了RepoFusion,一个训练模型以整合相关存储库上下文的框架。针对单行代码补全的实验表明,我们训练的带有存储库上下文的模型明显优于规模大得多的代码模型,如CodeGen-16B-multi(规模相差73倍),并且与使用Fill-in-the-Middle目标训练的规模大70倍的StarCoderBase模型的性能相匹配。我们认为这些结果是训练带有存储库上下文的模型所能带来的收益的一种新颖而有说服力的展示。我们进行了大量消融研究,以调查框架中的设计选择,如上下文类型、上下文数量、上下文长度和初始化等的影响。最后,我们发布了Stack-Repo,这是一个包含200个Java存储库的数据集,具有宽松许可证和近重复文件,这些文件增加了三种类型的存储库上下文。此外,我们还提供了我们工作的代码和训练检查点。我们发布的资源可在https://huggingface.co/RepoFusion 找到。
English
Despite the huge success of Large Language Models (LLMs) in coding assistants
like GitHub Copilot, these models struggle to understand the context present in
the repository (e.g., imports, parent classes, files with similar names, etc.),
thereby producing inaccurate code completions. This effect is more pronounced
when using these assistants for repositories that the model has not seen during
training, such as proprietary software or work-in-progress code projects.
Recent work has shown the promise of using context from the repository during
inference. In this work, we extend this idea and propose RepoFusion, a
framework to train models to incorporate relevant repository context.
Experiments on single-line code completion show that our models trained with
repository context significantly outperform much larger code models as
CodeGen-16B-multi (sim73times larger) and closely match the performance of
the sim 70times larger StarCoderBase model that was trained with the
Fill-in-the-Middle objective. We find these results to be a novel and
compelling demonstration of the gains that training with repository context can
bring. We carry out extensive ablation studies to investigate the impact of
design choices such as context type, number of contexts, context length, and
initialization within our framework. Lastly, we release Stack-Repo, a dataset
of 200 Java repositories with permissive licenses and near-deduplicated files
that are augmented with three types of repository contexts. Additionally, we
are making available the code and trained checkpoints for our work. Our
released resources can be found at https://huggingface.co/RepoFusion.