大型代码语言模型在完成潜在存在错误的代码时失败。

摘要

最近，代码的大型语言模型（Code-LLMs）为代码补全带来了巨大的进展，这是编程辅助和代码智能的基本功能。然而，大多数现有研究忽视了在生成过程中可能存在的代码上下文中的错误，这在软件开发中是不可避免的。因此，我们引入并研究了有缺陷代码补全问题，灵感来自于实时代码建议的现实场景，其中代码上下文包含潜在错误 - 可能会在完成的程序中成为错误的反模式。为了系统地研究这一任务，我们引入了两个数据集：一个包含从改变语义的操作符变化中衍生的合成错误（buggy-HumanEval），另一个包含从用户提交的编程问题中衍生的现实错误（buggy-FixEval）。我们发现潜在错误的存在显著降低了高性能的Code-LLMs的生成性能。例如，对于buggy-HumanEval测试用例，CodeGen-2B-mono的通过率在上下文中存在一个潜在错误时下降超过50％。最后，我们研究了几种事后方法来减轻潜在错误的不利影响，并发现在事后减轻性能方面仍存在很大差距。

English

Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.

大型代码语言模型在完成潜在存在错误的代码时失败。

Large Language Models of Code Fail at Completing Code with Potential Bugs

摘要

Support