大型程式碼語言模型在完成可能含有錯誤的程式碼時失敗。

摘要

最近，程式碼的大型語言模型（Code-LLMs）為程式碼補全帶來了巨大的進展，這是程式設計輔助和程式碼智能的基本功能。然而，大多數現有的研究忽略了在生成過程中程式碼上下文中可能存在的錯誤，這在軟體開發中是不可避免的。因此，我們引入並研究了有錯誤程式碼補全問題，靈感來自於實時程式碼建議的現實情境，其中程式碼上下文包含可能的錯誤 - 反模式，這些反模式可能在完成的程式中變成錯誤。為了系統地研究這個任務，我們引入了兩個資料集：一個包含從語意改變運算子變更中衍生的合成錯誤（buggy-HumanEval），另一個包含從使用者提交的編碼問題中衍生的現實錯誤（buggy-FixEval）。我們發現，潛在錯誤的存在顯著降低了高效的Code-LLMs的生成效能。例如，當上下文中存在單個潛在錯誤時，CodeGen-2B-mono在buggy-HumanEval的測試案例通過率下降超過50％。最後，我們研究了幾種事後方法來緩解潛在錯誤的不良影響，並發現在事後緩解效能上仍存在較大差距。

English

Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.

大型程式碼語言模型在完成可能含有錯誤的程式碼時失敗。

Large Language Models of Code Fail at Completing Code with Potential Bugs

摘要

Support