잠재적 버그가 포함된 코드 완성에서 대형 코드 언어 모델의 한계

초록

코드용 대형 언어 모델(Code-LLMs)은 최근 프로그래밍 지원 및 코드 인텔리전스의 기본 기능인 코드 완성에 있어 엄청난 발전을 가져왔습니다. 그러나 대부분의 기존 연구는 생성 과정에서 코드 문맥에 존재할 수 있는 버그를 간과하고 있으며, 이러한 버그는 소프트웨어 개발에서 불가피한 요소입니다. 따라서 우리는 실시간 코드 제안의 현실적인 시나리오에서 영감을 받아 버그가 포함된 코드 문맥에서의 코드 완성 문제를 소개하고 연구합니다. 이는 완성된 프로그램에서 버그로 이어질 수 있는 안티 패턴을 포함하는 상황입니다. 이 작업을 체계적으로 연구하기 위해, 우리는 두 가지 데이터셋을 도입했습니다: 하나는 의미를 변경하는 연산자 변경으로부터 유도된 합성 버그를 포함한 데이터셋(buggy-HumanEval)이고, 다른 하나는 사용자가 제출한 코딩 문제에서 유도된 실제 버그를 포함한 데이터셋(buggy-FixEval)입니다. 우리는 잠재적 버그의 존재가 고성능 Code-LLMs의 생성 성능을 크게 저하시킨다는 사실을 발견했습니다. 예를 들어, CodeGen-2B-mono의 테스트 케이스 통과율은 buggy-HumanEval에서 단 하나의 잠재적 버그가 문맥에 주어졌을 때 50% 이상 하락했습니다. 마지막으로, 잠재적 버그의 부정적 영향을 완화하기 위한 몇 가지 사후 처리 방법을 조사했으며, 완화 후 성능에 여전히 큰 격차가 남아 있음을 발견했습니다.

English

Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.

잠재적 버그가 포함된 코드 완성에서 대형 코드 언어 모델의 한계

Large Language Models of Code Fail at Completing Code with Potential Bugs

초록

Support