모니터를 활용한 전역 컨텍스트 기반 코드 언어 모델 지도

초록

코드 언어 모델(LMs)은 생성 지점 주변의 코드가 충분한 맥락을 제공할 때 잘 작동합니다. 그러나 다른 모듈이나 라이브러리에 정의된 타입이나 기능을 사용해야 하는 경우, 특히 훈련 중에 접하지 않은 것들을 사용할 때는 이와 같은 상황이 적용되지 않습니다. LMs는 이러한 전역 맥락에 대한 제한된 인식으로 인해 환각(hallucination) 현상을 겪으며, 예를 들어 다른 파일에 정의된 타입을 잘못 사용하는 등의 문제가 발생합니다. 최근 연구에서는 전역 정보를 검색하여 지역 맥락을 보강함으로써 이 문제를 극복하려고 시도하고 있습니다. 그러나 이는 프롬프트를 비대하게 만들거나 아키텍처 수정 및 추가 훈련을 필요로 합니다. 통합 개발 환경(IDE)은 정적 분석을 통해 개발자에게 전역 맥락을 손쉽게 제공하여 개발을 지원합니다. 우리는 이러한 개발자들이 누리는 지원을 LMs로 확장합니다. 우리는 디코딩 과정을 안내하기 위해 백그라운드에서 정적 분석을 사용하는 모니터(monitor) 개념을 제안합니다. 사전 검색과 달리, 정적 분석은 전체 디코딩 과정 동안 반복적으로 호출되어 요청에 따라 가장 관련성 높은 제안을 제공합니다. 우리는 LMs가 객체 역참조를 위한 코드를 생성할 때마다 식별자의 타입 일관성을 모니터링함으로써 제안의 유용성을 입증합니다. 우리의 접근 방식을 평가하기 위해, 우리는 PragmaticCode라는 오픈소스 프로젝트와 그 개발 환경으로 구성된 데이터셋을 구축했습니다. 다양한 파라미터 규모의 모델에서, 모니터가 안내하는 디코딩은 LMs가 실제 값(ground truth)과 일치하는 식별자를 생성하는 능력을 꾸준히 향상시킬 뿐만 아니라 컴파일률과 실제 값과의 일치도도 개선함을 보여줍니다. 우리는 더 적은 파라미터를 가진 LMs가 우리의 모니터로 안내될 때 더 큰 LMs를 능가할 수 있음을 발견했습니다. 모니터가 안내하는 디코딩을 통해 SantaCoder-1.1B는 훨씬 더 큰 text-davinci-003 모델보다 더 나은 컴파일률과 다음 식별자 일치도를 달성했습니다. 데이터셋과 코드는 https://aka.ms/monitors4codegen 에서 공개될 예정입니다.

English

Language models of code (LMs) work well when the surrounding code in the vicinity of generation provides sufficient context. This is not true when it becomes necessary to use types or functionality defined in another module or library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating, e.g., using types defined in other files incorrectly. Recent work tries to overcome this issue by retrieving global information to augment the local context. However, this bloats the prompt or requires architecture modifications and additional training. Integrated development environments (IDEs) assist developers by bringing the global context at their fingertips using static analysis. We extend this assistance, enjoyed by developers, to the LMs. We propose a notion of monitors that use static analysis in the background to guide the decoding. Unlike a priori retrieval, static analysis is invoked iteratively during the entire decoding process, providing the most relevant suggestions on demand. We demonstrate the usefulness of our proposal by monitoring for type-consistent use of identifiers whenever an LM generates code for object dereference. To evaluate our approach, we curate PragmaticCode, a dataset of open-source projects with their development environments. On models of varying parameter scale, we show that monitor-guided decoding consistently improves the ability of an LM to not only generate identifiers that match the ground truth but also improves compilation rates and agreement with ground truth. We find that LMs with fewer parameters, when guided with our monitor, can outperform larger LMs. With monitor-guided decoding, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. The datasets and code will be released at https://aka.ms/monitors4codegen .

모니터를 활용한 전역 컨텍스트 기반 코드 언어 모델 지도

Guiding Language Models of Code with Global Context using Monitors

초록

Support