프로젝트 수준 코드 완성을 위한 사전 학습에 관하여

초록

리포지토리 수준의 사전 학습은 대형 언어 모델이 코드베이스 전체의 문맥을 활용할 수 있도록 하는 데 흔히 사용됩니다. 이를 통해 모델은 정확하고 문맥을 고려한 코드 완성 능력을 향상시킬 수 있습니다. 본 연구에서는 15억 개의 파라미터를 가진 OpenCoder 모델에서 다양한 리포지토리 처리 전략이 문맥 내 학습에 미치는 영향을 조사합니다. 우리는 추가로 10억 개의 토큰으로 구성된 정제된 리포지토리 수준 데이터를 학습시켜 모델의 문맥 창을 4,096에서 16,384 토큰으로 확장했습니다. 경쟁 모델들(종종 수천억 개의 토큰을 사용)보다 작은 데이터셋에 의존함에도 불구하고, 우리의 모델은 Long Code Arena 벤치마크에서 비슷한 성능을 달성했습니다. 다양한 리포지토리 처리 기법이 비슷하게 강력한 결과를 보였으며, 주요 성능 향상은 새로운 회전 위치 임베딩(RoPE) 스케일링 파라미터에 적응함으로써 얻어졌음을 발견했습니다. 마지막으로, 원래의 시퀀스 길이에서 더 간단한 파일 수준의 학습 접근법이 여전히 매우 효과적임을 보여줌으로써, 데이터와 컴퓨팅 자원이 더 제한된 환경에서도 리포지토리 수준의 코드 완성 연구를 가능하게 합니다.

English

Repository-level pretraining is commonly used to enable large language models for code to leverage codebase-wide context. This enhances their ability to generate accurate and context-aware code completions. In this work, we investigate how different repository-processing strategies affect in-context learning in OpenCoder, a 1.5B-parameter model. We extend its context window from 4,096 to 16,384 tokens by training on additional 1B tokens of curated repository-level data. Despite relying on a smaller dataset than competing models (which often use hundreds of billions of tokens), our model achieves comparable performance on the Long Code Arena benchmark. We find that various repository-processing techniques yield similarly strong results, with the primary gain coming from adapting to a new rotary positional embedding (RoPE) scaling parameter. Finally, we show that a simpler file-level training approach at the original sequence length remains highly effective, opening up repository-level code completion research to settings with more constrained data and compute resources.

프로젝트 수준 코드 완성을 위한 사전 학습에 관하여

On Pretraining for Project-Level Code Completion

초록

Support