LoRACode: 코드 임베딩을 위한 LoRA 어댑터

초록

코드 임베딩은 의미론적 코드 검색에 필수적이지만, 현재의 접근 방식들은 코드에 내재된 정확한 구문적 및 맥락적 뉘앙스를 포착하는 데 어려움을 겪는 경우가 많습니다. CodeBERT와 UniXcoder와 같은 오픈소스 모델들은 확장성과 효율성 측면에서 한계를 보이며, 고성능의 독점 시스템들은 상당한 계산 비용을 요구합니다. 우리는 Low-Rank Adaptation (LoRA) 기반의 파라미터 효율적 미세 조정 방법을 도입하여 코드 검색을 위한 작업 특화 어댑터를 구축합니다. 우리의 접근 방식은 학습 가능한 파라미터 수를 기본 모델의 2% 미만으로 줄여, 대규모 코드 코퍼스(2백만 개 샘플을 2개의 H100 GPU에서 25분 내)에 대한 빠른 미세 조정을 가능하게 합니다. 실험 결과, Code2Code 검색에서 최대 9.1%의 Mean Reciprocal Rank (MRR) 향상과, 여러 프로그래밍 언어에 걸친 Text2Code 검색 작업에서 최대 86.69%의 성능 향상을 보였습니다. 작업별 및 언어별 적응의 차이는 구문적 및 언어적 변이에 대한 코드 검색의 민감도를 탐구하는 데 도움을 줍니다.

English

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

LoRACode: 코드 임베딩을 위한 LoRA 어댑터

LoRACode: LoRA Adapters for Code Embeddings

초록

Support