LoRACode：適用於代碼嵌入的LoRA適配器

摘要

程式碼嵌入對於語義程式碼搜索至關重要；然而，現有方法往往難以精確捕捉程式碼中固有的語法和上下文細微差異。開源模型如CodeBERT和UniXcoder在可擴展性和效率方面存在限制，而高性能的專有系統則需承擔巨大的計算成本。我們引入了一種基於低秩適應（LoRA）的參數高效微調方法，用於構建特定任務的程式碼檢索適配器。該方法將可訓練參數數量減少至基礎模型的不到2%，從而能夠在大量程式碼語料庫上快速微調（在兩張H100 GPU上，25分鐘內處理200萬個樣本）。實驗表明，在多種程式語言中，Code2Code搜索的平均倒數排名（MRR）提升了高達9.1%，Text2Code搜索任務則提升了高達86.69%。任務和語言適應性的差異有助於探索程式碼檢索對語法和語言變化的敏感性。

English

Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.

LoRACode：適用於代碼嵌入的LoRA適配器

LoRACode: LoRA Adapters for Code Embeddings

摘要

Support