從程式碼生成模型中提取高效程式碼嵌入
Efficient Code Embeddings from Code Generation Models
August 29, 2025
作者: Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, Han Xiao
cs.AI
摘要
jina-code-embeddings 是一套創新的程式碼嵌入模型套件,旨在從自然語言查詢中檢索程式碼、執行技術問答,以及跨程式語言識別語意相似的程式碼片段。該模型創新地利用了在文本和程式碼上預訓練的自回歸骨幹網路,通過最後一個詞元的池化操作生成嵌入。我們概述了其訓練方法,並展示了儘管模型規模相對較小,仍能達到最先進的性能,從而驗證了這種程式碼嵌入模型構建方法的有效性。
English
jina-code-embeddings is a novel code embedding model suite designed to
retrieve code from natural language queries, perform technical
question-answering, and identify semantically similar code snippets across
programming languages. It makes innovative use of an autoregressive backbone
pre-trained on both text and code, generating embeddings via last-token
pooling. We outline the training recipe and demonstrate state-of-the-art
performance despite the relatively small size of the models, validating this
approach to code embedding model construction.