從程式碼生成模型中提取高效程式碼嵌入

摘要

jina-code-embeddings 是一套創新的程式碼嵌入模型套件，旨在從自然語言查詢中檢索程式碼、執行技術問答，以及跨程式語言識別語意相似的程式碼片段。該模型創新地利用了在文本和程式碼上預訓練的自回歸骨幹網路，通過最後一個詞元的池化操作生成嵌入。我們概述了其訓練方法，並展示了儘管模型規模相對較小，仍能達到最先進的性能，從而驗證了這種程式碼嵌入模型構建方法的有效性。

English

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

從程式碼生成模型中提取高效程式碼嵌入

Efficient Code Embeddings from Code Generation Models

摘要

Support