ChatPaper.aiChatPaper

基于代码生成模型的高效代码嵌入

Efficient Code Embeddings from Code Generation Models

August 29, 2025
作者: Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, Han Xiao
cs.AI

摘要

jina-code-embeddings 是一套创新的代码嵌入模型系列,旨在通过自然语言查询检索代码、执行技术问答以及跨编程语言识别语义相似的代码片段。该模型创新性地采用了在文本和代码上预训练的自回归主干网络,并通过最后词元池化生成嵌入向量。我们详细阐述了训练方案,并展示了尽管模型规模相对较小,仍能实现业界领先的性能,从而验证了这种代码嵌入模型构建方法的有效性。
English
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
PDF102September 1, 2025