코드 생성 모델을 통한 효율적인 코드 임베딩

초록

jina-code-embeddings는 자연어 쿼리로부터 코드를 검색하고, 기술적 질문에 답변하며, 프로그래밍 언어 간에 의미적으로 유사한 코드 스니펫을 식별하기 위해 설계된 새로운 코드 임베딩 모델 제품군입니다. 이 모델은 텍스트와 코드 모두에 대해 사전 학습된 자기회귀적 백본을 혁신적으로 활용하며, 마지막 토큰 풀링을 통해 임베딩을 생성합니다. 우리는 훈련 레시피를 설명하고, 상대적으로 작은 모델 크기에도 불구하고 최첨단 성능을 입증함으로써 코드 임베딩 모델 구축에 대한 이 접근 방식을 검증합니다.

English

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

코드 생성 모델을 통한 효율적인 코드 임베딩

Efficient Code Embeddings from Code Generation Models

초록

Support