Nomic Embed: 재현 가능한 장문 컨텍스트 텍스트 임베더 학습

초록

본 기술 보고서는 nomic-embed-text-v1의 학습 과정을 설명한다. 이 모델은 완전히 재현 가능하며, 오픈소스, 오픈 가중치, 오픈 데이터로 제공되는 8192 컨텍스트 길이의 영어 텍스트 임베딩 모델로, 짧은 및 긴 컨텍스트 작업에서 OpenAI Ada-002와 OpenAI text-embedding-3-small을 모두 능가한다. 우리는 학습 코드와 모델 가중치를 Apache 2 라이선스 하에 공개한다. 다른 오픈소스 모델과 달리, 2억 3,500만 개의 정제된 텍스트 쌍을 포함한 학습 데이터 로더를 공개하여 nomic-embed-text-v1의 완전한 재현을 가능하게 한다. 모델을 재현하기 위한 코드와 데이터는 https://github.com/nomic-ai/contrastors에서 확인할 수 있다.

English

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Nomic Embed: 재현 가능한 장문 컨텍스트 텍스트 임베더 학습

Nomic Embed: Training a Reproducible Long Context Text Embedder

초록

Support