Nomic Embed:訓練一個可重現的長文本內容嵌入器
Nomic Embed: Training a Reproducible Long Context Text Embedder
February 2, 2024
作者: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar
cs.AI
摘要
本技術報告描述了 nomic-embed-text-v1 的訓練,這是第一個完全可重現、開源、開放權重、開放數據、具有 8192 上下文長度的英文文本嵌入模型。在短文本和長文本任務中,它的表現均優於 OpenAI Ada-002 和 OpenAI text-embedding-3-small。我們在 Apache 2 許可證下釋出了訓練代碼和模型權重。與其他開源模型不同,我們釋出了一個訓練數據加載器,其中包含 2.35 億個經過精心策劃的文本對,可以實現對 nomic-embed-text-v1 的完全復制。您可以在以下位置找到復制模型的代碼和數據:https://github.com/nomic-ai/contrastors
English
This technical report describes the training of nomic-embed-text-v1, the
first fully reproducible, open-source, open-weights, open-data, 8192 context
length English text embedding model that outperforms both OpenAI Ada-002 and
OpenAI text-embedding-3-small on short and long-context tasks. We release the
training code and model weights under an Apache 2 license. In contrast with
other open-source models, we release a training data loader with 235 million
curated text pairs that allows for the full replication of nomic-embed-text-v1.
You can find code and data to replicate the model at
https://github.com/nomic-ai/contrastors