Nomic Embed：訓練一個可重現的長文本內容嵌入器

摘要

本技術報告描述了 nomic-embed-text-v1 的訓練，這是第一個完全可重現、開源、開放權重、開放數據、具有 8192 上下文長度的英文文本嵌入模型。在短文本和長文本任務中，它的表現均優於 OpenAI Ada-002 和 OpenAI text-embedding-3-small。我們在 Apache 2 許可證下釋出了訓練代碼和模型權重。與其他開源模型不同，我們釋出了一個訓練數據加載器，其中包含 2.35 億個經過精心策劃的文本對，可以實現對 nomic-embed-text-v1 的完全復制。您可以在以下位置找到復制模型的代碼和數據：https://github.com/nomic-ai/contrastors

English

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Nomic Embed：訓練一個可重現的長文本內容嵌入器

Nomic Embed: Training a Reproducible Long Context Text Embedder

摘要

Support