Nomic Embed：训练可复现的长文本上下文嵌入器

摘要

本技术报告描述了nomic-embed-text-v1的训练，这是第一个完全可复制、开源、开放权重、开放数据、8192上下文长度的英文文本嵌入模型。该模型在短文本和长文本任务中均优于OpenAI Ada-002和OpenAI text-embedding-3-small。我们在Apache 2许可下发布了训练代码和模型权重。与其他开源模型不同，我们提供了一个训练数据加载器，其中包含2.35亿个经过筛选的文本对，可实现对nomic-embed-text-v1的完全复制。您可以在https://github.com/nomic-ai/contrastors 找到复制该模型所需的代码和数据。

English

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Nomic Embed：训练可复现的长文本上下文嵌入器

Nomic Embed: Training a Reproducible Long Context Text Embedder

摘要

Support