Nomic Embed:训练可复现的长文本上下文嵌入器
Nomic Embed: Training a Reproducible Long Context Text Embedder
February 2, 2024
作者: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar
cs.AI
摘要
本技术报告描述了nomic-embed-text-v1的训练,这是第一个完全可复制、开源、开放权重、开放数据、8192上下文长度的英文文本嵌入模型。该模型在短文本和长文本任务中均优于OpenAI Ada-002和OpenAI text-embedding-3-small。我们在Apache 2许可下发布了训练代码和模型权重。与其他开源模型不同,我们提供了一个训练数据加载器,其中包含2.35亿个经过筛选的文本对,可实现对nomic-embed-text-v1的完全复制。您可以在https://github.com/nomic-ai/contrastors 找到复制该模型所需的代码和数据。
English
This technical report describes the training of nomic-embed-text-v1, the
first fully reproducible, open-source, open-weights, open-data, 8192 context
length English text embedding model that outperforms both OpenAI Ada-002 and
OpenAI text-embedding-3-small on short and long-context tasks. We release the
training code and model weights under an Apache 2 license. In contrast with
other open-source models, we release a training data loader with 235 million
curated text pairs that allows for the full replication of nomic-embed-text-v1.
You can find code and data to replicate the model at
https://github.com/nomic-ai/contrastors