Nomic Embed: 再現可能な長文テキスト埋め込みモデルのトレーニング

要旨

本技術レポートでは、nomic-embed-text-v1のトレーニングについて説明します。これは、完全に再現可能なオープンソース、オープンウェイト、オープンデータの8192トークンコンテキスト長を持つ英語テキスト埋め込みモデルであり、短いコンテキストと長いコンテキストのタスクにおいてOpenAI Ada-002およびOpenAI text-embedding-3-smallを上回る性能を発揮します。トレーニングコードとモデルウェイトをApache 2ライセンスの下で公開しています。他のオープンソースモデルとは異なり、2億3500万の厳選されたテキストペアを含むトレーニングデータローダーを公開しており、nomic-embed-text-v1の完全な再現を可能にしています。モデルを再現するためのコードとデータは、https://github.com/nomic-ai/contrastors で見つけることができます。

English

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors

Nomic Embed: 再現可能な長文テキスト埋め込みモデルのトレーニング

Nomic Embed: Training a Reproducible Long Context Text Embedder

要旨

Support