文脈に即した文書埋め込み

要旨

密な文書埋め込みは、ニューラル検索において中心的な役割を果たしています。主流のパラダイムは、個々の文書に直接エンコーダを実行して埋め込みを訓練および構築することです。本研究では、これらの埋め込みは効果的であるものの、検索の対象となる使用事例に対して暗黙的に非文脈的であると主張し、文脈化された文書埋め込みは文書と周辺文書の両方を文脈に含めるべきであり、文脈化された単語埋め込みに類似しています。我々は、文脈化された文書埋め込みのための2つの補完的な方法を提案します。第一に、文書の隣接文書をバッチ内文脈損失に明示的に組み込む代替対照学習目的。第二に、新しい文脈アーキテクチャを提案し、エンコードされた表現に隣接文書情報を明示的にエンコードします。結果は、両方の方法がいくつかの設定でバイエンコーダよりも優れたパフォーマンスを達成し、特にドメイン外での違いが顕著であることを示しています。私たちは、ハードネガティブマイニング、スコア蒸留、データセット固有の指示、GPU内例の共有、または非常に大きなバッチサイズを必要とせずに、MTEBベンチマークで最先端の結果を達成しました。私たちの方法は、対照学習データセットおよび任意のバイエンコーダにおいてパフォーマンスを向上させるために適用できます。

English

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

文脈に即した文書埋め込み

Contextual Document Embeddings

要旨

Support