文脈は黄金のパッセージを見つけるための黄金：文脈的ドキュメント埋め込みの評価とトレーニング

要旨

現代の文書検索埋め込み手法の限界は、同じ文書内のパッセージ（チャンク）を独立してエンコードすることが一般的であり、しばしば文書全体から得られる重要な文脈情報を見落としてしまう点にあります。この情報は個々のチャンク表現を大幅に改善する可能性があります。本研究では、文書全体の文脈を活用する能力を評価するために設計されたベンチマーク、ConTEB（Context-aware Text Embedding Benchmark）を紹介します。我々の結果は、最先端の埋め込みモデルが文脈を必要とする検索シナリオで苦戦することを示しています。この限界に対処するため、我々はInSeNT（In-sequence Negative Training）を提案します。これは、遅延チャンキングプーリングと組み合わせることで、計算効率を保ちつつ文脈表現学習を強化する新しいコントラスティブなポストトレーニング手法です。我々の手法は、ベースモデルの性能を犠牲にすることなく、ConTEBでの検索品質を大幅に向上させます。さらに、我々の手法で埋め込まれたチャンクは、最適でないチャンキング戦略や大規模な検索コーパスサイズに対してより頑健であることがわかりました。我々はすべての成果物をhttps://github.com/illuin-tech/contextual-embeddingsでオープンソースとして公開しています。

English

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.

文脈は黄金のパッセージを見つけるための黄金：文脈的ドキュメント埋め込みの評価とトレーニング

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

要旨

Support