Contextuele Document Embeddings

Samenvatting

Dichte document embeddings zijn essentieel voor neurale informatieopvraging. De dominante aanpak is om embeddings te trainen en construeren door encoders rechtstreeks op individuele documenten uit te voeren. In dit werk betogen we dat deze embeddings, hoewel effectief, impliciet buiten context zijn voor gerichte gebruiksscenario's van opvraging, en dat een gecontextualiseerde document embedding rekening moet houden met zowel het document als naburige documenten in context - analoog aan gecontextualiseerde woord embeddings. We stellen twee aanvullende methoden voor gecontextualiseerde document embeddings voor: ten eerste, een alternatief contrastief leerdoel dat expliciet de documentburen opneemt in het intra-batch contextuele verlies; ten tweede, een nieuwe contextuele architectuur die expliciet buurdocumentinformatie codeert in de gecodeerde representatie. Resultaten tonen aan dat beide methoden betere prestaties behalen dan biencoders in verschillende scenario's, met name opvallend verschillend buiten het domein. We behalen state-of-the-art resultaten op de MTEB benchmark zonder hard negatieve mining, score distillatie, dataset-specifieke instructies, intra-GPU voorbeeld-deling, of extreem grote batchgroottes. Onze methode kan worden toegepast om prestaties te verbeteren op elk contrastief leerdataset en elke biencoder.

English

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

Contextuele Document Embeddings

Contextual Document Embeddings

Samenvatting

Support