문맥은 핵심 구문을 찾는 데 있어 금과 같다: 문맥 기반 문서 임베딩의 평가와 훈련

초록

현대 문서 검색 임베딩 방법의 한계는 일반적으로 동일한 문서의 구절(청크)을 독립적으로 인코딩하여, 개별 청크 표현을 크게 개선할 수 있는 문서의 나머지 부분에서의 중요한 맥락 정보를 종종 간과한다는 점입니다. 본 연구에서는 문서 전체의 맥락을 활용하는 능력을 평가하기 위해 설계된 벤치마크인 ConTEB(Context-aware Text Embedding Benchmark)를 소개합니다. 우리의 결과는 최첨단 임베딩 모델들이 맥락이 필요한 검색 시나리오에서 어려움을 겪는 것을 보여줍니다. 이 한계를 해결하기 위해, 우리는 InSeNT(In-sequence Negative Training)라는 새로운 대조적 사후 학습 접근 방식을 제안합니다. 이 방법은 후기 청킹 풀링과 결합되어 계산 효율성을 유지하면서 맥락 표현 학습을 강화합니다. 우리의 방법은 기본 모델 성능을 희생하지 않으면서 ConTEB에서의 검색 품질을 크게 개선합니다. 또한, 우리의 방법으로 임베딩된 청크는 최적이 아닌 청킹 전략과 더 큰 검색 코퍼스 크기에 대해 더 강건한 것으로 나타났습니다. 우리는 모든 아티팩트를 https://github.com/illuin-tech/contextual-embeddings에서 오픈소스로 공개합니다.

English

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.

문맥은 핵심 구문을 찾는 데 있어 금과 같다: 문맥 기반 문서 임베딩의 평가와 훈련

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

초록

Support