생성적 검색은 수백만 개의 문서로 어떻게 확장될까?

초록

Differentiable Search Index로 대중화된 생성적 검색(Generative Retrieval)의 새로운 패러다임은 기존의 정보 검색 문제를 시퀀스-투-시퀀스 모델링 작업으로 재구성하며, 외부 인덱스를 배제하고 전체 문서 코퍼스를 단일 Transformer 내에 인코딩합니다. 생성적 검색의 효과를 개선하기 위해 다양한 접근 방식이 제안되었지만, 이들은 대부분 10만 규모의 문서 코퍼스에서만 평가되었습니다. 본 연구에서는 다양한 규모의 코퍼스에 걸쳐 생성적 검색 기술을 처음으로 실증적으로 연구하며, 최종적으로 880만 개의 패시지로 구성된 MS MARCO 패시지 랭킹 작업 전체를 대상으로 모델 크기를 최대 110억 파라미터까지 확장하여 평가합니다. 우리는 수백만 개의 패시지로 생성적 검색을 확장하는 과정에서 몇 가지 중요한 발견을 했습니다. 특히, 인덱싱 과정에서 합성 쿼리를 문서 표현으로 사용하는 것의 핵심 중요성, 계산 비용을 고려할 때 기존에 제안된 아키텍처 수정의 비효율성, 그리고 검색 성능과 관련하여 모델 파라미터를 단순히 확장하는 것의 한계 등이 그 예입니다. 우리는 생성적 검색이 소규모 코퍼스에서 최신의 이중 인코더(Dual Encoder)와 경쟁력이 있음을 확인했지만, 수백만 개의 패시지로 확장하는 것은 여전히 중요한 해결되지 않은 과제로 남아 있습니다. 이러한 연구 결과는 생성적 검색의 현재 상태를 명확히 하고, 독특한 도전 과제를 강조하며, 새로운 연구 방향을 제시하는 데 있어 커뮤니티에 가치 있는 통찰을 제공할 것이라 믿습니다.

English

Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.

생성적 검색은 수백만 개의 문서로 어떻게 확장될까?

How Does Generative Retrieval Scale to Millions of Passages?

초록

Support