ChatPaper.aiChatPaper

生成式检索如何扩展到数百万段落?

How Does Generative Retrieval Scale to Millions of Passages?

May 19, 2023
作者: Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran
cs.AI

摘要

由可微搜索索引推广的生成式检索范式将经典信息检索问题重新构建为一个序列到序列建模任务,放弃了外部索引,并在单个Transformer中对整个文档语料库进行编码。尽管已经提出了许多不同的方法来提高生成式检索的有效性,但它们仅在规模约为100k的文档语料库上进行了评估。我们进行了第一项实证研究,跨越各种语料库规模研究生成式检索技术,最终扩展到包含8.8M段落的整个MS MARCO段落排名任务,并评估了高达11B参数的模型大小。我们揭示了关于将生成式检索扩展到数百万段落的几个发现;特别是,在索引过程中使用合成查询作为文档表示的核心重要性,考虑计算成本时现有提出的架构修改的无效性,以及简单地按比例扩展模型参数对检索性能的限制。虽然我们发现生成式检索在小语料库上与最先进的双编码器具有竞争力,但扩展到数百万段落仍然是一个重要且尚未解决的挑战。我们相信这些发现将对社区澄清当前生成式检索的现状、突出独特挑战,并激发新的研究方向具有价值。
English
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
PDF30December 15, 2024