ChatPaper.aiChatPaper

生成式檢索如何擴展至數百萬個段落?

How Does Generative Retrieval Scale to Millions of Passages?

May 19, 2023
作者: Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran
cs.AI

摘要

由可微分搜索索引推廣開來,新興的生成檢索範式將經典的資訊檢索問題重新構架為一個序列到序列建模任務,放棄外部索引,並將整個文件語料庫編碼到單個Transformer中。儘管已提出許多不同方法來提高生成檢索的效果,但它們僅在約100k大小的文件語料庫上進行了評估。我們進行了對各種語料庫規模的生成檢索技術的第一個實證研究,最終擴展到包含8.8M段落的整個MS MARCO段落排名任務,並評估了高達11B參數的模型大小。我們發現了關於將生成檢索擴展到數百萬段落的幾個結果;特別是,在索引期間使用合成查詢作為文件表示的核心重要性,現有提出的架構修改在考慮計算成本時的無效性,以及對於檢索性能的模型參數的天真擴展限制。雖然我們發現在小語料庫上,生成檢索與最先進的雙編碼器競爭力相當,但擴展到數百萬段落仍然是一個重要且尚未解決的挑戰。我們相信這些發現將對社區澄清生成檢索的當前狀態、突出獨特挑戰並激發新的研究方向具有價值。
English
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
PDF30December 15, 2024