Infini-gram mini：基于FM-Index实现互联网规模下的精确n-gram搜索

摘要

语言模型主要基于互联网上的海量文本数据进行训练，因此理解这一数据源变得愈发重要。精确匹配搜索引擎能够在大型文本语料库中进行搜索——统计字符串出现次数并检索包含这些字符串的文档——然而其高存储开销阻碍了其在互联网规模数据上的应用。我们提出了Infini-gram mini，一个高效且可扩展的系统，能够使PB级别的文本语料库变得可搜索。基于FM-index数据结构（Ferragina和Manzini，2000年），该系统在索引文本的同时进行压缩，创建的索引大小仅为语料库的44%。Infini-gram mini在索引速度（提升18倍）以及索引和查询过程中的内存使用（分别减少3.2倍和降至可忽略不计）方面，显著优于现有FM-index的最佳实现。我们使用单个128核CPU节点在50天内索引了46TB的互联网文本（若使用75个此类节点，则仅需19小时）。我们展示了Infini-gram mini在基准污染大规模分析中的一个重要应用案例。我们发现，在互联网抓取的数据中，多个核心语言模型评估基准存在严重污染（如SQuAD中高达40%），若在此类数据上训练，可能导致高估语言模型的能力。我们设立了一个基准污染公告板，分享众多核心及社区贡献基准的污染率。同时，我们还发布了一个网络界面和API端点，以服务于Infini-gram mini索引上的通用搜索查询。

English

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

Infini-gram mini：基于FM-Index实现互联网规模下的精确n-gram搜索

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

摘要

Support