ChatPaper.aiChatPaper

Infini-gram mini:基于FM-Index实现互联网规模下的精确n-gram搜索

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

June 13, 2025
作者: Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI

摘要

语言模型主要基于互联网上的海量文本数据进行训练,因此理解这一数据源变得愈发重要。精确匹配搜索引擎能够在大型文本语料库中进行搜索——统计字符串出现次数并检索包含这些字符串的文档——然而其高存储开销阻碍了其在互联网规模数据上的应用。我们提出了Infini-gram mini,一个高效且可扩展的系统,能够使PB级别的文本语料库变得可搜索。基于FM-index数据结构(Ferragina和Manzini,2000年),该系统在索引文本的同时进行压缩,创建的索引大小仅为语料库的44%。Infini-gram mini在索引速度(提升18倍)以及索引和查询过程中的内存使用(分别减少3.2倍和降至可忽略不计)方面,显著优于现有FM-index的最佳实现。我们使用单个128核CPU节点在50天内索引了46TB的互联网文本(若使用75个此类节点,则仅需19小时)。我们展示了Infini-gram mini在基准污染大规模分析中的一个重要应用案例。我们发现,在互联网抓取的数据中,多个核心语言模型评估基准存在严重污染(如SQuAD中高达40%),若在此类数据上训练,可能导致高估语言模型的能力。我们设立了一个基准污染公告板,分享众多核心及社区贡献基准的污染率。同时,我们还发布了一个网络界面和API端点,以服务于Infini-gram mini索引上的通用搜索查询。
English
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
PDF32June 18, 2025