Infini-gram mini:基于FM-Index实现互联网规模下的精确n-gram搜索
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
June 13, 2025
作者: Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI
摘要
语言模型主要基于互联网上的海量文本数据进行训练,因此理解这一数据源变得愈发重要。精确匹配搜索引擎能够在大型文本语料库中进行搜索——统计字符串出现次数并检索包含这些字符串的文档——然而其高存储开销阻碍了其在互联网规模数据上的应用。我们提出了Infini-gram mini,一个高效且可扩展的系统,能够使PB级别的文本语料库变得可搜索。基于FM-index数据结构(Ferragina和Manzini,2000年),该系统在索引文本的同时进行压缩,创建的索引大小仅为语料库的44%。Infini-gram mini在索引速度(提升18倍)以及索引和查询过程中的内存使用(分别减少3.2倍和降至可忽略不计)方面,显著优于现有FM-index的最佳实现。我们使用单个128核CPU节点在50天内索引了46TB的互联网文本(若使用75个此类节点,则仅需19小时)。我们展示了Infini-gram mini在基准污染大规模分析中的一个重要应用案例。我们发现,在互联网抓取的数据中,多个核心语言模型评估基准存在严重污染(如SQuAD中高达40%),若在此类数据上训练,可能导致高估语言模型的能力。我们设立了一个基准污染公告板,分享众多核心及社区贡献基准的污染率。同时,我们还发布了一个网络界面和API端点,以服务于Infini-gram mini索引上的通用搜索查询。
English
Language models are trained mainly on massive text data from the Internet,
and it becomes increasingly important to understand this data source.
Exact-match search engines enable searching in large text corpora -- counting
string appearances and retrieving the enclosing documents -- yet the high
storage overhead hinders their application on Internet-scale data. We present
Infini-gram mini, an efficient and scalable system that can make petabyte-level
text corpora searchable. Based on the FM-index data structure (Ferragina and
Manzini, 2000), which simultaneously indexes and compresses text, our system
creates indexes with size only 44% of the corpus. Infini-gram mini greatly
improves upon the best existing implementation of FM-index in terms of indexing
speed (18times) and memory use during both indexing (3.2times reduction)
and querying (down to a negligible amount). We index 46TB of Internet text in
50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes).
We show one important use case of Infini-gram mini in a large-scale analysis of
benchmark contamination. We find several core LM evaluation benchmarks to be
heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead
to overestimating the capabilities of language models if trained on such data.
We host a benchmark contamination bulletin to share the contamination rate of
many core and community-contributed benchmarks. We also release a web interface
and an API endpoint to serve general search queries on Infini-gram mini
indexes.