Infini-gram mini:基於FM-Index實現互聯網規模的精確n-元語法搜索
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
June 13, 2025
作者: Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
cs.AI
摘要
語言模型主要依賴於來自互聯網的大量文本數據進行訓練,因此理解這一數據源變得日益重要。精確匹配搜索引擎能夠在大型文本語料庫中進行搜索——統計字符串出現次數並檢索包含這些字符串的文檔——然而,高存儲開銷限制了其在互聯網規模數據上的應用。我們提出了Infini-gram mini,這是一個高效且可擴展的系統,能夠使PB級別的文本語料庫可搜索。基於FM-index數據結構(Ferragina和Manzini,2000年),該結構同時實現了文本的索引和壓縮,我們的系統創建的索引大小僅為語料庫的44%。Infini-gram mini在索引速度(提升18倍)和索引及查詢過程中的內存使用(索引時減少3.2倍,查詢時降至可忽略水平)方面,極大地優化了現有FM-index的最佳實現。我們使用單個128核CPU節點在50天內索引了46TB的互聯網文本(若使用75個此類節點,則僅需19小時)。我們展示了Infini-gram mini在大規模基準污染分析中的一個重要應用案例。我們發現,幾個核心語言模型評估基準在互聯網抓取數據中受到了嚴重污染(如SQuAD中高達40%),如果在此類數據上訓練語言模型,可能會高估其能力。我們設立了一個基準污染公告板,分享許多核心及社區貢獻的基準的污染率。此外,我們還發布了一個網頁界面和API端點,以支持對Infini-gram mini索引的一般搜索查詢。
English
Language models are trained mainly on massive text data from the Internet,
and it becomes increasingly important to understand this data source.
Exact-match search engines enable searching in large text corpora -- counting
string appearances and retrieving the enclosing documents -- yet the high
storage overhead hinders their application on Internet-scale data. We present
Infini-gram mini, an efficient and scalable system that can make petabyte-level
text corpora searchable. Based on the FM-index data structure (Ferragina and
Manzini, 2000), which simultaneously indexes and compresses text, our system
creates indexes with size only 44% of the corpus. Infini-gram mini greatly
improves upon the best existing implementation of FM-index in terms of indexing
speed (18times) and memory use during both indexing (3.2times reduction)
and querying (down to a negligible amount). We index 46TB of Internet text in
50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes).
We show one important use case of Infini-gram mini in a large-scale analysis of
benchmark contamination. We find several core LM evaluation benchmarks to be
heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead
to overestimating the capabilities of language models if trained on such data.
We host a benchmark contamination bulletin to share the contamination rate of
many core and community-contributed benchmarks. We also release a web interface
and an API endpoint to serve general search queries on Infini-gram mini
indexes.