Infini-gram mini：基於FM-Index實現互聯網規模的精確n-元語法搜索

摘要

語言模型主要依賴於來自互聯網的大量文本數據進行訓練，因此理解這一數據源變得日益重要。精確匹配搜索引擎能夠在大型文本語料庫中進行搜索——統計字符串出現次數並檢索包含這些字符串的文檔——然而，高存儲開銷限制了其在互聯網規模數據上的應用。我們提出了Infini-gram mini，這是一個高效且可擴展的系統，能夠使PB級別的文本語料庫可搜索。基於FM-index數據結構（Ferragina和Manzini，2000年），該結構同時實現了文本的索引和壓縮，我們的系統創建的索引大小僅為語料庫的44%。Infini-gram mini在索引速度（提升18倍）和索引及查詢過程中的內存使用（索引時減少3.2倍，查詢時降至可忽略水平）方面，極大地優化了現有FM-index的最佳實現。我們使用單個128核CPU節點在50天內索引了46TB的互聯網文本（若使用75個此類節點，則僅需19小時）。我們展示了Infini-gram mini在大規模基準污染分析中的一個重要應用案例。我們發現，幾個核心語言模型評估基準在互聯網抓取數據中受到了嚴重污染（如SQuAD中高達40%），如果在此類數據上訓練語言模型，可能會高估其能力。我們設立了一個基準污染公告板，分享許多核心及社區貢獻的基準的污染率。此外，我們還發布了一個網頁界面和API端點，以支持對Infini-gram mini索引的一般搜索查詢。

English

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

Infini-gram mini：基於FM-Index實現互聯網規模的精確n-元語法搜索

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

摘要

Support