Infini-gram mini: FM-Indexを用いたインターネット規模での正確なn-gram検索

要旨

言語モデルは主にインターネットからの大規模なテキストデータを用いて訓練されており、このデータソースを理解することがますます重要となっている。完全一致検索エンジンは、大規模なテキストコーパス内での検索を可能にするが、文字列の出現回数をカウントし、それを含む文書を取得する際に、高いストレージオーバーヘッドが発生し、インターネット規模のデータへの適用が妨げられている。本論文では、ペタバイトレベルのテキストコーパスを検索可能にする効率的でスケーラブルなシステムであるInfini-gram miniを提案する。本システムは、テキストを同時にインデックス化および圧縮するFM-indexデータ構造（Ferragina and Manzini, 2000）に基づいており、コーパスのサイズのわずか44%のインデックスを作成する。Infini-gram miniは、既存のFM-indexの最良の実装と比較して、インデックス作成速度（18倍）、インデックス作成時のメモリ使用量（3.2倍削減）、およびクエリ実行時のメモリ使用量（無視できるレベルまで削減）において大幅に改善されている。我々は、128コアのCPUノード1台を用いて46TBのインターネットテキストを50日間でインデックス化した（75台のノードを使用した場合、19時間で完了）。また、Infini-gram miniの重要な使用例として、ベンチマーク汚染の大規模分析を示す。我々は、主要な言語モデル評価ベンチマークの多くがインターネットクロールにおいて重度に汚染されていることを発見した（SQuADでは最大40%）。このようなデータを用いて訓練を行うと、言語モデルの能力を過大評価する可能性がある。我々は、主要なベンチマークおよびコミュニティ提供のベンチマークの汚染率を共有するためのベンチマーク汚染掲示板をホストしている。さらに、Infini-gram miniインデックスに対する一般的な検索クエリを提供するためのウェブインターフェースとAPIエンドポイントを公開している。

English

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18times) and memory use during both indexing (3.2times reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

Infini-gram mini: FM-Indexを用いたインターネット規模での正確なn-gram検索

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

要旨

Support