Infini-gram: 無制限のn-gram言語モデルを1兆トークンまでスケーリング

要旨

ニューラル大規模言語モデル（LLM）の時代において、n-gram言語モデルはまだ関連性があるのか？私たちの答えは「イエス」であり、テキスト分析とニューラルLLMの改善におけるその価値を示します。ただし、これにはn-gramモデルを2つの側面で近代化する必要があります。まず、ニューラルLLMと同じデータ規模（1.4兆トークン）で学習を行います。これはこれまでに構築された最大のn-gramモデルです。次に、既存のn-gramモデルは小さなnを使用しており、性能が制限されています。代わりに、新しいinfty-gram LMとバックオフを導入することで、nを任意に大きくすることを可能にします。n-gramカウントテーブルを事前計算する（非常に高コストになる）代わりに、サフィックスアレイを活用したinfini-gramエンジンを開発し、ミリ秒レベルの遅延でinfty-gram（および任意のnのn-gram）確率を計算できるようにします。infty-gramフレームワークとinfini-gramエンジンにより、人間が書いたテキストと機械生成テキストの多くの新規で興味深い分析が可能になります。infty-gram LMは次のトークン予測においてかなり高い精度（47%）を示し、ニューラルLLMを補完してその言語モデルのパープレキシティを大幅に低減できることがわかりました。機械生成テキストを分析する際には、サフィックス長に対する機械とinfty-gramの一致レベルに不規則性が観察され、ニューラルLLMの事前学習とTransformerの位置埋め込みの欠陥を示唆しています。私たちはinfini-gramエンジンをオープンソース化し、大規模テキストコーパスから取得した逐語的情報を最適に活用する方法についてのさらなる研究を促進することを期待しています。

English

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new infty-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute infty-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The infty-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the infty-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--infty-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

Infini-gram: 無制限のn-gram言語モデルを1兆トークンまでスケーリング

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

要旨

Support