Infini-gram：將無界n-gram語言模型擴展至一兆令牌

摘要

在這個神經大語言模型（LLM）時代，n-gram語言模型仍然具有重要意義嗎？我們的答案是肯定的，我們展示了它們在文本分析和改進神經LLM中的價值。然而，這需要在兩個方面現代化n-gram模型。首先，我們以與神經LLM相同的數據規模訓練它們-- 1.4兆令牌。這是迄今為止建立的最大的n-gram模型。其次，現有的n-gram模型使用小的n會影響性能；我們允許n可以任意增大，通過引入一個具有回退功能的新infty-gram LM。我們開發了一個名為infini-gram的引擎，由後綴數組驅動，可以以毫秒級延遲計算infty-gram（以及任意n的n-gram）的概率，而不是預先計算n-gram計數表（這將非常昂貴）。infty-gram框架和infini-gram引擎使我們能夠對人類撰寫和機器生成的文本進行許多新穎和有趣的分析：我們發現infty-gram LM對於下一令牌預測有相當高的準確性（47％），並且可以補充神經LLM以大幅降低其語言建模的困惑度。在分析機器生成的文本時，我們還觀察到機器與infty-gram的一致性水平存在不規則性，這表明神經LLM預訓練和Transformer的位置嵌入存在缺陷。我們開源了我們的infini-gram引擎，希望能促進更多對如何最好地利用從大型文本語料庫檢索的逐字信息進行研究。

English

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new infty-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute infty-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The infty-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the infty-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--infty-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

Infini-gram：將無界n-gram語言模型擴展至一兆令牌

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

摘要

Support