无限-gram：将无界n-gram语言模型扩展至万亿标记

摘要

在这个神经大语言模型（LLMs）时代，n-gram语言模型是否仍然相关？我们的答案是肯定的，并且我们展示了它们在文本分析和改进神经LLMs中的价值。然而，这需要在两个方面现代化n-gram模型。首先，我们以与神经LLMs相同的数据规模进行训练--1.4万亿标记。这是迄今为止构建的最大n-gram模型。其次，现有的n-gram模型使用较小的n会影响其性能；我们相反允许n可以任意大，通过引入一个新的无穷大-gram LM与回退。我们开发了一个名为infini-gram的引擎，由后缀数组驱动，可以以毫秒级延迟计算无穷大-gram（以及任意n的n-gram）概率，而不是预先计算n-gram计数表（这将非常昂贵）。无穷大-gram框架和infini-gram引擎使我们能够对人类编写和机器生成的文本进行许多新颖有趣的分析：我们发现无穷大-gram LM对于下一个标记预测具有相当高的准确性（47%），并且可以辅助神经LLMs大大降低其语言建模的困惑度。在分析机器生成的文本时，我们还观察到机器与无穷大-gram在后缀长度方面的一致性水平存在不规则性，这表明神经LLMs预训练和Transformer的位置嵌入存在缺陷。我们开源了我们的infini-gram引擎，希望能够促进更多关于如何最好地利用从大型文本语料库中检索的逐字信息的研究。

English

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new infty-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute infty-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The infty-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the infty-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--infty-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

无限-gram：将无界n-gram语言模型扩展至万亿标记

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

摘要

Support