ChatPaper.aiChatPaper

大型语言模型作为马尔可夫链。

Large Language Models as Markov Chains

October 3, 2024
作者: Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé, Ievgen Redko
cs.AI

摘要

大型语言模型(LLMs)已被证明在广泛的自然语言处理任务以及更广泛的领域中都非常高效。然而,对它们出色性能起源的全面理论分析仍然难以捉摸。在本文中,我们通过将具有大小为T的词汇和大小为K的上下文窗口的通用自回归语言模型与定义在大小为O(T^K)的有限状态空间上的马尔可夫链进行等价,来尝试解决这一具有挑战性的任务。我们得出了几个关于捕获LLMs推理能力的马尔可夫链的平稳分布的存在、它们收敛速度以及温度对后者的影响的令人惊讶的发现。然后,我们证明了预训练和上下文泛化界限,并展示了如何通过所得到的等价关系丰富它们的解释。最后,我们通过对几个最近的LLMs进行实验来说明我们的理论保证,以突出它们捕捉到的实践中观察到的行为。
English
Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K). We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

Summary

AI-Generated Summary

PDF333November 16, 2024