大型語言模型作為馬可夫鏈

摘要

大型語言模型（LLMs）已被證明在廣泛的自然語言處理任務以及其他領域中都非常高效。然而，對於它們出色表現的根源進行全面的理論分析仍然是困難的。在本文中，我們通過將具有大小為 T 的詞彙和大小為 K 的上下文窗口的通用自回歸語言模型與在大小為 O(T^K) 的有限狀態空間上定義的馬爾可夫鏈進行等效來應對這一挑戰。我們得出了幾個關於馬爾可夫鏈的平稳分布存在、其收斂速度以及溫度對後者的影響的令人驚訝的發現。然後，我們證明了預訓練和上下文泛化界限，並展示了如何通過所繪製的等效性來豐富它們的解釋。最後，我們通過對幾個最近的LLMs進行實驗來說明我們的理論保證，以突顯它們如何捕捉實踐中觀察到的行為。

English

Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K). We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

大型語言模型作為馬可夫鏈

Large Language Models as Markov Chains

摘要

Support