大規模言語モデルをマルコフ連鎖として

要旨

大規模言語モデル（LLMs）は、自然言語処理タスク全般にわたって非常に効率的であり、それ以上の領域でも優れた性能を発揮していることが証明されています。ただし、その印象的な性能の起源に関する包括的な理論的分析はまだ明確ではありません。本論文では、サイズTの語彙とサイズKのコンテキストウィンドウを持つ一般的な自己回帰言語モデルと、サイズがO(T^K)の有限状態空間上に定義されたマルコフ連鎖との同等性を示すことで、この難しい課題に取り組みます。我々は、LLMsの推論力を捉えるマルコフ連鎖の定常分布の存在、それに対する収束速度、およびその温度への影響に関連するいくつかの驚くべき発見を導出します。その後、事前学習とコンテキスト内汎化の境界を証明し、描かれた同等性がこれらの解釈を豊かにする方法を示します。最後に、実験を通じて最近のいくつかのLLMsにおける観察された振る舞いを捉える方法を強調するために、理論的保証を実証します。

English

Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K). We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

大規模言語モデルをマルコフ連鎖として

Large Language Models as Markov Chains

要旨

Support