ChatPaper.aiChatPaper

摒弃比特,拥抱令牌:面向大语言模型的语义信息理论新探

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

November 3, 2025
作者: Bo Bai
cs.AI

摘要

大型语言模型(LLMs)在众多现实应用中展现出卓越能力。尽管基于实验视角的绝大多数研究进展迅速,但其需要消耗大量算力、数据及其他资源。因此,如何从理论层面揭开LLMs的黑箱机制已成为关键挑战。本文以率失真函数、定向信息与格兰杰因果关系的理论为起点,探究LLMs背后的信息论原理,进而构建以词元为基本单元(而非缺乏语义的比特)的LLM语义信息理论。通过定义LLMs的概率模型,我们讨论了结构无关的信息论度量方法,包括预训练中的定向率失真函数、后训练中的定向率奖励函数,以及推理阶段的语义信息流。本文还深入探讨了词元级语义嵌入理论及信息论最优向量化方法,进而提出自回归LLM的通用定义框架,从理论上推导出Transformer架构及其性能指标(如ELBO、泛化误差界、记忆容量和语义信息度量)。其他架构如Mamba/Mamba2和LLaDA也在本框架中得到讨论。由此,本文构建了从语义信息论视角理解LLMs的理论框架,为后续深入研究提供了必要的理论工具。
English
Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.
PDF51December 2, 2025