ChatPaper.aiChatPaper

摒弃比特,聚焦令牌:面向大语言模型的语义信息理论新探

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

November 3, 2025
作者: Bo Bai
cs.AI

摘要

大型语言模型(LLMs)在众多现实应用中展现出卓越能力。尽管基于实验视角的研究正飞速推进,但其需要消耗大量算力、数据及其他资源。因此,如何从理论层面揭开LLMs的黑箱已成为关键挑战。本文以率失真函数、有向信息与格兰杰因果理论为出发点,探究LLMs背后的信息论原理,进而构建以语义单元——词元(token)为核心的LLM语义信息理论体系,替代缺乏语义意义的比特单位。通过定义LLMs的概率模型,我们讨论了结构无关的信息论度量方法,包括预训练中的有向率失真函数、后训练中的有向率奖励函数,以及推理阶段的语义信息流。本文还深入探讨了词元级语义嵌入理论及信息论最优向量化方法。在此基础上,我们提出自回归LLM的通用定义,可理论推导Transformer架构及其性能指标(如ELBO、泛化误差界、记忆容量与语义信息度量),并在本框架下讨论了Mamba/Mamba2、LLaDA等其他架构。由此,本文构建了从语义信息论视角理解LLMs的理论框架,为后续深入研究提供了必要的理论工具。
English
Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.
PDF51December 2, 2025