作为噪声信道的大语言模型:香农视角下的模型容量与缩放法则
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
May 22, 2026
作者: Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma
cs.AI
摘要
现有的针对大语言模型(LLM)的缩放定律,主要是单调幂律形式,无法解释诸如灾难性过训练和量化引起的性能退化等新兴非单调现象——在这些现象中,即便计算量增加,模型性能反而下降。
我们提出香农缩放定律(Shannon Scaling Law),这是一个统一的理论框架,将LLM训练建模为在噪声信道上的信息传输过程,其理论根基是香农-哈特利定理。通过将模型参数映射为信道带宽,训练数据映射为信号功率,我们的公式明确刻画了学习信号与内在噪声之间的相互作用。这一视角揭示了LLM存在一个香农容量:若在扩展模型规模或数据量的同时未能保持足够的信噪比(SNR),则会不可避免地放大噪声,从而引发从单调提升到U形性能退化的转变。
我们通过在Pythia和OLMo2模型上施加高斯噪声、量化以及在数学、问答、代码任务上进行监督微调等扰动实验验证了该理论。香农缩放定律始终优于经典缩放定律及近期提出的感知扰动定律,取得了较高的R²分数,并准确捕捉了先前方法未能发现的损失盆地。该定律还具备外推能力:基于≤6.9B参数的Pythia模型在≤180B tokens数据上的拟合结果,能够预测未见过的12B模型在高达307B tokens数据上的表现,合并R²达到0.847,而单调基线模型则完全失效。
English
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.
We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.
We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on leq6.9B Pythia models with leq180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2{=}0.847, while monotonic baselines collapse.