LLM作為雜訊通道:從香農觀點看模型容量與尺度定律
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
May 22, 2026
作者: Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma
cs.AI
摘要
現有的大型語言模型(LLM)縮放定律,主要是單調冪律,無法解釋新出現的非單調現象,例如災難性過度訓練與量化導致的退化——在增加運算量的情況下,模型性能反而惡化。
我們提出香農縮放定律(Shannon Scaling Law),這是一個統一的理論框架,基於香農-哈特利定理(Shannon-Hartley theorem),將 LLM 訓練建模為雜訊通道中的資訊傳輸。透過將模型參數映射為通道頻寬、訓練 token 映射為訊號功率,我們的公式明確捕捉了學習訊號與內在雜訊之間的交互作用。此視角揭示了 LLM 的香農容量:若在擴展模型規模或資料時未能維持足夠的信噪比(SNR),將不可避免地放大雜訊,導致性能從單調提升轉變為 U 形退化。
我們透過在 Pythia 和 OLMo2 上進行干擾實驗(包括高斯雜訊、量化,以及數學、問答和程式碼任務上的監督式微調)驗證了此理論。香農縮放定律持續優於經典縮放定律及近期提出的干擾感知定律,獲得了優異的 R² 分數,並準確捕捉了先前方法未能發現的損失盆地。此外,該定律具備外推能力:在 ≤6.9B 參數的 Pythia 模型上以 ≤180B token 擬合後,可預測未見過的 12B 模型至多 307B token 的表現,匯總 R² 達 0.847,而單調基線法則完全失效。
English
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.
We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.
We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on leq6.9B Pythia models with leq180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2{=}0.847, while monotonic baselines collapse.