LLMをノイズチャネルとして捉える：モデル容量とスケーリング則に関するシャノン的視点

要旨

大規模言語モデル（LLM）の既存のスケーリング則は、主に単調な冪乗則であり、計算資源を増やしても性能が低下するカタストロフィックな過学習や量子化による劣化などの非単調現象を説明できない。我々は、シャノン・ハートレーの定理に基づき、LLMの学習をノイズのある通信路における情報伝送としてモデル化する統一的な理論的枠組みである「シャノンスケーリング則」を提案する。モデルパラメータをチャネル帯域幅に、学習トークン数を信号電力に対応付けることで、我々の定式化は学習信号と内在ノイズの間の相互作用を明示的に捉える。この視点は、LLMにおける基本的なシャノン容量を明らかにする。すなわち、十分な信号対雑音比（SNR）を維持せずにモデルサイズやデータを拡大すると、必然的にノイズが増幅され、単調な改善からU字型の性能低下への移行が引き起こされる。我々は、PythiaおよびOLMo2に対して、ガウスノイズ、量子化、数学・QA・コードタスクにおける教師ありファインチューニングなどの摂動を加えた実験により理論を検証する。シャノンスケーリング則は、従来のスケーリング則や最近の摂動を考慮したスケーリング則を一貫して上回り、高いR²スコアを達成し、従来手法では捉えられなかった損失の谷を正確に捉える。また、外挿性能も優れており、6.9B以下のPythiaモデルを180B以下のトークンで学習させてフィッティングしたところ、未見の12Bモデルを最大307BトークンまでプールR²=0.847で予測でき、単調なベースラインは機能しなくなる。

English

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on leq6.9B Pythia models with leq180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2{=}0.847, while monotonic baselines collapse.