잡음 채널로서의 LLM: 샤논의 관점에서 본 모델 용량과 스케일링 법칙

초록

기존의 대규모 언어 모델(LLM)에 대한 스케일링 법칙은 주로 단조 증가하는 멱법칙(power law) 형태를 띠며, 연산량 증가에도 불구하고 성능이 저하되는 파국적 과잉 학습(catastrophic overtraining)이나 양자화로 인한 성능 저하(quantization-induced degradation)와 같은 비단조적 현상을 설명하지 못한다. 본 연구에서는 Shannon-Hartley 정리에 기반하여 LLM 학습을 잡음 채널을 통한 정보 전송으로 모델링하는 통합 이론적 프레임워크인 섀넌 스케일링 법칙(Shannon Scaling Law)을 제안한다. 모델 파라미터를 채널 대역폭에, 훈련 토큰을 신호 전력에 대응시킴으로써, 본 공식은 학습 신호와 내재적 잡음 간의 상호작용을 명시적으로 포착한다. 이러한 관점은 LLM의 근본적인 섀넌 용량(Shannon capacity)을 밝혀낸다: 충분한 신호 대 잡음비(SNR)를 유지하지 않은 채 모델 크기나 데이터를 확장하면 잡음이 필연적으로 증폭되어, 단조적 개선에서 U자형 성능 저하로의 전이가 유발된다. 본 이론은 Gaussian 잡음, 양자화, 그리고 수학, 질의응답, 코드 작업에 대한 지도 미세 조정(supervised fine-tuning) 등의 교란 조건 하에 Pythia와 OLMo2를 대상으로 한 실험을 통해 검증된다. 섀넌 스케일링 법칙은 기존의 스케일링 법칙 및 최근의 교란 인지 법칙(perturbation-aware laws)을 일관되게 능가하며, 높은 R² 점수를 달성하고 이전 접근법이 놓친 손실 곡선의 분지(loss basins)를 정확히 포착한다. 또한 외삽이 가능하여, 180B 토큰 이하로 학습된 6.9B 이하 Pythia 모델에 적합시킨 후, 통합 R²=0.847로 307B 토큰까지의 보지 못한 12B 모델을 예측하는 반면, 단조적 기준 모델(monotonic baselines)은 붕괴된다.

English

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on leq6.9B Pythia models with leq180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled R^2{=}0.847, while monotonic baselines collapse.