언어 모델의 스펙트럼 스케일링 법칙: 피드포워드 네트워크가 잠재 공간을 얼마나 효과적으로 활용하는가?

초록

대규모 언어 모델(LLM)의 규모가 커짐에 따라, 단순히 모델이 얼마나 커지는가뿐만 아니라 그 용량이 얼마나 효과적으로 활용되는지가 중요한 문제로 대두되고 있습니다. 기존의 스케일링 법칙은 모델 크기와 손실 간의 관계를 설명하지만, 구성 요소들이 잠재 공간을 어떻게 활용하는지는 간과하고 있습니다. 본 연구에서는 피드포워드 네트워크(FFN)를 대상으로 폭 선택 문제를 스펙트럼 활용 문제로 재해석합니다. 경량 진단 도구인 하드 랭크(참여 비율), 소프트 랭크(섀넌 랭크), 스펙트럼 집중도, 그리고 이를 종합한 스펙트럼 활용 지수(SUI)를 사용하여 LLaMA, GPT-2, nGPT 계열 모델에서 의미 있게 활성화된 잠재 방향의 수를 정량화합니다. 주요 발견은 비대칭적 스펙트럼 스케일링 법칙입니다: 소프트 랭크는 FFN 폭에 대해 거의 완벽한 멱법칙을 따르는 반면, 하드 랭크는 하위 선형적으로만 증가하며 높은 분산을 보입니다. 이 비대칭성은 FFN의 폭을 넓히는 것이 주로 저에너지 꼬리 방향을 추가하는 반면, 주요 모드 부분 공간은 일찍 포화된다는 것을 시사합니다. 더욱이, 더 큰 폭에서는 분산이 더욱 좁은 부분 공간으로 수렴되어 잠재 공간의 상당 부분이 미활용 상태로 남게 됩니다. 이러한 결과는 FFN 폭 선택을 꼬리 용량과 주요 모드 용량 간의 원칙적 절충 문제로 재조명하며, 추론 효율적인 LLM 설계를 위한 구체적인 지침을 제공합니다.

English

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

언어 모델의 스펙트럼 스케일링 법칙: 피드포워드 네트워크가 잠재 공간을 얼마나 효과적으로 활용하는가?

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

초록

Support