言語モデルにおけるスペクトルスケーリング則：フィードフォワードネットワークは潜在空間をどの程度効果的に活用しているか？

要旨

大規模言語モデル（LLM）がスケールアップするにつれ、重要な問いはその規模だけでなく、その容量がどの程度効果的に活用されているかである。既存のスケーリング則はモデルサイズと損失を関連付けるが、構成要素が潜在空間をどのように活用するかを見落としている。本研究ではフィードフォワードネットワーク（FFN）に着目し、幅の選択をスペクトル活用の問題として再定義する。軽量な診断スイート——ハードランク（参加率）、ソフトランク（シャノンランク）、スペクトル集中度、および複合的なスペクトル活用指数（SUI）——を用いて、LLaMA、GPT-2、nGPTファミリーにおいて、どの程度の潜在方向が意味的に活性化されているかを定量化する。主な発見は、非対称的なスペクトルスケーリング則である：ソフトランクはFFN幅に対してほぼ完璧なべき乗則に従う一方、ハードランクはサブ線形にしか成長せず、高い分散を示す。この非対称性は、FFNを広げることで主に低エネルギーのテール方向が追加される一方、支配的なモードの部分空間は早期に飽和することを示唆している。さらに、幅が大きくなると、分散はさらに狭い部分空間に収束し、潜在空間の大部分が未活用のまま残される。これらの結果は、FFN幅の選択をテール容量と支配的モード容量の間の原理的なトレードオフとして再定義し、推論効率の高いLLM設計に対する具体的な指針を提供する。

English

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

言語モデルにおけるスペクトルスケーリング則：フィードフォワードネットワークは潜在空間をどの程度効果的に活用しているか？

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

要旨

Support