语言模型中的频谱缩放定律：前馈网络如何有效利用其潜在空间？

摘要

随着大型语言模型（LLMs）规模的扩大，问题不仅在于它们变得多大，更在于其容量被有效利用的程度。现有的扩展法则将模型规模与损失相关联，却忽视了各组件如何利用其潜在空间。我们研究了前馈网络（FFNs），并将宽度选择重新定义为频谱利用问题。通过一套轻量级诊断工具——硬秩（参与率）、软秩（香农秩）、频谱集中度以及综合的频谱利用指数（SUI）——我们量化了在LLaMA、GPT-2和nGPT系列中，有多少潜在方向被有意义地激活。我们的关键发现是一条非对称的频谱扩展法则：软秩随FFN宽度几乎完美地遵循幂律增长，而硬秩仅呈次线性增长且方差较大。这种非对称性表明，增加FFN宽度主要添加了低能量的尾部方向，而主导模式的子空间则较早饱和。此外，在更大宽度下，方差进一步坍缩至狭窄的子空间，导致大部分潜在空间未被充分利用。这些发现将FFN宽度选择重新定义为尾部容量与主导模式容量之间的原则性权衡，为推理高效的大型语言模型设计提供了具体指导。

English

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

语言模型中的频谱缩放定律：前馈网络如何有效利用其潜在空间？

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

摘要

Support