語言模型中的頻譜縮放定律：前饋神經網絡如何有效利用其潛在空間？

摘要

隨著大型語言模型（LLMs）規模的擴大，問題不僅在於它們變得有多大，更在於其容量有多少被有效利用。現有的擴展法則將模型大小與損失相關聯，卻忽略了各組件如何利用其潛在空間。我們研究了前饋網絡（FFNs），並將寬度選擇重新定義為一個頻譜利用問題。通過使用一套輕量級診斷工具——硬秩（參與比）、軟秩（香農秩）、頻譜集中度以及綜合的頻譜利用指數（SUI）——我們量化了在LLaMA、GPT-2和nGPT系列中，有多少潛在方向被有意義地激活。我們的主要發現是一個非對稱的頻譜擴展法則：軟秩與FFN寬度幾乎完美地遵循冪律增長，而硬秩僅呈次線性增長且具有高方差。這種非對稱性表明，增加FFN寬度主要添加的是低能量尾方向，而主導模式子空間則早期就達到飽和。此外，在更大寬度下，方差進一步坍縮到一個狹窄的子空間中，導致大部分潛在空間未被充分利用。這些結果將FFN寬度選擇重新定義為尾容量與主導模式容量之間的原則性權衡，為推理高效的大型語言模型設計提供了具體指導。

English

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

語言模型中的頻譜縮放定律：前饋神經網絡如何有效利用其潛在空間？

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

摘要

Support