NerVE: LLMフィードフォワードネットワークにおける非線形固有スペクトルダイナミクス

要旨

大規模言語モデル（LLM）におけるフィードフォワードネットワーク（FFN）が、高次元潜在空間内で情報フローをどのように組織化し制御するかを理解するための、統一的な固有スペクトル枠組み「NerVE」を提案する。FFNはパラメータ予算の大部分を占めるにもかかわらず、その高次元動態は十分に理解されていない。NerVEは、4つの相補的指標——スペクトルエントロピー（分散度）、参加率（実効次元数）、固有値早期富化（トップ重み性）、およびジェンセン-シャノン発散（分布変位）——による軽量かつメモリ効率的な固有スペクトル動態の追跡を通じてこの課題に取り組む。我々の重要な知見は、FFNの非線形性が固有モード間で分散を再注入し、潜在次元の利用を根本的に支配すること、そしてオプティマイザの幾何学がこの分散再注入の程度を強く調整することである。NerVEを様々なモデル規模、多様なアーキテクチャおよびオプティマイザ設定で検証し、それぞれがFFN動態を独自に形成することを確認した：正規化手法は分散の流れを制御し、FFN重みの幾何学は潜在空間を拘束し、位置エンコーディングと活性化関数は情報フローを調整し、オプティマイザの選択は深さ方向への実効容量の再配分を行う。これらの設定において、NerVEはモデルの汎化能力と相関し、設計選択に対して予測可能な応答を示す安定したスペクトル特性を一貫して抽出する。これはTransformerを超えてMLP-Mixerアーキテクチャにも一般化され、試行錯誤を超えたアーキテクチャおよびオプティマイザ選択に対する実践的な知見を提供する。

English

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

NerVE: LLMフィードフォワードネットワークにおける非線形固有スペクトルダイナミクス

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

要旨

Support