NerVE:大语言模型前馈网络中的非线性特征谱动力学
NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks
March 6, 2026
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI
摘要
我们提出NerVE——一个统一的本征谱框架,用于理解大语言模型(LLM)中前馈网络(FFN)如何在高维潜空间内组织并调控信息流。尽管FFN占据模型参数的主要部分,但其高维动态特性仍未被充分认知。NerVE通过四种互补指标实现轻量化的本征谱动态追踪:谱熵(离散度)、参与率(有效维度)、特征值早期富集(顶部集中度)以及Jensen-Shannon散度(分布偏移)。核心发现表明,FFN非线性操作会跨本征模重新注入方差,从根本上控制潜维度利用率,且优化器几何结构会强烈调节这种方差重注入的程度。我们在不同模型规模、多样化架构与优化器配置下验证NerVE,每种配置均独特塑造FFN动态:归一化方案控制方差流动;FFN权重几何约束潜空间;位置编码与激活函数调控信息流;优化器选择沿深度方向重新分配有效容量。在所有场景中,NerVE始终能提取稳定的谱特征,这些特征与模型泛化能力相关,并对设计选择呈现可预测的响应。该框架可泛化至MLP-Mixer等非Transformer架构,为超越试错法的架构与优化器选择提供可操作的洞见。
English
We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.