NerVE:大型语言模型前馈网络中的非线性特征谱动力学
NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks
March 6, 2026
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI
摘要
我们提出NerVE框架——一种统一的特征谱分析体系,用于理解前馈网络(FFN)如何在大语言模型(LLM)的高维潜空间内组织与调控信息流。尽管FFN占据模型参数的主要部分,但其高维动态特性仍未被充分认知。NerVE通过四种互补指标实现轻量级、内存高效的特征谱动态追踪:谱熵(离散度)、参与率(有效维度)、特征值早期富集(顶部集中度)和詹森-香农散度(分布偏移)。核心发现表明:FFN非线性操作会跨特征模重注方差,从根本上调控潜空间维度利用效率;而优化器的几何特性会显著调节这种方差重注的强度。我们在不同模型规模、多样化架构与优化器配置下验证NerVE框架,发现每种配置都独特塑造FFN动态:归一化方案控制方差流动,FFN权重几何约束潜空间,位置编码与激活函数调控信息流,优化器选择沿深度维度重分布有效容量。在所有场景中,NerVE均能稳定提取与模型泛化能力相关的特征谱信号,这些信号对设计决策呈现可预测的响应规律。该框架可泛化至Transformer之外的MLP-Mixer架构,为超越试错法的架构设计与优化器选择提供可操作的洞见。
English
We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.