NerVE: LLM 피드포워드 네트워크의 비선형 고유스펙트럼 역학

초록

우리는 대규모 언어 모델(LLM)의 순전파 네트워크(FFN)가 고차원 잠재 공간에서 정보 흐름을 어떻게 조직하고 조절하는지 이해하기 위한 통합 고유스펙트럼 프레임워크인 NerVE를 소개한다. FFN이 매개변수 예산의 대부분을 차지함에도 불구하고, 그 고차원 동역학은 여전히 제대로 이해되지 않고 있다. NerVE는 네 가지 상호 보완적 메트릭(스펙트럼 엔트로피(분산), 참여 비율(유효 차원), 고유값 조기 풍부화(상위 집중도), 옌센-섀넌 발산(분포 변화))을 통한 경량 및 메모리 효율적인 고유스펙트럼 동역학 추적을 통해 이 격차를 해소한다. 우리의 핵심 통찰은 FFN 비선형성이 고유모드 간 분산을 재주입하여 잠재 차원 활용을 근본적으로 통제하며, 최적화기 기하구조가 이러한 분산 재주입 정도를 강력하게 조절한다는 것이다. 우리는 NerVE를 다양한 모델 규모와 다양한 아키텍처 및 최적화기 구성에서 검증하였으며, 각각은 FFN 동역학을 고유하게 형성한다: 정규화 기법이 분산 흐름을 제어하고, FFN 가중치 기하구조가 잠재 공간을 제한하며, 위치 인코딩과 활성화 함수가 정보 흐름을 조절하고, 최적화기 선택이 깊이에 따른 유효 용량을 재분배한다. 이러한 다양한 설정에서 NerVE는 모델의 일반화 능력과 상관관계를 가지며 설계 선택에 예측 가능하게 반응하는 안정적인 스펙트럼 신호를 일관되게 복원하며, 트랜스포머를 넘어 MLP-Mixer 아키텍처로까지 일반화되어 시행착오를 넘어선 아키텍처 및 최적화기 선택에 실행 가능한 통찰을 제공한다.

English

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

NerVE: LLM 피드포워드 네트워크의 비선형 고유스펙트럼 역학

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

초록

Support