버그인가 기능인가²: 가중치 드리프트, 활성화 희소성, 그리고 스파이크

초록

현대 신경망 구조의 설계는 점진적인 경험적 선택을 통해 수렴되어 왔지만, 그 훈련 역학을 지배하는 메커니즘은 여전히 부분적으로만 이해되고 있다. 우리는 표준 손실 함수와 양의 편향을 가진 활성화 함수 간의 상호작용에 의해 유발되는 음의 가중치 드리프트(weight drift)를 식별하고 분석한다. MSE 또는 교차 엔트로피 손실 하에서 초기화 시 양의 사전 활성화(pre-activation)에 대한 기울기의 기댓값이 음이 아니며, 이로 인해 초기 훈련 중 하류 가중치가 음수 값으로 유도됨을 증명한다. 이 드리프트는 데이터보다는 최적화에 내재적이며, 다양한 아키텍처(MLP, ResNet, ViT, GPT-nano, MP-SENe)와 비대칭 활성화 함수(ReLU, GELU, SiLU)에서 지속된다. ReLU와 결합될 경우, 가중치 드리프트는 GPT-nano에서 최대 90%에 달하는 활성화 희소성(sparsity)을 생성한다. 우리는 79개 구성에서 희소성-정확도 트레이드오프를 특성화하고, 약 70% 활성화 희소성 이상에서 급격한 정확도 절벽(cliff)을 식별한다. ReLU^2는 GPT-nano에서 좋은 희소성-정확도 비율을 달성하지만, 중간 트랜스포머 계층에서 식별된 활성화 스파이크를 병리적으로 증폭시킨다. 클리핑(clipping)은 제곱의 표현적 이점을 보존하면서 이를 해결한다. 클리핑된 ReLU^2는 클리핑되지 않은 버전보다 성능이 우수하며, GELU^2는 GPT-nano에서 가장 낮은 검증 손실을 달성한다. 코드는 https://github.com/On-Point-RND/BugOrFeature에서 확인할 수 있다.

English

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.