漏洞还是特性^2：权重漂移、激活稀疏性与尖峰

摘要

现代神经架构的设计通过经验性的逐步优化趋于一致，但其训练动态的控制机制仍仅部分可知。我们识别并分析了标准损失函数与正偏置激活函数之间的相互作用所引发的负权重漂移。我们证明，在均方误差或交叉熵损失下，初始化阶段正预激活值的梯度期望非负，导致早期训练过程中下游权重向负值漂移。该漂移源于优化过程本身而非数据，且普遍存在于多种架构（MLP、ResNet、ViT、GPT-nano、MP-SENe）及非对称激活函数（ReLU、GELU、SiLU）中。与ReLU结合时，权重漂移使GPT-nano的激活稀疏性高达90%。我们通过79种配置表征了稀疏性与准确性的权衡，并发现当激活稀疏性超过约70%时存在一个尖锐的准确性悬崖。尽管ReLU²在GPT-nano中实现了良好的稀疏性-准确性比，但它会病态地放大中间Transformer层中已识别的激活尖峰。裁剪在保留平方运算表征优势的同时解决了这一问题：裁剪后的ReLU²优于其未裁剪版本，且GELU²在GPT-nano上取得了最低的验证损失。代码见https://github.com/On-Point-RND/BugOrFeature。

English

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.