Bug還是Feature平方：權重漂移、激活稀疏性與尖峰

摘要

現代神經架構的設計經由漸進式經驗選擇趨於收斂，然而主導其訓練動態的機制仍僅部分明確。我們識別並分析由標準損失函數與正偏啟動函數交互作用所誘發的負權重漂移現象。我們證明，在均方誤差或交叉熵損失下，初始化時正預啟動值對應的梯度期望值非負，導致早期訓練中下游權重朝負值偏移。此漂移本質上源自最佳化過程而非數據，並跨越多種架構（多層感知機、殘差網路、視覺Transformer、GPT-nano、MP-SENe）及非對稱啟動函數（ReLU、GELU、SiLU）持續存在。與ReLU結合時，權重漂移在GPT-nano中產生的激活稀疏性高達90%。我們針對79種配置刻劃稀疏性與準確率的權衡關係，並識別出當激活稀疏度超過70%時出現的急遽準確率懸崖。雖ReLU²在GPT-nano中達到優良的稀疏性-準確率比，但會病態放大中間Transformer層中所識別的激活尖峰。裁剪可解決此問題，同時保留平方運算的表示優勢：裁剪版ReLU²優於其未裁剪版本，而GELU²在GPT-nano上達成最低驗證損失。程式碼請參閱 https://github.com/On-Point-RND/BugOrFeature。

English

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.