ReLU 的復甦：談談在無正規化的大型語言模型中的熵過載

摘要

LayerNorm 是現代大型語言模型（LLMs）中的重要組件，用於穩定訓練並確保平滑優化。然而，它在機械解釋性、異常特徵抑制、信號傳播的忠實性，以及私密推論的計算和通信複雜性方面帶來顯著挑戰。本研究探討無正則化解碼器的LLMs中理想的激活函數。與基於Transformer模型的傳統偏好GELU相反，我們的實證發現呈現一個相反的趨勢 - ReLU在無LayerNorm模型中顯著優於GELU，導致8.2%的困惑度改善。我們發現了GELU的一個關鍵問題，即早期層面遇到熵過載，導致注意力頭的表徵能力被過度利用不足。這凸顯了像GELU這樣的平滑激活函數對於無LayerNorm架構來說是不適合的，而ReLU的幾何特性 - 在輸入空間中的專業化和類內選擇性 - 則導致學習動態的改善，並在沒有LayerNorm的情況下更好地保留信息。這項研究為優化Transformer架構提供了重要見解，其中LayerNorm帶來了顯著挑戰。

English

LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.

ReLU 的復甦：談談在無正規化的大型語言模型中的熵過載

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

摘要

Support