ReLU 的復甦:談談在無正規化的大型語言模型中的熵過載
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
October 12, 2024
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI
摘要
LayerNorm 是現代大型語言模型(LLMs)中的重要組件,用於穩定訓練並確保平滑優化。然而,它在機械解釋性、異常特徵抑制、信號傳播的忠實性,以及私密推論的計算和通信複雜性方面帶來顯著挑戰。本研究探討無正則化解碼器的LLMs中理想的激活函數。與基於Transformer模型的傳統偏好GELU相反,我們的實證發現呈現一個相反的趨勢 - ReLU在無LayerNorm模型中顯著優於GELU,導致8.2%的困惑度改善。我們發現了GELU的一個關鍵問題,即早期層面遇到熵過載,導致注意力頭的表徵能力被過度利用不足。這凸顯了像GELU這樣的平滑激活函數對於無LayerNorm架構來說是不適合的,而ReLU的幾何特性 - 在輸入空間中的專業化和類內選擇性 - 則導致學習動態的改善,並在沒有LayerNorm的情況下更好地保留信息。這項研究為優化Transformer架構提供了重要見解,其中LayerNorm帶來了顯著挑戰。
English
LayerNorm is a critical component in modern large language models (LLMs) for
stabilizing training and ensuring smooth optimization. However, it introduces
significant challenges in mechanistic interpretability, outlier feature
suppression, faithful signal propagation, and computational and communication
complexity of private inference. This work explores desirable activation
functions in normalization-free decoder-only LLMs. Contrary to the conventional
preference for the GELU in transformer-based models, our empirical findings
demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in
LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We
discover a key issue with GELU, where early layers experience entropic
overload, leading to the under-utilization of the representational capacity of
attention heads. This highlights that smoother activations like GELU are {\em
ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical
properties -- specialization in input space and intra-class selectivity -- lead
to improved learning dynamics and better information retention in the absence
of LayerNorm. This study offers key insights for optimizing transformer
architectures where LayerNorm introduces significant challenges.Summary
AI-Generated Summary