ReLU的复兴：关于无归一化大型语言模型中的熵超载

摘要

LayerNorm 是现代大型语言模型（LLMs）中的关键组件，用于稳定训练并确保平滑优化。然而，它在机械解释性、异常特征抑制、信号传播的忠实性，以及私人推理的计算和通信复杂性方面带来了重大挑战。本研究探讨了无归一化解码器的LLMs中理想的激活函数。与基于Transformer的模型对GELU的传统偏好相反，我们的实证发现展示了一种相反的趋势——在无LayerNorm模型中，ReLU明显优于GELU，导致 perplexity 提高了 8.2%。我们发现了GELU存在的一个关键问题，即早期层面经历了信息过载，导致注意力头的表征能力被低估。这突显了像GELU这样的更平滑的激活函数不适合于无LayerNorm的架构，而ReLU的几何特性——在输入空间中的专业化和类内选择性——导致了改进的学习动态和更好的信息保留在没有LayerNorm的情况下。这项研究为优化Transformer架构提供了关键见解，其中LayerNorm引入了重大挑战。

English

LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.

ReLU的复兴：关于无归一化大型语言模型中的熵超载

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

摘要

Support