無歸一化的變換器
Transformers without Normalization
March 13, 2025
作者: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
cs.AI
摘要
歸一化層在現代神經網絡中無處不在,長期以來被認為是不可或缺的。本研究表明,無需歸一化的Transformer模型通過一種極為簡單的技術即可達到同等甚至更優的性能。我們引入了動態Tanh(DyT),這是一種元素級操作DyT(x) = tanh(alpha x),作為Transformer中歸一化層的直接替代品。DyT的靈感來自於觀察到Transformer中的層歸一化常常產生類似tanh的S形輸入輸出映射。通過融入DyT,無需歸一化的Transformer模型能夠匹配或超越其帶歸一化對應模型的性能,且大多無需進行超參數調優。我們在多種場景下驗證了帶有DyT的Transformer的有效性,範圍涵蓋從識別到生成、從監督學習到自監督學習,以及從計算機視覺到語言模型。這些發現挑戰了現代神經網絡中歸一化層不可或缺的傳統認知,並為其在深度網絡中的角色提供了新的見解。
English
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
DyT(x) = tanh(alpha x), as a drop-in replacement for normalization
layers in Transformers. DyT is inspired by the observation that layer
normalization in Transformers often produces tanh-like, S-shaped input-output
mappings. By incorporating DyT, Transformers without normalization can match or
exceed the performance of their normalized counterparts, mostly without
hyperparameter tuning. We validate the effectiveness of Transformers with DyT
across diverse settings, ranging from recognition to generation, supervised to
self-supervised learning, and computer vision to language models. These
findings challenge the conventional understanding that normalization layers are
indispensable in modern neural networks, and offer new insights into their role
in deep networks.Summary
AI-Generated Summary