無歸一化的變換器

摘要

歸一化層在現代神經網絡中無處不在，長期以來被認為是不可或缺的。本研究表明，無需歸一化的Transformer模型通過一種極為簡單的技術即可達到同等甚至更優的性能。我們引入了動態Tanh（DyT），這是一種元素級操作DyT(x) = tanh(alpha x)，作為Transformer中歸一化層的直接替代品。DyT的靈感來自於觀察到Transformer中的層歸一化常常產生類似tanh的S形輸入輸出映射。通過融入DyT，無需歸一化的Transformer模型能夠匹配或超越其帶歸一化對應模型的性能，且大多無需進行超參數調優。我們在多種場景下驗證了帶有DyT的Transformer的有效性，範圍涵蓋從識別到生成、從監督學習到自監督學習，以及從計算機視覺到語言模型。這些發現挑戰了現代神經網絡中歸一化層不可或缺的傳統認知，並為其在深度網絡中的角色提供了新的見解。

English

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

無歸一化的變換器

Transformers without Normalization

摘要

Support