正規化なしのトランスフォーマー

要旨

正規化層は現代のニューラルネットワークにおいて遍在し、長らく不可欠とされてきました。本研究では、驚くほどシンプルな手法を用いることで、正規化なしのTransformerが同等またはそれ以上の性能を達成できることを示します。我々は、正規化層の代替として、要素ごとの操作であるDynamic Tanh（DyT）を導入します。DyT(x) = tanh(alpha x) という形式で、Transformerにおける正規化層の代替として使用できます。DyTは、Transformerにおける層正規化がしばしばtanhのようなS字型の入出力マッピングを生成するという観察に基づいています。DyTを組み込むことで、正規化なしのTransformerは、その正規化された対応モデルと同等またはそれ以上の性能を達成でき、ほとんどの場合ハイパーパラメータの調整を必要としません。我々は、認識から生成、教師あり学習から自己教師あり学習、コンピュータビジョンから言語モデルまで、多様な設定においてDyTを組み込んだTransformerの有効性を検証します。これらの発見は、正規化層が現代のニューラルネットワークにおいて不可欠であるという従来の理解に挑戦し、深層ネットワークにおけるその役割について新たな洞察を提供します。

English

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

正規化なしのトランスフォーマー

Transformers without Normalization

要旨

Support