정규화 없는 트랜스포머

초록

정규화 계층은 현대 신경망에서 어디서나 볼 수 있으며 오랫동안 필수적인 요소로 여겨져 왔습니다. 본 연구는 정규화 없이도 Transformer가 놀랍도록 간단한 기법을 사용하여 동등하거나 더 나은 성능을 달성할 수 있음을 보여줍니다. 우리는 정규화 계층을 대체할 수 있는 요소별 연산인 Dynamic Tanh(DyT), 즉 DyT(x) = tanh(alpha x)를 소개합니다. DyT는 Transformer의 레이어 정규화가 종종 tanh와 유사한 S자 형태의 입력-출력 매핑을 생성한다는 관찰에서 영감을 받았습니다. DyT를 도입함으로써, 정규화 없이도 Transformer는 대부분 하이퍼파라미터 튜닝 없이 정규화된 모델과 동등하거나 더 나은 성능을 보일 수 있습니다. 우리는 DyT를 적용한 Transformer의 효과를 인식에서 생성, 지도 학습에서 자기 지도 학습, 컴퓨터 비전에서 언어 모델에 이르기까지 다양한 설정에서 검증합니다. 이러한 결과는 현대 신경망에서 정규화 계층이 필수적이라는 기존의 통념에 도전하며, 심층 신경망에서의 역할에 대한 새로운 통찰을 제공합니다.

English

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

정규화 없는 트랜스포머

Transformers without Normalization

초록

Support