비전 트랜스포머에서 소프트맥스를 ReLU로 대체하기

초록

이전 연구에서는 어텐션의 소프트맥스를 ReLU와 같은 점별(point-wise) 활성화 함수로 대체할 때 정확도 저하가 관찰되었습니다. 비전 트랜스포머의 맥락에서, 우리는 시퀀스 길이로 나누면 이러한 성능 저하가 완화된다는 사실을 발견했습니다. ImageNet-21k 데이터셋에서 소규모부터 대규모까지의 비전 트랜스포머를 학습한 실험 결과, ReLU 기반 어텐션은 계산량에 따른 스케일링 행동 측면에서 소프트맥스 기반 어텐션의 성능에 근접하거나 동등한 수준을 보일 수 있음을 확인했습니다.

English

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

비전 트랜스포머에서 소프트맥스를 ReLU로 대체하기

Replacing softmax with ReLU in Vision Transformers

초록

Support