在視覺Transformer中將softmax替換為ReLU
Replacing softmax with ReLU in Vision Transformers
September 15, 2023
作者: Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
cs.AI
摘要
先前的研究觀察到,當將注意力 softmax 替換為像 ReLU 這樣的點對點激活時,準確性會下降。在視覺 transformer 的背景下,我們發現當除以序列長度時,可以減輕這種下降。我們在 ImageNet-21k 上訓練從小到大的視覺 transformer 的實驗表明,就計算的規模行為而言,ReLU-attention 可以接近或匹敵 softmax-attention 的性能。
English
Previous research observed accuracy degradation when replacing the attention
softmax with a point-wise activation such as ReLU. In the context of vision
transformers, we find that this degradation is mitigated when dividing by
sequence length. Our experiments training small to large vision transformers on
ImageNet-21k indicate that ReLU-attention can approach or match the performance
of softmax-attention in terms of scaling behavior as a function of compute.