在视觉Transformer中用ReLU替换softmax
Replacing softmax with ReLU in Vision Transformers
September 15, 2023
作者: Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
cs.AI
摘要
先前的研究发现,将注意力softmax替换为诸如ReLU之类的逐点激活时会导致准确性下降。在视觉Transformer的背景下,我们发现通过除以序列长度可以减轻这种下降。我们在ImageNet-21k上训练小到大的视觉Transformer的实验表明,就计算规模函数的缩放行为而言,ReLU-注意力可以接近或与softmax-注意力的性能相匹配。
English
Previous research observed accuracy degradation when replacing the attention
softmax with a point-wise activation such as ReLU. In the context of vision
transformers, we find that this degradation is mitigated when dividing by
sequence length. Our experiments training small to large vision transformers on
ImageNet-21k indicate that ReLU-attention can approach or match the performance
of softmax-attention in terms of scaling behavior as a function of compute.