在视觉Transformer中用ReLU替换softmax

摘要

先前的研究发现，将注意力softmax替换为诸如ReLU之类的逐点激活时会导致准确性下降。在视觉Transformer的背景下，我们发现通过除以序列长度可以减轻这种下降。我们在ImageNet-21k上训练小到大的视觉Transformer的实验表明，就计算规模函数的缩放行为而言，ReLU-注意力可以接近或与softmax-注意力的性能相匹配。

English

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

在视觉Transformer中用ReLU替换softmax

Replacing softmax with ReLU in Vision Transformers

摘要

Support