在視覺Transformer中將softmax替換為ReLU

摘要

先前的研究觀察到，當將注意力 softmax 替換為像 ReLU 這樣的點對點激活時，準確性會下降。在視覺 transformer 的背景下，我們發現當除以序列長度時，可以減輕這種下降。我們在 ImageNet-21k 上訓練從小到大的視覺 transformer 的實驗表明，就計算的規模行為而言，ReLU-attention 可以接近或匹敵 softmax-attention 的性能。

English

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

在視覺Transformer中將softmax替換為ReLU

Replacing softmax with ReLU in Vision Transformers

摘要

Support