Vision TransformersにおけるsoftmaxをReLUに置き換える

要旨

先行研究では、アテンションのソフトマックスをReLUのような点単位の活性化関数に置き換えると精度の低下が観察されていました。ビジョントランスフォーマーの文脈において、シーケンス長で除算することでこの低下が緩和されることがわかりました。ImageNet-21kで小型から大型のビジョントランスフォーマーを訓練した実験結果から、ReLUアテンションは計算量の関数としてのスケーリング特性において、ソフトマックスアテンションの性能に匹敵し、あるいは同等の性能を発揮できることが示されています。

English

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

Vision TransformersにおけるsoftmaxをReLUに置き換える

Replacing softmax with ReLU in Vision Transformers

要旨

Support