Swin-Free: サイズ可変ウィンドウによるクロスウィンドウアテンションと効率性の向上

要旨

Transformerモデルは、言語タスクでの成功に続き、コンピュータビジョンにおいても大きな可能性を示しています。Swin Transformerはその一つで、畳み込みベースのアーキテクチャを精度の面で上回り、Vision Transformer（ViT）やその派生モデルと比較して効率性を向上させています。ViTとその派生モデルは入力サイズに対して二次の計算量を要しますが、Swin Transformerはシフトウィンドウを特徴としており、自己注意機構の計算を非重複のローカルウィンドウに限定しつつ、ウィンドウ間の接続を可能にします。しかし、シフトウィンドウはメモリコピー操作を導入し、これが実行時間の大部分を占めます。この問題を緩和するため、我々はSwin-Freeを提案します。Swin-Freeでは、シフトウィンドウの代わりに、ステージごとにサイズが変化するウィンドウを適用し、ローカルウィンドウ間の接続を実現します。このシンプルな設計変更により、Swin-Freeは推論時にSwin Transformerよりも高速に動作し、かつ精度も向上します。さらに、Swin-Freeのいくつかの派生モデルも提案し、これらは対応するSwin Transformerモデルよりも高速です。

English

Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency when compared to Vision Transformer (ViT) and its variants, which have quadratic complexity with respect to the input size. Swin Transformer features shifting windows that allows cross-window connection while limiting self-attention computation to non-overlapping local windows. However, shifting windows introduces memory copy operations, which account for a significant portion of its runtime. To mitigate this issue, we propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows. With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy. Furthermore, we also propose a few of Swin-Free variants that are faster than their Swin Transformer counterparts.

Swin-Free: サイズ可変ウィンドウによるクロスウィンドウアテンションと効率性の向上

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

要旨

Support