Swin-Free:透過變大小視窗實現更好的跨視窗注意力和效率
Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window
June 23, 2023
作者: Jinkyu Koo, John Yang, Le An, Gwenaelle Cunha Sergio, Su Inn Park
cs.AI
摘要
Transformer 模型在語言任務取得成功後,已展現出在計算機視覺領域的巨大潛力。Swin Transformer 是其中之一,在準確性方面優於基於卷積的架構,同時與 Vision Transformer (ViT) 及其變體相比,在效率上有所提升,後者對於輸入大小具有二次複雜度。Swin Transformer 具有移動窗口,允許跨窗口連接,同時將自注意力計算限制在不重疊的本地窗口上。然而,移動窗口引入了記憶體複製操作,佔其運行時間的相當大部分。為了緩解這個問題,我們提出了 Swin-Free,其中我們在各階段應用大小變化的窗口,而非移動窗口,以實現本地窗口之間的交叉連接。透過這個簡單的設計更改,Swin-Free 在推論時運行速度比 Swin Transformer 更快,並具有更高的準確性。此外,我們還提出了幾個 Swin-Free 變體,比其 Swin Transformer 對應物更快。
English
Transformer models have shown great potential in computer vision, following
their success in language tasks. Swin Transformer is one of them that
outperforms convolution-based architectures in terms of accuracy, while
improving efficiency when compared to Vision Transformer (ViT) and its
variants, which have quadratic complexity with respect to the input size. Swin
Transformer features shifting windows that allows cross-window connection while
limiting self-attention computation to non-overlapping local windows. However,
shifting windows introduces memory copy operations, which account for a
significant portion of its runtime. To mitigate this issue, we propose
Swin-Free in which we apply size-varying windows across stages, instead of
shifting windows, to achieve cross-connection among local windows. With this
simple design change, Swin-Free runs faster than the Swin Transformer at
inference with better accuracy. Furthermore, we also propose a few of Swin-Free
variants that are faster than their Swin Transformer counterparts.