Iwin Transformer：基於交錯窗口的層次化視覺Transformer

摘要

我們介紹了Iwin Transformer，這是一種無需位置嵌入的層次化視覺Transformer，它能夠通過創新的交錯窗口注意力與深度可分離卷積的協作，直接從低分辨率到高分辨率進行微調。該方法利用注意力連接遠距離的標記，並應用卷積來連結相鄰的標記，從而在單一模塊內實現全局信息交換，克服了Swin Transformer需要連續兩個區塊來近似全局注意力的限制。在視覺基準上的大量實驗表明，Iwin Transformer在圖像分類（在ImageNet-1K上達到87.4的top-1準確率）、語義分割和視頻動作識別等任務中展現出強大的競爭力。我們還驗證了Iwin核心組件作為獨立模塊的有效性，它可以無縫替換類條件圖像生成中的自注意力模塊。Iwin Transformer引入的概念和方法具有激發未來研究的潛力，例如在視頻生成中的Iwin 3D Attention。代碼和模型可在https://github.com/cominder/Iwin-Transformer獲取。

English

We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Iwin Transformer：基於交錯窗口的層次化視覺Transformer

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

摘要

Support