Iwin Transformer:基於交錯窗口的層次化視覺Transformer
Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
July 24, 2025
作者: Simin Huo, Ning Li
cs.AI
摘要
我們介紹了Iwin Transformer,這是一種無需位置嵌入的層次化視覺Transformer,它能夠通過創新的交錯窗口注意力與深度可分離卷積的協作,直接從低分辨率到高分辨率進行微調。該方法利用注意力連接遠距離的標記,並應用卷積來連結相鄰的標記,從而在單一模塊內實現全局信息交換,克服了Swin Transformer需要連續兩個區塊來近似全局注意力的限制。在視覺基準上的大量實驗表明,Iwin Transformer在圖像分類(在ImageNet-1K上達到87.4的top-1準確率)、語義分割和視頻動作識別等任務中展現出強大的競爭力。我們還驗證了Iwin核心組件作為獨立模塊的有效性,它可以無縫替換類條件圖像生成中的自注意力模塊。Iwin Transformer引入的概念和方法具有激發未來研究的潛力,例如在視頻生成中的Iwin 3D Attention。代碼和模型可在https://github.com/cominder/Iwin-Transformer獲取。
English
We introduce Iwin Transformer, a novel position-embedding-free hierarchical
vision transformer, which can be fine-tuned directly from low to high
resolution, through the collaboration of innovative interleaved window
attention and depthwise separable convolution. This approach uses attention to
connect distant tokens and applies convolution to link neighboring tokens,
enabling global information exchange within a single module, overcoming Swin
Transformer's limitation of requiring two consecutive blocks to approximate
global attention. Extensive experiments on visual benchmarks demonstrate that
Iwin Transformer exhibits strong competitiveness in tasks such as image
classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and
video action recognition. We also validate the effectiveness of the core
component in Iwin as a standalone module that can seamlessly replace the
self-attention module in class-conditional image generation. The concepts and
methods introduced by the Iwin Transformer have the potential to inspire future
research, like Iwin 3D Attention in video generation. The code and models are
available at https://github.com/cominder/Iwin-Transformer.