Iwin Transformer:基于交错窗口的层次化视觉Transformer
Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
July 24, 2025
作者: Simin Huo, Ning Li
cs.AI
摘要
我们推出Iwin Transformer,一种无需位置嵌入的层次化视觉Transformer,通过创新的交错窗口注意力与深度可分离卷积的协作,能够直接从低分辨率到高分辨率进行微调。该方法利用注意力机制连接远距离的标记,并应用卷积链接邻近的标记,从而在单一模块内实现全局信息交换,克服了Swin Transformer需要连续两个模块来近似全局注意力的局限。在视觉基准上的大量实验表明,Iwin Transformer在图像分类(ImageNet-1K上87.4%的top-1准确率)、语义分割及视频动作识别等任务中展现出强劲竞争力。我们还验证了Iwin核心组件作为独立模块的有效性,它能够无缝替换类别条件图像生成中的自注意力模块。Iwin Transformer引入的概念与方法,如Iwin 3D Attention在视频生成中的应用,有望激发未来研究。代码与模型已发布于https://github.com/cominder/Iwin-Transformer。
English
We introduce Iwin Transformer, a novel position-embedding-free hierarchical
vision transformer, which can be fine-tuned directly from low to high
resolution, through the collaboration of innovative interleaved window
attention and depthwise separable convolution. This approach uses attention to
connect distant tokens and applies convolution to link neighboring tokens,
enabling global information exchange within a single module, overcoming Swin
Transformer's limitation of requiring two consecutive blocks to approximate
global attention. Extensive experiments on visual benchmarks demonstrate that
Iwin Transformer exhibits strong competitiveness in tasks such as image
classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and
video action recognition. We also validate the effectiveness of the core
component in Iwin as a standalone module that can seamlessly replace the
self-attention module in class-conditional image generation. The concepts and
methods introduced by the Iwin Transformer have the potential to inspire future
research, like Iwin 3D Attention in video generation. The code and models are
available at https://github.com/cominder/Iwin-Transformer.