Iwin Transformer：基于交错窗口的层次化视觉Transformer

摘要

我们推出Iwin Transformer，一种无需位置嵌入的层次化视觉Transformer，通过创新的交错窗口注意力与深度可分离卷积的协作，能够直接从低分辨率到高分辨率进行微调。该方法利用注意力机制连接远距离的标记，并应用卷积链接邻近的标记，从而在单一模块内实现全局信息交换，克服了Swin Transformer需要连续两个模块来近似全局注意力的局限。在视觉基准上的大量实验表明，Iwin Transformer在图像分类（ImageNet-1K上87.4%的top-1准确率）、语义分割及视频动作识别等任务中展现出强劲竞争力。我们还验证了Iwin核心组件作为独立模块的有效性，它能够无缝替换类别条件图像生成中的自注意力模块。Iwin Transformer引入的概念与方法，如Iwin 3D Attention在视频生成中的应用，有望激发未来研究。代码与模型已发布于https://github.com/cominder/Iwin-Transformer。

English

We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Iwin Transformer：基于交错窗口的层次化视觉Transformer

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

摘要

Support