Iwin Transformer: 인터리브 윈도우를 활용한 계층적 비전 트랜스포머

초록

우리는 새로운 위치 임베딩이 필요 없는 계층적 비전 트랜스포머인 Iwin Transformer를 소개한다. 이 모델은 혁신적인 인터리브드 윈도우 어텐션과 깊이별 분리 가능 컨볼루션의 협업을 통해 저해상도에서 고해상도로 직접 미세 조정이 가능하다. 이 접근 방식은 어텐션을 사용해 멀리 떨어진 토큰들을 연결하고, 컨볼루션을 적용해 인접한 토큰들을 연결함으로써 단일 모듈 내에서 전역 정보 교환을 가능하게 하며, Swin Transformer가 전역 어텐션을 근사하기 위해 두 개의 연속 블록을 필요로 하는 한계를 극복한다. 다양한 비주얼 벤치마크에서의 실험 결과, Iwin Transformer는 이미지 분류(ImageNet-1K에서 87.4%의 top-1 정확도), 의미론적 분할, 비디오 동작 인식과 같은 작업에서 강력한 경쟁력을 보여준다. 또한, Iwin의 핵심 구성 요소가 클래스 조건부 이미지 생성에서 셀프 어텐션 모듈을 원활하게 대체할 수 있는 독립형 모듈로서의 효과를 검증하였다. Iwin Transformer가 도입한 개념과 방법은 비디오 생성에서의 Iwin 3D 어텐션과 같은 미래 연구에 영감을 줄 잠재력을 가지고 있다. 코드와 모델은 https://github.com/cominder/Iwin-Transformer에서 확인할 수 있다.

English

We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Iwin Transformer: 인터리브 윈도우를 활용한 계층적 비전 트랜스포머

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

초록

Support