DualToken-ViT: 이중 토큰 융합을 통한 위치 인식 효율적 비전 트랜스포머

초록

셀프 어텐션 기반의 비전 트랜스포머(ViTs)는 컴퓨터 비전 분야에서 매우 경쟁력 있는 아키텍처로 부상했습니다. 합성곱 신경망(CNNs)과 달리, ViTs는 전역 정보 공유가 가능합니다. 다양한 ViTs 구조의 발전과 함께, ViTs는 많은 비전 작업에서 점점 더 유리해지고 있습니다. 그러나 셀프 어텐션의 이차 복잡도로 인해 ViTs는 계산적으로 부담이 크며, 지역성과 병진 등변성에 대한 귀납적 편향이 부족하여 시각적 특징을 효과적으로 학습하기 위해 CNNs에 비해 더 큰 모델 크기가 필요합니다. 본 논문에서는 CNNs와 ViTs의 장점을 활용한 경량화되고 효율적인 비전 트랜스포머 모델인 DualToken-ViT를 제안합니다. DualToken-ViT는 합성곱 기반 구조로 얻은 지역 정보를 가진 토큰과 셀프 어텐션 기반 구조로 얻은 전역 정보를 가진 토큰을 효과적으로 융합하여 효율적인 어텐션 구조를 달성합니다. 또한, 모든 단계에서 위치 인식 전역 토큰을 사용하여 전역 정보를 풍부하게 하여 DualToken-ViT의 효과를 더욱 강화합니다. 위치 인식 전역 토큰은 이미지의 위치 정보도 포함하고 있어, 우리의 모델이 비전 작업에 더 적합하도록 합니다. 우리는 이미지 분류, 객체 탐지 및 의미론적 분할 작업에 대한 광범위한 실험을 통해 DualToken-ViT의 효과를 입증했습니다. ImageNet-1K 데이터셋에서, 우리의 다양한 규모의 모델은 각각 0.5G와 1.0G FLOPs로 75.4%와 79.4%의 정확도를 달성했으며, 1.0G FLOPs를 사용한 우리의 모델은 전역 토큰을 사용하는 LightViT-T보다 0.7% 더 우수한 성능을 보였습니다.

English

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

DualToken-ViT: 이중 토큰 융합을 통한 위치 인식 효율적 비전 트랜스포머

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

초록

Support