DualToken-ViT：具有雙 Token 融合的位置感知高效視覺 Transformer

摘要

基於自注意力的視覺Transformer（ViTs）已成為計算機視覺中一種高競爭力的架構。與卷積神經網絡（CNNs）不同，ViTs能夠進行全局信息共享。隨著各種ViTs結構的發展，ViTs在許多視覺任務上具有越來越多的優勢。然而，自注意力的二次複雜度使ViTs在計算上變得密集，且它們缺乏局部性和平移等變性的歸納偏差，相對於CNNs，需要更大的模型尺寸來有效地學習視覺特徵。在本文中，我們提出了一種名為DualToken-ViT的輕量級高效視覺Transformer模型，該模型充分利用了CNNs和ViTs的優勢。DualToken-ViT有效地融合了通過基於卷積的結構獲得的局部信息的標記和通過基於自注意力的結構獲得的全局信息的標記，以實現高效的注意力結構。此外，我們在所有階段使用具有位置感知能力的全局標記來豐富全局信息，進一步增強DualToken-ViT的效果。位置感知全局標記還包含圖像的位置信息，使我們的模型更適合視覺任務。我們在圖像分類、目標檢測和語義分割任務上進行了大量實驗，以展示DualToken-ViT的有效性。在ImageNet-1K數據集上，我們不同規模的模型分別實現了75.4%和79.4%的準確率，僅使用0.5G和1.0G FLOPs，而我們的1.0G FLOPs模型優於使用全局標記的LightViT-T模型0.7%。

English

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

DualToken-ViT：具有雙 Token 融合的位置感知高效視覺 Transformer

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

摘要

Support