DualToken-ViT:具有双令牌融合的位置感知高效视觉Transformer
DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
September 21, 2023
作者: Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun Huang, Weining Qian
cs.AI
摘要
基于自注意力的视觉Transformer(ViTs)已经成为计算机视觉中一种竞争力极强的架构。与卷积神经网络(CNNs)不同,ViTs能够实现全局信息共享。随着ViTs各种结构的发展,ViTs在许多视觉任务中具有越来越多的优势。然而,自注意力的二次复杂度使ViTs计算密集型,并且它们缺乏局部性和平移等变性的归纳偏好,相比CNNs需要更大的模型尺寸来有效学习视觉特征。在本文中,我们提出了一种名为DualToken-ViT的轻量级高效视觉Transformer模型,它充分利用了CNNs和ViTs的优势。DualToken-ViT有效地融合了通过基于卷积的结构获得的局部信息和通过自注意力结构获得的全局信息的令牌,实现了高效的注意力结构。此外,我们在所有阶段都使用了位置感知的全局令牌来丰富全局信息,进一步增强了DualToken-ViT的效果。位置感知的全局令牌还包含图像的位置信息,使我们的模型更适用于视觉任务。我们在图像分类、目标检测和语义分割任务上进行了大量实验,以展示DualToken-ViT的有效性。在ImageNet-1K数据集上,我们不同规模的模型分别以0.5G和1.0G FLOPs的计算量分别达到了75.4%和79.4%的准确率,而我们的1.0G FLOPs模型的性能优于使用全局令牌的LightViT-T模型0.7%。
English
Self-attention-based vision transformers (ViTs) have emerged as a highly
competitive architecture in computer vision. Unlike convolutional neural
networks (CNNs), ViTs are capable of global information sharing. With the
development of various structures of ViTs, ViTs are increasingly advantageous
for many vision tasks. However, the quadratic complexity of self-attention
renders ViTs computationally intensive, and their lack of inductive biases of
locality and translation equivariance demands larger model sizes compared to
CNNs to effectively learn visual features. In this paper, we propose a
light-weight and efficient vision transformer model called DualToken-ViT that
leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the
token with local information obtained by convolution-based structure and the
token with global information obtained by self-attention-based structure to
achieve an efficient attention structure. In addition, we use position-aware
global tokens throughout all stages to enrich the global information, which
further strengthening the effect of DualToken-ViT. Position-aware global tokens
also contain the position information of the image, which makes our model
better for vision tasks. We conducted extensive experiments on image
classification, object detection and semantic segmentation tasks to demonstrate
the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of
different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G
FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using
global tokens by 0.7%.