DualToken-ViT: デュアルトークン融合による位置情報を考慮した効率的なVision Transformer

要旨

セルフアテンションベースのビジョントランスフォーマー（ViT）は、コンピュータビジョン分野において非常に競争力のあるアーキテクチャとして登場しました。畳み込みニューラルネットワーク（CNN）とは異なり、ViTはグローバルな情報共有が可能です。ViTの様々な構造が開発されるにつれ、多くの視覚タスクにおいてViTの利点がますます顕著になっています。しかし、セルフアテンションの二次的な計算複雑性により、ViTは計算集約的であり、局所性や並進等価性といった帰納的バイアスの欠如から、視覚的特徴を効果的に学習するためにはCNNと比較して大きなモデルサイズが必要となります。本論文では、CNNとViTの利点を活用した軽量で効率的なビジョントランスフォーマーモデルであるDualToken-ViTを提案します。DualToken-ViTは、畳み込みベースの構造から得られる局所情報を持つトークンと、セルフアテンションベースの構造から得られるグローバル情報を持つトークンを効果的に融合させ、効率的なアテンション構造を実現します。さらに、全ステージを通じて位置情報を意識したグローバルトークンを使用し、グローバル情報を豊かにすることで、DualToken-ViTの効果をさらに強化します。位置情報を意識したグローバルトークンは画像の位置情報も含むため、視覚タスクにおいてより優れた性能を発揮します。画像分類、物体検出、セマンティックセグメンテーションのタスクにおいて、DualToken-ViTの有効性を実証するために広範な実験を行いました。ImageNet-1Kデータセットにおいて、異なるスケールのモデルはそれぞれ0.5Gと1.0GのFLOPsで75.4%と79.4%の精度を達成し、1.0G FLOPsのモデルはグローバルトークンを使用したLightViT-Tを0.7%上回りました。

English

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.

DualToken-ViT: デュアルトークン融合による位置情報を考慮した効率的なVision Transformer

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

要旨

Support