スケール対応変調とTransformerの融合

要旨

本論文では、畳み込みネットワークとVision Transformerを組み合わせることで、様々な下流タスクを効率的に処理できる新しいVision Transformer、Scale-Aware Modulation Transformer（SMT）を提案する。SMTにおけるScale-Aware Modulation（SAM）は、2つの主要な新規設計を含んでいる。まず、マルチスケールの特徴を捉え、受容野を拡張できるMulti-Head Mixed Convolution（MHMC）モジュールを導入する。次に、軽量でありながら効果的なScale-Aware Aggregation（SAA）モジュールを提案し、異なるヘッド間での情報融合を可能にする。これら2つのモジュールを活用することで、畳み込み変調がさらに強化される。さらに、従来の研究が全ての段階で変調を利用してアテンションフリーネットワークを構築していたのに対し、本論文ではEvolutionary Hybrid Network（EHN）を提案する。EHNは、ネットワークが深くなるにつれて局所的な依存関係からグローバルな依存関係への移行を効果的にシミュレートし、優れた性能を発揮する。大規模な実験により、SMTが幅広い視覚タスクにおいて既存の最先端モデルを大幅に上回ることが実証された。具体的には、11.5M / 2.4GFLOPsおよび32M / 7.7GFLOPsのSMTは、ImageNet-1Kにおいてそれぞれ82.2%および84.3%のtop-1精度を達成する。ImageNet-22Kで224^2解像度で事前学習した後、224^2および384^2解像度でファインチューニングすると、それぞれ87.1%および88.1%のtop-1精度を達成する。Mask R-CNNを用いた物体検出では、1xおよび3xスケジュールで学習したSMT baseは、COCOにおいてSwin Transformerをそれぞれ4.2および1.3 mAP上回る。UPerNetを用いたセマンティックセグメンテーションでは、ADE20Kにおいて、シングルスケールおよびマルチスケールでのSMT baseのテスト結果は、Swinをそれぞれ2.0および1.1 mIoU上回る。

English

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

スケール対応変調とTransformerの融合

Scale-Aware Modulation Meet Transformer

要旨

Support