尺度感知調節遇上Transformer

摘要

本文提出了一種新的視覺Transformer，名為尺度感知調節Transformer（SMT），通過結合卷積網絡和視覺Transformer，能夠有效處理各種下游任務。SMT中提出的尺度感知調節（SAM）包括兩個主要的新設計。首先，我們引入了多頭混合卷積（MHMC）模塊，可以捕獲多尺度特徵並擴展感受野。其次，我們提出了輕量但有效的尺度感知聚合（SAA）模塊，實現跨不同頭部的信息融合。通過利用這兩個模塊，卷積調節進一步增強。此外，與以往利用調節在所有階段構建無關注網絡的作品相比，我們提出了一種進化混合網絡（EHN），可以有效模擬隨著網絡變得更深，從捕獲局部到全局依賴性的轉變，從而實現卓越性能。大量實驗表明，SMT在各種視覺任務中顯著優於現有的最先進模型。具體來說，SMT在ImageNet-1K上的11.5M / 2.4GFLOPs和32M / 7.7GFLOPs可以分別達到82.2%和84.3%的top-1準確率。在224^2分辨率的ImageNet-22K上預訓練後，當分別使用224^2和384^2分辨率進行微調時，其準確率分別達到87.1%和88.1%的top-1。對於使用Mask R-CNN進行對象檢測，與Swin Transformer相比，以1x和3x進度訓練的SMT基礎分別在COCO上表現優異，分別高出4.2和1.3 mAP。對於使用UPerNet進行語義分割，SMT基礎在單尺度和多尺度測試上均超越Swin，分別高出2.0和1.1 mIoU在ADE20K上。

English

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

尺度感知調節遇上Transformer

Scale-Aware Modulation Meet Transformer

摘要

Support