尺度感知调制遇上Transformer

摘要

本文提出了一种新的视觉Transformer，即尺度感知调制Transformer（SMT），通过结合卷积网络和视觉Transformer，能够高效处理各种下游任务。SMT中提出的尺度感知调制（SAM）包括两个主要的创新设计。首先，我们引入了多头混合卷积（MHMC）模块，可以捕获多尺度特征并扩展感受野。其次，我们提出了轻量但有效的尺度感知聚合（SAA）模块，实现了跨不同头部的信息融合。通过利用这两个模块，卷积调制得到进一步增强。此外，与以往利用调制贯穿所有阶段构建无注意力网络的作品相比，我们提出了一种进化混合网络（EHN），可以有效模拟网络变得更深时从捕获局部到全局依赖的转变，从而实现卓越的性能。大量实验证明，SMT在各种视觉任务中明显优于现有的最先进模型。具体而言，SMT在ImageNet-1K上的11.5M / 2.4GFLOPs和32M / 7.7GFLOPs可以分别实现82.2%和84.3%的top-1准确率。在以224^2分辨率在ImageNet-22K上预训练后，当分别使用224^2和384^2分辨率微调时，其准确率分别达到87.1%和88.1%的top-1准确率。对于使用Mask R-CNN进行目标检测，SMT基础模型在1x和3x训练计划下分别比Swin Transformer同类模型在COCO上高出4.2和1.3 mAP。对于使用UPerNet进行语义分割，SMT基础模型在单尺度和多尺度测试上分别比Swin高出2.0和1.1 mIoU。

English

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

尺度感知调制遇上Transformer

Scale-Aware Modulation Meet Transformer

摘要

Support