Modulazione Consapevole della Scala Incontra il Transformer

Abstract

Questo articolo presenta un nuovo Transformer per la visione, il Scale-Aware Modulation Transformer (SMT), in grado di gestire in modo efficiente varie attività downstream combinando la rete convoluzionale e il Transformer per la visione. La proposta Scale-Aware Modulation (SAM) nell'SMT include due principali design innovativi. In primo luogo, introduciamo il modulo Multi-Head Mixed Convolution (MHMC), che può catturare caratteristiche multi-scala e ampliare il campo ricettivo. In secondo luogo, proponiamo il modulo Scale-Aware Aggregation (SAA), leggero ma efficace, che consente la fusione di informazioni tra diverse teste. Sfruttando questi due moduli, la modulazione convoluzionale viene ulteriormente migliorata. Inoltre, a differenza dei lavori precedenti che utilizzavano modulazioni in tutte le fasi per costruire una rete priva di attenzione, proponiamo una rete ibrida evolutiva (EHN), che può simulare efficacemente il passaggio dalla cattura di dipendenze locali a quelle globali man mano che la rete diventa più profonda, ottenendo prestazioni superiori. Esperimenti estensivi dimostrano che l'SMT supera significativamente i modelli state-of-the-art esistenti in un'ampia gamma di compiti visivi. Nello specifico, l'SMT con 11.5M / 2.4GFLOPs e 32M / 7.7GFLOPs può raggiungere rispettivamente un'accuratezza top-1 dell'82.2% e dell'84.3% su ImageNet-1K. Dopo il pre-addestramento su ImageNet-22K con risoluzione 224^2, raggiunge un'accuratezza top-1 dell'87.1% e dell'88.1% quando viene fine-tuned con risoluzione 224^2 e 384^2, rispettivamente. Per il rilevamento di oggetti con Mask R-CNN, l'SMT base addestrato con schedule 1x e 3x supera la controparte Swin Transformer rispettivamente di 4.2 e 1.3 mAP su COCO. Per la segmentazione semantica con UPerNet, l'SMT base testato su scala singola e multi-scala supera Swin rispettivamente di 2.0 e 1.1 mIoU su ADE20K.

English

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

Modulazione Consapevole della Scala Incontra il Transformer

Scale-Aware Modulation Meet Transformer

Abstract

Support