스케일 인지 변조가 트랜스포머와 만나다

초록

본 논문은 컨볼루션 네트워크와 비전 트랜스포머를 결합하여 다양한 다운스트림 작업을 효율적으로 처리할 수 있는 새로운 비전 트랜스포머인 Scale-Aware Modulation Transformer(SMT)를 제안한다. SMT에서 제안된 Scale-Aware Modulation(SAM)은 두 가지 주요 혁신적인 설계를 포함한다. 첫째, 다중 스케일 특징을 포착하고 수용 필드를 확장할 수 있는 Multi-Head Mixed Convolution(MHMC) 모듈을 소개한다. 둘째, 경량이면서도 효과적이며 서로 다른 헤드 간의 정보 융합을 가능하게 하는 Scale-Aware Aggregation(SAA) 모듈을 제안한다. 이 두 모듈을 활용함으로써 컨볼루션 변조가 더욱 강화된다. 또한, 모든 단계에서 변조를 사용하여 주의 메커니즘 없는 네트워크를 구축한 기존 연구와 달리, 네트워크가 깊어짐에 따라 지역적 의존성에서 전역적 의존성으로의 전환을 효과적으로 시뮬레이션할 수 있는 Evolutionary Hybrid Network(EHN)를 제안하여 우수한 성능을 달성한다. 광범위한 실험을 통해 SMT가 다양한 시각 작업에서 기존의 최첨단 모델들을 크게 능가함을 입증한다. 구체적으로, 11.5M / 2.4GFLOPs와 32M / 7.7GFLOPs의 SMT는 각각 ImageNet-1K에서 82.2%와 84.3%의 top-1 정확도를 달성한다. ImageNet-22K에서 224^2 해상도로 사전 학습한 후, 224^2와 384^2 해상도로 미세 조정했을 때 각각 87.1%와 88.1%의 top-1 정확도를 기록한다. Mask R-CNN을 사용한 객체 탐지에서, 1x 및 3x 스케줄로 학습된 SMT base는 COCO에서 Swin 트랜스포머 대비 각각 4.2와 1.3 mAP로 우수한 성능을 보인다. UPerNet을 사용한 의미 분할에서, 단일 및 다중 스케일 테스트에서 SMT base는 ADE20K에서 Swin 대비 각각 2.0과 1.1 mIoU로 더 높은 성능을 보인다.

English

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

스케일 인지 변조가 트랜스포머와 만나다

Scale-Aware Modulation Meet Transformer

초록

Support