적응형 주파수 필터: 효율적인 글로벌 토큰 믹서

초록

최근의 비전 트랜스포머, 대형 커널 CNN 및 MLP는 전역 범위에서의 효과적인 정보 융합 덕분에 다양한 비전 작업에서 주목할 만한 성과를 거두었습니다. 그러나 특히 모바일 기기에서의 효율적인 배포는 여전히 자기 주의 메커니즘, 대형 커널 또는 완전 연결 계층의 높은 계산 비용으로 인해 상당한 어려움을 겪고 있습니다. 본 연구에서는 이러한 문제를 해결하기 위해 전통적인 컨볼루션 정리를 딥러닝에 적용하고, 적응형 주파수 필터가 효율적인 전역 토큰 믹서로 사용될 수 있음을 밝혔습니다. 이러한 통찰을 바탕으로, 우리는 적응형 주파수 필터링(AFF) 토큰 믹서를 제안합니다. 이 신경 연산자는 푸리에 변환을 통해 잠재 표현을 주파수 영역으로 전환하고, 요소별 곱셈을 통해 의미론적으로 적응형 주파수 필터링을 수행합니다. 이는 수학적으로 잠재 표현의 공간 해상도만큼 큰 동적 컨볼루션 커널을 사용하여 원래 잠재 공간에서 토큰 믹싱 연산을 수행하는 것과 동일합니다. 우리는 AFF 토큰 믹서를 주요 신경 연산자로 사용하여 경량 신경망인 AFFNet을 구축했습니다. 광범위한 실험을 통해 제안된 AFF 토큰 믹서의 효과를 입증하고, AFFNet이 시각 인식 및 밀집 예측 작업을 포함한 다양한 비전 작업에서 다른 경량 네트워크 설계와 비교하여 우수한 정확도와 효율성의 균형을 달성함을 보여줍니다.

English

Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

적응형 주파수 필터: 효율적인 글로벌 토큰 믹서

Adaptive Frequency Filters As Efficient Global Token Mixers

초록

Support