自適應頻率過濾器作為高效的全局代幣混合器。

摘要

最近的視覺轉換器、大核心卷積神經網絡和多層感知器在廣泛的視覺任務中取得了顯著的成功，這要歸功於它們在全局範圍內的有效信息融合。然而，它們的高效部署，尤其是在移動設備上，仍然面臨顯著挑戰，這是由於自注意機制、大核心或全連接層的高計算成本所致。在這項工作中，我們應用傳統卷積定理到深度學習中，以應對這一問題，並揭示自適應頻率濾波器可以作為高效的全局標記混合器。基於這一見解，我們提出了自適應頻率濾波（AFF）標記混合器。這種神經運算子通過傅立葉變換將潛在表示轉換到頻率域，並通過逐元素乘法執行語義自適應頻率濾波，從數學上等於在原始潛在空間中使用動態卷積核進行標記混合操作，其尺寸與該潛在表示的空間分辨率一樣大。我們將AFF標記混合器作為主要神經運算子來構建一個輕量級神經網絡，名為AFFNet。大量實驗證明了我們提出的AFF標記混合器的有效性，並顯示AFFNet在廣泛的視覺任務上，包括視覺識別和密集預測任務，實現了優越的準確性和效率折衷，相較於其他輕量級網絡設計。

English

Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

自適應頻率過濾器作為高效的全局代幣混合器。

Adaptive Frequency Filters As Efficient Global Token Mixers

摘要

Support