自适应频率滤波器作为高效的全局令牌混合器。

摘要

最近，视觉Transformer、大卷积核卷积神经网络（CNNs）和多层感知器（MLPs）在广泛的视觉任务中取得了显著成功，这要归功于它们在全局范围内的有效信息融合。然而，它们的高效部署，特别是在移动设备上，仍然面临显著挑战，这是由于自注意机制、大卷积核或全连接层的高计算成本所致。在这项工作中，我们将传统卷积定理应用于深度学习，以解决这一问题，并揭示自适应频率滤波器可以作为高效的全局令牌混合器。基于这一见解，我们提出了自适应频率滤波（AFF）令牌混合器。这种神经算子通过傅里叶变换将潜在表示转换到频率域，并通过逐元素乘法执行语义自适应频率滤波，数学上等同于在原始潜在空间中使用动态卷积核进行令牌混合操作，卷积核的大小与该潜在表示的空间分辨率一样大。我们将AFF令牌混合器作为主要神经算子来构建轻量级神经网络，命名为AFFNet。大量实验证明了我们提出的AFF令牌混合器的有效性，并表明AFFNet在广泛的视觉任务（包括视觉识别和密集预测任务）上相较于其他轻量级网络设计实现了更优越的准确性和效率的折衷。

English

Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

自适应频率滤波器作为高效的全局令牌混合器。

Adaptive Frequency Filters As Efficient Global Token Mixers

摘要

Support