밀집 이미지 예측을 위한 주파수 동적 컨볼루션

초록

동적 컨볼루션(Dynamic Convolution, DY-Conv)은 여러 병렬 가중치와 어텐션 메커니즘을 결합하여 적응형 가중치 선택을 가능하게 함으로써 유망한 성능을 보여왔지만, 이러한 가중치의 주파수 응답은 높은 유사성을 보이는 경향이 있어 높은 파라미터 비용을 초래하는 반면 적응성은 제한적입니다. 본 연구에서는 이러한 한계를 완화하기 위해 푸리에 도메인에서 고정된 파라미터 예산을 학습하는 새로운 접근 방식인 주파수 동적 컨볼루션(Frequency Dynamic Convolution, FDConv)을 소개합니다. FDConv는 이 예산을 서로 겹치지 않는 푸리에 인덱스를 가진 주파수 기반 그룹으로 나누어, 파라미터 비용을 증가시키지 않으면서도 주파수 다양성을 갖는 가중치를 구성할 수 있게 합니다. 더 나아가 적응성을 강화하기 위해 커널 공간 변조(Kernel Spatial Modulation, KSM)와 주파수 대역 변조(Frequency Band Modulation, FBM)를 제안합니다. KSM은 각 필터의 주파수 응답을 공간 수준에서 동적으로 조정하며, FBM은 가중치를 주파수 도메인에서 별개의 주파수 대역으로 분해하고 로컬 콘텐츠에 기반하여 동적으로 변조합니다. 객체 탐지, 세그멘테이션, 분류에 대한 광범위한 실험을 통해 FDConv의 효과성을 검증하였습니다. ResNet-50에 적용했을 때, FDConv는 단 +3.6M 파라미터의 적은 증가로도 우수한 성능을 달성하며, 파라미터 예산이 크게 증가하는 기존 방법들(예: CondConv +90M, KW +76.5M)을 능가함을 보여줍니다. 또한 FDConv는 ConvNeXt, Swin-Transformer를 포함한 다양한 아키텍처에 원활하게 통합되어 현대 비전 작업을 위한 유연하고 효율적인 솔루션을 제공합니다. 코드는 https://github.com/Linwei-Chen/FDConv에서 공개되었습니다.

English

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

밀집 이미지 예측을 위한 주파수 동적 컨볼루션

Frequency Dynamic Convolution for Dense Image Prediction

초록

Support