高密度画像予測のための周波数動的畳み込み

要旨

動的畳み込み（DY-Conv）は、複数の並列重みと注意機構を組み合わせることで適応的な重み選択を可能にし、有望な性能を示してきました。しかし、これらの重みの周波数応答は高い類似性を示す傾向があり、高いパラメータコストに見合う適応性が限られています。本研究では、フーリエ領域で固定されたパラメータ予算を学習することでこれらの制限を緩和する新しいアプローチである周波数動的畳み込み（FDConv）を提案します。FDConvはこの予算を互いに重ならないフーリエインデックスを持つ周波数ベースのグループに分割し、パラメータコストを増やすことなく周波数多様な重みを構築します。さらに適応性を高めるために、カーネル空間変調（KSM）と周波数帯域変調（FBM）を提案します。KSMは各フィルタの周波数応答を空間レベルで動的に調整し、FBMは重みを周波数領域で異なる周波数帯域に分解し、局所的な内容に基づいて動的に変調します。物体検出、セグメンテーション、分類における広範な実験により、FDConvの有効性が検証されました。ResNet-50に適用した場合、FDConvはわずか+3.6Mのパラメータ増加で優れた性能を達成し、大幅なパラメータ予算の増加を必要とする従来の手法（例：CondConv +90M、KW +76.5M）を上回りました。さらに、FDConvはConvNeXtやSwin-Transformerなど様々なアーキテクチャにシームレスに統合され、現代の視覚タスクに対する柔軟で効率的なソリューションを提供します。コードはhttps://github.com/Linwei-Chen/FDConvで公開されています。

English

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

高密度画像予測のための周波数動的畳み込み

Frequency Dynamic Convolution for Dense Image Prediction

要旨

Support