iFormer: モバイルアプリケーション向けのConvNetとTransformerの統合

要旨

新しいモバイルハイブリッドビジョンネットワークの一族であるiFormerを提案し、モバイルアプリケーションにおけるレイテンシと精度の最適化に焦点を当てます。iFormerは、畳み込みの高速なローカル表現能力とセルフアテンションの効率的なグローバルモデリング能力を効果的に統合しています。ローカルな相互作用は、標準的な畳み込みネットワークであるConvNeXtを変換して、より軽量なモバイルネットワークを設計することから派生しています。新たに導入されたモバイル調節アテンションは、MHA内のメモリ集約型の操作を除去し、効率的な調節メカニズムを使用して動的なグローバル表現能力を向上させます。iFormerがさまざまなタスクで既存の軽量ネットワークを凌駕することを示す包括的な実験を行います。特に、iFormerは、iPhone 13上でわずか1.10ミリ秒のレイテンシでImageNet-1kで80.4\%の印象的なTop-1精度を達成し、最近提案されたMobileNetV4を同様のレイテンシ制約下で凌駕しています。さらに、当社の手法は、高解像度の入力に対してこれらのシナリオでモバイルデバイス上で低レイテンシを維持しながら、COCOオブジェクト検出、インスタンスセグメンテーション、ADE20kセマンティックセグメンテーションを含む下流タスクで大幅な改善を示しています。

English

We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, i.e., ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

iFormer: モバイルアプリケーション向けのConvNetとTransformerの統合

iFormer: Integrating ConvNet and Transformer for Mobile Application

要旨

Support