iFormer：將ConvNet和Transformer整合至行動應用程式

摘要

我們提出了一個名為 iFormer 的新型移動混合視覺網絡家族，專注於優化移動應用程式的延遲和準確性。iFormer有效地將卷積的快速局部表示能力與自注意力的高效全局建模能力相結合。局部交互是從轉換標準卷積網絡 ConvNeXt 衍生而來，以設計一個更輕量級的移動網絡。我們新引入的移動調製注意力消除了 MHA 中的佔用內存的操作，並採用了一種高效的調製機制來增強動態全局表示能力。我們進行了全面的實驗，證明了 iFormer 在各種任務中優於現有的輕量級網絡。值得注意的是，iFormer 在 ImageNet-1k 上實現了令人印象深刻的 80.4\% Top-1 準確性，僅在 iPhone 13 上以 1.10 毫秒的延遲，超越了最近提出的 MobileNetV4 在相似延遲約束下的表現。此外，我們的方法在下游任務中顯示出顯著的改進，包括 COCO 物體檢測、實例分割和 ADE20k 語義分割，同時在這些情境中為移動設備上的高分辨率輸入保持低延遲。

English

We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, i.e., ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

iFormer：將ConvNet和Transformer整合至行動應用程式

iFormer: Integrating ConvNet and Transformer for Mobile Application

摘要

Support