iFormer:將ConvNet和Transformer整合至行動應用程式
iFormer: Integrating ConvNet and Transformer for Mobile Application
January 26, 2025
作者: Chuanyang Zheng
cs.AI
摘要
我們提出了一個名為 iFormer 的新型移動混合視覺網絡家族,專注於優化移動應用程式的延遲和準確性。iFormer有效地將卷積的快速局部表示能力與自注意力的高效全局建模能力相結合。局部交互是從轉換標準卷積網絡 ConvNeXt 衍生而來,以設計一個更輕量級的移動網絡。我們新引入的移動調製注意力消除了 MHA 中的佔用內存的操作,並採用了一種高效的調製機制來增強動態全局表示能力。我們進行了全面的實驗,證明了 iFormer 在各種任務中優於現有的輕量級網絡。值得注意的是,iFormer 在 ImageNet-1k 上實現了令人印象深刻的 80.4\% Top-1 準確性,僅在 iPhone 13 上以 1.10 毫秒的延遲,超越了最近提出的 MobileNetV4 在相似延遲約束下的表現。此外,我們的方法在下游任務中顯示出顯著的改進,包括 COCO 物體檢測、實例分割和 ADE20k 語義分割,同時在這些情境中為移動設備上的高分辨率輸入保持低延遲。
English
We present a new family of mobile hybrid vision networks, called iFormer,
with a focus on optimizing latency and accuracy on mobile applications. iFormer
effectively integrates the fast local representation capacity of convolution
with the efficient global modeling ability of self-attention. The local
interactions are derived from transforming a standard convolutional network,
i.e., ConvNeXt, to design a more lightweight mobile network. Our newly
introduced mobile modulation attention removes memory-intensive operations in
MHA and employs an efficient modulation mechanism to boost dynamic global
representational capacity. We conduct comprehensive experiments demonstrating
that iFormer outperforms existing lightweight networks across various tasks.
Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k
with a latency of only 1.10 ms on an iPhone 13, surpassing the recently
proposed MobileNetV4 under similar latency constraints. Additionally, our
method shows significant improvements in downstream tasks, including COCO
object detection, instance segmentation, and ADE20k semantic segmentation,
while still maintaining low latency on mobile devices for high-resolution
inputs in these scenarios.Summary
AI-Generated Summary