EMOv2: 500 万視覚モデルのフロンティアを推進

要旨

この研究は、パラメータ効率と軽量モデルの開発に焦点を当てており、パラメータ、FLOPs、性能のトレードオフを考慮しながら密な予測に向けた新たな5M規模の軽量モデルのフロンティアを確立することを目指しています。Inverted Residual Block（IRB）は軽量CNNのインフラストラクチャとして機能しますが、注意ベースの設計による対応が認識されていませんでした。本研究では、効率的なIRBの軽量インフラストラクチャとTransformer内の実用的なコンポーネントを統一的な視点から再考し、CNNベースのIRBを注意ベースのモデルに拡張し、軽量モデル設計のための1つのリジュメタモバイルブロック（MMBlock）を抽象化します。整然かつ効果的な設計基準に従い、現代的なImproved Inverted Residual Mobile Block（i2RMB）を導出し、複雑な構造を持たない階層的なEfficient MOdel（EMOv2）を改良します。モバイルユーザーが4G/5G帯域幅でモデルをダウンロードする際のほとんど気づかれない遅延を考慮し、モデルの性能を確保するために、5M規模の軽量モデルの性能上限を調査します。さまざまなビジョン認識、密な予測、画像生成タスクにおける幅広い実験は、当社のEMOv2が最先端の手法に優越していることを示し、例えば、EMOv2-1M/2M/5Mは、それぞれ72.3、75.8、79.4のTop-1を達成し、同程度のCNN-/Attentionベースのモデルを大幅に上回っています。同時に、EMOv2-5Mを搭載したRetinaNetは、物体検出タスクで41.5のmAPを達成し、以前のEMO-5Mを+2.6上回っています。より堅牢なトレーニングレシピを採用すると、EMOv2-5Mは最終的に82.9のTop-1精度を達成し、5M規模モデルの性能を新たなレベルに引き上げます。コードはhttps://github.com/zhangzjn/EMOv2 で入手可能です。

English

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

EMOv2: 500 万視覚モデルのフロンティアを推進

EMOv2: Pushing 5M Vision Model Frontier

要旨

Support