EfficientViM: 隠れ状態ミキサーに基づく状態空間双対を持つ効率的ビジョンマンバ

要旨

ニューラルネットワークをリソース制約のある環境に展開するために、これまでの研究では、それぞれ局所的およびグローバルな依存関係を捉えるために畳み込みと注意機構を備えた軽量なアーキテクチャが構築されてきました。最近、状態空間モデルは、トークンの数に対する線形計算コストの点で有利であるため、効果的なグローバルトークン間の相互作用として浮上しています。しかし、SSMを用いた効率的なビジョンバックボーンの研究はまだ少ないです。本論文では、効率的なビジョンマンバ（EfficientViM）という新しいアーキテクチャを紹介します。これは、ヒドゥンステートミキサーに基づく状態空間双対（HSM-SSD）に構築され、さらに計算コストを削減しつつグローバルな依存関係を効率的に捉えます。HSM-SSDレイヤーでは、以前のSSDレイヤーを再設計して、ヒドゥンステート内でのチャネルミキシング操作を可能にします。さらに、マルチステージのヒドゥンステート融合を提案し、ヒドゥンステートの表現力をさらに強化し、メモリバウンドの操作によるボトルネックを緩和する設計を提供します。その結果、EfficientViMファミリーは、ImageNet-1kにおいて新たな最先端の速度と精度のトレードオフを達成し、2番目に優れたモデルSHViTよりも0.7%の性能向上を実現しました。さらに、画像のスケーリングや蒸留トレーニングを行う際に、従来の研究と比較してスループットと精度が大幅に向上することが観察されました。コードはhttps://github.com/mlvlab/EfficientViM で入手可能です。

English

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

EfficientViM: 隠れ状態ミキサーに基づく状態空間双対を持つ効率的ビジョンマンバ

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

要旨

Support