行動式 V-MoEs: 透過稀疏混合專家縮小視覺轉換器
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts
September 8, 2023
作者: Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du
cs.AI
摘要
稀疏混合專家模型(MoEs)最近因其能夠將模型大小與推理效率解耦而變得流行,它只會激活模型參數的一小部分,以處理任何給定的輸入標記。因此,稀疏MoEs實現了前所未有的可擴展性,在自然語言處理和計算機視覺等領域取得了巨大成功。在這項工作中,我們探索了使用稀疏MoEs來縮小視覺Transformer(ViTs)的規模,使其更適用於資源受限的視覺應用。為此,我們提出了一種簡化且適合移動設備的MoE設計,其中整個圖像而不是單個塊被路由到專家。我們還提出了一種穩定的MoE訓練程序,該程序使用超類信息來引導路由器。我們通過實驗表明,我們的稀疏Mobile Vision MoEs(V-MoEs)可以在性能和效率之間取得更好的折衷,優於相應的密集ViTs。例如,對於ViT-Tiny模型,我們的Mobile V-MoE在ImageNet-1k上的表現優於其密集對應模型3.39%。對於僅具有54M FLOPs推理成本的更小的ViT變體,我們的MoE實現了4.66%的改進。
English
Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.