移动V-MoEs：通过稀疏混合专家缩小视觉Transformer的规模

摘要

稀疏混合专家模型（MoEs）最近因其能够通过仅激活模型参数的一个小子集来将模型大小与推理效率分离而变得流行。因此，稀疏MoEs实现了前所未有的可扩展性，在自然语言处理和计算机视觉等领域取得了巨大成功。在这项工作中，我们探讨了使用稀疏MoEs来缩小视觉Transformer（ViTs）的规模，使其更适用于资源受限的视觉应用。为此，我们提出了一种简化且适合移动设备的MoE设计，其中整个图像而不是单独的补丁被路由到专家。我们还提出了一种稳定的MoE训练过程，该过程使用超类信息来指导路由器。我们凭经验证明，我们的稀疏移动视觉MoEs（V-MoEs）可以在性能和效率之间取得更好的折衷，优于相应的密集ViTs。例如，对于ViT-Tiny模型，我们的移动V-MoE在ImageNet-1k上的表现比其密集对应物高出3.39%。对于一个仅具有54M FLOPs推理成本的更小的ViT变体，我们的MoE实现了4.66%的改进。

English

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

移动V-MoEs：通过稀疏混合专家缩小视觉Transformer的规模

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

摘要

Support