Mobile V-MoEs: 희소 Mixture-of-Experts를 통해 Vision Transformers의 규모 축소

초록

희소 전문가 혼합 모델(Sparse Mixture-of-Experts, MoE)은 최근 특정 입력 토큰에 대해 모델 파라미터의 작은 부분집합만 활성화함으로써 모델 크기와 추론 효율성을 분리할 수 있는 능력으로 인해 인기를 얻고 있습니다. 이로 인해 희소 MoE는 전례 없는 확장성을 가능하게 하여 자연어 처리 및 컴퓨터 비전과 같은 다양한 분야에서 큰 성공을 거두었습니다. 본 연구에서는 희소 MoE를 활용하여 Vision Transformer(ViT)를 축소하여 자원이 제한된 비전 애플리케이션에 더 적합하게 만드는 방법을 탐구합니다. 이를 위해, 개별 패치가 아닌 전체 이미지를 전문가로 라우팅하는 단순화되고 모바일 친화적인 MoE 설계를 제안합니다. 또한, 라우터를 안내하기 위해 슈퍼 클래스 정보를 사용하는 안정적인 MoE 학습 절차를 제안합니다. 실험적으로, 우리의 희소 Mobile Vision MoE(V-MoE)가 해당하는 밀집 ViT보다 성능과 효율성 간의 더 나은 균형을 달성할 수 있음을 보여줍니다. 예를 들어, ViT-Tiny 모델의 경우, 우리의 Mobile V-MoE는 ImageNet-1k에서 밀집 버전보다 3.39% 더 나은 성능을 보입니다. 54M FLOPs의 추론 비용만을 가지는 더 작은 ViT 변형의 경우, 우리의 MoE는 4.66%의 성능 향상을 달성합니다.

English

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

Mobile V-MoEs: 희소 Mixture-of-Experts를 통해 Vision Transformers의 규모 축소

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

초록

Support