EfficientViT: 캐스케이드 그룹 어텐션을 활용한 메모리 효율적 비전 트랜스포머

초록

비전 트랜스포머는 높은 모델 능력으로 인해 큰 성공을 거두었습니다. 그러나 이러한 뛰어난 성능은 높은 계산 비용을 동반하며, 이는 실시간 애플리케이션에 적합하지 않게 만듭니다. 본 논문에서는 EfficientViT라는 고속 비전 트랜스포머 패밀리를 제안합니다. 우리는 기존 트랜스포머 모델의 속도가 일반적으로 메모리 비효율적인 연산, 특히 MHSA(Multi-Head Self-Attention)에서의 텐서 재구성 및 요소별 함수에 의해 제한된다는 것을 발견했습니다. 따라서 우리는 샌드위치 레이아웃을 가진 새로운 빌딩 블록을 설계했습니다. 즉, 효율적인 FFN(Feed-Forward Network) 레이어 사이에 단일 메모리 바운드 MHSA를 사용하여 메모리 효율성을 향상시키고 채널 간 통신을 강화했습니다. 또한, 우리는 어텐션 맵이 헤드 간에 높은 유사성을 공유하여 계산적 중복을 초래한다는 것을 발견했습니다. 이를 해결하기 위해, 우리는 전체 특징을 다른 분할로 나누어 어텐션 헤드에 공급하는 캐스케이드 그룹 어텐션 모듈을 제안합니다. 이는 계산 비용을 절약할 뿐만 아니라 어텐션 다양성을 향상시킵니다. 포괄적인 실험을 통해 EfficientViT가 기존의 효율적인 모델들을 능가하며 속도와 정확도 사이의 좋은 균형을 달성함을 입증했습니다. 예를 들어, 우리의 EfficientViT-M5는 MobileNetV3-Large보다 정확도에서 1.9% 우수하며, Nvidia V100 GPU와 Intel Xeon CPU에서 각각 40.4%와 45.2% 더 높은 처리량을 달성했습니다. 최근의 효율적인 모델인 MobileViT-XXS와 비교했을 때, EfficientViT-M2는 1.8% 더 우수한 정확도를 달성하며 GPU/CPU에서 각각 5.8배/3.7배 더 빠르게 실행되고, ONNX 형식으로 변환 시 7.4배 더 빠릅니다. 코드와 모델은 https://github.com/microsoft/Cream/tree/main/EfficientViT에서 확인할 수 있습니다.

English

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

EfficientViT: 캐스케이드 그룹 어텐션을 활용한 메모리 효율적 비전 트랜스포머

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

초록

Support