EfficientViT：具有級聯群組注意力的記憶效率視覺轉換器

摘要

視覺轉換器因其高模型能力而取得巨大成功。然而，其卓越表現伴隨著龐大的計算成本，使其不適用於實時應用。本文提出了一系列名為EfficientViT的高速視覺轉換器。我們發現現有轉換器模型的速度通常受到記憶體效率低下操作的限制，特別是在MHSA中的張量重塑和逐元素函數。因此，我們設計了一個新的構建模塊，採用三明治佈局，即在高效FFN層之間使用單個受記憶體限制的MHSA，從而提高記憶體效率並增強通道通信。此外，我們發現關注地圖在不同頭部之間存在高度相似性，導致計算冗余。為解決此問題，我們提出了一個級聯組關注模塊，將完整特徵的不同分割提供給關注頭部，這不僅節省了計算成本，還提高了關注多樣性。全面的實驗表明EfficientViT優於現有的高效模型，在速度和準確性之間取得了良好的折衷。例如，我們的EfficientViT-M5在準確性上超越了MobileNetV3-Large 1.9%，同時在Nvidia V100 GPU和Intel Xeon CPU上的吞吐量分別提高了40.4%和45.2%。與最近的高效模型MobileViT-XXS相比，EfficientViT-M2在準確性上達到了1.8%的優越性，同時在GPU/CPU上運行速度分別提高了5.8倍/3.7倍，轉換為ONNX格式後速度提高了7.4倍。代碼和模型可在https://github.com/microsoft/Cream/tree/main/EfficientViT找到。

English

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

EfficientViT：具有級聯群組注意力的記憶效率視覺轉換器

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

摘要

Support