EfficientViT：具有级联分组注意力的内存高效视觉Transformer

摘要

视觉Transformer因其高模型能力而取得了巨大成功。然而，其卓越性能伴随着沉重的计算成本，使其不适用于实时应用。在本文中，我们提出了一系列名为EfficientViT的高速视觉Transformer。我们发现现有Transformer模型的速度通常受到内存效率低下操作的限制，特别是MHSA中的张量重塑和逐元素函数。因此，我们设计了一种新的构建模块，采用三明治布局，即在高效FFN层之间使用单个受内存限制的MHSA，从而提高内存效率同时增强通道通信。此外，我们发现注意力图在不同头之间存在高度相似性，导致计算冗余。为了解决这个问题，我们提出了一个级联组注意力模块，将全特征的不同分割提供给注意力头，这不仅节省了计算成本，还提高了注意力多样性。全面的实验表明，EfficientViT优于现有的高效模型，在速度和准确性之间取得了良好的平衡。例如，我们的EfficientViT-M5在准确性上超过了MobileNetV3-Large 1.9%，在Nvidia V100 GPU和Intel Xeon CPU上的吞吐量分别提高了40.4%和45.2%。与最近的高效模型MobileViT-XXS相比，EfficientViT-M2在准确性上提高了1.8%，在GPU/CPU上运行速度分别提高了5.8倍/3.7倍，并在转换为ONNX格式时提高了7.4倍。代码和模型可在https://github.com/microsoft/Cream/tree/main/EfficientViT找到。

English

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

EfficientViT：具有级联分组注意力的内存高效视觉Transformer

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

摘要

Support