EfficientViT: カスケードグループアテンションを備えたメモリ効率の良いVision Transformer

要旨

ビジョントランスフォーマーはその高いモデル能力により大きな成功を収めています。しかし、その優れた性能は重い計算コストを伴い、リアルタイムアプリケーションには不向きです。本論文では、EfficientViTと名付けた高速ビジョントランスフォーマーのファミリーを提案します。既存のトランスフォーマーモデルの速度は、メモリ効率の悪い操作、特にMHSAにおけるテンソルのリシェイプや要素ごとの関数によって制限されていることがわかりました。そこで、効率的なFFN層の間に単一のメモリバウンドMHSAを使用するサンドイッチレイアウトの新しいビルディングブロックを設計し、メモリ効率を向上させながらチャネル間の通信を強化します。さらに、アテンションマップがヘッド間で高い類似性を持つため、計算の冗長性が生じていることを発見しました。これに対処するため、異なる分割された全特徴量をアテンションヘッドに供給するカスケードグループアテンションモジュールを提案し、計算コストを削減するとともにアテンションの多様性を向上させます。包括的な実験により、EfficientViTが既存の効率的なモデルを上回り、速度と精度の良いトレードオフを実現することが示されました。例えば、EfficientViT-M5はMobileNetV3-Largeを精度で1.9%上回りながら、Nvidia V100 GPUとIntel Xeon CPUでそれぞれ40.4%と45.2%高いスループットを達成しました。最近の効率的なモデルであるMobileViT-XXSと比較すると、EfficientViT-M2は1.8%優れた精度を達成し、GPU/CPUで5.8倍/3.7倍高速に動作し、ONNX形式に変換した場合7.4倍高速でした。コードとモデルはhttps://github.com/microsoft/Cream/tree/main/EfficientViTで公開されています。

English

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

EfficientViT: カスケードグループアテンションを備えたメモリ効率の良いVision Transformer

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

要旨

Support