FasterViT：具有階層式注意力的快速視覺Transformer

摘要

我們設計了一個名為FasterViT的新型混合CNN-ViT神經網絡家族，專注於用於計算機視覺（CV）應用的高圖像吞吐量。FasterViT結合了CNN中快速局部表示學習和ViT中全局建模特性的優勢。我們新引入的分層注意（HAT）方法將具有二次複雜度的全局自注意力分解為具有降低計算成本的多級注意力。我們受益於高效的基於窗口的自注意力。每個窗口都可以訪問專用的載體令牌，這些令牌參與局部和全局表示學習。在高層次上，全局自注意力實現了以更低成本進行跨窗口通信。FasterViT在準確性與圖像吞吐量之間實現了SOTA Pareto前緣。我們已廣泛驗證了它在各種CV任務上的有效性，包括分類、目標檢測和分割。我們還展示了HAT可用作現有網絡的即插即用模塊並增強它們。我們進一步展示，對於具有高分辨率的圖像，與競爭對手相比，性能顯著更快更準確。代碼可在https://github.com/NVlabs/FasterViT找到。

English

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy \vs image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

FasterViT：具有階層式注意力的快速視覺Transformer

FasterViT: Fast Vision Transformers with Hierarchical Attention

摘要

Support