FasterViT：具有分层注意力的快速视觉Transformer

摘要

我们设计了一种名为FasterViT的新型混合CNN-ViT神经网络系列，重点关注计算机视觉（CV）应用中的高图像吞吐量。FasterViT结合了CNN中快速局部表示学习的优势和ViT中的全局建模特性。我们引入了一种名为Hierarchical Attention（HAT）的新方法，将具有二次复杂度的全局自注意力分解为具有降低计算成本的多级注意力。我们受益于高效的基于窗口的自注意力。每个窗口都可以访问专用的载体标记，这些标记参与局部和全局表示学习。在高层次上，全局自注意力实现了以较低成本实现跨窗口通信。FasterViT在准确性与图像吞吐量方面实现了SOTA Pareto前沿。我们已经在各种CV任务上广泛验证了其有效性，包括分类、目标检测和分割。我们还展示了HAT可以作为现有网络的即插即用模块并增强它们。我们进一步展示，对于高分辨率图像，与竞争对手相比，性能更快更准确。代码可在https://github.com/NVlabs/FasterViT获取。

English

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy \vs image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

FasterViT：具有分层注意力的快速视觉Transformer

FasterViT: Fast Vision Transformers with Hierarchical Attention

摘要

Support