FMViT：一种多频混合视觉Transformer。

摘要

最近，Transformer模型在计算机视觉任务中得到了广泛应用。然而，由于自注意力的二次时间和内存复杂度与输入标记数量成正比，大多数现有的Vision Transformers（ViTs）在实际工业部署场景中（如TensorRT和CoreML）遇到了效率性能方面的挑战，传统的CNN表现卓越。尽管最近一些尝试设计了CNN-Transformer混合架构来解决这个问题，但它们的整体性能并未达到预期。为了解决这些挑战，我们提出了一种名为FMViT的高效混合ViT架构。该方法通过混合具有不同频率的高频特征和低频特征来增强模型的表达能力，使其能够有效捕获局部和全局信息。此外，我们引入了部署友好的机制，如卷积多组重参数化（gMLP）、轻量级多头自注意力（RLMHSA）和卷积融合块（CFB），以进一步提高模型性能并减少计算开销。我们的实验表明，FMViT在各种视觉任务的延迟/准确性权衡方面超越了现有的CNNs、ViTs和CNN-Transformer混合架构。在TensorRT平台上，FMViT在ImageNet数据集的top-1准确率方面比Resnet101高出2.5%（83.3% vs. 80.8%），同时保持类似的推理延迟。此外，FMViT在推理速度上与EfficientNet-B5的性能相当，但推理速度提高了43%。在CoreML上，FMViT在ImageNet数据集的top-1准确率上比MobileOne高出2.6%，推理延迟与MobileOne相当（78.5% vs. 75.9%）。我们的代码可在https://github.com/tany0699/FMViT找到。

English

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

FMViT：一种多频混合视觉Transformer。

FMViT: A multiple-frequency mixing Vision Transformer

摘要

Support