FMViT：一個多頻混合視覺Transformer

摘要

最近，Transformer 模型在電腦視覺任務中得到廣泛應用。然而，由於自注意力的二次時間和記憶體複雜度與輸入 token 數量成正比，大多數現有的 Vision Transformers (ViTs) 在實際工業部署場景（如 TensorRT 和 CoreML）中遇到了效能不高的挑戰，傳統的 CNN 在這些場景中表現優異。儘管一些最近的嘗試設計了 CNN-Transformer 混合架構來應對這個問題，但整體表現並未達到預期。為了應對這些挑戰，我們提出了一種名為 FMViT 的高效混合 ViT 架構。這種方法通過混合高頻特徵和低頻特徵，並具有不同頻率，增強了模型的表達能力，使其能夠有效地捕捉局部和全局信息。此外，我們引入了部署友好的機制，如 Convolutional Multigroup Reparameterization (gMLP)、Lightweight Multi-head Self-Attention (RLMHSA) 和 Convolutional Fusion Block (CFB)，進一步提高了模型的性能並減少了計算開銷。我們的實驗表明，FMViT 在各種視覺任務的延遲/準確性折衷方面超越了現有的 CNN、ViTs 和 CNN-Transformer 混合架構。在 TensorRT 平台上，FMViT 在 ImageNet 數據集的 top-1 準確性方面比 Resnet101 高出 2.5%（83.3% vs. 80.8%），同時保持類似的推理延遲。此外，FMViT 在推理速度上與 EfficientNet-B5 的表現相當，但推理速度提高了 43%。在 CoreML 上，FMViT 在 ImageNet 數據集的 top-1 準確性方面比 MobileOne 高出 2.6%，並且推理延遲與 MobileOne 相當（78.5% vs. 75.9%）。我們的程式碼可在 https://github.com/tany0699/FMViT 找到。

English

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

FMViT：一個多頻混合視覺Transformer

FMViT: A multiple-frequency mixing Vision Transformer

摘要

Support