FMViT: 다중 주파수 혼합 비전 트랜스포머

초록

트랜스포머 모델은 최근 컴퓨터 비전 작업에서 널리 채택되고 있다. 그러나 입력 토큰 수에 비례하는 셀프 어텐션의 2차 시간 및 메모리 복잡도로 인해, 대부분의 기존 비전 트랜스포머(ViT)는 TensorRT 및 CoreML과 같은 실용적인 산업 배포 시나리오에서 효율적인 성능을 달성하는 데 어려움을 겪고 있으며, 이는 전통적인 CNN이 뛰어난 분야이다. 최근 일부 연구에서 이 문제를 해결하기 위해 CNN-트랜스포머 하이브리드 아키텍처를 설계하려는 시도가 있었지만, 전반적인 성능은 기대에 미치지 못했다. 이러한 문제를 해결하기 위해, 우리는 FMViT라는 효율적인 하이브리드 ViT 아키텍처를 제안한다. 이 접근 방식은 다양한 주파수를 가진 고주파 특징과 저주파 특징을 혼합하여 모델의 표현력을 향상시키고, 이를 통해 지역적 및 전역적 정보를 효과적으로 포착할 수 있도록 한다. 또한, Convolutional Multigroup Reparameterization(gMLP), Lightweight Multi-head Self-Attention(RLMHSA), Convolutional Fusion Block(CFB)와 같은 배포 친화적인 메커니즘을 도입하여 모델의 성능을 더욱 개선하고 계산 오버헤드를 줄였다. 우리의 실험 결과, FMViT는 다양한 비전 작업에서 기존 CNN, ViT, CNN-트랜스포머 하이브리드 아키텍처를 지연 시간/정확도 트레이드오프 측면에서 능가하는 것으로 나타났다. TensorRT 플랫폼에서 FMViT는 ImageNet 데이터셋에서 Resnet101보다 2.5% 더 높은 top-1 정확도(83.3% 대 80.8%)를 달성하면서도 유사한 추론 지연 시간을 유지했다. 또한, FMViT는 EfficientNet-B5와 비슷한 성능을 보이면서도 추론 속도가 43% 향상되었다. CoreML에서는 FMViT가 ImageNet 데이터셋에서 MobileOne보다 2.6% 더 높은 top-1 정확도(78.5% 대 75.9%)를 달성하면서도 MobileOne과 비슷한 추론 지연 시간을 보였다. 우리의 코드는 https://github.com/tany0699/FMViT에서 확인할 수 있다.

English

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

FMViT: 다중 주파수 혼합 비전 트랜스포머

FMViT: A multiple-frequency mixing Vision Transformer

초록

Support