FMViT: Un Vision Transformer con miscelazione a frequenze multiple

Abstract

Il modello transformer ha guadagnato un'ampia adozione nelle attività di computer vision negli ultimi tempi. Tuttavia, a causa della complessità quadratica in termini di tempo e memoria dell'auto-attenzione, che è proporzionale al numero di token di input, la maggior parte degli esistenti Vision Transformers (ViT) incontra difficoltà nel raggiungere prestazioni efficienti negli scenari di implementazione industriale pratica, come TensorRT e CoreML, dove le tradizionali CNN eccellono. Sebbene alcuni recenti tentativi siano stati fatti per progettare architetture ibride CNN-Transformer per affrontare questo problema, le loro prestazioni complessive non hanno soddisfatto le aspettative. Per affrontare queste sfide, proponiamo un'architettura ibrida ViT efficiente denominata FMViT. Questo approccio migliora il potere espressivo del modello mescolando caratteristiche ad alta frequenza e caratteristiche a bassa frequenza con frequenze variabili, consentendogli di catturare efficacemente sia informazioni locali che globali. Inoltre, introduciamo meccanismi adatti all'implementazione come la Convolutional Multigroup Reparameterization (gMLP), la Lightweight Multi-head Self-Attention (RLMHSA) e il Convolutional Fusion Block (CFB) per migliorare ulteriormente le prestazioni del modello e ridurre il sovraccarico computazionale. I nostri esperimenti dimostrano che FMViT supera le esistenti CNN, ViT e architetture ibride CNN-Transformer in termini di compromessi tra latenza/accuratezza per varie attività di visione. Sulla piattaforma TensorRT, FMViT supera Resnet101 del 2,5% (83,3% vs. 80,8%) in termini di accuratezza top-1 sul dataset ImageNet mantenendo una latenza di inferenza simile. Inoltre, FMViT raggiunge prestazioni comparabili con EfficientNet-B5, ma con un miglioramento del 43% nella velocità di inferenza. Su CoreML, FMViT supera MobileOne del 2,6% in termini di accuratezza top-1 sul dataset ImageNet, con una latenza di inferenza comparabile a MobileOne (78,5% vs. 75,9%). Il nostro codice è disponibile all'indirizzo https://github.com/tany0699/FMViT.

English

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at https://github.com/tany0699/FMViT.

FMViT: Un Vision Transformer con miscelazione a frequenze multiple

FMViT: A multiple-frequency mixing Vision Transformer

Abstract

Support