RMT: Retentive Networks incontrano i Vision Transformer

Abstract

Il Transformer è apparso inizialmente nel campo dell'elaborazione del linguaggio naturale ed è stato successivamente adattato al dominio della visione artificiale, dove ha dimostrato prestazioni eccellenti nelle attività visive. Tuttavia, recentemente, la Retentive Network (RetNet) è emersa come un'architettura con il potenziale di sostituire il Transformer, attirando un'ampia attenzione nella comunità NLP. Pertanto, ci siamo posti la domanda se trasferire l'idea della RetNet alla visione possa portare anche a prestazioni eccezionali nelle attività visive. Per affrontare questa questione, abbiamo combinato RetNet e Transformer per proporre RMT. Ispirati da RetNet, abbiamo introdotto un decadimento esplicito nel backbone visivo di RMT, portando conoscenze pregresse relative alle distanze spaziali nel modello di visione. Questo prior spaziale legato alla distanza consente un controllo esplicito dell'intervallo di token a cui ogni token può prestare attenzione. Inoltre, per ridurre il costo computazionale della modellazione globale, abbiamo scomposto questo processo lungo i due assi coordinati dell'immagine. Abbondanti esperimenti hanno dimostrato che il nostro RMT mostra prestazioni eccezionali in varie attività di visione artificiale. Ad esempio, RMT raggiunge un'accuratezza Top1 dell'84,1% su ImageNet-1k utilizzando appena 4,5G FLOPs. Per quanto ne sappiamo, tra tutti i modelli, RMT raggiunge la più alta accuratezza Top1 quando i modelli hanno dimensioni simili e sono addestrati con la stessa strategia. Inoltre, RMT supera significativamente i backbone visivi esistenti in attività downstream come il rilevamento di oggetti, la segmentazione di istanze e la segmentazione semantica. Il nostro lavoro è ancora in corso.

English

Transformer first appears in the field of natural language processing and is later migrated to the computer vision domain, where it demonstrates excellent performance in vision tasks. However, recently, Retentive Network (RetNet) has emerged as an architecture with the potential to replace Transformer, attracting widespread attention in the NLP community. Therefore, we raise the question of whether transferring RetNet's idea to vision can also bring outstanding performance to vision tasks. To address this, we combine RetNet and Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay into the vision backbone, bringing prior knowledge related to spatial distances to the vision model. This distance-related spatial prior allows for explicit control of the range of tokens that each token can attend to. Additionally, to reduce the computational cost of global modeling, we decompose this modeling process along the two coordinate axes of the image. Abundant experiments have demonstrated that our RMT exhibits exceptional performance across various computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT achieves the highest Top1-acc when models are of similar size and trained with the same strategy. Moreover, RMT significantly outperforms existing vision backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Our work is still in progress.

RMT: Retentive Networks incontrano i Vision Transformer

RMT: Retentive Networks Meet Vision Transformers

Abstract

Support