RMT:保留式網路遇上視覺轉換器
RMT: Retentive Networks Meet Vision Transformers
September 20, 2023
作者: Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He
cs.AI
摘要
Transformer 首次出現在自然語言處理領域,後來遷移到計算機視覺領域,在那裡展現出在視覺任務中卓越的表現。然而,最近,保留網絡(RetNet)作為一種具有取代 Transformer 潛力的架構出現,引起了自然語言處理社區的廣泛關注。因此,我們提出了一個問題,即將 RetNet 的思想轉移到視覺領域是否也能為視覺任務帶來出色的表現。為了解決這個問題,我們將 RetNet 和 Transformer 結合起來提出了 RMT。受 RetNet 啟發,RMT 將明確的衰減引入到視覺骨幹中,將與空間距離相關的先前知識引入到視覺模型中。這種與距離相關的空間先驗允許明確控制每個標記可以參與的標記範圍。此外,為了降低全局建模的計算成本,我們將這個建模過程分解沿著圖像的兩個坐標軸。豐富的實驗表明,我們的 RMT 在各種計算機視覺任務中表現出色。例如,RMT 在 ImageNet-1k 上僅使用 4.5G FLOPs 就實現了 84.1% 的 Top1-acc。據我們所知,在所有模型中,當模型大小相似並且使用相同策略訓練時,RMT 實現了最高的 Top1-acc。此外,RMT 在物體檢測、實例分割和語義分割等下游任務中明顯優於現有的視覺骨幹。我們的工作仍在進行中。
English
Transformer first appears in the field of natural language processing and is
later migrated to the computer vision domain, where it demonstrates excellent
performance in vision tasks. However, recently, Retentive Network (RetNet) has
emerged as an architecture with the potential to replace Transformer,
attracting widespread attention in the NLP community. Therefore, we raise the
question of whether transferring RetNet's idea to vision can also bring
outstanding performance to vision tasks. To address this, we combine RetNet and
Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay
into the vision backbone, bringing prior knowledge related to spatial distances
to the vision model. This distance-related spatial prior allows for explicit
control of the range of tokens that each token can attend to. Additionally, to
reduce the computational cost of global modeling, we decompose this modeling
process along the two coordinate axes of the image. Abundant experiments have
demonstrated that our RMT exhibits exceptional performance across various
computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k
using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT
achieves the highest Top1-acc when models are of similar size and trained with
the same strategy. Moreover, RMT significantly outperforms existing vision
backbones in downstream tasks such as object detection, instance segmentation,
and semantic segmentation. Our work is still in progress.