RMT: 리텐티브 네트워크와 비전 트랜스포머의 만남

초록

Transformer는 처음에 자연어 처리 분야에서 등장한 후 컴퓨터 비전 영역으로 확장되었으며, 비전 작업에서 뛰어난 성능을 입증했습니다. 그러나 최근 Retentive Network(RetNet)가 Transformer를 대체할 가능성이 있는 아키텍처로 부상하며 NLP 커뮤니티에서 광범위한 관심을 끌고 있습니다. 따라서 우리는 RetNet의 아이디어를 비전 영역으로 전이시키는 것이 비전 작업에서도 탁월한 성능을 가져올 수 있는지에 대한 질문을 제기합니다. 이를 해결하기 위해 우리는 RetNet과 Transformer를 결합하여 RMT를 제안합니다. RetNet에서 영감을 받은 RMT는 비전 백본에 명시적인 감쇠를 도입하여 공간 거리와 관련된 사전 지식을 비전 모델에 제공합니다. 이 거리 관련 공간 사전 지식은 각 토큰이 주의를 기울일 수 있는 토큰의 범위를 명시적으로 제어할 수 있게 합니다. 또한 전역 모델링의 계산 비용을 줄이기 위해, 우리는 이 모델링 과정을 이미지의 두 좌표축을 따라 분해합니다. 다양한 실험을 통해 우리의 RMT가 여러 컴퓨터 비전 작업에서 탁월한 성능을 보인다는 것을 입증했습니다. 예를 들어, RMT는 단 4.5G FLOPs를 사용하여 ImageNet-1k에서 84.1%의 Top1 정확도를 달성합니다. 우리가 아는 한, 모든 모델 중에서 RMT는 유사한 크기의 모델이 동일한 전략으로 훈련되었을 때 가장 높은 Top1 정확도를 달성합니다. 또한 RMT는 객체 탐지, 인스턴스 분할, 의미론적 분할과 같은 다운스트림 작업에서 기존의 비전 백본을 크게 능가합니다. 우리의 작업은 아직 진행 중입니다.

English

Transformer first appears in the field of natural language processing and is later migrated to the computer vision domain, where it demonstrates excellent performance in vision tasks. However, recently, Retentive Network (RetNet) has emerged as an architecture with the potential to replace Transformer, attracting widespread attention in the NLP community. Therefore, we raise the question of whether transferring RetNet's idea to vision can also bring outstanding performance to vision tasks. To address this, we combine RetNet and Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay into the vision backbone, bringing prior knowledge related to spatial distances to the vision model. This distance-related spatial prior allows for explicit control of the range of tokens that each token can attend to. Additionally, to reduce the computational cost of global modeling, we decompose this modeling process along the two coordinate axes of the image. Abundant experiments have demonstrated that our RMT exhibits exceptional performance across various computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT achieves the highest Top1-acc when models are of similar size and trained with the same strategy. Moreover, RMT significantly outperforms existing vision backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Our work is still in progress.

RMT: 리텐티브 네트워크와 비전 트랜스포머의 만남

RMT: Retentive Networks Meet Vision Transformers

초록

Support