RMT：保留网络遇见视觉Transformer

摘要

Transformer 首次出现在自然语言处理领域，后来迁移到计算机视觉领域，在那里展现出出色的视觉任务表现。然而，最近，保留网络（RetNet）作为一种架构出现，具有取代 Transformer 的潜力，在自然语言处理社区引起了广泛关注。因此，我们提出了一个问题，即将 RetNet 的思想转移到视觉领域是否也能为视觉任务带来出色的表现。为了解决这个问题，我们将 RetNet 和 Transformer 结合起来提出了 RMT。受 RetNet 启发，RMT 在视觉主干中引入了显式衰减，将与空间距离相关的先验知识引入到视觉模型中。这种与距离相关的空间先验允许明确控制每个标记可以关注的标记范围。此外，为了减少全局建模的计算成本，我们沿图像的两个坐标轴分解了这个建模过程。大量实验表明，我们的 RMT 在各种计算机视觉任务中表现出色。例如，RMT 在 ImageNet-1k 上仅使用 4.5G FLOPs 就实现了 84.1% 的 Top1-acc。据我们所知，在所有模型中，当模型大小相似且采用相同策略训练时，RMT 实现了最高的 Top1-acc。此外，RMT 在目标检测、实例分割和语义分割等下游任务中明显优于现有的视觉主干。我们的工作仍在进行中。

English

Transformer first appears in the field of natural language processing and is later migrated to the computer vision domain, where it demonstrates excellent performance in vision tasks. However, recently, Retentive Network (RetNet) has emerged as an architecture with the potential to replace Transformer, attracting widespread attention in the NLP community. Therefore, we raise the question of whether transferring RetNet's idea to vision can also bring outstanding performance to vision tasks. To address this, we combine RetNet and Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay into the vision backbone, bringing prior knowledge related to spatial distances to the vision model. This distance-related spatial prior allows for explicit control of the range of tokens that each token can attend to. Additionally, to reduce the computational cost of global modeling, we decompose this modeling process along the two coordinate axes of the image. Abundant experiments have demonstrated that our RMT exhibits exceptional performance across various computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT achieves the highest Top1-acc when models are of similar size and trained with the same strategy. Moreover, RMT significantly outperforms existing vision backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Our work is still in progress.

RMT：保留网络遇见视觉Transformer

RMT: Retentive Networks Meet Vision Transformers

摘要

Support