MSViT：视觉Transformer的动态混合尺度标记化

摘要

Vision Transformers 的输入标记很少携带语义含义，因为它们被定义为输入图像的常规等大小的补丁，而与其内容无关。然而，处理图像中的均匀背景区域不应该需要与密集、混乱区域一样多的计算。为了解决这个问题，我们提出了一种动态混合尺度标记方案，即 MSViT。我们的方法引入了一种条件门控机制，为每个图像区域选择最佳的标记尺度，从而动态确定每个输入的标记数量。所提出的门控模块轻量级，不受变压器骨干选择的影响，并且在很少的训练轮次内（例如在 ImageNet 上的 20 轮）进行训练时几乎没有额外的训练开销。此外，为了增强门控在训练期间的条件行为，我们引入了一种新颖的批量塑形损失的泛化。我们展示了，尽管在粗粒度的补丁级别上局部操作，我们的门控模块能够学习有意义的语义。我们在分类和分割任务上验证了 MSViT，在这些任务中，它带来了更好的准确性-复杂性权衡。

English

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

MSViT：视觉Transformer的动态混合尺度标记化

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

摘要

Support