利用Patch混合将ViT的Patch选择性硬连线到CNN中

摘要

视觉Transformer（ViTs）已经显著改变了计算机视觉领域，并在视觉任务中周期性地展现出比卷积神经网络（CNNs）更优越的性能。虽然目前尚无定论哪种模型类型更优越，但每种模型都具有独特的归纳偏差，塑造了它们的学习和泛化性能。例如，ViTs在早期层的非局部特征依赖方面具有有趣的特性，以及能够增强学习灵活性的自注意机制，使其能够更有效地忽略图像中的非上下文信息。我们假设这种忽略非上下文信息的能力（我们称之为补丁选择性），结合在早期层以非局部方式整合上下文信息的能力，使ViTs能够更轻松地处理遮挡。在这项研究中，我们的目标是看看是否我们可以让CNNs模拟这种补丁选择性的能力，通过有效地硬编码这种归纳偏差，使用Patch Mixing数据增强，其中包括将另一幅图像的补丁插入训练图像，并在两个图像类之间插值标签。具体来说，我们使用Patch Mixing来训练最先进的ViTs和CNNs，并评估其对它们忽略非上下文补丁和处理自然遮挡能力的影响。我们发现，ViTs在使用Patch Mixing训练时既不会提高也不会降低，但CNNs获得了忽略非上下文信息和改进遮挡基准的新能力，这使我们得出结论，这种训练方法是在CNNs中模拟ViTs已经具备的能力的一种方式。我们将发布我们的Patch Mixing实现和提议的数据集供公共使用。项目页面：https://arielnlee.github.io/PatchMixing/

English

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name patch selectivity), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs simulate this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

利用Patch混合将ViT的Patch选择性硬连线到CNN中

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

摘要

Support