利用 Patch Mixing 在 CNN 中將 ViT 的 Patch 選擇性硬體化

摘要

視覺轉換器（ViTs）已顯著改變了計算機視覺領域，並在視覺任務中時常展現出比卷積神經網絡（CNNs）更優越的性能。儘管目前對於哪種模型類型更優勢仍有爭議，但每種模型都具有獨特的歸納偏差，塑造了它們的學習和泛化性能。例如，ViTs在早期層的非局部特徵依賴性方面具有有趣的特性，以及能夠增強學習靈活性的自注意機制，使它們能夠更有效地忽略圖像中不相關的信息。我們假設這種忽略不相關信息的能力（我們稱之為補丁選擇性），以及在早期層以非局部方式整合相關信息的能力，使ViTs更容易應對遮擋。在這項研究中，我們的目標是看看是否我們可以讓CNNs模擬這種補丁選擇性的能力，通過有效地將這種歸納偏差硬編碼進去，使用補丁混合數據增強進行訓練，該方法包括將來自另一圖像的補丁插入到訓練圖像中，並在兩個圖像類別之間插值標籤。具體來說，我們使用補丁混合來訓練最先進的ViTs和CNNs，評估其對它們忽略不相關補丁並處理自然遮擋的影響。我們發現，當使用補丁混合進行訓練時，ViTs的性能沒有改善也沒有下降，但CNNs獲得了新的能力來忽略不相關信息並在遮擋基準上取得進步，這使我們得出結論，這種訓練方法是在CNNs中模擬ViTs已經擁有的能力的一種方式。我們將釋出我們的補丁混合實現和提議的數據集供公眾使用。項目頁面：https://arielnlee.github.io/PatchMixing/

English

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name patch selectivity), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs simulate this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

利用 Patch Mixing 在 CNN 中將 ViT 的 Patch 選擇性硬體化

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

摘要

Support