패치 믹싱을 사용하여 CNN에 ViT 패치 선택성 하드와이어링

초록

비전 트랜스포머(ViTs)는 컴퓨터 비전 분야에 큰 변화를 가져왔으며, 주기적으로 컨볼루션 신경망(CNNs)에 비해 비전 작업에서 우수한 성능을 보여주고 있습니다. 아직 어떤 모델 유형이 더 우수한지에 대한 결론은 나지 않았지만, 각각은 학습과 일반화 성능을 형성하는 고유한 귀납적 편향을 가지고 있습니다. 예를 들어, ViTs는 초기 계층에서의 비-지역적 특징 의존성과 학습 유연성을 향상시키는 자기 주의 메커니즘을 통해 컨텍스트 외부의 이미지 정보를 더 효과적으로 무시할 수 있는 흥미로운 특성을 가지고 있습니다. 우리는 이러한 컨텍스트 외부 정보를 무시하는 능력(우리는 이를 패치 선택성이라고 명명함)과 초기 계층에서 비-지역적 방식으로 컨텍스트 내부 정보를 통합하는 능력이 ViTs가 가림 현상을 더 쉽게 처리할 수 있게 한다고 가정합니다. 본 연구에서는 패치 믹싱 데이터 증강을 통해 이러한 패치 선택성 능력을 CNNs에 효과적으로 하드와이어링하여 시뮬레이션할 수 있는지 확인하는 것을 목표로 합니다. 패치 믹싱은 다른 이미지의 패치를 훈련 이미지에 삽입하고 두 이미지 클래스 간의 레이블을 보간하는 방식으로 구성됩니다. 구체적으로, 우리는 최신 ViTs와 CNNs를 패치 믹싱으로 훈련시켜 컨텍스트 외부 패치를 무시하고 자연스러운 가림 현상을 처리하는 능력에 미치는 영향을 평가합니다. 우리는 패치 믹싱으로 훈련된 ViTs가 성능이 향상되거나 저하되지 않지만, CNNs는 컨텍스트 외부 정보를 무시하는 새로운 능력을 획득하고 가림 벤치마크에서 성능이 향상된다는 것을 발견했습니다. 이를 통해 이 훈련 방법이 CNNs에서 ViTs가 이미 가지고 있는 능력을 시뮬레이션하는 방법이라는 결론을 내렸습니다. 우리는 패치 믹싱 구현과 제안된 데이터셋을 공개하여 누구나 사용할 수 있도록 할 것입니다. 프로젝트 페이지: https://arielnlee.github.io/PatchMixing/

English

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name patch selectivity), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs simulate this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

패치 믹싱을 사용하여 CNN에 ViT 패치 선택성 하드와이어링

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

초록

Support