MSViT: 비전 트랜스포머를 위한 동적 혼합 스케일 토큰화

초록

비전 트랜스포머(Vision Transformers, ViT)의 입력 토큰은 입력 이미지의 내용과 무관하게 동일한 크기의 정규 패치로 정의되기 때문에 시맨틱 의미를 거의 포함하지 않습니다. 그러나 이미지의 균일한 배경 영역을 처리하는 데에는 복잡하고 혼잡한 영역만큼의 계산 자원이 필요하지 않습니다. 이 문제를 해결하기 위해, 우리는 ViT를 위한 동적 혼합 스케일 토큰화 기법인 MSViT를 제안합니다. 우리의 방법은 각 이미지 영역에 대해 최적의 토큰 스케일을 선택하는 조건부 게이팅 메커니즘을 도입하여, 입력마다 토큰의 수를 동적으로 결정합니다. 제안된 게이팅 모듈은 경량이며, 트랜스포머 백본 선택에 독립적이고, 적은 학습 오버헤드로 몇 에포크(예: ImageNet에서 20 에포크) 내에 학습됩니다. 또한, 학습 중 게이트의 조건부 동작을 강화하기 위해 배치 셰이핑 손실(batch-shaping loss)의 새로운 일반화를 도입합니다. 우리는 게이팅 모듈이 거친 패치 수준에서 로컬하게 동작함에도 불구하고 의미 있는 시맨틱을 학습할 수 있음을 보여줍니다. MSViT는 분류 및 세분화 작업에서 검증되었으며, 정확도와 복잡성 간의 균형을 개선하는 결과를 보여줍니다.

English

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

MSViT: 비전 트랜스포머를 위한 동적 혼합 스케일 토큰화

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

초록

Support