MSViT: Dynamische Mixed-Scale Tokenisatie voor Vision Transformers

Samenvatting

De invoertokens voor Vision Transformers dragen weinig semantische betekenis, aangezien ze zijn gedefinieerd als regelmatige, even grote patches van de invoerafbeelding, ongeacht de inhoud ervan. Het verwerken van uniforme achtergrondgebieden van een afbeelding zou echter niet zoveel rekenkracht moeten vereisen als dichte, rommelige gebieden. Om dit probleem aan te pakken, stellen we een dynamisch gemengd-schaal tokenisatieschema voor ViT voor, genaamd MSViT. Onze methode introduceert een conditioneel gatingmechanisme dat de optimale tokenschaal selecteert voor elk beeldgebied, zodat het aantal tokens dynamisch wordt bepaald per invoer. Het voorgestelde gatingmodule is lichtgewicht, onafhankelijk van de keuze van het transformer-backbone, en wordt binnen enkele epochs getraind (bijvoorbeeld 20 epochs op ImageNet) met weinig trainingsoverhead. Daarnaast introduceren we een nieuwe generalisatie van de batch-shaping loss om het conditionele gedrag van de gate tijdens de training te verbeteren. We laten zien dat onze gatingmodule in staat is om betekenisvolle semantiek te leren, ondanks dat deze lokaal opereert op het grove patchniveau. We valideren MSViT op de taken van classificatie en segmentatie, waar het leidt tot een verbeterde nauwkeurigheid-complexiteit afweging.

English

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

MSViT: Dynamische Mixed-Scale Tokenisatie voor Vision Transformers

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

Samenvatting

Support