Patch n' Pack: NaViT, een Vision Transformer voor elke beeldverhouding en resolutie

Samenvatting

De alomtegenwoordige en aantoonbaar suboptimale keuze om afbeeldingen te verkleinen naar een vaste resolutie voordat ze worden verwerkt met computervisie-modellen, is nog steeds niet succesvol uitgedaagd. Modellen zoals de Vision Transformer (ViT) bieden echter flexibele, op sequenties gebaseerde modellering, en dus variërende invoersequentielengtes. Wij maken hier gebruik van met NaViT (Native Resolution ViT), dat sequentiepakking gebruikt tijdens de training om invoer van willekeurige resoluties en beeldverhoudingen te verwerken. Naast flexibel modelgebruik, demonstreren we verbeterde trainingsefficiëntie voor grootschalige supervised en contrastieve beeld-tekst pretraining. NaViT kan efficiënt worden overgedragen naar standaard taken zoals beeld- en videoclassificatie, objectdetectie en semantische segmentatie, en leidt tot verbeterde resultaten op robuustheid en eerlijkheid benchmarks. Tijdens inferentie kan de flexibiliteit in invoerresolutie worden gebruikt om soepel te navigeren in de afweging tussen kosten en prestaties tijdens de testfase. Wij geloven dat NaViT een afwijking markeert van de standaard, door CNN ontworpen, invoer- en modelleringspipeline die door de meeste computervisie-modellen wordt gebruikt, en een veelbelovende richting vertegenwoordigt voor ViTs.

English

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Patch n' Pack: NaViT, een Vision Transformer voor elke beeldverhouding en resolutie

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Samenvatting

Support