패치 앤 팩: NaViT, 모든 종횡비와 해상도를 위한 비전 트랜스포머

초록

컴퓨터 비전 모델에서 이미지를 처리하기 전에 고정된 해상도로 크기를 조정하는 것이 보편적이고 명백히 최적이 아닌 선택임에도 불구하고, 이 관행은 아직 성공적으로 도전받지 못했습니다. 그러나 Vision Transformer(ViT)와 같은 모델은 유연한 시퀀스 기반 모델링을 제공하며, 따라서 다양한 입력 시퀀스 길이를 허용합니다. 우리는 이를 NaViT(Native Resolution ViT)에서 활용하여, 임의의 해상도와 종횡비를 가진 입력을 처리하기 위해 훈련 중 시퀀스 패킹을 사용합니다. 유연한 모델 사용과 함께, 우리는 대규모 지도 학습 및 대조적 이미지-텍스트 사전 훈련에서 향상된 훈련 효율성을 입증합니다. NaViT는 이미지 및 비디오 분류, 객체 탐지, 의미론적 분할과 같은 표준 작업에 효율적으로 전이될 수 있으며, 견고성과 공정성 벤치마크에서 개선된 결과를 보여줍니다. 추론 시에는 입력 해상도의 유연성을 활용하여 테스트 시 비용-성능 트레이드오프를 원활하게 탐색할 수 있습니다. 우리는 NaViT가 대부분의 컴퓨터 비전 모델에서 사용되는 표준적인 CNN 설계의 입력 및 모델링 파이프라인에서 벗어나, ViT의 유망한 방향을 대표한다고 믿습니다.

English

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

패치 앤 팩: NaViT, 모든 종횡비와 해상도를 위한 비전 트랜스포머

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

초록

Support