Patch n' Pack：NaViT，一种适用于任何纵横比和分辨率的视觉Transformer

摘要

在使用计算机视觉模型处理图像之前将其调整为固定分辨率的普遍且明显不够优化的选择尚未成功受到挑战。然而，诸如视觉Transformer（ViT）之类的模型提供了灵活的基于序列的建模，因此支持不同长度的输入序列。我们利用这一特点开发了NaViT（原生分辨率ViT），它在训练过程中利用序列打包来处理任意分辨率和长宽比的输入。除了模型的灵活使用，我们展示了在大规模监督和对比图像-文本预训练中的训练效率的提升。NaViT可以高效地迁移到标准任务，如图像和视频分类、目标检测以及语义分割，并在鲁棒性和公平性基准测试中取得了改进的结果。在推断时，输入分辨率的灵活性可用于平滑地在测试时间的成本和性能之间进行权衡。我们相信NaViT标志着与大多数计算机视觉模型使用的标准CNN设计的输入和建模流程有所不同，并代表了ViT的一个有前途的方向。

English

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Patch n' Pack：NaViT，一种适用于任何纵横比和分辨率的视觉Transformer

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

摘要

Support