ChatPaper.aiChatPaper

ViTAR:具有任意解析度的視覺Transformer

ViTAR: Vision Transformer with Any Resolution

March 27, 2024
作者: Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
cs.AI

摘要

本文探討了視覺Transformer(ViTs)所面臨的一個重要挑戰:它們在不同圖像解析度下的受限可擴展性。通常,當ViTs處理與訓練過程中所見解析度不同的圖像時,性能會下降。我們的工作引入了兩個關鍵創新來解決這個問題。首先,我們提出了一個新穎的模塊,用單個Transformer塊設計,專門用於實現高效的增量式標記集成,以解決動態解析度調整的問題。其次,我們在視覺Transformer中引入了模糊位置編碼,以在多個解析度下提供一致的位置感知,從而防止對任何單一訓練解析度的過度擬合。我們的結果模型ViTAR(具有任意解析度的視覺Transformer)展示了出色的適應性,在1120x1120解析度下達到83.3%的top-1準確率,在4032x4032解析度下達到80.4%的準確率,同時降低了計算成本。ViTAR在下游任務中也表現出色,如實例分割和語義分割,並且可以輕鬆結合自監督學習技術,如遮罩自編碼器。我們的工作為增強ViTs的解析度可擴展性提供了一個具有成本效益的解決方案,為更多多功能和高效的高解析度圖像處理打開了道路。
English
his paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration. Secondly, we introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions, thereby preventing overfitting to any single training resolution. Our resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and 80.4\% accuracy at a 4032x4032 resolution, all while reducing computational costs. ViTAR also shows strong performance in downstream tasks such as instance and semantic segmentation and can easily combined with self-supervised learning techniques like Masked AutoEncoder. Our work provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.

Summary

AI-Generated Summary

PDF562December 15, 2024