MSViT：視覺Transformer的動態混合尺度標記化

摘要

Vision Transformers 的輸入標記在語義上幾乎沒有意義，因為它們被定義為輸入圖像的常規等大小補丁，而與其內容無關。然而，處理圖像中的均勻背景區域不應該需要像處理密集、混亂區域那樣多的計算。為了解決這個問題，我們提出了一種動態混合尺度標記方案，稱為 MSViT。我們的方法引入了一種條件閘控機制，為每個圖像區域選擇最佳的標記尺度，從而動態確定每個輸入的標記數量。所提出的閘控模塊輕量級，不受變壓器主幹選擇的影響，並在很少的訓練時期內（例如在 ImageNet 上的 20 個時期內）進行訓練，訓練開銷很小。此外，為了增強閘控在訓練期間的條件行為，我們引入了一種新的批次塑形損失的泛化方法。我們展示了我們的閘控模塊能夠學習有意義的語義，儘管它在粗粒度補丁級別上是局部操作的。我們在分類和分割任務上驗證了 MSViT，在這些任務中，它帶來了更好的準確性-複雜性平衡。

English

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

MSViT：視覺Transformer的動態混合尺度標記化

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

摘要

Support