你的視覺Transformer實質上是個圖像分割模型

摘要

視覺Transformer（ViTs）在多種電腦視覺任務中展現了卓越的性能和可擴展性。為了將單尺度ViTs應用於影像分割，現有方法採用卷積適配器來生成多尺度特徵，使用像素解碼器融合這些特徵，並利用Transformer解碼器基於融合特徵進行預測。本文中，我們展示了這些任務特定組件引入的歸納偏置，實際上可以通過足夠大的模型和廣泛的預訓練，由ViT自身學習得到。基於這些發現，我們提出了僅含編碼器的遮罩Transformer（EoMT），它重新利用純粹的ViT架構來執行影像分割。通過大規模模型和預訓練，EoMT獲得了與使用任務特定組件的最新模型相當的分割精度。同時，得益於其架構的簡潔性，EoMT顯著快於這些方法，例如，使用ViT-L時速度可提升至4倍。在各種模型規模下，EoMT展示了分割精度與預測速度之間的最佳平衡，表明計算資源更應投入於擴展ViT本身，而非增加架構複雜性。代碼：https://www.tue-mps.org/eomt/。

English

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

你的視覺Transformer實質上是個圖像分割模型

Your ViT is Secretly an Image Segmentation Model

摘要

Support