당신의 ViT는 비밀스럽게 이미지 분할 모델입니다

초록

비전 트랜스포머(ViTs)는 다양한 컴퓨터 비전 작업에서 뛰어난 성능과 확장성을 보여주고 있습니다. 단일 스케일 ViT를 이미지 세그멘테이션에 적용하기 위해, 기존 방법들은 다중 스케일 특징을 생성하기 위한 컨볼루션 어댑터, 이러한 특징을 융합하기 위한 픽셀 디코더, 그리고 융합된 특징을 사용하여 예측을 수행하는 트랜스포머 디코더를 채택했습니다. 본 논문에서는 이러한 작업별 구성 요소가 도입한 귀납적 편향이 충분히 큰 모델과 광범위한 사전 학습이 주어진다면 ViT 자체에 의해 학습될 수 있음을 보여줍니다. 이러한 발견을 바탕으로, 우리는 일반적인 ViT 아키텍처를 재활용하여 이미지 세그멘테이션을 수행하는 인코더 전용 마스크 트랜스포머(EoMT)를 소개합니다. 대규모 모델과 사전 학습을 통해 EoMT는 작업별 구성 요소를 사용하는 최첨단 모델과 유사한 세그멘테이션 정확도를 달성합니다. 동시에 EoMT는 아키텍처의 단순성으로 인해 이러한 방법들보다 훨씬 빠릅니다. 예를 들어, ViT-L을 사용할 경우 최대 4배 빠릅니다. 다양한 모델 크기에 걸쳐 EoMT는 세그멘테이션 정확도와 예측 속도 사이의 최적의 균형을 보여주며, 컴퓨팅 자원을 아키텍처의 복잡성을 추가하는 대신 ViT 자체를 확장하는 데 사용하는 것이 더 나음을 시사합니다. 코드: https://www.tue-mps.org/eomt/.

English

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

당신의 ViT는 비밀스럽게 이미지 분할 모델입니다

Your ViT is Secretly an Image Segmentation Model

초록

Support