이미지는 16x16 패치보다 더 많은 가치를 지닌다: 개별 픽셀에서 트랜스포머 탐구하기

초록

본 연구는 새로운 방법론을 제시하지 않습니다. 대신, 우리는 현대 컴퓨터 비전 아키텍처에서의 귀납적 편향(inductive bias), 특히 지역성(locality)의 필요성에 의문을 제기하는 흥미로운 발견을 제시합니다. 구체적으로, 우리는 기본적인 트랜스포머(vanilla Transformer)가 각각의 개별 픽셀을 토큰으로 직접 처리하여도 높은 성능을 달성할 수 있다는 사실을 발견했습니다. 이는 비전 트랜스포머(Vision Transformer)에서 흔히 사용되는, 합성곱 신경망(ConvNet)으로부터 유래한 지역적 이웃에 대한 귀납적 편향(예: 각 16x16 패치를 토큰으로 처리)을 유지하는 디자인과는 상당히 다릅니다. 우리는 픽셀을 토큰으로 처리하는 방식의 효과를 컴퓨터 비전의 세 가지 잘 알려진 작업을 통해 주로 보여줍니다: 객체 분류를 위한 지도 학습, 마스크된 자동 인코딩(masked autoencoding)을 통한 자기 지도 학습, 그리고 확산 모델(diffusion model)을 이용한 이미지 생성. 비록 개별 픽셀을 직접 처리하는 방식이 계산적으로 덜 실용적이지만, 우리는 컴퓨터 비전을 위한 차세대 신경망 아키텍처를 설계할 때 이 놀라운 사실을 커뮤니티가 반드시 인지해야 한다고 믿습니다.

English

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

이미지는 16x16 패치보다 더 많은 가치를 지닌다: 개별 픽셀에서 트랜스포머 탐구하기

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

초록

Support