一張圖像價值超過16x16塊:探索Transformer在個別像素上的應用
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
June 13, 2024
作者: Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen
cs.AI
摘要
本研究並未引入新方法。相反地,我們呈現了一個有趣的發現,質疑在現代計算機視覺架構中歸納偏差──即局部性的必要性。具體而言,我們發現純Transformer可以通過直接將每個像素視為一個標記來運作,並取得高效的結果。這與Vision Transformer中的流行設計截然不同,後者保留了從ConvNets對局部鄰域的歸納偏差(例如,將每個16x16塊視為一個標記)。我們主要展示了像素作為標記在計算機視覺中三個經常研究的任務中的有效性:對象分類的監督學習、通過遮罩自編碼的自監督學習,以及使用擴散模型進行圖像生成。雖然直接操作單個像素在計算上不太實際,但我們認為在設計下一代計算機視覺神經架構時,社群必須意識到這一令人驚訝的知識片段。
English
This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision.Summary
AI-Generated Summary