一幅图像价值超过16x16的块:探索变压器在单个像素上的应用
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
June 13, 2024
作者: Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen
cs.AI
摘要
这项工作并未引入新的方法。相反,我们呈现了一个有趣的发现,质疑现代计算机视觉架构中归纳偏差——即局部性的必要性。具体而言,我们发现普通的Transformer可以通过直接将每个像素视为一个标记来运行,并取得高性能的结果。这与Vision Transformer中流行的设计有着显著不同,后者保留了从ConvNets到局部邻域的归纳偏差(例如,通过将每个16x16块视为一个标记)。我们主要展示了像素作为标记在计算机视觉中三个广泛研究的任务中的有效性:用于对象分类的监督学习,通过遮罩自编码的自监督学习,以及使用扩散模型进行图像生成。尽管直接操作单个像素在计算上不太实用,但我们认为社区在设计下一代计算机视觉神经架构时必须意识到这一令人惊讶的知识片段。
English
This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision.Summary
AI-Generated Summary