重新思考对遮挡自编码器的补丁依赖
Rethinking Patch Dependence for Masked Autoencoders
January 25, 2024
作者: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg
cs.AI
摘要
在这项工作中,我们重新审视了掩码自编码器(MAE)解码机制中的补丁间依赖关系。我们将MAE中用于补丁重建的解码机制分解为自注意力和交叉注意力。我们的研究表明,掩码补丁之间的自注意力对于学习良好的表示并非必要。基于此,我们提出了一种新颖的预训练框架:交叉注意力掩码自编码器(CrossMAE)。CrossMAE的解码器仅利用掩码和可见标记之间的交叉注意力,而在下游性能上没有降级。这种设计还能够仅解码一小部分掩码标记,提升效率。此外,每个解码器块现在可以利用不同的编码器特征,从而改善表示学习。CrossMAE在性能上与MAE相匹配,解码计算量降低了2.5到3.7倍。在相同计算条件下,它还在ImageNet分类和COCO实例分割任务上超越了MAE。代码和模型:https://crossmae.github.io
English
In this work, we re-examine inter-patch dependencies in the decoding
mechanism of masked autoencoders (MAE). We decompose this decoding mechanism
for masked patch reconstruction in MAE into self-attention and cross-attention.
Our investigations suggest that self-attention between mask patches is not
essential for learning good representations. To this end, we propose a novel
pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE).
CrossMAE's decoder leverages only cross-attention between masked and visible
tokens, with no degradation in downstream performance. This design also enables
decoding only a small subset of mask tokens, boosting efficiency. Furthermore,
each decoder block can now leverage different encoder features, resulting in
improved representation learning. CrossMAE matches MAE in performance with 2.5
to 3.7times less decoding compute. It also surpasses MAE on ImageNet
classification and COCO instance segmentation under the same compute. Code and
models: https://crossmae.github.io