重新思考对遮挡自编码器的补丁依赖

摘要

在这项工作中，我们重新审视了掩码自编码器（MAE）解码机制中的补丁间依赖关系。我们将MAE中用于补丁重建的解码机制分解为自注意力和交叉注意力。我们的研究表明，掩码补丁之间的自注意力对于学习良好的表示并非必要。基于此，我们提出了一种新颖的预训练框架：交叉注意力掩码自编码器（CrossMAE）。CrossMAE的解码器仅利用掩码和可见标记之间的交叉注意力，而在下游性能上没有降级。这种设计还能够仅解码一小部分掩码标记，提升效率。此外，每个解码器块现在可以利用不同的编码器特征，从而改善表示学习。CrossMAE在性能上与MAE相匹配，解码计算量降低了2.5到3.7倍。在相同计算条件下，它还在ImageNet分类和COCO实例分割任务上超越了MAE。代码和模型：https://crossmae.github.io

English

In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7times less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io

重新思考对遮挡自编码器的补丁依赖

Rethinking Patch Dependence for Masked Autoencoders

摘要

Support