重新思考對於遮罩式自編碼器的補丁依賴

摘要

在這份研究中，我們重新檢視遮罩式自編碼器（MAE）解碼機制中的區塊間依賴性。我們將MAE中用於遮罩區塊重建的解碼機制分解為自注意力和交叉注意力。我們的研究表明，遮罩區塊之間的自注意力對於學習良好的表示並非必要。因此，我們提出了一個新的預訓練框架：交叉注意力遮罩式自編碼器（CrossMAE）。CrossMAE的解碼器僅利用遮罩和可見標記之間的交叉注意力，而在下游性能上沒有降級。這種設計還可以僅解碼一小部分遮罩標記，提高效率。此外，每個解碼器塊現在可以利用不同的編碼器特徵，從而改善表示學習。CrossMAE在解碼計算量減少2.5至3.7倍的情況下與MAE的性能相當。它還在ImageNet分類和COCO實例分割任務中超越了MAE，並使用相同計算量。代碼和模型：https://crossmae.github.io

English

In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7times less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io

重新思考對於遮罩式自編碼器的補丁依賴

Rethinking Patch Dependence for Masked Autoencoders

摘要

Support