マスクドオートエンコーダーのためのパッチ依存性の再考

要旨

本研究では、マスク付きオートエンコーダ（MAE）のデコード機構におけるパッチ間依存関係を再検証します。MAEにおけるマスクパッチ再構成のデコード機構を、セルフアテンションとクロスアテンションに分解しました。調査の結果、マスクパッチ間のセルフアテンションは、優れた表現を学習するために必須ではないことが示唆されました。これに基づき、新しい事前学習フレームワークであるCross-Attention Masked Autoencoders（CrossMAE）を提案します。CrossMAEのデコーダは、マスクトークンと可視トークン間のクロスアテンションのみを活用し、下流タスクの性能を低下させることなく、効率を向上させます。この設計により、マスクトークンの一部のみをデコードすることが可能になり、効率性が向上します。さらに、各デコーダブロックが異なるエンコーダ特徴を活用できるようになり、表現学習が改善されます。CrossMAEは、MAEと同等の性能を達成しながら、デコード計算量を2.5～3.7倍削減します。また、同じ計算量条件下で、ImageNet分類とCOCOインスタンスセグメンテーションにおいてMAEを上回ります。コードとモデルはhttps://crossmae.github.ioで公開されています。

English

In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7times less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io

マスクドオートエンコーダーのためのパッチ依存性の再考

Rethinking Patch Dependence for Masked Autoencoders

要旨

Support