마스크드 오토인코더를 위한 패치 의존성 재고찰

초록

본 연구에서는 마스크된 오토인코더(MAE)의 디코딩 메커니즘에서 패치 간 의존성을 재검토한다. 우리는 MAE에서 마스크된 패치 재구성을 위한 디코딩 메커니즘을 자기 주의(self-attention)와 교차 주의(cross-attention)로 분해하였다. 연구 결과, 마스크 패치 간의 자기 주의는 좋은 표현을 학습하는 데 필수적이지 않음을 보여준다. 이를 바탕으로, 우리는 새로운 사전 학습 프레임워크인 교차 주의 마스크 오토인코더(CrossMAE)를 제안한다. CrossMAE의 디코더는 마스크된 토큰과 가시적 토큰 간의 교차 주의만을 활용하며, 하위 작업 성능의 저하 없이 효율성을 높인다. 또한, 이 디자인은 소수의 마스크 토큰만을 디코딩할 수 있게 하여 효율성을 증대시킨다. 더 나아가, 각 디코더 블록은 서로 다른 인코더 특징을 활용할 수 있게 되어 표현 학습이 개선된다. CrossMAE는 MAE와 동등한 성능을 유지하면서 디코딩 계산량을 2.5배에서 3.7배까지 줄인다. 또한, 동일한 계산량 하에서 ImageNet 분류 및 COCO 인스턴스 세분화 작업에서 MAE를 능가한다. 코드와 모델은 https://crossmae.github.io에서 확인할 수 있다.

English

In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7times less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io

마스크드 오토인코더를 위한 패치 의존성 재고찰

Rethinking Patch Dependence for Masked Autoencoders

초록

Support