區域相遇遮罩自編碼器:R-MAE
R-MAE: Regions Meet Masked Autoencoders
June 8, 2023
作者: Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen
cs.AI
摘要
視覺專用概念,如「區域」,在將一般機器學習框架擴展到物體檢測等任務中發揮了關鍵作用。鑒於基於區域的檢測器在監督學習方面取得的成功,以及對比學習的內部圖像方法的進展,我們探索了將區域應用於重建預訓練的可能性。從遮罩自編碼(MAE)作為基準和靈感出發,我們提出了一個針對解決圖像與區域之間一對多映射的平行預文本任務。由於這些區域可以以無監督方式生成,我們的方法(R-MAE)繼承了MAE的廣泛應用性,同時更具「區域感知性」。在開發R-MAE過程中進行了深入分析,並收斂於一個既有效又高效的變體(比MAE多出1.3%的開銷)。此外,當推廣到各種預訓練數據和下游檢測和分割基準時,它表現出一致的定量改進。最後,我們提供了大量的定性可視化來增進對R-MAE行為和潛力的理解。代碼將在https://github.com/facebookresearch/r-mae 上提供。
English
Vision-specific concepts such as "region" have played a key role in extending
general machine learning frameworks to tasks like object detection. Given the
success of region-based detectors for supervised learning and the progress of
intra-image methods for contrastive learning, we explore the use of regions for
reconstructive pre-training. Starting from Masked Autoencoding (MAE) both as a
baseline and an inspiration, we propose a parallel pre-text task tailored to
address the one-to-many mapping between images and regions. Since such regions
can be generated in an unsupervised way, our approach (R-MAE) inherits the wide
applicability from MAE, while being more "region-aware". We conduct thorough
analyses during the development of R-MAE, and converge on a variant that is
both effective and efficient (1.3% overhead over MAE). Moreover, it shows
consistent quantitative improvements when generalized to various pre-training
data and downstream detection and segmentation benchmarks. Finally, we provide
extensive qualitative visualizations to enhance the understanding of R-MAE's
behaviour and potential. Code will be made available at
https://github.com/facebookresearch/r-mae.